RackHD Overview ¶

Table of Contents

RackHD Overview

RackHD serves as an abstraction layer between other M&O layers and the underlying physical hardware. Developers can use the RackHD API to create a user interface that serves as single point of access for managing hardware services regardless of the specific hardware in place.

RackHD has the ability to discover the existing hardware resources, catalog each component, and retrieve detailed telemetry information from each resource. The retrieved information can then be used to perform low-level hardware management tasks, such as BIOS configuration, OS installation, and firmware management.

RackHD sits between the other M&O layers and the underlying physical hardware devices. User interfaces at the higher M&O layers can request hardware services from RackHD. RackHD handles the details of connecting to and managing the hardware devices.

The RackHD API allows you to automate a great range of management tasks, including:

Install, configure, and monitor bare metal hardware (compute servers, PDUs, DAEs, network switches).
Provision and erase server OSes.
Install and upgrade firmware.
Monitor bare metal hardware through out-of-band management interfaces.
Provide data feeds for alerts and raw telemetry from hardware.

Vision ¶

Feature	Description
Discovery and Cataloging	Discovers the compute, network, and storage resources and catalogs their attributes and capabilities.
Telemetry and Genealogy	Telemetry data includes genealogical details, such as hardware, revisions, serial numbers, and date of manufacture
Device Management	Powers devices on and off. Manages the firmware, power, OS installation, and base configuration of the resources.
Configuration	Configures the hardware per application requirements. This can range from the BIOS configuration on compute devices to the port configurations in a network switch.
Provisioning	Provisions a node to support the intended application workflow, for example lays down ESXi from an image repository. Reprovisions a node to support a different workload, for example changes the ESXi platform to Bare Metal CentOS.
Firmware Management	Manages all infrastructure firmware versioning.
Logging	Log information can be retrieved for particular elements or collated into a single timeline for multiple elements within the management neighborhood.
Environmental Monitoring	Aggregates environmental data from hardware resources. The data to monitor is configurable and can include power information, component status, fan performance, and other information provided by the resource.
Fault Detection	Monitors compute and storage devices for both hard and soft faults. Performs suitable responses based on pre-defined policies.
Analytics Data	Data generated by environmental and fault monitoring can be provided to analytic tools for analysis, particularly around predictive failure.

Goals ¶

The primary goals of RackHD are to provide REST APIs and live data feeds to enable automated solutions for managing hardware resources. The technology and architecture are built to provide a platform agnostic solution.

The combination of these services is intended to provide a REST API based service to:

Install, configure, and monitor bare metal hardware, such as compute servers, power distribution units (PDUs), direct attached extenders (DAE) for storage, and network switches.
Provision, erase, and reprovision a compute server’s OS.
Install and upgrade firmware for qualified hardware.
Monitor and alert bare metal hardware through out-of-band management interfaces.
Provide RESTful APIs for convenient access to knowledge about both common and vendor-specific hardware.
Provide pub/sub data feeds for alerts and raw telemetry from hardware.

The RackHD Project ¶

The original motive centered on maximizing the automation of firmware and BIOS updates in the data center, thereby reducing the extensive manual processes that are still required for these operations.

Existing open source solutions do an admirable job of inventory and bare OS provisioning, but the ability to upgrade firmware is beyond the technology stacks currently available (i.e. xCat, Cobbler, Razor or Hanlon). By adding an event-based workflow engine that works in conjunction with classical PXE booting, RackHD makes it possible to architect different deployment configurations as described in :doc:how_it_works and Deployment Environment.

RackHD extends automation beyond simple PXE booting. It can perform highly customizable tasks on machines, as is illustrated by the following sequence:

PXE boot the server
Interrogate the hardware to determine if it has the correct firmware version
If needed, flash the firmware to the correct version
Reboot (mandated by things like BIOS and BMC flashing)
PXE boot again
Interrogate the hardware to ensure it has the correct firmware version.
SCORE!

In effect, RackHD combines open source tools with a declarative, event-based workflow engine. It is similar to Razor and Hanlon in that it sets up and boots a microkernel that can perform predefined tasks. However, it extends this model by adding a remote agent that communicates with the workflow engine to dynamically determine the tasks to perform on the target machine, such as zero out disks, interrogate the PCI bus, or reset the IPMI settings through the hosts internal KCS channel.

Along with this agent-to-workflow integration, RackHD optimizes the path for interrogating and gathering data. It leverages existing Linux tools and parses outputs that are sent back and stored as free-form JSON data structures.

The workflow engine was extended to support polling via out-of-band interfaces in order to capture sensor information and other data that can be retrieved using IPMI. In RackHD these become pollers that periodically capture telemetry data from the hardware interfaces.

What RackHD Does Well ¶

RackHD is focused on being the lowest level of automation that interrogates agnostic hardware and provisions machines with operating systems. The API can be used to pass in data through variables in the workflow configuration, so you can parameterize workflows. Since workflows also have access to all of the SKU information and other catalogs, they can be authored to react to that information.

The real power of RackHD, therefore, is that you can develop your own workflows and use the REST API to pass in dynamic configuration details. This allows you to execute a specific sequence of arbitrary tasks that satisfy your requirements.

When creating your initial workflows, it is recommended that you use the existing workflows in our code repository to see how different actions can be performed.

What RackHD Doesn’t Do ¶

RackHD is a comparatively passive system. Workflows do not contain the complex logic for functionality that is implemented in the layers above hardware management and orchestration. For example, workflows do not provide scheduling functionality or choose which machines to allocate to particular services.

We document and expose the events around the workflow engine to be utilized, extended, and incorporated into an infrastructure management system, but we did not take RacKHD itself directly into the infrastructure layer.

Comparison with Other Projects ¶

Comparison to other open source technologies:

Cobbler comparison

Grand-daddy of open source tools to enable PXE imaging
Original workhorse of datacenter PXE automation
XML-RPC interface for automation, no REST interface
No dynamic events or control for TFTP, DHCP
Extensive manual and OS level configuration needed to utilize
One-shot operations - not structured to change personalities (OS installed) on a target machine, or multiple reboots to support some firmware update needs
No workflow engine or concept of orchestration with multiple reboots

Razor/Hanlon comparison

HTTP wrapper around stock open source tools to enable PXE booting (DHCP, TFTP, HTTP)
Razor and Hanlon extended beyond Cobbler’s concepts to include microkernel to interrogate remote host and use that information with policies to choose what to PXE boot
Razor isn’t set to make dynamic responses through TFTP or DHCP where RackHD uses dynamic responses based on current state for PXE to enable workflows
Catalog and policy are roughly equivalent to RackHD default/discovery workflow and SKU mechanism, but oriented on single OS deployment for a piece or type of hardware
Razor and Hanlon are often focused on hardware inventory to choose and enable OS installation through Razor’s policy mechanisms.
No workflow engine or concept of orchestration with multiple reboots
Tightly bound to and maintained by Puppet
Forked variant Hanlon used for Chef Metal driver

xCat comparison

HPC Cluster Centric tool focused on IBM supported hardware
Firmware update features restricted to IBM/Lenovo proprietary hardware where firmware was made to “one-shot-update”, not explicitly requiring a reboot
Has no concept of workflow or sequencing
Has no obvious mechanism for failure recovery
Competing with Puppet/Chef/Ansible/cfEngine to own config management story
Extensibility model tied exclusively to Perl code
REST API is extremely light with focus on CLI management
Built as a master controller of infrastructure vs an element in the process