Sequence Diagram for the Discovery Workflow
The diagram is made with WebSequenceDiagrams.
To see if the DHCP request was received by ISC DHCP, look in /var/log/syslog of the RackHD host. grep DHCP /var/log/syslog works reasonably well - you’re looking for a sequence like this:
Jan 8 15:43:43 rackhd-demo dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 3 (xid=0x5b3b9260) Jan 8 15:43:43 rackhd-demo dhclient: DHCPREQUEST of 10.0.2.15 on eth0 to 255.255.255.255 port 67 (xid=0x60923b5b) Jan 8 15:43:43 rackhd-demo dhclient: DHCPOFFER of 10.0.2.15 from 10.0.2.2 Jan 8 15:43:43 rackhd-demo dhclient: DHCPACK of 10.0.2.15 from 10.0.2.2
You should also see the DHCP proxy return the bootfile. In the DHCP-proxy logs, look for lines with DHCP.messageHandler:
S 2016-01-08T19:31:43.268Z [on-dhcp-proxy] [DHCP.messageHandler] [Server] Unknown node 08:00:27:f3:9f:2e. Sending down default bootfile.
And immediately thereafter, you should see the server request the file from TFTP:
S 2016-01-08T19:31:43.352Z [on-tftp] [Tftp.Server] [Server] tftp: 67.300 monorail.ipxe
title Default Discovery Workflow Server->RackHD: DHCP from PXE(nic or BIOS) RackHD->Server: ISC DHCP response with IP RackHD->Server: DHCP-proxy response with bootfile note right of RackHD: If the node is already "known", it will only respond if there's an active workflow that's been invoked related to the node Server->RackHD: Request to download bootfile via TFTP RackHD->Server: TFTP sends requested file (monorail.ipxe) note left of Server: Server loads monorail.ipxe and initiates on bootloader Server->RackHD: IPXE script requests what to do from RackHD (http) note right of RackHD: multiline RackHD looks up IP address of HTTP request from iPXE script to find the node via its mac-address. If the node is already "known", it will only respond if there's an active workflow that's been invoked related to the node If the node isn't known, it will create a workflow (default is the workflow 'Graph.Sku.Discovery') and respond with an iPXE script to initiate that end note RackHD->Server: iPXE script (what RackHD calls a Profile) (via http) note left of Server: iPXE script with discovery microkernel and initrd (http) Server->RackHD: iPXE requests static file - the vmlinuz kernel RackHD->Server: vmlinuz (http) Server->RackHD: iPXE requests static file - initrd RackHD->Server: initrd (http) note left of Server: Server loads the kernel and initrd and transfers control (boots that microkernel) Server->RackHD: initrd loads additional file (overlay) from Server to extend microkernel note left of Server: the discovery microkernel is set to request and launch a NodeJS task runnner Server->RackHD: requests the bootstrap.js template RackHD->Server: bootstrap.js filled out with values specific to the node based on a lookup note left of Server: runs node bootstrap.js Server->RackHD: bootstrap asks for tasks (what should I do?) RackHD->Server: data packet of tasks (via http) note left of Server: Discovery Workflow passes down tasks to interrogate hardware loop for each Task from RackHD Server->RackHD: output of task end note right of RackHD multiline task output stored as catalogs in RackHD related to the node if RackHD is configured with SKU definitions, it processes these catalogs to determine the SKU if there's a SKU specific workflow defined, control is continued to that the discovery workflow will create an enclosure node based on the catalog data the discovery workflow will also create IPMI pollers for the node if relevent information can be found in the catalog end note Server->RackHD: bootstrap asks for tasks (what should I do?) RackHD->Server: Nothing more, thanks - please reboot (via http)
Footprint benchmark test collects system data when running poller (15min), node discovery and CentOS bootstrap test cases. It can also run independently from any test cases, allowing users to measure footprint about any operations they carry out. The data includes CPU, memory, disk and network consumption of every process in RackHD, as well as RabbitMQ and MongoDB processes. The result is presented as HTML files. For more details, please check the wiki page proposal-footprint-benchmarks.
Footprint benchmark test is integrated into RackHD test framework. It can be executed as long as the machine running the test can access the RackHD API and manipulate the RackHD machine via SSH.
sudo socat -d -d TCP4-LISTEN:55672,reuseaddr,fork TCP4:localhost:5672
Clone the test repo from GitHub
git clone https://github.com/RackHD/RackHD.git
Enter test directory and install required modules in virtual env
cd RackHD/test virtualenv .venv source .venv/bin/activate pip install -r requirements.txt
Configure RackHD related parameters in config.ini
Run the test. The first time user kicks off the test, he/she will be asked to input sudoer’s username and password of localhost.
If user would like to run only one of the three benchmark cases, the following command can be used
python benchmark.py --group=poller|discovery|bootstrap
Run footprint data collection independently
python benchmark.py --start|stop
To get the directory of the latest log file
python benchmark.py --getdir
After the test finishes, the result is in ~/benchmark, and arranged by the timestamp and case name. Please use the command below to open Chrome
chrome.exe --user-data-dir="C:/Chrome dev session" --allow-file-access-from-files
In the “report” directory of the case, drag the summary.html into Chrome. The footprint data and graph will be shown in the page, and user can also compare it with previous runs by selecting another case from the drop-down menu in the page.
I’m seeing this warning appear in the logs but it all seems to be working. What’s happening?
W 2016-01-29T21:06:22.756Z [on-tftp] [Tftp.Server] [Server] Tftp error -> /lib/server.js:57 file: monorail.ipxe remoteAddress: 172.31.128.5 remotePort: 2070 W 2016-01-29T21:12:43.783Z [on-tftp] [Tftp.Server] [Server] Tftp error -> /lib/server.js:57 file: monorail.ipxe remoteAddress: 172.31.128.5 remotePort: 2070
What I learned (so I may be wrong here, but think it’s accurate) is that during the boot loading/PXE process the NICs will attempt to interact with TFTP in such a way that the first request almost always fails - it’s how the C code in those nics is negotiating for talking with TFTP. So you’ll frequently see those errors in the logs, and then immediately also see the same file downloading on the second request from the nic (or host) doing the bootloading.
When we’re boostraping a node (or running a workflow against a node in general) with a NUC, we sometimes see these extended messages on the server’s console reading Link...... down, and depending on the network configuration can see failures for the node to bootstrap and respond to PXE.
The link down is a pernicious problem for PXE booting in general, and a part of the game that’s buried into how switches react and bring up and down ports. We’ve generally encouraged settings like “portfast” which more agressively bring up links that are going down and coming back up with a power cycle. In the NUCs you’re using, you’ll see that extensively, but it happens on all networks. If you have spanning-tree enabled, some things like that - it’ll expand the time. There’s only so much we can do to work around it, but fundamentally it means that while the relevant computer things things are “UP and OK” and has started a TFTP/PXE boot process, the switch hasn’t brought the NIC link up. So we added an explicit sleep in there in the monorail.ipxe to extend ‘the time to let networks converge so that the process has a better chance of succeeding.