This story begins about a year ago when my company decided to relocate offices. Instead of disposing of the lab equipment, they offered it to me. There was just one catch—my house wasn’t finished being built, which meant the equipment needed to be stored for a while. In my mind, it was worth the hassle, as I had always dreamed of having a lab, and this equipment was far better than buying NUCs or similar gear. I spent a weekend at the HQ, loaded up the gear into a moving truck, and embarked on a one-way trip home (to storage).
Fast forward approximately six months, and my house was now complete, and the equipment had found its new home. Excitement welled up within me, not just because I could finally play with the gear, but also because I no longer had to pay for the storage facility.
I decided to start by powering up the UPS to charge the batteries. However, critical alerts indicated that the UPS didn’t detect any installed batteries. I double-checked all the connections, swapped the battery positions, reset the UPS to factory defaults, and tested it with and without any load. It seemed that the batteries were dead, but I chose to address this issue later, as I didn’t feel like hunting down my multimeter and other electrical tools. At least the UPS alert could be silenced, and it supplied 208V to my rack.
I began to rack the networking gear at the top of the rack and the servers at the bottom. I plugged them into my PDU, and suddenly, the room was filled with the sound of fans whirring and lights flashing. As I prepared my laptop, it dawned on me that I lacked the necessary credentials for the equipment and had no IP addresses either. Even if I had them, would configuring them be more complex than what I had envisioned for my own lab? I decided to reset the Mellanox SN2410 switch to its factory settings from the back. It was relatively simple to console into the unit and configure the MGMT0 port, allowing me to move on to the next step.
Next, I connected my Lenovo HX server, previously part of a Nutanix cluster. Due to its past use in a customer environment, the drives had been wiped and zeroed for security reasons. My plan was to perform a bare-metal foundation and deploy Nutanix AHV on the nodes.
I connected the networking side of the nodes and then the port for the Lenovo XClarity Controller (XCC). Obviously, the discovery method wasn’t going to work, so I assigned IP addresses to the nodes using their MAC addresses. One of the nodes refused to accept an IP address, and I remembered that the customer had reported hardware issues with it when it was returned to us. I assumed that was the problem and decided to postpone investigating that specific node while proceeding with the other three nodes in the block.
As soon as the foundation tool initiated the process, it produced an error—IPMItool could not be found. I retried the deployment with some configuration changes, but the same error persisted. I considered whether it was a cabling issue, swapped out cables, switched to a different unmanaged flat switch, and even downloaded the latest version of the Foundation app—all in vain. I also tried resetting the BMC of each node to factory defaults and ensured that the legacy BIOS was selected after the reset. The same error persisted. I attempted to upgrade the BMC, UEFI, and LXPM drivers on the first node, but still encountered the same error.
Frustration mounting, I decided to step away for the day.
I posted a question on the Nutanix Technology Champion (NTC) Slack channel, hoping someone might have encountered this issue before. A comment caught my attention: “I wonder if it’s looking for ipmitool inside the Foundation setup to perform ipmitool tasks on your new nodes?” It was time to delve into the Field Installation Guide and Foundation documentation. After some searching, I came across this information:
If using Foundation for Windows/Mac –
If you image nodes without discovery, the hardware support is limited as follows:
Nutanix: Only G4 and above Dell HPE
[Link to the document: https://portal.nutanix.com/page/documents/details?targetId=Field-Installation-Guide-v5_5:fie-foundation-guidelines-c.html]
Lenovo wasn’t listed on this version. When I checked a slightly older version of the document, it stated that Lenovo was not supported.
I decided to download VirtualBox and the Standalone Foundation VM since Lenovo was supported through this method. I reconnected everything to the main SN2410 switch, and this time, the ipmitool error didn’t appear immediately. Excitement building, I took a break for food while waiting for the foundation process to complete. However, when I returned, there was another fatal error.
This time, it pertained to the software installation. I went back to basics and checked network connectivity. Strangely, I couldn’t ping anything on the network anymore. I examined the cables, and the NIC lights were no longer lit. I couldn’t even ping the switch when connected to the MGMT0 port.
I unplugged the power to the switch and let it boot back up. After some time, the same issue occurred. I disconnected all connections to the switch and power-cycled it again. This time, there was no issue. I researched and discovered a known compatibility problem between Mellanox and the NIC in the Lenovo HX chassis—the Intel X722. Of course, there was such a problem!
The notes mentioned that the issue could be overcome by disabling auto-negotiation on the ports. I attempted to run the command, but it turned out that “no-autoneg” was not a valid variable. After further research, I found that controlling that feature was enabled in a future release.
I downloaded the latest release of the ONYX software and uploaded it to the switch. It applied successfully and required a reboot. While waiting for the reboot to complete, I monitored the MGMT0 interface with a ping and noticed it kept going down. To investigate, I connected the console cable again. It turned out the switch was stuck in an infinite crash loop. At this point, I could only laugh at myself and the situation.
Fortunately, I remembered from a previous experience that if I quickly pressed the reset button on the back of the switch and went to my laptop, I could choose which partition to boot into. I swiftly selected the older version, and the switch booted up normally.
I speculated that the issue might be related to the significant version difference (going from the 2017 version to the 2024 version). So, I decided to download a version from 2020 and repeated the process. This time, the installation was successful, and the switch booted up as expected.
I reviewed the configuration and was relieved to see that I could now disable auto-negotiation. I configured the ports as needed and reconnected the nodes. The lights started blinking again, giving me confidence that this might be the final step to get the foundation running smoothly.
I booted up the Foundation VM again, restarted the process, and noticed it progressed further this time. When I saw that node 1 began copying the AOS tarball and preparing the host, a wave of relief washed over me. Finally.
However, about 30 seconds later, nodes 2 and 3 encountered a fatal error. Why? Node 1 was working perfectly. Then, I recalled one of the initial troubleshooting steps I took for the ipmitool error—upgrading the BMC, UEFI, and LXPM on node 1. So, I went ahead and completed the same process for nodes 2 and 3.
Crossing my fingers, I initiated the foundation process again. I’m happy to report that this time, everything went smoothly, and I now have an AHV-based Nutanix cluster up and running in my lab!
You may never encounter such a series of issues all at once, but I thought it would be valuable to share my experience. I hope you found my journey through these challenges both entertaining and enlightening. Happy Foundationing!