A customer of mine ran a Dell VRTX chassis with two M630 blades that are vSphere 6.5 hosts. Over the weekend one blade had a critical error and rebooted itself and then locked up at the iDRAC initialization screen. Meanwhile, vSphere attempted to do its thing and vMotion the guests to the remaining host but something in that process failed as well, leaving the existing guests on that host running (albeit poorly) but failing to start the guests coming over from the failed blade. To make matters worse, I lost all manageability of the entire system. I was unable to manage vSphere via vCenter or the remaining host directly (logins were being rejected because of service unavailability), and could not manage the VRTX via CMC nor could I access the iDRACs! No response to pings either.
After quick head-scratching pause, I determined the CMC and iDRAC issue to be with a switch that was used for server and device management. It was a Dell Powerconnect 5448 that was configured with a separate VLAN for that purpose. It turns out that the switch somehow was reset to factory defaults, and lost the VLAN and trunk configuration (well, all its configuration). Fixing that up took me about 15 minutes, at which time I could once again manage the VRTX via CMC/iDRACs.
This whole incident was simply weird. It started with a call from my client stating they’d lost access to several servers. Why I wasn’t alerted prior to that by our monitoring system is a blog post for another day. After a quick confirmation that stuff was down, I decided that a failed host and a problem with HA was likely, so I high-tailed it over to their office. At first glance of the VRTX itself, I began second-guessing my initial diagnosis. The box was showing no signs of any trouble – no orange LEDs or abnormal message on the LCD panel. OK, so maybe it was some sort of failure of the hypervisor – stuff happens. That’s when I discovered the lack of manageability. I had to resort to KVM to get a look at the hosts themselves. Blade 1 looked OK, but blade 2 was stuck on the iDRAC initialization screen… not good. A reboot of that blade caused those front panel LEDs and LCD to go completely nuts. Blade 2 made it past the iDRAC part and stopped again with an indication of a failed memory module. Ah-ha. A look at that blade’s iDRAC which echoed the console output, slot A1 DIMM was failed. CMC and the other blade’s iDRAC showed no other issues. But why couldn’t I get into that other host to see what was going on with vSphere?
The four guests on the remaining host were running so poorly they were just about useless. I ended up having to reboot the “good” blade after everything else I threw at it failed. Yes, even CLI. It came back perfectly normally. Manageability was restored, and I was able to start all guests. vMotion had moved the machines to this blade following the other host’s failure, but they all failed to start for some reason. So now I was in a better place. My client was back online with all Windows and Linux virtual machines running on the one blade, and performance was pretty good. They do also have a physical backup DC and an AS/400 that were not affected by this catastrophe.
So then my call to Dell started. To make a long story short, they confirmed the DIMM was bad after having me swap it into a different slot, and shipped me my replacement in the four-hour mission critical/ProSupport window. It was a bit odd on Dell’s part – they actually had me do all the work rather than sending a tech, although I didn’t ask for one. It’s just a little strange that they trusted someone they had no clue as to their technical ability to mess with this $50k piece of equipment. 90 minutes later I had the new memory stick in hand.
After replacing the DIMM and powering the blade back up, I thought I was home free. But when vCenter told me there was a connection error when trying to access the host server, I realized my day wasn’t over. If you’ve made it this far, I won’t bore you with the remaining troubleshooting steps I had to take, but to summarize:
The internal 10Gb links between this blade and the chassis switch were in a down state and couldn’t be brought back up. Another call to Dell had me removing the NIC from the blade and re-seating it, swapping the blade into another slot (same problem) and ultimately after almost 3 hours of troubleshooting they wanted me to reload the switch in the chassis. I did that on my own after hours so that my client wouldn’t be interrupted again. The same problem persisted, so I tried putting the four 10Gb links in an administratively down state and bringing them back up. BINGO – they came back up and started passing traffic normally. BUT… then I noticed that only one of the links between the VRTX and the LAN was up. The VRTX to LAN connectivity was done via a Cisco Catalyst switch with dual 10Gb links in an Etherchannel/LAG with LACP as per Dell’s deployment team. I reloaded the VRTX chassis switch again and everything was back to normal following the 10-15 minute reload cycle.
It is disturbing to me that the VRTX failed so miserably following the failure of a blade. Redundancy is the whole point of purchasing such a product. vSphere logs indicated the beginning of trouble as it registered memory errors and lost connectivity to the one host shortly after. vMotion started doing its thing, but all logging stopped abruptly about 20 minutes from the first sign of trouble and didn’t start back up until I rebooted the good blade.
I think the internal VRTX switch is unstable/unreliable. It was only one firmware release back from the newest available, which appeared to be only a security patch according to the release notes. After dealing with another catastrophic VRTX failure a couple years ago as well (storage controller failure and complete data loss – another inexplicable event that was eventually chased down to a faulty riser card and an unfortunate misstep by a tech), I’m not sure I would ever recommend one to another client.