Tuesday, October 7, was a big day for me. After searching for more than three months for the cause of a repeated storage connectivity failure, I finally found a chunk of definitive data. The scientific method would be happy–I had an hypothesis, a consistently reproducible test, and a clear finding to a proposition that had hung in the ether unanswered for two months.
My environment has never seemed eccentric or exceptional until EMC, VMware, and I were unable to explain why our ESXi hosts could not sustain a storage controller failover (June). It was a “non-disruptive update” sans the “non-“. The array, though, indicated no issues inside itself. The VMs and hosts depending on the disks didn’t agree.
As with any troubleshooting, a key follow-up is being able to reproduce it and to gather sufficient logs when you do, so that another downtime event isn’t necessary after that. We achieved that first part (repro) with ease, but came up short on analytical data to find out why (August). Being that this was a production environment, repeated hard crashes of database servers wasn’t in the cards.
The other participant organizations in this Easter egg hunt were suspicious of the QLogic 8262 Converged Network Adapter firmware as the culprit, apparently after receiving indications to that effect from QLogic. As that data came second-hand, I can’t say whether that was a guess or a hard-evidence-based hypothesis. Our CNAs were running the latest available from Dell’s FTP update site (via the Lifecycle Controller), but that repository stays a few revisions behind for some unknown yet intentional reason (ask Dell).
In late September, EMC was able to free up a demo XtremIO X-Brick so that we could perform concentrated testing around firmware and the behavior in general. This was a big step as the prior three months had been consumed by e-mails ad nauseum trying to squeeze a cause out of insufficient data. EMC understood that “production” meant “not crashable at will”, but VMware didn’t seem to get that concept. That was slightly frustrating. The demo brick alleviated that.
On October 3, I inadvertently came across the finding that was solidified yesterday. In the past, my habit for setting multipathing policies on new storage was to configure it LUN by LUN in vSphere Client. It was mundane, but we only maintain around a dozen datastores (that rarely change), so it didn’t take long. This fall, I’m trying to work on automation and finding methods to turn repetitive actions into single steps. A native multipathing (NMP) SATP rule doesn’t quite count as “automation” but one line and a reboot does achieve my manual action reduction goal.
For reasons beyond the scope of this post, I had to rebuild the ESXi host which I was using to reproduce the issue connecting during XtremIO controller failovers. When I configured the NMP policy on the XtremIO storage, I violated the scientific method and accidentally changed something outside the test parameters. I configured NMP differently. It seemed the same, only less work, but the best practice command for creating a custom NMP SATP rule also sets additional attributes. One of those is “IOPS=1”.
That value tells ESXi to rotate between active storage paths for every single I/O operation, rather than the VMware default of every 1000 I/O operations. EMC and other vendors have begun adding this lower value for IOPS, as well as other performance tweaks, into their best practice recommendations (especially for all-flash arrays). I understood that suggestion, but until our old 3PAR array was/is finally put out to pasture, I have avoided those adjustments under the concept of “one change at a time”.
Back to my testing, after I configured NMP on the rebuilt test host, I was unable to reproduce the issue. One very fruitless day went down the drain as I searched for the reason why. The next morning (Oct 3), I found/realized/tested it. With IOPS=1000 (and about 30,000 IOPS), my issue came back immediately. Switch to IOPS=1, and those same conditions didn’t faze it. Now for firmware.
VMware, QLogic, whoever were pretty sure that my firmware (4.16.17) was at fault. It was sending or looking for sense data that it wasn’t liking or getting (obviously I’m not an HBA engineer). It was time to change that and find the proof either way.
First, I found out how not to flash the firmware. At least in my setup, the QConvergeConsole in vSphere Client 5.5 is not a safe method for updating firmware. I followed a great blog with clear steps for using QLogic’s vSphere plug-in. Obviously it works for others, but in my case, vSphere timed out after 15 minutes of waiting for the operation to complete, told me to wait another 15 minutes, and then to reboot and carry on. I did, and upon reboot, the NIC portions of both CNAs gave ugly nasty-grams of corrupt firmware. Joy. Ask me another time how fun the Lifecycle Controller was to deal with when it couldn’t read the firmware.
Second, I found out how to safely flash the firmware the good old-fashioned way using a bootable drive (Dell Virtual Console was happy to call it a “Virtual Floppy”, but it was really a USB stick). @InTheDC pointed me to Rufus, which was a real lifesaver, since creating DOS-bootable USB drives from Windows 8.1 is nigh impossible. I even ran into himem.sys and upper memory block issues trying to figure out before Rufus. Talk about nostalgia…
Using Dell’s latest QLogic firmware release (4.16.101), I finally brought the CNAs up to the present (Sept 5, 2014 release date) and returned to ESXi.
With VMs migrated back over and IOmeter pushing ~34K IOPS (a mix of everything with two big chunks of 4K / 25% Read and 64K / 50% Read) from four of those VMs, I deactivated the XtremIO storage controller. Result: connectivity lost. ESXi went on a retry journey to find itself and never came back (just like it did with the old firmware). Then I changed the NMP SATP rule to IOPS=1, and it was resilient. QLogic firmware didn’t affect the test.
The finding here is that the value of “IOPS=????” matters in VMware native multipathing, particularly in our XtremIO environment. The question is whether other storage arrays utilizing the Round Robin policy, especially those that use the VMW_SATP_DEFAULT_AA rule, are similarly affected. SANs aren’t exactly an easily attainable commodity to borrow or test, so fair, multi-vendor assessments are difficult to conduct. If you happen to be swimming in diverse arrays that are also safe to test and possibly crash, please let me know how controller failover tests go when pushing 30K or more IOPS from a single host using both IOPS=1 and IOPS=1000.
I’m looking forward to finding out where the ultimate root cause resides and will add another entry at that time. Until then, IOPS=1 is your friend :).