We began our hands-on exploration of all-flash arrays in September 2013, and for all intents and purposes, the testing has never really concluded. If I knew then what I know now, I would have conducted a number of tests quickly during the official “Proof of Concept” (POC) phases.
All of the below tests are worth doing on the named products, as well as other similar products that official support the actions. Some tests particularly target a product architecture. Where applicable, I’ll note that. As with any storage array, the best and first test should be running real data (day-to-day workloads) atop it. The points build upon that being implied.
1. Capacity: Fill It Up!
This test is most practically focused on Pure Storage and its history and architecture. At the same time, the concept is worth processing with XtremIO.
In 2013 and before, Pure’s array dashboard showed a capacity bar graph that extended from 0% to 100%. At 80%, the array gave a warning that space was low, but failed to indicate the significance of this threshold. The code releases up to that point put an immediate write throttle on processing when the array passed that threshold. In short, everything but reads ground to a halt. This philosophy of what percentage truly is full was reassessed and redefined around the turn of the year to better protect the array and the user experience.
Pure’s architecture still needs a space buffer for its garbage collection (GC), which I believe is guarded by the redefinition of “full”. However, I have heard of at least one user experience where running near full caused performance issues due to GC running out of space (even with the protected buffer). If you’re testing Pure, definitely fill it up with a mix of data (especially non-dedupe friendly data) to see how it goes in the 80’s and 90’s.
For XtremIO, it’s a conceptual consideration. I haven’t filled up our array, but it doesn’t do anything that requires unprotected buffer space, so the risk isn’t particularly notable (feel free to still try!). The thing here is to think about what comes next when it does get full. The product road map is supposed to support hot-expansion, but today it requires swinging data between bricks (i.e. copy from an array of 1 x-brick to 2 x-bricks, 2 x-bricks to 4 x-bricks, etc).
2. Diversify & Observe: Block Sizes
Pure and XtremIO use different block sizes for deduplication and process those block sizes differently as well. Services and applications similarly use different block sizes when writing down to arrays. Microsoft Exchange favors 32KB blocks, while SQL Server tends toward 64KB blocks. Down the line, backup applications and jobs often times use blocks ranging from 256KB to 512KB. OS and miscellaneous writes stay on the smaller end around 4KB (or less).
Since Pure takes a bigger block size and then looks for duplicate patterns of various lengths, larger blocks like backup jobs have the potential to raise latency. It’s simple physics as I mentioned in the previous post–finding matching cards in 100 decks takes longer than finding them in 2 decks (take the analogy for what its worth). Your environment may not create any issues for a Pure array, and Pure arrays, code, and hardware may have moved beyond that by now, but test and verify.
XtremIO uses a fixed block size so bigger blocks don’t affect how its deduplication processes data. Everything is chopped down to 4KB (pre-3.0) or 8KB (3.0+) blocks. The thing to observe here is how deduplication and compression works. With the same data on both arrays (Pure & XtremIO), which provides the better data reduction? What are the trade-offs, if any, for that advantage?
3. Patch & Reboot: High Availability
My experiences with array software updates have almost always involved the words “non-disruptive”. In fact, since 2006 and our first EMC CLARiiON CX300, I can’t recall an update that required downtime. Sure, they recommended it and things were slower during updates, due to write-cache disabling, but one storage controller/processor was always online and serving data. Furthermore, in the storage array realm, “high availability” is pretty much a given. As the saying goes, though, “trust but verify”.
When you get your POC arrays, I’d recommend making sure that you can go through a software update during your evaluation. If the vendor doesn’t have one releasing during your POC, ask to have the POC unit loaded with the previous, minor revision of the code/software. Then, with your data fully loaded on it, schedule a time to perform that Non-Disruptive Update (NDU). This also provides the benefit of testing out the technical support experience with Pure and EMC Support (or any vendor).
Pure probably has an equivalent to this command, but you can also perform additional fail-over testing of XtremIO arrays by logging into the XMS CLI and running the following commands to see how an HA event is handled:
- Open two SSH sessions to the XMS
- In one session, run the following command. It repeats every 15 seconds. Open the XMS GUI to see more real-time data at the array level.
- Observe/verify that traffic is flowing down all initiators evenly
- In the second session, run the following command. Note that this will take a controller out of service (and may affect performance or availability).
- Watch the first SSH session and the GUI for the effects of the fail over (recommend waiting five minutes at least before re-activating)
- In the second session, run the following command to reactivate the controller:
- Observe/verify that traffic returns to an even flow across all initiators
If real-world data on your array doesn’t generate at least 10,000 to 20,000 IOPS, I recommend running IOmeter on a few array-connected servers to create additional load. Four VMs/servers running IOmeter with the following characteristics provided roughly 34,o00 IOPS in my experiments.
- Fully random I/O
- Two disks checked per VM (in different datastores; mostly just to see how IOPS patterns affected different volumes)
- Four outstanding IOPS
- Access Specification on VM 1: All-In-One
- Access Specification on VM 2: All-In-One
- Access Specification on VM 3: 4K / 25% Read (OS simulation, heavy writes)
- Access Specification on VM 4: 64K / 50% Read (SQL simulation)
4. Other Stuff: It Depends
This last part entirely depends on your environment and how you intend to use a new all-flash array. If you are fully virtualized like we are, look at the best practices, recommendations, and supported features. Compare your backup solution and architecture with array support. Do you need things like transportable snapshots for Veeam Backup & Replication, for example? If you use snapshots, how do you create, export, and delete them? Make sure any APIs that you use (or want to use) are supported.
At the end of the day, every environment and every use case is different. Relationships also matter, so your account team and VAR may sway your feelings toward, or away from, a given product. If all of the above tests go smoothly, smaller things like the UI and implementation process may make or break it. Or if you find the chinks in both products’ armor, support may be winning vote.
Either way, near the end of your evaluation, take some time to step back and write down the results and the pro’s/con’s to both or all of the products tested. Chances are you’ll find what matters to your organization on the page when you do.
Appendix 1: Latency, Block Size & Bandwidth
I’m adding the below screenshot from our POC to provide an example of #2 above and for Pure to help interpret for customers who might see such behavior. Thanks for the comment, Mike!