Field Review: XtremIO Gen2

Several months ago I walked through some of the issues we faced when XtremIO hit the floor and found it not to be exactly what the marketing collateral might present. While the product was very much a 1.0 (in spite of its Gen2 name), EMC Support gave a full-court-press response to the issues, and our account team delivered on additional product. Now it’s 100% production and we live/die by its field performance. So how’s it doing?

For an organized rundown, I’ll hit the high points of Justin Warren’s Storage Field Day 5 (SFD5) review and append a few of my own notes.

  • Scale-Out vs. Scale-Up: The Impact of Sharing
  • Compression: Needed & Coming
  • Snapshots & Replication
  • XtremIO > Alternatives? It Depends

 

Scale-Out vs. Scale-Up: The Impact of Sharing

True to Justin’s review, XtremIO practically scales up. Anything else is disruptive. EMC Support does their best to make up for this situation by readily offering swing hardware, but it’s still an impact. Storage vMotion works for us, but I’m sure spare hardware isn’t the panacea for everyone, especially those with physical servers.

The impact of sharing is key as well. XtremIO sharing everything can mean more than just the good stuff. In April, ours “shared” a panic over the InfiniBand connection when EMC replaced a storage controller to address one bad FC port. I believe they’ve fixed that issue (or widely publicized to their staff how not to swap an SC in a way that leads to panic, until code can protect), but it was production-down for us. Thankfully we were only one foot in, so our key systems kept going on other storage. We’ve seemed to find the InfiniBand exceptions, so I do not think this is a cause for widespread worry. ‘Just stating the facts.

I could elaborate further, but choosing XtremIO means being prepared to swing your data for disruptive activities. If you expect the need to expand, plan for that–rack space, power, connections, etc for the swing hardware, or whatever other method you choose.

 

Compression: Needed & Coming

xtremio_dedupeThis was the deficit that led to us needing four times the XtremIO capacity to meet our Pure POC’s abilities. At the time, we thought Pure achieved a “deduplication” ratio of 4.5 to 1 and were sorely disappointed when XtremIO didn’t. Then we realized it was data “reduction”, which incorporated compression and deduplication. Pure’s dedupe is likely still more efficient since it uses variable block sizes (like EMC Avamar), but variable takes time and post-processing.

When compression comes in the XIOS 3.0 release later this year, I hope to see our data reduction ratio converge with what we saw on Pure. As it stands, we fluctuate around 1.4 to 1 deduplication (which feels like the wrong word–dedupe seems to imply a minimum of 2:1). I choose to ignore the “Overall Efficiency” ratio at the top, as it is a combination of dedupe and thin provisioning savings, the latter of which nearly everyone has. We’ve thin provisioned for nearly 6 years with our outgoing 3PAR, so that wasn’t a selling point; it was an assumption. As a last note on this, Pure Storage asks the pertinent question: “The new release will come with an upgrade to compression for current customers. Can I enable it non-disruptively, or do I have to migrate all my data off and start over?”

 

Snapshots & Replication

I won’t say much on these items, because we haven’t historically used the first, and other factors have hindered the second. Given that our first EMC CX300 array even had snapshots, the feature arrival in 2.4 was more of an announcement that XtremIO had fully shown up to the starting line of the SAN race (it was competing extremely well in other areas, but was hard to understand the lag here). We may actually use this feature with Veeam’s Backup & Replication product as it offers the ability to do array-level snapshots and transfer them to a backup proxy for offloaded processing.

As for replication, my colleagues and I see it as feature with huge differentiating potential, at least where deduplication ratios are high. VDI or more clone-based deployments with 5:1, 7:1, or even higher ratios could benefit greatly if only unique data blocks were shipped to partnering array(s). For now, VPLEX is that answer (sans the dedupe).

 

XtremIO > Alternatives? It Depends

As I mentioned in the past, we started this flash journey with a Pure Storage POC. It wasn’t without challenges, or I probably wouldn’t be writing about XtremIO now, but those issues weren’t necessarily as objectively bad or unique to them as I felt at the time. Everyone has caveats and weaknesses. In our case, Pure’s issues with handling large block I/O gave us pause and cause to listen to EMC’s XtremIO claims.

Those claims fleshed out in some ways, but not in others (at least not without more hardware). Both products can make the I/O meters scream with numbers unlikely to be found in daily production, though it’s nice to see the potential. The rubber meets the road when your data is on their box and you see what it does as a result. No assessment tool can tell you that; only field experience can.

If unwavering low-latency metrics are the goal, XtremIO wins the prize. It doesn’t compromise or slow up for anything–the data flies in and out regardless of block size or volume. Is no-compromise ideal? It depends.

Deduplication is the magic sauce that turned us on to Pure, and XtremIO marketing said, “we can do that, too!” Without compromising speed, though, and without post-processing, the result isn’t the same. That’s the point of the compression mentioned earlier.

xtremio_pure_tweetThen there’s availability arguments. Pure doesn’t have any backup batteries (but stores to NVRAM in flight, so that’s not a deal-breaker), which EMC can point out. EMC uses 23+2 RAID/parity, which Pure is quick to highlight as a weakness. Everyone wants to be able to fail four drives and keep flying, right?

From what I’ve heard, Hitachi will take an entirely different angle and argue that magic is unnecessary. Just use their 1.6TB and 3.2TB flash drives and swim in the ocean of space. Personally, I think that’s short-sighted, but they’re welcome to that opinion.

 

Last Thoughts

In production, day to day, notwithstanding our noted glitches, XtremIO delivers. Furthermore, it has the heft of EMC behind it, and the vibe I get is that they don’t seem to be content with second place. Philosophies on sub-components may disagree between vendors, but nothing trips XtremIO’s performance. Is there potential for improvement, efficiencies (esp. data reduction), and even hybrid considerations (why not a little optional post-processing?)? Absolutely. And I’ve met the XtremIO engineers from Israel who aim to do just that. Time will tell.

xtremio_latency

15 Comments

  1. […] Gurley wrote up his experience with using XtremIO in a comprehensive and fair way. It looks like the impressions I got from SFD5 bear out in […]

    June 13, 2014
    Reply
  2. P. KIMMEL said:

    IBM FlashSystem not mentioned here amongst the alternatives – Gartner declared them the current No. 1 in market share for flash arrays.
    Pure 2nd, EMC 4th.

    June 23, 2014
    Reply
  3. Chris said:

    Thanks for the feedback. However, unless I’m missing something in the IBM FlashSystem data sheet, I don’t see any deduplication technology, which is the differentiating factor in Pure and XtremIO offerings.

    Flash of any kind is great, but what most consumers of XtremIO and Pure are seeking is intelligence and innovation that will slow or stop the unending cycle of adding raw disks. If IBM has taken that step, please feel free to share.

    June 23, 2014
    Reply
    • Chris, I just ran across your excellent review, and I am curious what kind of workloads you are running on your XtremIO array. You mentioned VMware ESXi, but what are the actual apps? Doesn’t look like VDI from your data reduction ratios.

      I’m an IBMer – from the TMS acquisition, which became the FlashSystem team. It is true that we don’t have dedupe baked into our FlashSystem products (today at least). But we have very strong real-time compression that for many workloads will give you a better result than dedupe. It looks like your workload is not very dedupe-friendly. Have you ever run our Comprestimator tool?

      http://ibm.biz/Comprestimator

      September 19, 2014
      Reply
      • Chris said:

        Erik,

        Thanks for the feedback/follow-up. Our workload on XtremIO is primarily MSSQL databases (as vSphere VMs), so you are correct that dedupe isn’t that effective. That’s where Pure’s compression made a difference, and where we’ll see if XtremIO’s does the same.

        I have not run IBM’s Comprestimator tool (or even heard of it until now; knowledge gained :), but I’ll take a look.

        As I understand it, Comprestimator requires an install (vs a stand-alone binary/executable) to run on ESXi or Windows. It also appears to be slightly dated (absence of latest OS versions in the support list). Any chance there’s an updated revision coming? I’d be interested in the data, but am protective of the kernels of all of my production systems.

        Thanks again!

        September 19, 2014
        Reply
        • Comprestimator is actually a standalone executable with no install required – just need to make sure you have the right permissions to run. The .exe you download is actually an archive with setup wizard that extracts the real binaries for various platforms, along with a README etc.

          Thanks for bringing the support list to my attention. We regularly update that tool; e.g. Windows 8 and FlashSystem V840 were added recently. I think our documentation team may have lagged a bit from the most recent updates. I’m assuming you’re after ESXi 5.5 and Windows Server 2012 R2? Feel free to shoot me a note separately and I’ll confirm support with the development team.

          September 19, 2014
          Reply
          • Chris said:

            Thanks; yeah, the “install” notes seemed to imply hooking in, etc. And yes, 5.5 and 2012 R2 (as well as Ubuntu 14) were the absent members that made it look outdated. ‘Appreciate it.

            September 19, 2014
  4. Shahar said:

    Chris, I really loved your review. It is probably very fair, as are probably most of your criticisms. At the end of the day I hear “In production, day to day, …, XtremIO delivers”. That’s the only thing that matters for me. In fact it’s amazing once you remember it is a first GA version. Even tough I am not part of the XIO team anymore, I know for sure that most issues and missing features are already handled. Some missing features are a result of a very careful launch that sacrifices features over stability (I admire that). You are right that persistent performance is a key feature of the platform. In fact it is part of a broader philosophy that claims that: – The customer should not manage performance (hot spots, loads, tiering, caching, etc.).
    – The performance should be as stable it can be across block size, load levels, capacities utilization level, etc.
    – Everything should just work, without tuning and without weird behaviors.

    That’s exactly why XIO don’t do garbage collection, post process, or implement any other mechanism that can introduce variance to the system. No – that doesn’t mean you cannot do scale-out rebalance, snapshots, replication, and compression. I am sure you will be surprised by the rate XIO will fill the gaps.

    I truly believes that XIO design and architecture is superior to any other existing block array design in the sense that it offloads the biggest part of the storage administration burden from the customer (No, I am not objective). Yes, there are glitches and issues to fix and some features to complete, but at the end of the day, XIO should be the most *trusted* platform – the one that makes the admin to sleep better.

    Shahar

    July 1, 2014
    Reply
  5. Chris said:

    Thanks for the feedback, Shahar. We’ve unfortunately had more issues since this review was posted, so the “XtremIO delivers” statement has a big asterisk next to it. To date, maintenance events mean unexpected downtime. We’ll see what the RCA reveals this time.

    July 1, 2014
    Reply
  6. Hi,

    Can you provide the RCA or at least describe where you stand today? A friend of mine works for a company that is considering a 4-brick XtremeIO, so your hands-on experience is really valuable.

    July 24, 2014
    Reply
  7. Chris said:

    The RCA was indeterminate and could only raise slightly-dated host HBA drivers as potential culprits. Our current course of action involves running some pathing failover tests during a weekend maintenance window, and then applying a minor XIOS update (2.4.1) in another window. We intend to update drivers on some (not all) hosts first in order to test whether hosts with the current driver version behave differently than those with a later version.

    I wish I had a more conclusive answer for you, but that’s where we stand. We’ve been through numerous storage array updates on our HP 3PAR array over the past years, and we haven’t seen any path or volume access issues like this, despite repeated, during-business-hours reboots of the 3PAR controller nodes. Perhaps we’re a fluke, but it would seem that our environment, config or state exposes a weakness in XtremIO. We’ll see. ETA on testing is the following 2-3 weekends (yay for night work…).

    July 24, 2014
    Reply
  8. Chris said:

    For those tracking here, we still don’t have a resolution on disruptive upgrade except that we solidly crashed it again in early August during a reproduction attempt to get more logs. Of course VMware/EMC didn’t get all that they needed, so we’re at a stale mate presently waiting on someone to stand up a lab so we don’t keep bearing the brunt of troubleshooting. ESXi 5.5, QLogic 8262 CNAs, and XtremIO — not a winning recipe.

    August 29, 2014
    Reply
  9. Muhammad said:

    Hi Chris, Thanks for sharing your experience. We are kind of in the same situation as you were during POC for SQL storage. We are seriously considering buying XtremIO (with code3 that have compression as well) over Pure. Wondering you managed to fix the issues you had with XtremIO and able to provide any more feed back on XtremIO? Thanks!

    October 21, 2014
    Reply
    • Chris said:

      Hi Muhammad. First, I’d point you to my recent “Doing It Again” posts that highlight more thoughts on the XtremIO/Pure consideration. Have you been able to do a POC with your SQL data on XtremIO with 3.0? I’d be very interested in your data reduction ratio.

      Second, check out my post below on multipathing: http://www.thegurleyman.com/iops-matter-vmware-native-multipathing-rule-attribute-affects-storage-failover/

      I can’t say that we/EMC/VMware have “fixed” the issue that began in June, but setting IOPS=1 in the native multipathing rule does appear to have avoided it. Definitely create a custom SATP rule on your ESXi hosts to use Round Robin and cover that, and you should be safe.

      If you have time left in your POC (or if you haven’t started it yet), I’d consider the things I’d do again in a POC and run them against the array to make sure it measures up as you’d want. Let me know how it goes!

      October 21, 2014
      Reply
      • Muhammad said:

        Thanks Chris for reply. We saw more data reduction in Pure than XtremIO. With Pure we saw around 3.5:1 for SQL data (some of it was already compressed in SQL). On XtremIO we saw around 1.7:1. XtremIO stats are taken by running their dedup and compression measurement tool. We were not able to get right value from array as XtremIO doesn’t give dedup&comp stats for each volume. We are hoping to get exact value by deleting all other data from XtremIO and placing only SQL data on Array.

        October 22, 2014
        Reply

Leave a Reply