Is The Data Efficiency Metric A Bunch Of Hot Air?

Since the advent of thin provisioning, the concept of “data efficiency” has been used to describe how storage arrays deliver on large capacity requests while only reserving what’s actually occupied by data. For me, 3PAR (pre-HP) pioneered this in their thin provisioning and “Zero Detect” feature combo, which they like to deem a sort of deduplication 1.0 (they deduped zeroes in stream).

With the wider implementation of deduplication and compression, the “data efficiency” term has stepped (or been pushed) into marketing spotlights around the industry. HP 3PAR still promotes it, EMC XtremIO positions it at the top of their array metrics, and Pure Storage has it at the top-left of their capacity bar.

Is it bad to calculate and show? No. It’s a real statistic after all. Does it have any power or intrinsic value, though? No.

Thin provisioning is like fiberchannel or iSCSI connectivity in enterprise arrays. An array that doesn’t include it doesn’t rate the classification of “modern enterprise storage” in my book. Do most people consider air conditioning a perk when shopping for a car? I doubt it–it’s a given today.

Beyond the fact that it should be a given, the real issue I have with “Data Efficiency” is the inappropriate and near-fraudulent use of it in sales & marketing communication.

The above tweet has a few other issues as well, but for the topic at hand, the efficiency stat of 418 to 1 is the focus. It’s incredible at face value! Who wouldn’t want an XtremIO!?

It’s a completely fabricated number, however. For testing and whatever else is going on with that array, the admin has provisioned 1.1 petabytes on top of 15 terabytes of real storage. Even with magic pixie dust, no storage vendor can achieve that squeeze job (or anything reasonably close for mainstream over-provisioning).

The number that matters is “Data Reduction”. It has a hard backing of deduplication + compression and reflects how much more data you are fitting on the array as a result of software capabilities. That said, any number can be inflated with synthetic activities like cloning, etc (as is also occurring in the above image).

I’m a big proponent of integrity in the workplace and sales cycle, which I realize is idealistic when so much money is on the line with these vendors. Still, they are disgracing the worthy products they represent by using these tactics to seal deals. Let the products stand on their own two feet!

The final takeaway of all these words is that “data efficiency” is more of a reflection on a storage admin’s provisioning style than it is on the storage array’s performance.

If I only provision what I will use, my efficiency (minus dedupe+compression) will be 1. I’m not over-subscribing and I’m using it all–it’s like paying for everything with cash.

If I create massive volumes to leave the OS or hypervisor lots of overhead, then my efficiency will soar–it’s like paying with high-limit credit cards. The limit is sky-high, but trying to use it all would bankrupt your array (and resume–most arrays don’t take kindly to being completely full, which means downtime). At the end of the day, my number will be different than yours, and that’s all it is–different.

So leave “efficiency” out of your POC conversation. Keep it to what’s objective. Reduction is the name of the game. How’s yours?

7 Comments

  1. Chris I love where you’re going with this talk track but even data reduction is not agreed upon across the industry. Some vendors consider zero removal as reduplication where others look at is as a core function in support of thin provisioning.

    I’m beginning to think the only value that maters is the amount of storage capacity consumed to store the data, which is the value that truly matters.

    As always, thanks for sharing your views.

    October 2, 2015
    Reply
    • Chris said:

      Vaughn, you’re right on. Vendors aren’t speaking the same language when they measure themselves and each other. Or to compare it to systems, it’s like metric vs imperial estimations. Certain units seem nearly interchangeable (meters & yards), but it gets magnified at scale (kilometers & miles). Even just one of either is vastly different, nevermind a thousand.

      I resonate with Richard Arnold’s wish for a standard to exist for the terminology. It probably wouldn’t favor vendors who currently choose opportunistic definitions, but it would respect the customers who want an honest comparison.

      On the comment about zeroes, I’m for equality in the binary community. 0’s and 1’s should be counted equally and factored into deduplication as such. That said, I can see the dilemma with factoring in eager-zeroed file systems into the equation. I’ve always considered all zeroes after the last 1 to be the definition of “thin” efficiency.

      Thanks for the feedback.

      October 2, 2015
      Reply
  2. shahar said:

    Chris, I am not sure what you are saying. Over subscription is “bad” but compression and dedup is OK? Even if this is true, snapshots/clones are almost always thin provisioned unless they are not space efficient. If you want to say that space management is tricky, complex and even dangerous, I totally agree. In fact the entire idea of space reporting and data reduction ratio become undefined under many circumstances. For example, on some modern arrays to assess how many space you will reclaim if you will remove a volume may be practically impossible to calculate. This will push the admin to manage the space using only simple heuristics. Sadly this is the way most systems are going to (not by choice).

    Shahar

    October 4, 2015
    Reply
  3. Chris said:

    Shahar, I apologize if I was confusing or unclear–it appears you picked up a different message than I intended. Let me break it down below.

    Regarding your over-subscription question, I didn’t mean that at all. As I said in the post, “data efficiency [read: ‘over-subscription’] is more of a reflection on a storage admin’s provisioning style than it is on the storage array’s performance”. And in another part, I said it wasn’t a “good or bad” choice, just “different”. I oversubscribe and dedupe/compress, and I appreciate both.

    I wouldn’t go so far as to call space management “tricky, complex or even dangerous” any more than I would call systems or network management that. With the right information and training/practice, it can be quite predictable, albeit always with some risk (what in IT /doesn’t/ have risk?).

    My objective here was to challenge a measurement (data efficiency) that often gets abused in product marketing and sales. It has no industry standard or independent definition, and it says nothing about the intrinsic value or virtue of a given product, yet sales teams will use it to elevate one brand and slight another.

    As for your closing statement about the way most systems are going, I think this is very similar to the move from physical servers to virtualization. Memory mapping, over-subscribed CPUs, and shared datastores bore many of the same challenges for sysadmins. But we learned and continue learning, and at the end of the day, we’re better off for it.

    Thanks for your feedback!
    –Chris

    October 4, 2015
    Reply
  4. Johnmh said:

    All of these measurements are taken from the arrays perspective and use relatively simple but generally inconsistent (across the industry) formula to arrive at a savings ratio. It’s very easy to skew the numbers either way and one persons dedupe is another’s zero detect or pattern matching. Forget about the reported numbers, the only real measurement is how much does the host think it has written and how much space has been consumed on the backend storage. It doesn’t really matter how it was achieved, thin, dedupe, compression, zero detect etc.

    October 5, 2015
    Reply
    • Allen said:

      This is a great statement, we ran across the post about how Chris would do a PoC as we were doing our PoC with XIO and Pure. This was one of the big determining factors for us, putting our actual data on both arrays and see how much space it consumed.

      February 2, 2016
      Reply
  5. Johnmh said:

    One more thought, assuming similar performance, features and density. It doesn’t really matter what capacity you saved unless you can quantify how much did it really cost to store (per GB on the backend). e.g very high space savings may also incur very high upfront costs or vice versa, ideally you need a balance the two.

    October 7, 2015
    Reply

Leave a Reply