Rubrik Update: Scale & Cloud

With Virtualization Field Day 5 (VFD5) coming up this week, it seems appropriate timing for an update on Rubrik in action. For a refresh on what Rubrik is, check out Mike Preston’s #VFD5 Preview – Rubrik. I’ll be using some of what he shared as launching points for elaboration and on-the-ground validation.

Share Nothing – Do Everything

rubrik_systemI believe that this is both the most important and likely the most overlooked characteristic of Rubrik and its architecture. It is crucial because it defines how users manage the solution, build redundancy in and around it, and assess performance needs. I also believe it is overlooked because it is like the foundation of a great building–most of it is under the surface and behind the scenes, enabling prominent edifices like Time Machine-like simplicity.

One way that I can describe it is “multi-master management and operations”, though it falls short because Rubrik has no slaves. Every node is the master. Some data protection solutions have redundant storage nodes which all depend on a single control node. If issues arise with control, the plethora of storage behind it is helpless except to sit and maintain integrity. With Rubrik, all nodes have command authority to manage, support, and execute across the infrastructure.

A few weeks ago, one of our brik’s health checks took a node out of active status. When that happened, the rest of the nodes kept taking snapshots and protecting the infrastructure. Furthermore, I didn’t lose management access. I just pointed Chrome at a different node, logged in and initiated a tunnel for support to check it out. Within a few minutes, the node was back online–a mere, overly sensitive check parameter that engineering tuned in the next update.

Parallel teamwork is the grunt side of Rubrik’s “do everything” design. Whether you have a 3-node R330 or several 4-node R340 appliances, participating nodes automatically shoulder the data–snapshot, live mount, restore, and archive–burden and scale linearly. More nodes mean more concurrent snapshots, more live VM mounts, more management endpoint redundancy, and of course, more IOPS for everything.

Cloud Archive

When I last posted, the cloud archiving feature was still being vetted internally before unleashing it to the hordes (including me). Today, we’re about two weeks into archiving and have a combined 19TB of snapshots up in Amazon S3 buckets. It should have been harder to implement a hybrid data protection solution–how else are admins supposed to look smart? But it wasn’t. Rubrik’s cloud archive is as simple as a slider.

sla_slider

Most of my policies are to retain 30 days of VM snapshots, while practical usage of the entire VM (for live mount or VMDK restore) is 10 days or less. Consequently, I changed the cloud policy from keeping everything local to only 10 days, and Rubrik went to work moving the other 20 days up.

True confessional: setting up the S3 buckets was the hardest part for me. I hadn’t used AWS storage yet and the combo of buckets, Identity and Access Management (IAM), policies, and access and encryption keys was initially puzzling. Then the Rubrik team sent over a guide with the steps to meet their needs, and the pieces came together. A word of caution to all those AWS newbies like me: pay special attention to the zone where you create your encryption keys. They can never be deleted and they always reserve the name you give them (especially that perfect one…in the wrong zone). Your key and bucket zones need to match.

Tech Tip

The principle behind Rubrik’s “Gold” SLA is that frequent snapshots lead to smaller deltas and thus faster snapshots. Thus, a member of the default “Gold” SLA will have backups taken every 4 hours, which is amazing for Recovery Point Objectives (RPOs), not to mention the instantaneous Recovery Time Objective (RTO) provided through Rubrik’s performant Live Mounts.

sla_bronze_lightIn the majority of cases, this principle is excellent to follow and delivers as expected. However, certain cases do advise against it, because of data change. For example, I have certain SQL database servers that crunch through a lot of data, chew up tempdb, and do recurring manipulations of data through the day. In a 24-hour period, I can see large portions of this data entirely rewritten several times. What this means is that taking backups at 4-hour iterations will also lead to exponential growth in the space used by snapshots of these VMs.

This is no limitation of Rubrik or any other product that would seek to protect this data; it’s just simple physics. In my case, some of these servers are not actually time-sensitive in a manner that benefits more from 4-hour backups than it does from 24-hour backups. It also isn’t data that needs extended retention. As such, I created an SLA Domain that I call “Bronze Light” that keeps a mere 14 days of daily snapshots and ships the oldest 7 days of those to the cloud. This keeps local data slim, but gives me that insurance buffer of an extra week in case I need to reach out and touch it.

Exciting Days Ahead

I’m looking forward to hearing what Rubrik and others announce and present at VFD5 this week. It’s a great time to be a tech enthusiast and a privilege to work with innovations like Rubrik. I hope you get the opportunity.

One Comment

  1. […] introduction and funding post on Rubrik, along with those published by Chris Gurley (and his update), Marc Farley, Chris Mellor, Scott Lowe, Cormac Hogan, Arjan Timmerman, Frank Denneman, and Mike […]

    June 25, 2015
    Reply

Leave a Reply