VMworld: vSphere Stretched Clusters, DR & Mobility (BCO2479)

Speaker: Lee Dilworth (VMware), Chad Sakac (EMC)

– lots of confusion between disaster recovery (DR) and disaster avoidance (DA)

Part 1: Disaster Avoidance vs. Disaster Recovery

Disaster Avoidance
– you know a host will go down
– Host: vMotion
– Site: vMotion

Disaster Recovery
– unplanned host outage
– Host: VMware HA
– Site: SRM

*** More content forthcoming (to fill in the blanks) ***

Type 1: Stretched Single vSphere Cluster
– single vCenter instance controlling both sites
– single cluster over both sites
– intra-cluster vMotion can be highly parallelized
– – four per datastore in vSphere 5
– network requirements
– – 622Mbps or more, 5ms RTT (or 10ms in vSphere 5 Enterprise Plus)
– – Layer 2 equivalence for vmkernel and VM network traffic

Type 2: Inter-Cluster vMotion
– inter-cluster vMotions are serialized
– – involves additional calls into vCenter
– – lose VM cluster properties
– network requirements
– – same as Type 1

Type 3: Classic Site Recovery Manager
– two vCenters and clusters

– DA != DR
– stretched clusters are complex
– SRM and non-disruptive workload mobility are mutually exclusive right now
– – vMotion = single vCenter vs. SRM = two+ vCenter domains

Part 2: Stretched Cluster Considerations

Design Considerations
– understand the difference compared to DR
– – HA does not follow a recovery plan workflow
– – HA is not site aware for applications, moving parts, dispersed sites
– single stretched site = single vCenter
– – vCenter setting consistency across sites (DRS affinity, cluster settings, network)
– will network support? Layer2 stretch? IP mobility?
– cluster split brain = big concern, how to handle?

Stretched Storage Configuration
– literally stretch SAN fabric between locations
– requires synchronous replication
– limited in distance to 100km
– read/write on one side, read/only on the other side
– similar to NetApp Metro Cluster
– consider failure modes of both network and storage

Distributed Virtual Storage Configuration
– leverages new technologies
– requires synchronous mirroring
– limited to 100km in most cases
– read/write in both locations, employs data locality algorithm
– typically uses multiple controllers in scale-out fashion
– must address “split brain” scenarios

EMC VPLEX Overview
– falls into distributed virtual storage category
– keeps data synchronized and also read/write at two locations
– uses scale out architecture
– support both EMC and non-EMC arrays behind VPLEX

Preferred Site in VPLEX Metro
– provides read/write in two locations at same time
– in failure scenario, VPLEX uses “detach rules” to prevent split brain
– invoked only by entire cluster failure, entire site failure, or cluster partition
– VMs don’t know if they are on the winning side or the losing side
– important for admins to configure the bias accordingly

Yanking and Suspending Storage
– when storage is lost, yanked, suspended, VMs don’t always go completely down
– VMs become zombies of sorts, sometimes responding to pings, etc
– vCenter may also be a zombie and not respond properly to SRM commands
– admin action: immediately shut down ESX hosts at down site

Stretched Cluster Considerations #1
– consideration: without read/write storage at both sites, roughly have the VMs…

– prior to and including vSphere 4.1, you can’t control HA/DRS behavior for “sidedness”

– with vSphere 5, you can use DRS host affinity rules to control HA/DRS behavior
– doesn’t address HA primary/secondary node selection (storage)
– beware of single-controller implementations
– storage latency still present in the event of controller failure

– no supported way to control VMware HA primary/secondary node selection with vSphere 4.x
– with all storage configurations:
– – limits cluster size to 8 hosts (4 in each site)
– – no supported mechanism for controlling/specifying node selection
– –

– stretched HA/DRS clusters require L2 equivalence

– network lacks site awareness, so stretched clusters introduce new networking challenges
– storage considerations:
– –


Part 3: What’s New?

New Features
– new workflows including failback
– planned migration – with replication update
– vSphere Replication Framework
– redesigned UI
– faster IP customization
– SRM specific Shadow VM icons at recovery site
– in guest script fallout via recovery plans
– VM dependency ordering configurable

vSphere 5 HA
– completely redesigned
– heartbeat datastores

Metro vMotion – Stretched Clusters
– longer distances
– workload balancing across sites
– less latency sensitive

– use cases:
– – mobility
– – availability
– – collaboration
– local & metro for stretched clusters
– geo: not yet; can result in data loss

Be First to Comment

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.