Speaker: Lee Dilworth (VMware), Chad Sakac (EMC)
Premise:
– lots of confusion between disaster recovery (DR) and disaster avoidance (DA)
Part 1: Disaster Avoidance vs. Disaster Recovery
Disaster Avoidance
– you know a host will go down
– Host: vMotion
– Site: vMotion
Disaster Recovery
– unplanned host outage
– Host: VMware HA
– Site: SRM
*** More content forthcoming (to fill in the blanks) ***
Type 1: Stretched Single vSphere Cluster
– single vCenter instance controlling both sites
– single cluster over both sites
– intra-cluster vMotion can be highly parallelized
– – four per datastore in vSphere 5
– network requirements
– – 622Mbps or more, 5ms RTT (or 10ms in vSphere 5 Enterprise Plus)
– – Layer 2 equivalence for vmkernel and VM network traffic
Type 2: Inter-Cluster vMotion
– inter-cluster vMotions are serialized
– – involves additional calls into vCenter
– – lose VM cluster properties
– network requirements
– – same as Type 1
Type 3: Classic Site Recovery Manager
– two vCenters and clusters
Summary:
– DA != DR
– stretched clusters are complex
– SRM and non-disruptive workload mobility are mutually exclusive right now
– – vMotion = single vCenter vs. SRM = two+ vCenter domains
Part 2: Stretched Cluster Considerations
Design Considerations
– understand the difference compared to DR
– – HA does not follow a recovery plan workflow
– – HA is not site aware for applications, moving parts, dispersed sites
– single stretched site = single vCenter
– – vCenter setting consistency across sites (DRS affinity, cluster settings, network)
– will network support? Layer2 stretch? IP mobility?
– cluster split brain = big concern, how to handle?
Stretched Storage Configuration
– literally stretch SAN fabric between locations
– requires synchronous replication
– limited in distance to 100km
– read/write on one side, read/only on the other side
– similar to NetApp Metro Cluster
– consider failure modes of both network and storage
Distributed Virtual Storage Configuration
– leverages new technologies
– requires synchronous mirroring
– limited to 100km in most cases
– read/write in both locations, employs data locality algorithm
– typically uses multiple controllers in scale-out fashion
– must address “split brain” scenarios
EMC VPLEX Overview
– falls into distributed virtual storage category
– keeps data synchronized and also read/write at two locations
– uses scale out architecture
– support both EMC and non-EMC arrays behind VPLEX
Preferred Site in VPLEX Metro
– provides read/write in two locations at same time
– in failure scenario, VPLEX uses “detach rules” to prevent split brain
– invoked only by entire cluster failure, entire site failure, or cluster partition
– VMs don’t know if they are on the winning side or the losing side
– important for admins to configure the bias accordingly
Yanking and Suspending Storage
– when storage is lost, yanked, suspended, VMs don’t always go completely down
– VMs become zombies of sorts, sometimes responding to pings, etc
– vCenter may also be a zombie and not respond properly to SRM commands
– admin action: immediately shut down ESX hosts at down site
Stretched Cluster Considerations #1
– consideration: without read/write storage at both sites, roughly have the VMs…
–
#2
– prior to and including vSphere 4.1, you can’t control HA/DRS behavior for “sidedness”
–
#3
– with vSphere 5, you can use DRS host affinity rules to control HA/DRS behavior
– doesn’t address HA primary/secondary node selection (storage)
– beware of single-controller implementations
– storage latency still present in the event of controller failure
–
#4
– no supported way to control VMware HA primary/secondary node selection with vSphere 4.x
– with all storage configurations:
– – limits cluster size to 8 hosts (4 in each site)
– – no supported mechanism for controlling/specifying node selection
– –
#5
– stretched HA/DRS clusters require L2 equivalence
–
#6
– network lacks site awareness, so stretched clusters introduce new networking challenges
– storage considerations:
– –
Summary:
(image)
Part 3: What’s New?
New Features
– new workflows including failback
– planned migration – with replication update
– vSphere Replication Framework
– redesigned UI
– faster IP customization
– SRM specific Shadow VM icons at recovery site
– in guest script fallout via recovery plans
– VM dependency ordering configurable
vSphere 5 HA
– completely redesigned
– heartbeat datastores
Metro vMotion – Stretched Clusters
– longer distances
– workload balancing across sites
– less latency sensitive
VPLEX 5.0
– use cases:
– – mobility
– – availability
– – collaboration
– local & metro for stretched clusters
– geo: not yet; can result in data loss
Be First to Comment