Power Outage and DFS Replication

On Monday I had the privilege of participating in an unplanned recovery drill after maintenance on our site UPS and generator tripped over itself (four times). Needless to say, a lot of our infrastructure doesn’t take too kindly to unexpected darkness, and its lack of choreography means that things come up out of order. But that’s not the focus here, just the context. Most everything was restored in relatively short order thanks to good documentation of this nature:

  1. Confirm power is reliable and not likely to fail again in the immediate future
  2. Power on network switches (core, rack, etc)
  3. Power on storage array(s) (HP, EMC, etc)
  4. Power on virtualization hosts (ESXi, Hyper-V)
  5. Connect to each virtualization host directly (vSphere Client to hostname / RDP to Hyper-V Console)
  6. Confirm presence of all storage (LUNs, datastores, CSVs)
  7. Confirm recognition/identities of hosted virtual machines (out-of-order boot may see VMs as “unknown”)
  8. If any storage is missing or VMs “unknown”, reboot hosts to confirm storage accessibility
  9. If VMs auto-power on, force power off to prevent incorrect boot order (i.e. VMs on before Active Directory)
  10. Power on Active Directory servers
  11. Reboot AD servers (one at a time) until they come up smoothly, recognize network (as “domain network”), and serve DHCP, DNS, etc
  12. Power on vCenter and VMM servers
  13. Power on DFS and shared file servers
  14. Power on System Center Operations Manager server to begin monitoring
  15. Power on load balancers
  16. Etc…

Back on topic, this post is about DFS-R (Distributed File System Replication), mentioned in Step 13, and only fully understood in this context today. I probably should have known this by now, but there’s a reason why Step 13 isn’t enough to get DFS-R operational. It catches me by surprise every time when someone reports that data is out of sync, and probably every time since, I’ve had to manually re-sync the data before doing an authoritative sync because some data comes from each side. Finally today, I know why and how to fix it.

A July 23, 2012 TechNet Blog post explains that Microsoft changed the default DFS-R dirty/unexpected shutdown recovery behavior in a hotfix to Windows Server 2008 R2 that then became default behavior in Windows Server 2012 and beyond. Pre-patch, it auto-recovered. Post-patch, it stays down in manual mode waiting for intervention. Thus, reboot and restart the services all you may, it will never auto-recover.

It’s for a good cause–Microsoft wasn’t just flipping a coin to mess with IT admins. The new manual default allows for a backup to be taken in case the recovery replication overwrites data during clean up. Still, it doesn’t set off enough bells and whistles to properly let you know there’s trouble (even in SCOM).

The solution to manually recovering is surprisingly given in the Event Log (Event ID: 2213) with the actual command you need to run, following this format:

wmic /namespace:\\root\microsoftdfs path dfsrVolumeConfig where volumeGuid=”<volume-GUID>” call ResumeReplication

If you default to PowerShell like I do, though, your rejoicing will be short-lived. As soon as you run the command, it returns:

Unexpected switch at this level

No fun. Thankfully, another blogger out there ran into this and posted that the answer is using Command Prompt. Apparently not everything is PowerShell friendly yet. Once you run the command (one per volume), your output should look like this:

dfsr_wmic

Hope this helps any other DFS-R users out there. Oh, and if backups aren’t a concern, you can change the default behavior in the registry by setting the following key to “1”:

HKLM\System\CurrentControlSet\Services\DFSR\Parameters\StopReplicationOnAutoRecovery

Be First to Comment

Leave a Reply