Active Directory and DFS-R Auto-Recovery

Back in January 2012, Microsoft released KB2663685 which changes the default behaviour of DFS-R replication and it effects Active Directory. Prior to the hotfix, when a DSF-R replication group member performs a dirty shutdown, the member would perform an automatic recovery when it came back online however after the hotfix, this is no longer the case. This behaviour change results in a DFS-R replication group member halting replication after a dirty shutdown awaiting manual intervention. Your intervention choices range from manually activating the recovery task to decommissioning the server and replacing it, all depending on the nature of the dirty shutdown. What we need to understand however is that a dirty shutdown can happen more often than you think so it’s important to be aware of this.

I appreciate this is an old subject but it is one that I’ve come across a couple of times recently so wanted to share and highlight the importance of it. This will be one of a few posts I have upcoming on slightly older topics but none the less important ones that need to be addressed.

How Does DFS-R Effect Active Directory

In Windows Server 2008, Microsoft made a big change to Active Directory Domain Services (AD DS) by allowing us to use DFS-R for the underlying replication technology for the Active Directory SYSVOL, replacing File Replication Service (FRS) that has been with us since the birth of Active Directory. DFS-R is a massive improvement on FRS and you can read about the changes that DFS-R brings to understand the benefits at http://technet.microsoft.com/en-us/library/cc794837(v=WS.10).aspx. If you have upgraded your domains from Windows Server 2003 to Windows Server 2008 or Windows Server 2008 R2 and you haven’t completed the FRS to DFS-R migration (and it’s easily overlooked as you have to manually complete this part of a migration in addition to upgrading or replacing your domain controllers with Windows Server 2008 servers and there are no prompts or reminders when replacing your domain controllers to do it), I’d really recommend you look at it. There is a guide available on TechNet at http://technet.microsoft.com/en-us/library/dd640019(v=WS.10).aspx to help you through the process.

Back in January 2012, Microsoft released KB2663685 which changes the default behaviour of DFS-R replication and it effects Active Directory. Prior to the hotfix, when a DSF-R replication group member performs a dirty shutdown, the member would perform an automatic recovery when it came back online however after the hotfix, this is no longer the case. This behaviour change results in a DFS-R replication group member halting replication after a dirty shutdown awaiting manual intervention. Your intervention choices range from manually activating the recovery task to decommissioning the server and replacing it, all depending on the nature of the dirty shutdown. What we need to understand however is that a dirty shutdown can happen more often than you think so it’s important to be aware of this.

Identifying Dirty DFS-R Shutdown Events

Dirty shutdown events are logged to the DFS Replication event log with the event ID of 2213 as shown below in the screenshot and it advises you that replication has been halted. If you have virtual domain controllers and you shutdown your domain controller using the Shutdown Guest Operating System options in vSphere or in Hyper-V, this will actually trigger a dirty shutdown state. Similarly, if you have a HA cluster of hypervisors and you have a host failure causing the VM to restart on another host, yep, you guessed it, that’s another dirty shutdown. The lesson here first and foremost is always shutdown domain controllers from within the guest operating system to ensure that it is done cleanly and not forcefully via a machine agent. The event ID 2213 is quite helpful in that it actually gives us the exact command to recover the replication so a simply copy and paste into an elevated command prompt will recover the server. No need to edit to taste. Once you’ve entered the command, another event is logged with the event ID 2214 to indicate that replication has recovered shown in the second screenshot.

AD DS DFS-R Dirty Shutdown 2213  AD DS DFS-R Dirty Shutdown 2214

Changing DFS-R Auto-Recovery Behaviour

So now that we understand the behaviour change, the event ID’s that lets us track this issue, how can we get back to the previous behaviour so that DFS-R can automatically recover itself? Before you do this, you need to realise that there is a risk to this change and the risk is that if you allow automatic recovery of DFS-R replication groups and the server that is coming back online is indeed dirty, it could have an impact on the sanctity of your Active Directory Domain Services SYSVOL directory.

Unless you have a very large organisation or unless you are making continuous change to your Group Policy Objects or files which are stored in SYSVOL, this shouldn’t really be a problem and I believe that the risk is outweighed by the advantages. If a domain controller restarts and you don’t pick up on the event ID 2213, you have a domain controller which is out of sync with the rest of the domain controllers. The risk to this happening is that domain members and domain users will be getting out of date versions of Group Policy Objects if they use this domain controller as the domain controller will still be active servicing clients whilst this DFS-R replication group is in an unhealthy state.

Effects Beyond Active Directory

DFS-R is a technology originally designed for replicating file server data. This change to DFS-R Auto-Recovery impacts not only Active Directory, the scope of this post but also file services. If you are using DFS-R to replicate your file servers then you may want to consider this change for those servers too. Whilst having an out of date SYSVOL can be an inconvenience, having an out of date file server can be a major problem as users will be working with out of date copies of documents or users may not even be able to find documents if the document they are looking for is new and hasn’t been replicated to their target server.

My take on this though would be to carefully consider the change for a file server. Whilst having a corrupt Group Policy can fairly easily be fixed or recovered from a GPO backup or re-created if the policy wasn’t too complex, asking a user to re-create their work because you allowed a corrupt copy of it to be brought into the environment might not go down quite so well.