Friday, January 9, 2009

Recovery of SCCM Site Failure

Despite the fact that Microsoft has recovery information posted in their online support docs ( http://technet.microsoft.com/en-us/library/bb680474.aspx), site recovery is still a confusing task.  I've gone through several site recoveries and here are the notes from my last one, where the central site happened to fail.  An important note that I don't think is explained very well is that you have to have a functioning site setup the same way as the old site before you can recover the old site.

So before starting the site recovery review the original site setup procedure.  If you don't have one documented, then see the excellent write up by Ying Li: http://myitforum.com/cs2/blogs/yli628/archive/2008/06/25/setup-configmgr-2007-sp1-from-start-to-finish.aspx

Notes from recovery of Central site on 12/17/08.

1. OS, Server name, Site Code, and Drive layout (OS on C: and program files on drive d: ) should match original hardware.  (Do not move from 64-bit to 32-bit OS).

2. Install the OS and configure as normal

3. Give the machine account admin rights on the SQL server

4. Install WSUS 3.0 with SP1.  Do not use the default website.  Use custom website.  During install point it to the remote SQL server (if you use a remote sql server to host wsus metadata).  Do not overwrite the contents of the database.  Do not use the configuration wizard to setup wsus.  Simply exit when the configuration wizard starts.

5. Start copying the backup files to the local machine since it can take 1/2 hour.

6. Install the correct version of SCCM

a. Make sure to reuse the same program path (d:\sms for our Primary sites since they were upgrades from sms.  d:\sccm for our Secondary sites, since they were fresh installs when we set them up).  This part is critical.  When you do the recovery step, your site will to use the original setup paths that were in use prior to the site failure.  Changing the paths will cause a significant headache!

b. On a new fully patched system you will not pass the prereq check.  Double click on each item to see how to resolve the issue.  You should be able to resolve every error.

c. After resolving the MMC sp3 issue, it will still show up as an warning in prereq check, that is fine if you are sure you have applied sp3 for MMC.  The setup routine does not correctly query the registry to see that lastest version of the hotfix that Microsoft issued.

d. You can ignore the warning about SQL server authentication mode if you typically run SQL under the system account (without hardening SQL).

e. If you are attempting to restore a site that has a remote provider and point it back to the correct remote provider machine, the installer will complain that machine already has a provider.

1. On the provider machine, the registry key is blocking the install of the new remote provider, so remove the HKLM\SOFTWARE\Microsoft\SMS\Providers key.

2. On the provider machine, connect to root\sms with WBEMTEST and press the 'Enum Classes button'. No input is necessary, just press 'OK' to do an 'Immediate only' search. In the Query Results dialog window, click on the 'SMS_ProviderLocation' and press the 'Delete' key. Close out of all of these dialogs.

3. Delete the SMSPROV folder on the root.

4. Add the new site server machine as a local administrator and remove the old site server (if applicable).

Note: Provider fix, pasted from http://social.technet.microsoft.com/forums/en-US/configmgrsetup/thread/bb307748-7638-404d-a6a4-982827a051c8/

f. All other site servers will install the provider on themselves.

g. On the SQL DB server detach the old DB.  Create a new DB with the same name and file locations.  Give the smssite server db_owner on the db.

7. After the install has completed successfully run the site recovery wizard.

a. Close the console if open

b. Start>all programs>Microsoft System Center>Config Mgr 2007>Config Mgr Site Repair Wizard

c. Redsite and ROSsite do not have local DP's installed.  Choose the option to skip package verification.

8. Reset permissions for site in AD.

a. Open AD users and Computers> System> System Management

b. Open properties and give site server full control on Systems Management container.

c. Open advance properties and change permissions so that they apply to "This object and all descendant objects"  (this is not the default so be sure to do it).

9. Restore the site control file

a. Copy  site_control_files sitectrl_<SiteCode>.ct0 to D:\sms\inboxes\sitectrl.box

b. Rename file from sitectrl_<SiteCode>.ct0 to sitectrl.ct0

10. After recovery perform a site reset

a. Rerun setup from Start>all programs>Microsoft System Center>Config Mgr 2007>

b. Choose site reset

11. Set user group permissions for recovered site and related site servers

a. Computer mgmt>local users and groups>Groups

b. Sms_sitetositeconnection_<sitecode>  should contain the parent server and any child servers that need to connect to the site.

c. Sms_siteSystemtoSiteServer_<sitecode> should contain any parent or child site that needs to write to the site's DB.

d. Sms Reporting Users should contain any domain accounts that have reporting rights.

e. Sms admins should contain your sms administrators domain accounts.

12. If this was the central site with the wsus updates, then the wsus updates folders need to be reshared with the same share names.  Check the software updates deployment packages nodes.  On each package open the package properties.  The general tab will show the share name that the packages is expected to be found on.  The central site's machine account will need full control of this share.

13. Reset the wsus db:

a. From the cmd prompt:  c:\program files\Update Services\tools\wsusutil.exe reset

b. Wait 1 hour

c. Force a Synchronization on the Update Repository

d. Verify wsus is syncing properly: wsyncmgr.log for errors.

14. Verify backup share permissions for newly restored site.  Our backups are set for a share on another server, which is then backed up to tape.  This can be verified in the site maintenance node and by reviewing the smsbkup.log located in the backup share.

15. If this was the central site, recreate the backup schedule for the site control file.  A scheduled task that runs every 15 mins to dump the site control file and copy them to another machine:

a. site_control_file_backup.bat

D:\sms\bin\i386\00000409\preinst.exe /Dump

xcopy d:\*.ct0 backup_location\site_control_files /C /Y

16. Check and review system logs for errors for the next several days.

17. Monitor Site Status for errors.  The only errors should be on the central site and be related to unapproved clients trying to get policy.  Recheck in 24 hours.

18. Verify successful backups by reviewing the smsbkup.log located in the backup share.

5 comments:

MEI said...

Why would we need a WSUS reset?
Isn't that only necessary if you store updates locally?

Bill Phillips said...

You still store EULA's locally in order to approve them. The wsus reset is optional, but it doesn't hurt anything. Just monitor wsyncmgr.log for errors. If you see errors then I would start with a wsus reset.

Anonymous said...

Any issues going from 32bit to 64 bit OS if we're doing a rebuild of the Central site server? Will all the roles and services still work with IIS 7?
Thanks

Bill Phillips said...

I wouldn't recommend it. A recovery is not the time to do an upgrade. The site recovery is going to assume everything is setup exactly the way it was when the backup was made (install directories and registry key entries). If you want to move from 32 to 64 do a site migration after the recovery.

Anonymous said...

Hi We had a Central Site, a Primary Site Reporting to it and beneath Primary there were 5-6 Secondary sites. Our Primary Child Site Sever failed (it hosted MP, DP, WSUS 3.0 SP2). We restored the failed server using the Site Repair Wizard but at the Parent Connection page of the wizard if we select the check box "Recover Data From Parent Site" the wizard fails (due to WMI issue) on the naxt step i:e Parent Recovery page at "Connect to Provider" step. If we deselect the "Recover Data From Parent Site" the wizard does not tries to connect to Central site and thus finishes fine.

I need to know, what will happen if I deselect this option and move on to finish the wizard. Does the data/settings that this option tries to retrieve from the Central site can be recovered later on, if yes then how.

What other steps (considering my scenario) I need to do in order to get things back on track (like Site Reset, WSUS Database reset etc.).

Secondly, how can I validate the success of Site Restore i:e can I create something on Central Site as well as on Child Parent site and see if they are replicating or talking to each other.

Also, what all things (like Collection, Packages, etc.) gets replicated from Child Parent Site to Central Site and Vice Versa. Does collections and packages created on Child Primary will appear at Central site or not.

Regards
Taranjeet Singh