SQL max worker threads Problem When Using VSS To Backup Numerous Databases

In our ongoing (sort-of pilot) migration from VMware vSphere 5.5 to Microsoft Hyper-V 2012 R2, we encountered a very concerning and puzzling issue with backups. The transition had been smooth for the most part and we used the project to bring aging Windows/SQL 2008 servers up to 2012 R2 and 2014, respectively. Two of our SQL environments had moved over just fine and were being backed up successfully with Microsoft Data Protection Manager 2012 R2 for the time being (other products are being considered, including Veeam). The third of such SQL environments ran into a host of VSS errors once its data was populated and a backup attempted.

sqlvss_dpmfailed
DPM 2012 R2 – Job Failed

Background (before/after):

  • Hypervisor: vSphere 5.5 to Hyper-V 2012 R2
  • Guest OS: Windows Server 2008 to 2014
  • Backup product: EMC Avamar 7.0.1 to MS DPM 2012 R2
  • Backup method: Crash-consistent image to VSS-quiesced image

 

We had seen an occasional VSS-related backup failure from time to time in DPM, but most were tied to available disk space for the protection group (DPM doesn’t do so well with deduplication of images, so growing has been near-continual). Retrying didn’t make a difference this time, though. We restarted VSS writers and even took downtime to restart the VM. Still the same failure.

Digging further into the event logs, we accumulated these:

Hyper-V host: Event 10172, Hyper-V-VMMS
Hyper-V host: Event 10172, Hyper-V-VMMS

 

SQL guest: Event 8229, VSS
SQL guest: Event 8229, VSS

 

SQL guest: Event 18210, MSSQLSERVER
SQL guest: Event 18210, MSSQLSERVER

 

SQL guest: Event 24583, SQLWRITER
SQL guest: Event 24583, SQLWRITER

 

SQL guest: Event 1, SQLVDI
SQL guest: Event 1, SQLVDI

 

sqlvss_vssadminIn search for the answer, we came across MS KB 2615182, a Social TechNet thread about SQLVDI errors, and finally a blog about “volume shadow barfs with 0x80040e14 code”. The KB would have helped us narrow down the cause as SQL, but it’s a production server (so no stopping the SQL instance), and we already knew SQL was the issue thanks to “vssadmin list writers”. The TechNet thread seemed initially promising, but wasn’t exactly on track for us and also involved stopping SQL (to re-register SQLVDI.dll). Our third (more like 50th) find was the colorfully-named blog with one of our error codes.

We had most of his errors, but didn’t quite have the glaring message about “Cannot create worker thread” in our events (that we could find). Of course, it could be deep in there, but the SQL server in question happens to have 276 databases attached, so the errors are prolific.

Hopping over to MSDN, we looked up the automatically set value for max worker threads in SQL 2014 on a 2-CPU 64-bit server. That value is 512. We resorted more to speculation at this point, but given that 276 databases were attached, SQL background and agent processes also were running, and VSS would need to grab at least 276 threads of its own for the backup, we took a stab in the dark that we might be blowing that 512 cap. Plus, changing the value posed no real risk or service restarts (woot!).

sqlvss_sqlmaxBeing conservative as we try to be most of the time, we raised it to 1024 and clicked OK. Then we click “Resume Backup” on the critical event in the DPM Monitoring console. Finally, we rejoiced to see VSS snaps succeeding, SQL backups happening, and data transferring into DPM.

Event IDs, error codes, and keywords:

  • Event 1, SQLVDI, SQLVDIMemoryName
  • Event 8229, VSS writer, non-transient error, 0x800423f4, SqlServerWriter
  • Event 10172, Hyper-V-VMMS, BackupComplete, Catastrophic failure, 0x8000FFF
  • Event 18210, MSSQLSERVER, BackupIoRequest, ReportIoError, Operating system error 995
  • Event 24583, SQLWRITER, Sqllib, OLEDB, 0x80040e14, SQLSTATE 42000, Native Error 3013, Native Error 3202, BACKUP DATABASE
  • Maximum worker threads
  • Non-retryable error, State [8] Failed

——————————————————

By Chris Gurley, MCSE, CCNA
Last updated: June 5, 2014

2 Comments

  1. Glad to know the post helped. If you are not getting an error related to worker exhaustion, you should check the waits in the sys.dm_os_wait_stats output to determine if you see high waittimes for THREADPOOL waittype. This will also indicate that you might be running out of worker threads.

    June 9, 2014
    Reply
    • Chris said:

      Thanks, Amit. I’ll check that out if we hit this again or want to rein in the allocated worker threads.

      June 9, 2014
      Reply

Leave a Reply