Site Recovery Manager service fails with “Win32 exception: Access Violation” message in the vmware-dr.log
Hi All, I am happy to announce that I have managed to bag the vExpert title for two consecutive years. Feels great & motivated to blog and contribute to VMware community.
Now talking about today’s article, It was tricky but an interesting one. My customer came up with a scenario where there were no recent changes done in the environment but SRM service in the primary site stopped running indefinitely. As I always believe in vmware-dr.log ,I went ahead and validated the backtrace being generated.
To my surprise it was the first time I encountered such a back trace with no clear parameters defined.
[02784 trivia ‘ThreadPool’] ThreadPool[idle:31, busy_io:0, busy_long:2] HandleWork(type: 0, fun: class boost::_bi::bind_t<struct boost::_bi::unspecified,class boost::function<void __cdecl(class boost::system::error_code const & __ptr64,unsigned __int64)>,class boost::_bi::list2<class boost::_bi::value<class boost::system::error_code>,class boost::_bi::value<unsigned __int64> > >)
[02784 trivia ‘ThreadPool’] HandleWork() leaving
[10064 panic ‘Default’ opID=cb1f5d0]
–> Panic: Win32 exception: Access Violation (0xc0000005)
–> Write (1) at address 0000000000000024
–> rip: 0000000077c52964 rsp: 000000000c8be0f0 rbp: 0000000000000000
–> rax: 0000000000000000 rbx: 0000000006a453e8 rcx: 00000000fffffffc
–> rdx: 00000000000004b0 rdi: 0000000000000000 rsi: 00000000000004b0
–> r8: 000000000c8be0a8 r9: 0000000000000004 r10: 0000000000000000
–> r11: 0000000000000246 r12: 0000000000000000 r13: 0000000000000000
–> r14: 000007fffff74000 r15: 0000000000000000
Enabling trivia did not have a detailed information on the cause of the issue either.
I started researching further in the internal bugs to validate if any other customer had received a similar backtrace. To my luck I managed to find one.
Guess what was the cause of the issue??
Customer had a VM in the protection group which was running on 80 snapshots. So when a VM generally runs with more than 32 snapshots and the same VM is protected by SRM we have witnessed this behavior.
Customer had a powershell script to determine the VM’s running on a snapshot. We verified it against the list of VM’s being protected by SRM.
Later we went ahead and consolidated the snapshot and tried restarting the SRM service. Services resumed successfully and customer was relieved big time.
I hope this article was helpful. Watch out for more.