Troubleshooting

Upgrading SRM 6.x to 6.x.x, service fails with the error Details: “Failed to start service. VMware vCenter Site Recovery Manager service failed to start. Check that all required Windows services are running”

Hello People,

These days I have been working more of under wild card products like Site Recovery Manager and vSphere Replication so you would find most of the blogs related to SRM and VR.

I would like to share my recent experience with one of the customer where the upgrade from SRM 6.1.1 to 6.1.2 was failing during the upgrade. SRM used to register the upgraded service under PSC but right at the end when it is about the start the service we used to encounter the error mentioned in below screen shot.

 

As always being a good engineer we prefer looking at the logs, I straight away went through the logs available in below path. I could find a backtrace every time I try to restart the service.

C:\ProgramData\VMware\VMware Site Recovery Manager\Logs\

Vmware-dr.log

–> Panic: FAILURE:  “Deserialize failed for data item (persistence id: ##global##_pvmi.protected-vm-8094): std::exception ‘class boost::archive::archive_exception’ “input stream error”” @ d:/build/ob/bora-3884620/srm/src/jobs/jobs.cpp:304

–> Backtrace:

–>

–> [backtrace begin] product: VMware vCenter Site Recovery Manager, version: 6.1.1, build: build-3884620, tag: –

–> backtrace[00] vmacore.dll[0x001C568A]

–> backtrace[01] vmacore.dll[0x0005CA8F]

–> backtrace[02] vmacore.dll[0x0005DBDE]

–> backtrace[03] vmacore.dll[0x001D7405]

–> backtrace[04] vmacore.dll[0x001D74FD]

–> backtrace[05] vmacore.dll[0x0004D83C]

–> backtrace[06] dr-jobs.dll[0x00035DB7]

–> backtrace[07] MSVCR90.dll[0x00074830]

–> backtrace[08] MSVCR90.dll[0x00043B3C]

–> backtrace[09] ntdll.dll[0x00050C51]

–> backtrace[10] dr-jobs.dll[0x0000390C]

–> backtrace[11] dr-jobs.dll[0x00005408]

–> backtrace[12] dr-recovery.dll[0x0016F153]

–> backtrace[13] dr-recovery.dll[0x0016AC81]

–> backtrace[14] dr-recovery.dll[0x002B0488]

–> backtrace[15] dr-recovery.dll[0x002AEF33]

–> backtrace[16] dr-recovery.dll[0x002B545E]

–> backtrace[17] dr-recovery.dll[0x00031A19]

–> backtrace[18] functional.dll[0x00028089]

–> backtrace[19] vmacore.dll[0x00159CCE]

–> backtrace[20] vmacore.dll[0x0015D53F]

–> backtrace[21] vmacore.dll[0x0015EA91]

–> backtrace[22] vmacore.dll[0x001607C5]

–> backtrace[23] vmacore.dll[0x00065FEB]

–> backtrace[24] vmacore.dll[0x0015BC50]

–> backtrace[25] vmacore.dll[0x001D2A5B]

–> backtrace[26] MSVCR90.dll[0x00002FDF]

–> backtrace[27] MSVCR90.dll[0x00003080]

–> backtrace[28] kernel32.dll[0x0001652D]

–> backtrace[29] ntdll.dll[0x0002C541]

–> [backtrace end]

 

After thorough research I managed to find the cause of the problem where the issue is caused due to corrupted database table pdj_dataitem under SRM DB.

In my case, customer was using embedded postgres DB so I followed the below procedure to get rid of the stale records.

 

Note: Before Proceeding further please consider the points mentioned in bullet points.

  • Take a backup of the database if its external to SRM
  • For embedded DB ensure you take a snapshot before you proceed to delete the records
  • Since the upgrade gets halted during starting the service, Do not close the upgrade installer window, Minimize and perform the below task, Once done maximize the installer window and click on retry.

 

Solution

Opened a command prompt and went to the below mentioned path.

C:\ProgramData\VMware\VMware Site Recovery Manager Embedded Database\bin

Connected to the database using the command

psql -p 5678 -U user_name -d Databasename  –> This can be retrieved from ODBC configuration

Once connected executed the below query

select * from pdj_dataitem;   –> I got the output with couple of records in place

Based on the record retrieved deleted the records accordingly based on db_id.

delete from pdj_dataitem where db_id=value retrieved;

 

Retried to start the service as upgrade installer was not closed. Service started successfully and SRM was upgraded to 6.1.2.

I hope this article was helpful. Watch out for more.

Ritesh Shenoy
Hey, My name is Ritesh Shenoy working as a Senior Consultant for SAP. The goal of this blog is to contribute towards VMware community and make ones life better with necessary content in place!

1 Comment

  1. Thank you . Ritesh, this got us through a failed Planned migration.

    We were running 6.1..1 and came across an unknown issue while executing a planned migration that caused the SRM service to panic.

    2018-02-03T11:23:56.969Z [03592 verbose ‘Recovery’ ctxID=xyz opID=xyz Found VM setting mappings; planDbId: 0, # of mappings: 0, # of protected VMs in the plan: 1
    2018-02-03T11:23:57.141Z [03592 panic ‘Default’ ctxID=xyz opID=xyz
    –>
    –> Panic: VERIFY d:/build/ob/bora-4535903/srm/src/recovery/settings/vmRecoverySettingsRepository.cpp:1291
    –>
    –> Backtrace:

    This is supposedly caused by duplicated UUID’s in the recovery plan – https://kb.vmware.com/s/article/2145559 however in our environment this wasn’t the case.

    The only solution VMware could offer was to attempt an upgrade to 6.1.2 which then failed with the error above.

    The items in the table mentioned above for us were all linked to the recovery plan that was causing the SRM panic while on version 6.1.1

    So I suspect had we known about this table and cleared it the plan would have executed successfully.

    Live and learn 🙂

Leave a Response