In the past couple of weeks I have spent some time with VMware vCloud Director 1.5. The experience yielded three blog articles Collecting diagnostic information for VMware vCloud Director, Expanding vCloud Director Transfer Server Storage and now this one.
In this round, the vCD cell stopped working properly (single cell server environment). I could log into the vCD provider and organization portals but the deployment of vApps would run for an abnormally long time and then fail after 20 minutes with one of the resulting failure messages being Failed to receive status for task.
Doing some digging in the environment I found a few problems.
Problem #1: None of the cells have a vCenter proxy service running on the cell server.
Problem #2: Performing a Reconnect on the vCenter Server object resulted in Error performing operation and Unable to find the cell running this listener.
I search the Community Forums, talked with Chris Colotti (read his blog) for a bit, and then opened an SR with VMware support. VMware sent me a procedure along with a script to run on the Microsoft SQL Server:
- BACKUP the entire SQL Database.
- Stop all cells. (service vmware-vcd stop)
- Run the attached reset_qrtz_tables_sql_database.sql
– shutdown all cells before executing
delete from qrtz_scheduler_state
delete from qrtz_fired_triggers
delete from qrtz_paused_trigger_grps
delete from qrtz_calendars
delete from qrtz_trigger_listeners
delete from qrtz_blob_triggers
delete from qrtz_cron_triggers
delete from qrtz_simple_triggers
delete from qrtz_triggers
delete from qrtz_job_listeners
delete from qrtz_job_details
- Start one cell and verify if issue is resolved. (service vmware-vcd start)
- Start the remaining cells.
Before running the script I knew I had to make a few modifications to select the vCloud database first.
When running the script, it failed due to case sensitivity with respect to the table names. Upon installation, vCD creates all tables with upper case names. When the MS SQL Server database was first created by yours truly, case sensitivity, along with accent sensitivity, were enabled with COLLATE Latin1_General_CS_AS which comes straight from page 17 of the vCloud Director Installation and Configuration Guide.
After fixing the script, it looked like this:
– shutdown all cells before executing
delete from QRTZ_SCHEDULER_STATE
delete from QRTZ_FIRED_TRIGGERS
delete from QRTZ_PAUSED_TRIGGER_GRPS
delete from QRTZ_CALENDARS
delete from QRTZ_TRIGGER_LISTENERS
delete from QRTZ_BLOB_TRIGGERS
delete from QRTZ_CRON_TRIGGERS
delete from QRTZ_SIMPLE_TRIGGERS
delete from QRTZ_TRIGGERS
delete from QRTZ_JOB_LISTENERS
delete from QRTZ_JOB_DETAILS
The script ran successfully wiping out all rows in each of the named tables. A little sidebar discussion here.. I talked with @sqlchicken (Jorge Segarra, read his blog here) about the delete from statements in the script. It is sometimes a best practice to use the truncate table statement instead so that the transaction logs are bypassed instead of using the delete from statement which is more resource intensive due to the row by row deletion method and the rows being recorded in the transaction logs. Thank you for that insight Jorge! More on MS SQL Delete vs Truncate here. Jorge was also kind enough to provide a link on the subject matter but credentials will be required to view the content.
I was now able to restart the vCD cell and my problems were gone. Everything was working again. All errors have vanished. I thanked the VMware support staff and then tried to gain a little bit more information about how the problem was resolved by deleting table rows and what exactly are the qrtz tables? I had looked at the table rows myself before they were deleted and the information in there didn’t make a lot of sense to me (but that doesn’t necessarily classify it as transient data). This is what he had to say:
These [vCenter Proxy Service] issues are usually caused by a disconnect from the database, causing the tables to become stale. vCD constantly needs the ability to write to the database and when it cannot, the cell ends up in a state that is similar to the one that you have seen.
The qrtz tables contain information that controls the coordinator service, and lets it know when the coordinator to be dropped and restarted, for cell to cell fail over to another cell in multi cell enviroment.
When the tables are purged it forces the cell on start up to recheck its status and start the coordinator service. In your situation the cell, due to corrupt records in the table was not allowing this to happen.
So by clearing them forced the cell to recheck and to restart the coordinator.
Good information to know going forward. I’m going to keep this in my back pocket. Or on my blog as it were. Have a great weekend!