@Mikemohr on Twitter tonight said it best:
“Haven’t we learned from Hollywood what happens when the machines become self-aware?”
I got a good chuckle. He took my comment of VMware becoming “self-aware” exactly where I wanted it to go. A reference to The Terminator series of films in which a sophisticated computer defense system called Skynet becomes self-aware and things go downhill for mankind from there.
Metaphorically speaking in today’s case, Skynet is VMware vSphere and mankind is represented by VMware vSphere Administrators.
During an attempt to patch my ESX(i)4 hosts, I received an error message (click the image for a larger version):

At that point, the remediation task fails and the host is not patched. The VUM log file reflects the same error in a little more detail:
[2010-03-04 14:58:04:690 ‘JobDispatcher’ 3020 INFO] [JobDispatcher, 1616] Scheduling task VciHostRemediateTask{675}
[2010-03-04 14:58:04:690 ‘JobDispatcher’ 3020 INFO] [JobDispatcher, 354] Starting task VciHostRemediateTask{675}
[2010-03-04 14:58:04:690 ‘VciHostRemediateTask.VciHostRemediateTask{675}’ 2676 INFO] [vciTaskBase, 534] Task started…
[2010-03-04 14:58:04:908 ‘VciHostRemediateTask.VciHostRemediateTask{675}’ 2676 INFO] [vciHostRemediateTask, 680] Host host-112 scheduled for patching.
[2010-03-04 14:58:05:127 ‘VciHostRemediateTask.VciHostRemediateTask{675}’ 2676 INFO] [vciHostRemediateTask, 691] Add remediate host: vim.HostSystem:host-112
[2010-03-04 14:58:13:987 ‘InventoryMonitor’ 2180 INFO] [InventoryMonitor, 427] ProcessUpdate, Enter, Update version := 15936
[2010-03-04 14:58:13:987 ‘InventoryMonitor’ 2180 INFO] [InventoryMonitor, 460] ProcessUpdate: object = vm-2642; type: vim.VirtualMachine; kind: 0
[2010-03-04 14:58:17:533 ‘VciHostRemediateTask.VciHostRemediateTask{675}’ 2676 WARN] [vciHostRemediateTask, 717] Skipping host solo.boche.mcse as it contains VM that is running VUM or VC inside it.
[2010-03-04 14:58:17:533 ‘VciHostRemediateTask.VciHostRemediateTask{675}’ 2676 INFO] [vciHostRemediateTask, 786] Skipping host 0BC5A140, none of upgrade and patching is supported.
[2010-03-04 14:58:17:533 ‘VciHostRemediateTask.VciHostRemediateTask{675}’ 2676 ERROR] [vciHostRemediateTask, 230] No supported Hosts found for Remediate.
[2010-03-04 14:58:17:737 ‘VciRemediateTask.RemediateTask{674}’ 2676 INFO] [vciTaskBase, 583] A subTask finished: VciHostRemediateTask{675}
Further testing in the lab revealed that this condition will be caused with a vCenter VM and/or a VMware Update Manager (VUM) VM. I understand from other colleagues on the Twitterverse that they’ve seen the same symptoms occur with patch staging.
The work around is to manually place the host in maintenance mode, at which time it has no problem whatsoever evacuating all VMs, including infrastructure VMs. At that point, the host in maintenance mode can be remediated.
VMware Update Manager has apparently become self-aware in that it detects when its infrastructure VMs are running on the same host hardware which is to be remediated. Self-awareness in and of itself isn’t bad, however, its feature integration is. Unfortunately for the humans, this is a step backwards in functionality and a reduction in efficiency for a task which was once automated. Previously, a remediation task had no problem evacuating all VMs from a host, infrastructure or not. What we have now is… well… consider the following pre and post “self-awareness” remediation steps:
Pre “self-awareness” remediation for a 6 host cluster containing infrastructure VMs:
- Right click the cluster object and choose Remediate
- Hosts are automatically and sequentially placed in maintenance mode, evacuated, patched, rebooted, and brought out of maintenance mode
Post “self-awareness” remediation for a 6 host cluster containing infrastructure VMs:
- Right click Host1 object and choose Enter Maintenance Mode
- Wait for evacutation to complete
- Right click Host1 object and choose Remediate
- Wait for remediation to complete
- Right click Host1 object and choose Exit Maintenance Mode
- Right click Host2 object and choose Enter Maintenance Mode
- Wait for evacutation to complete
- Right click Host2 object and choose Remediate
- Wait for remediation to complete
- Right click Host2 object and choose Exit Maintenance Mode
- Right click Host3 object and choose Enter Maintenance Mode
- Wait for evacutation to complete
- Right click Host3 object and choose Remediate
- Wait for remediation to complete
- Right click Host3 object and choose Exit Maintenance Mode
- Right click Host4 object and choose Enter Maintenance Mode
- Wait for evacutation to complete
- Right click Host4 object and choose Remediate
- Wait for remediation to complete
- Right click Host4 object and choose Exit Maintenance Mode
- Right click Host5 object and choose Enter Maintenance Mode
- Wait for evacutation to complete
- Right click Host5 object and choose Remediate
- Wait for remediation to complete
- Right click Host5 object and choose Exit Maintenance Mode
- Right click Host6 object and choose Enter Maintenance Mode
- Wait for evacutation to complete
- Right click Host6 object and choose Remediate
- Wait for remediation to complete
- Right click Host6 object and choose Exit Maintenance Mode
It’s Saturday and your kids want to go to the park. Do the math.
Update 5/5/10: I received this response back on 3/5/10 from VMware but failed to follow up with finding out if it was ok to share with the public. I’ve received the blessing now so here it is:
[It] seems pretty tactical to me. We’re still trying to determine if this was documented publicly, and if not, correct the documentation and our processes.
We introduced this behavior in vSphere 4.0 U1 as a partial fix for a particular class of problem. The original problem is in the behavior of the remediation wizard if the user has chosen to power off or suspend virtual machines in the Failure response option.
If a stand-alone host is running a VM with VC or VUM in it and the user has selected those options, the consequences can be drastic – you usually don’t want to shut down your VC or VUM server when the remediation is in progress. The same applies to a DRS disabled cluster.
In DRS enabled cluster, it is also possible that VMs could not be migrated to other hosts for configuration or other reasons, such as a VM with Fault Tolerance enabled. In all these scenarios, it was possible that we could power off or suspend running VMs based on the user selected option in the remediation wizard.
To avoid this scenario, we decided to skip those hosts totally in first place in U1 time frame. In a future version of VUM, it will try to evacuate the VMs first, and only in cases where it can’t migrate them will the host enter a failed remediation state.
One work around would be to remove such a host from its cluster, patch the cluster, move the host back into the cluster, manually migrate the VMs to an already patched host, and then patch the original host.
It would appear VMware intends to grant us back some flexibility in future versions of vCenter/VUM. Let’s hope so. This implementation leaves much to be desired.
Update 5/6/10: LucD created a blog post titled Counter the self-aware VUM. In this blog post you’ll find a script which finds the ESX host(s) that is/are running the VUM guest and/or the vCenter guest and will vMotion the guest(s) to another ESX host when needed.