VMware Update Manager Becomes Self-Aware

March 4th, 2010 by jason Leave a reply »

@Mikemohr on Twitter tonight said it best:

“Haven’t we learned from Hollywood what happens when the machines become self-aware?”

I got a good chuckle.  He took my comment of VMware becoming “self-aware” exactly where I wanted it to go.  A reference to The Terminator series of films in which a sophisticated computer defense system called Skynet becomes self-aware and things go downhill for mankind from there.

Metaphorically speaking in today’s case, Skynet is VMware vSphere and mankind is represented by VMware vSphere Administrators.

During an attempt to patch my ESX(i)4  hosts, I received an error message (click the image for a larger version):

At that point, the remediation task fails and the host is not patched.  The VUM log file reflects the same error in a little more detail:

[2010-03-04 14:58:04:690 ‘JobDispatcher’ 3020 INFO] [JobDispatcher, 1616] Scheduling task VciHostRemediateTask{675}
[2010-03-04 14:58:04:690 ‘JobDispatcher’ 3020 INFO] [JobDispatcher, 354] Starting task VciHostRemediateTask{675}
[2010-03-04 14:58:04:690 ‘VciHostRemediateTask.VciHostRemediateTask{675}’ 2676 INFO] [vciTaskBase, 534] Task started…
[2010-03-04 14:58:04:908 ‘VciHostRemediateTask.VciHostRemediateTask{675}’ 2676 INFO] [vciHostRemediateTask, 680] Host host-112 scheduled for patching.
[2010-03-04 14:58:05:127 ‘VciHostRemediateTask.VciHostRemediateTask{675}’ 2676 INFO] [vciHostRemediateTask, 691] Add remediate host: vim.HostSystem:host-112
[2010-03-04 14:58:13:987 ‘InventoryMonitor’ 2180 INFO] [InventoryMonitor, 427] ProcessUpdate, Enter, Update version := 15936
[2010-03-04 14:58:13:987 ‘InventoryMonitor’ 2180 INFO] [InventoryMonitor, 460] ProcessUpdate: object = vm-2642; type: vim.VirtualMachine; kind: 0
[2010-03-04 14:58:17:533 ‘VciHostRemediateTask.VciHostRemediateTask{675}’ 2676 WARN] [vciHostRemediateTask, 717] Skipping host solo.boche.mcse as it contains VM that is running VUM or VC inside it.
[2010-03-04 14:58:17:533 ‘VciHostRemediateTask.VciHostRemediateTask{675}’ 2676 INFO] [vciHostRemediateTask, 786] Skipping host 0BC5A140, none of upgrade and patching is supported.
[2010-03-04 14:58:17:533 ‘VciHostRemediateTask.VciHostRemediateTask{675}’ 2676 ERROR] [vciHostRemediateTask, 230] No supported Hosts found for Remediate.
[2010-03-04 14:58:17:737 ‘VciRemediateTask.RemediateTask{674}’ 2676 INFO] [vciTaskBase, 583] A subTask finished: VciHostRemediateTask{675}

Further testing in the lab revealed that this condition will be caused with a vCenter VM and/or a VMware Update Manager (VUM) VM. I understand from other colleagues on the Twitterverse that they’ve seen the same symptoms occur with patch staging.

The work around is to manually place the host in maintenance mode, at which time it has no problem whatsoever evacuating all VMs, including infrastructure VMs.  At that point, the host in maintenance mode can be remediated.

VMware Update Manager has apparently become self-aware in that it detects when its infrastructure VMs are running on the same host hardware which is to be remediated.  Self-awareness in and of itself isn’t bad, however, its feature integration is.  Unfortunately for the humans, this is a step backwards in functionality and a reduction in efficiency for a task which was once automated.  Previously, a remediation task had no problem evacuating all VMs from a host, infrastructure or not. What we have now is… well… consider the following pre and post “self-awareness” remediation steps:

Pre “self-awareness” remediation for a 6 host cluster containing infrastructure VMs:

  1. Right click the cluster object and choose Remediate
  2. Hosts are automatically and sequentially placed in maintenance mode, evacuated, patched, rebooted, and brought out of maintenance mode

Post “self-awareness” remediation for a 6 host cluster containing infrastructure VMs:

  1. Right click Host1 object and choose Enter Maintenance Mode
  2. Wait for evacutation to complete
  3. Right click Host1 object and choose Remediate
  4. Wait for remediation to complete
  5. Right click Host1 object and choose Exit Maintenance Mode
  6. Right click Host2 object and choose Enter Maintenance Mode
  7. Wait for evacutation to complete
  8. Right click Host2 object and choose Remediate
  9. Wait for remediation to complete
  10. Right click Host2 object and choose Exit Maintenance Mode
  11. Right click Host3 object and choose Enter Maintenance Mode
  12. Wait for evacutation to complete
  13. Right click Host3 object and choose Remediate
  14. Wait for remediation to complete
  15. Right click Host3 object and choose Exit Maintenance Mode
  16. Right click Host4 object and choose Enter Maintenance Mode
  17. Wait for evacutation to complete
  18. Right click Host4 object and choose Remediate
  19. Wait for remediation to complete
  20. Right click Host4 object and choose Exit Maintenance Mode
  21. Right click Host5 object and choose Enter Maintenance Mode
  22. Wait for evacutation to complete
  23. Right click Host5 object and choose Remediate
  24. Wait for remediation to complete
  25. Right click Host5 object and choose Exit Maintenance Mode
  26. Right click Host6 object and choose Enter Maintenance Mode
  27. Wait for evacutation to complete
  28. Right click Host6 object and choose Remediate
  29. Wait for remediation to complete
  30. Right click Host6 object and choose Exit Maintenance Mode

It’s Saturday and your kids want to go to the park. Do the math.

Update 5/5/10: I received this response back on 3/5/10 from VMware but failed to follow up with finding out if it was ok to share with the public.  I’ve received the blessing now so here it is:

[It] seems pretty tactical to me. We’re still trying to determine if this was documented publicly, and if not, correct the documentation and our processes.

We introduced this behavior in vSphere 4.0 U1 as a partial fix for a particular class of problem. The original problem is in the behavior of the remediation wizard if the user has chosen to power off or suspend virtual machines in the Failure response option.

If a stand-alone host is running a VM with VC or VUM in it and the user has selected those options, the consequences can be drastic – you usually don’t want to shut down your VC or VUM server when the remediation is in progress. The same applies to a DRS disabled cluster.

In DRS enabled cluster, it is also possible that VMs could not be migrated to other hosts for configuration or other reasons, such as a VM with Fault Tolerance enabled. In all these scenarios, it was possible that we could power off or suspend running VMs based on the user selected option in the remediation wizard.

To avoid this scenario, we decided to skip those hosts totally in first place in U1 time frame. In a future version of VUM, it will try to evacuate the VMs first, and only in cases where it can’t migrate them will the host enter a failed remediation state.

One work around would be to remove such a host from its cluster, patch the cluster, move the host back into the cluster, manually migrate the VMs to an already patched host, and then patch the original host.

It would appear VMware intends to grant us back some flexibility in future versions of vCenter/VUM.  Let’s hope so. This implementation leaves much to be desired.

Update 5/6/10: LucD created a blog post titled Counter the self-aware VUM. In this blog post you’ll find a script which finds the ESX host(s) that is/are running the VUM guest and/or the vCenter guest and will vMotion the guest(s) to another ESX host when needed.

Advertisement

13 comments

  1. Greg says:

    See Ian Angell was right after all. The unintended consequences of computer systems will kill us all 🙂 If Sonny the I Robot ever runs my datacenter i’ll run for the hills.

  2. I’m with you on this Jason. Wrote about in my vSphere4 book. Anything that stops automation sucks IMHO. I’m been manually VMotioning VC since Vi2 days, and letting DRS handle VC/VUM since Vi3.5. And then along comes vSphere4 and up comes this “self-awarness”. Quite why VMware change this functionality is beyond me. There must be good reason – that’s escaping both of us – because its so stupid for VUM not to move VC/VUM. It almost smacks like a lack faith in VMotion…. which we all know is super reliable…

  3. duncan says:

    Although I fully agree that this shouldn’t happening I think the comment about enterprises asking a refund doesn’t make sense.

    Most Enterprise companies use a separate Cluster for Management services and thus do not need to do 32 manual actions to updates.

  4. jason says:

    Good morning Duncan,

    With the utmost respect, I think you are making an assumption about Enterprise customers which I am not in agreement with. In your view, most Enterprise customers are large, with large infrastructures and management clusters dedicated to managing the infrastructure. This would seem to indicate that Enterprise licensing is mostly for these types of large customers. I disagree. I think there are customers of all sizes who can and do benefit from Enterprise features. The last shop I came from has 8 ESX hosts (4 DEV, 4 PROD). They are an Enterprise licensing customer and they reap the rewards of Enterprise features such as DRS and VMotion. This environment has no management cluster. They are not using PowerShell. They are not pinning VMs. Their VMs move around seamlessly in the private cloud as they should.

    There exist a few work arounds that can be applied to the problem. One is a management cluster. Another is pinning VMs. A third is PowerShell scripting. All of these work arounds involve extra time and/or money which was not required in the past. VMware’s best practice now is to virtualize vCenter Server so the functionality and flexiblity surrounding that choice is more important than ever. In my opinion, these are all work arounds which do not directly address the problem which has occurred. Putting on my customer hat, if I didn’t have these solutions in my shop previously, why should I be forced to reactively adopt them now? The marketing message from VMware is Efficiency, Control, Choice which I have become accustomed to. People may not use all of the features VMware offers, but at least the flexibility is there in the feature sets to be leveraged as needed. I can think of another hypervisor vendor that boldly makes design choices for its customers in its lack of a memory overcommit feature; they don’t offer the flexibility that VMware does. If VMware removes or breaks features a customer depends on, that creates friction and is contra to VMware’s marketing message.

    I do apologize for the refund comment. It was merely satire. I do not expect customers to ask for a refund. However, this is a real world problem and VMware does have the stigma of being the most expensive virtualization solution sticker price. I think it’s fair to say that customers who buy a Cadillac should expect Cadillac quality and a vehicle that will not let them down. It’s a competitive market and VMware needs to continue delivering their “A” game. They’ve had some issues with quality control in the past. These issues should be eliminated as much as possible.

  5. duncan says:

    Again I fully agree that this should be solved but there is more then you are discussing here. Outside of your world there are a million other companies with different implementations which need to be helped as well when they are facing issues.

    I’m not saying that everyone should have a management cluster but you are insanely exaggerating here.
    What if I would just patch the first host, move vCenter & VUM to the first host and then do a Remediate on the Cluster? The first host would be skipped as it is already patched the others would be patched without any manual intervention. So what would we be talking about? 5 minutes extra? I don’t know about you but I can spent those 5 minutes and still have a walk in the park with my kids.

    By the way, when I talk about enterprise customers I am not talking about customer who bought the enterprise license. I am talking about the size of the company! As you know I am not a sales guy, licensing is never part of my message.

  6. Chad Skinner says:

    I completely agree with Jason on this one, and actually ran into this problem two days before this post. I run a small VMware shop for a local government. We purchased the advanced acceleration kit and have three hosts with VC/VUM running as a VM. We purchased the advanced licensing primarily for HA and VMotion. On 2/28 I set about to patch my hosts with the patch that came out on 1/5, not knowing that the 3/3 patch would be coming out. This VUM self awareness caused a ton of unnecessary work to patch my three hosts, and seems completely unnecessary. Now that the patch released on 3/3 is out, I get to do that whole, more painful than it needs to be process all over again. Hopefully with 4.1 VMware will fix this problem.

  7. Chad Skinner says:

    Please disregard my last comment. When doing the patching on 2/28 I was still using the trial licensing, now that I have applied my advanced licensing I realize that because I now do not have DRS, the process is a much more manual one. So for anyone without DRS in their licensing the VUM self awareness doe not add additional steps. I would love DRS, but the price bump between Advanced and Enterprise licensing is too much for my small budget to handle.

  8. Greg says:

    I tend to look at both points. On one hand I dont see it causing a huge problem unless you let it. It is indeed possible to pin VC to a host after the first host update or add management clusters. The job of an admin is to ‘enable’ right, we constantly work around the shortcomings or ‘failures’ for the want of a better term of software/technology. Any admin worth his salt would find a way that works in their environment, within their configuration and their limitations, that’s their job right.

    That beign said, I grow increasingly frustrated by software companies and particularly one very big one who make seemingly wrong assumptions (ok sometimes based on market research) on how customers use their software. When they push even large or even small changes to the product which break fucntionality or current process then it becomes a problem, especially without warning.

    Take MS Exchange 2007 + as an example. They now dont provide the ability to create mailboxes from AD which they always did. Their argument was they believed that most exchange helpdesk people didnt have access to AD tools. The stupid thing is you still need AD to set some properties on an exchange account, unless you are a posh guru. Now there is more than one way to skin a cat, but that small change caused allot of gnashing of teeth and a change in process for allot of small, medium and large businesses alike. I think the responsibility lies somewhere in between.

  9. Hi,

    well to me it is a bug, especially as the documentation says that it will move the vms if needed IIRC.

    If it needs to be a new feature necessary for some customers vmware should make it a configurable item.

    The old way was perfect for most customers and didn’t require any manual interaction concerning VUM or VC.

    VUM is already bad at handling errors, just addining another “error” to it which requires you do browse through log files to figure out the real reason doesn’t really help to make customers feel enthousiastic about it.

    Marcus

  10. jason says:

    Blog post has been updated with new information from VMware on 5/5/10.

  11. Loren says:

    It seems this is still occurring in vSphere 5.0U2. Any update on VMware’s plans to make VUM even more self-aware so we avoid this issue?

Leave a Reply