Posts Tagged ‘Rant’

vSphere Virtual Machine Performance Counters Integration into Perfmon

July 8th, 2009

VMware introduced the VMware Descheduled Time Accounting Service as a new VMware Tools component in ESX 3.0. The goal was to account for inconsistent CPU cycles allocated to the guest VM by the VMkernel to provide accurate performance statistics using standard performance monitoring tools within the guest VM. Although the service was not installed and enabled with VMware Tools by default, nor did it ever escape the bonds of experimental support status, I found the service to be both stable and reliable and it was a standard installation component in one of my production datacenters. One caveat was that the service only supported uniprocessor guest VMs having a single vCPU.

The VMware Descheduled Time Accounting Service was deprecated in VMware vSphere. More accurately, it was sort of replaced by a new vSphere feature called Virtual Machine Performance Counters (Integrated into Perfmon). To quote VMware:

“Virtual Machine Performance Counters Integration into Perfmon — vSphere 4.0 introduces the integration of virtual machine performance counters such as CPU and memory into Perfmon for Microsoft Windows guest operating systems when VMware Tools is installed. With this feature, virtual machine owners can do accurate performance analysis within the guest operating system. See the vSphere Client Online Help.”

The vSphere Client Online Help has this to say about Virtual Machine Performance:

“In a virtualized environment, physical resources are shared among multiple virtual machines. Some virtualization processes dynamically allocate available resources depending on the status, or utilization rates, of virtual machines in the environment. This can make obtaining accurate information about the resource utilization (CPU utilization, in particular) of individual virtual machines, or applications running within virtual machines, difficult. VMware now provides virtual machine-specific performance counter libraries for the Windows Performance utility. Application administrators can view accurate virtual machine resource utilization statistics from within the guest operating system’s Windows Performance utility.”

Did you notice the explicit statement about Perfmon? Perfmon is Microsoft Windows Performance Monitor or perfmon.exe for short. Whereas the legacy VMware Descheduled Time Accounting Service supported both Windows and Linux guest VMs, its successor currently supports Perfmon ala Windows guest VMs only. It seems we’ve gone backwards in functionality from a Linux guest VM perspective. Another pie in the face for shops with Linux guest VMs.


I understand that Windows guest VMs are the low hanging fruit for software development and features, but VMware needs to make sure some love is spread through the land of Linux as well. Folks with Linux shops are still struggling with basic concepts such as Linux guest customization as well as flexibility and automation of VMware Tools installation in the Linux guest OS. If VMware is going to tout their support for Linux guest VMs, I’d like to see more of a commitment than what is currently being offered. There’s more to owning a virtualized infrastructure than powering on instances on top of a hypervisor. Building it is the easy part. Managing it can be much more difficult without the right tools. Flexibility and ease with in the management tools is critical, especially as virtual infrastructures grow.


So, taking a look at a VMware vSphere Windows VM with current VMware Tools, I launched Perfmon. The installation of VMware Tools installs two new Performance Objects along with various associated counters:

  • VM Memory
    • Memory Active in MB
    • Memory Ballooned in MB
    • Memory Limit in MB
    • Memory Mapped in MB
    • Memory Overhead in MB
    • Memory Reservation in MB
    • Memory Shared in MB
    • Memory Shared Saved in MB
    • Memory Shares
    • Memory Swapped in MB
    • Memory Used in MB
  • VM Processor
    • % Processor Time
    • Effective VM Speed in MHz
    • Host processor speed in MHz
    • Limit in MHz
    • Reservation in MHz
    • Shares

Observing some of the counter names, it’s interesting to see that VMware has given us direct insight into the hypervisor resource configuration settings via Performance Monitor from inside the guest VM. While this may be useful for VI Administrators who manage both the VI as well as the guest operating systems, it may be disservice to VI Administrators in environments where guest OS administration is delegated to another support group. The reason why I say this is that some of these new counters disclose an “over commit” or “thin provisioning” of virtual hardware resources which I’d rather not reveal to other supports groups. What they don’t know won’t hurt them. Revealing some of the tools in our bag of virtualization tricks may bring about difficult discussions we don’t really want to get into or perhaps provoke the finger of blame to be perpetually pointed in our direction whenever a guest OS problem is encountered.

I’ve grabbed a few screen shots from my lab which show the disparity between native Perfmon metrics and the new vSphere Virtual Machine Performance Counters. In this example, I compare %Processor Time from the Perfmon native Processor object against the %Processor Time from the VM Processor object which was injected into the VM during the vSphere VMware Tools installation. It’s interesting to note, and you should be able to clearly see it in the graph, that the VM Processor %Processor time is consistently double that of the Perfmon native Processor % Processor Time counter. Consider this when you are providing performance information for a guest VM or one of its applications. If you choose the native Perfmon counter, you could be reporting performance data with 100% margin of error as shown in the case below. This is significant and if used for capacity planning purposes could lead to all sorts of problems.

7-8-2009 9-15-20 PM

7-8-2009 10-17-02 PM

One other important item to note is that you may recall I said towards the beginning that the legacy VMware Descheduled Time Accounting Service only supported uniprocessor VMs. The same appears to be true for the new vSphere Virtual Machine Performance Counters. In the lab I took a single CPU VM which had the vSphere Virtual Machine Performance Counters, and I adjusted the vCPU count to 4. After powering on with the new vCPU count, the vSphere Virtual Machine Performance Counters disappeared from the pulldown list. VMware needs to address this shortcoming. Performance statistics on vSMP VMs are just as important, if not more important, than performance statistics on uniprocessor VMs. vSMP VM resource utilization needs to be watched more closely for vSMP justification purposes.

So VMware, in summary, here is what needs work with vSphere Virtual Machine Performance Counters:

  1. Must support vSMP VMs
  2. Must support Linux VMs
  3. Support for Solaris VMs would also be nice
  4. More objects: VM Disk and VM Networking

Update: On Friday July 11th, 2009, I received the following email response from Praveen Kannan, a VMware Product Manager. Praveen has given me permission to reprint the response here. It is an encouraging read:

Hi Jason,

I read your recent blog post on the Perfmon integration in vSphere 4.0. I’m the product manager for the feature and wanted to reach out to on your findings and feedback regarding the feature.

First off, thanks for the detailed post on the intricacies of the feature and the screenshots. I think this post would be very helpful to the community! Much appreciated…

1) note on vmdesched

We’ve deprecated vmdesched in vSphere 4.0 because it was primarily an experimental feature that we didn’t recommend putting in production. More importantly, vmdesched adds overhead to the guest and is not compatible with some of the newer kernels out there and so the Perfmon integration is our answer to improve on the current state and provide accurate CPU accounting to VM owners that can be deployed in production and is integrated well with VMware Tools for out-of-box functionality.

2) Linux support for accurate counters

The Perfmon integration in vSphere 4.0 leverages the guest SDK API to get to the accurate counters from the hypervisor and that is available on Linux GOS as well. All you need is to have the VMware Tools installed to get access to the guest SDK interface. We couldn’t provide something like Perfmon on Linux since there aren’t many broadly used tools/APIs that we can standardize on.

There are some discussions internally to solve the accounting issue on Linux guests in a much simplified manner but I can’t go into the specific details at this time. Rest assured, I can tell you that we are looking into the problem for Linux workloads.

On a side note, the Perfmon implementation exposes the two new counter groups through WMI (you can almost think of the Perfmon integration as a WMI provider that sits on top of the guest SDK interface and provide access to the counters). What this means is any in-guest agent, benchmarking, reporting tool etc. can quickly adapt to use these “accurate” counters using WMI

So for Linux guests, you can refer to the guest SDK documentation on how someone can modify their Linux agents, tools etc. to talk to the “accurate” counters. The programming guide for vSphere guest SDK 4.0 is available at The list of available perf counters is in Page 11 of the PDF (Accessor functions for VM data).

You can in fact use the older 3.5 version of the guest SDK API as well if you want to implement something that works with existing VI3 environments (yes, this SDK has been around for a while!). The only difference is that the vSphere version of the API has a few extra counters but you will get access to the important counters such as CPU utilization in the older API itself.

3) over commit, thin provisioning counters

Interesting feedback that I’ll take back to engineering 🙂 This is something that we need to think about for sure

4) uni-processor Perfmon?

I’m really surprised with your observations after moving to a 4 vCPUs. Not sure what’s going on but AFAIK, we report the _Total (aggregate) of all CPU utilization in one metric in the “VM Processor” counter group in Perfmon. What that means is regardless of how many CPUs in-guest, we do provide the _Total of CPU Utilization. Maybe you may have run into a bug. I’ll check with engineering on this anyways to confirm my understanding.

Just so you know we have a “standalone” version of the Perfmon tool that works with existing VI 3.5 environments. We’ve posted details about this experimental tool and the binaries on our performance blog here:

The reason I mentioned the standalone version is because on my test box running 3.5 with the standalone version of Perfmon, I was able to see the _Total on a 2 vCPU VM. I haven’t yet tested your findings on a vSphere test box yet but I look into it…

So to help us investigate this, could you please do the following?

a. re-install VMware tools on a test Windows VM after switching to 4 vCPUs and check if the problem is reproducible

b. if you have the 3.5 version of VMware tools running on a VI3 setup, download the standalone version of the Perfmon tool and install it on a Windows VM and check if the 4-vCPU problem is observed. I haven’t tested the same standalone version of Perfmon on a vSphere 4.0 setup (with 4.0 version of the tools) but I wouldn’t be surprised if the standalone version does work. You may want to snapshot the VM before you attempt this though so you can rollback.

5) more counters such as disk and networking

Some background…our main focus in 4.0 was to solve the immediate customer pain-point, namely the CPU accounting issue inside the guest for VM owners. Also, what we heard is that VI admins didn’t want to give out VI client access to VM owners whenever they wanted to look at “accurate” counters for CPU utilization. In fact, the memory counters in Perfmon were sort of a bonus since it was already available in the guest SDK interface 🙂

Importantly, other counters when measured inside the guest such as Memory, Disk and Network don’t really suffer from accounting problems (i.e. they are accurate) as compared to CPU utilization numbers captured over a period of time (which may be accounted different due to the scheduling and de-scheduling the hypervisor does). So the numbers for Disk, Memory and Network when captured inside the Windows guest will be the same as the VI client.

However, I do recognize that as more and more customers start using this integration, there will soon be a need for providing disk and network counters as well. This is definitely on my radar to address in a future release.

Hope the information I provided helps in better understanding the Perfmon integration in vSphere 4.0 and also answer some of your questions in the blog post.

Looking forward to your findings with the 4 vCPU VMs. LMK if you have any questions in the interim.

P.S: Do feel free to use the information discussed here for your blog where you deem useful…

Have a good weekend…

Praveen Kannan
Product Manager
VMware, Inc.

After some more investigation in another test VM, I replied to Praveen with the following information:


In my previous test, I had a 1 vCPU Windows Server 2003 VM. The VM Memory
and VM Processor objects were listed in the pulldown lits in perfmon. After
upgrading the VM to 4 vCPUs, the VM Memory and VM Processor objects were no
longer listed in the pulldown list in perfmon. So you see, the objects were
not available thus the counters (including _Total) were not available.

Today, I deployed a 1 vCPU Windows Server 2003 VM from a 1 vCPU template.
When I ran perfrmon, the VM Memory and VM CPU objects were missing (VMware
Tools was up to date). I closed perfmon and reopened it. Then the 2 VM
objects were there.

Then I upgraded the VM to a 4 vCPU VM. I ran perfmon and both the VM
objects were there.

Following that, I encountered more problems. I was able to choose the VM Processor object, but the counters for the object were all missing. Definitely a bug somewhere with these. Please advise.

VMware Update Manager, Updates, and New Builds

June 7th, 2009

This was somewhat of a strange post to get off the ground. I had a definite purpose at the beginning and I knew what I was going to write about, however, through some lab scenarios I unexpectedly took the scenic route in getting to the end.

In my mind, the topic started out as “Effective/Efficient Use of Update Manager For New Builds”.

Then, while working in the lab, the title changed to “Gosh, Update Manager Is Slow”.

A while later it morphed into “Cripes, What In The Heck Is Update Manager Doing?!”

Finally I had a revelation and the topic came full circle back to an appropriate title of “VMware Update Manager, Updates, and New Builds” which is what I more or less had in mind to begin with but as I said I picked up some information which I hadn’t recognized at the beginning.

“Effective/Efficient Use of Update Manager For New Builds”

So as I said, the idea of the post started out with a predefined purpose – discussion about the use of Update Manager in host deployments. It really has more to do with host deployment methodology as a basis of discussion that it has to do with patch management. What I was going highlight was that the deployment of an ESX host goes much quicker if you start out with the most current ESX .ISO allowed in your environment and then use VMware Update Manager to install the remaining patches to bring it to current.

As an example, let’s say our current ESX platform standard is ESX 3.5.0 Update 4 with all patches up to today’s date of 6/6/09.

  • The most efficient deployment method would be to perform the initial installation of ESX using the ESX 3.5.0 Update 4 .ISO and then afterwards, use VMware Update Manager to install the remaining 15 patches through today’s date. Using Ultimate Deployment Appliance version 1.4, I can deploy ESX 3.5.0 with Update 4 in five minutes. The subsequent 15 patches using VMware Update Manager takes an additional 16 minutes, end to end including the reboot. That’s a total of less than 25 minutes to deploy a host with all patches.
  • Now let’s look at an alternative and much more time consuming method. Install ESX 3.5.0 using the original or even the Update 1 .ISO. Again, using UDA 1.4, this takes 5 minutes. Now we use Update Manager to remediate the ESX host to Update 4 plus the remaining 15 patches. If you used the original ESX .ISO, you’re looking at 149 updates. If you installed from the ESX 3.5.0 Update 1 .ISO, you’ve got 125 patches to install. This patching process takes nearly 90 minutes! Even on an IBM x3850M2 (one of the fastest hardware platforms available on the market today), the patch process is 75 minutes.

The numbers in the second bullet above speak to the deployment of one host. We always have more than one host in a high availability cluster and a typical environment might have 6, 12, or even 32 hosts in a cluster. Ideally we don’t want to be running hosts in a cluster on different patch levels for an extended duration. Suddenly we’re looking at a long day of work for a 6 node cluster (9.5 hours) and an entire weekend gone for a cluster of 12 hosts or more (18 hours +). The kicker is that this is still an automated deployment. Automation usually means efficiency right? Not in this case. Granted, there’s not a lot of manual labor involved here, but there is a lot of “hurry up and wait”.

Now before anyone jumps in and recommends rebuilding all of the hosts concurrently, let’s just count that out as an option because in this scenario, we’re rebuilding an active cluster that can only afford 1 host outage at a time (N+1). I’m actually being generous with the time durations because I’m not even accounting for host evacuations, which at the vCenter default of 2 at a time, can take a long time on densely populated clusters. It’s a real world scenario and if you don’t plan ahead for it, you may find out there is not enough time in a weekend to complete your upgrade.

Moral of this section: When deploying hosts, use the most recent .ISO possible which has all of the updates injected into it up to the release date of the .ISO.

“Gosh, Update Manager Is Slow”

I’ve heard some comments via word of mouth about how slow Update Manager is. Myself, I thought the comments were unfounded. I’ve never had major issues with Update Manager aside from a few nuisances I’ve learned to work around. Having managed ESX environments before the advent of Update Manager, I’m grateful for what Update Manager has brought to the table in lieu of manually populated and managed intranet update repositories. I never really noticed the Update Manager slowness because I was always deploying new host builds from the latest ESX .ISO as I described in the first bullet in the section above, and then applying the few incremental post deployment patches. Deploying the full boat of ESX patches using Update Manager has opened up my eyes as to how painfully slow it can be.

One interesting thing that I discovered in the lab was not only is the patch deployment process longer, the preceding scan process is as well. The interesting component is that both the scan and the remediate steps seem to scale in a linear fashion, whether that is actually true or just a coincidence, who knows. What I mean is that:

  • An ESX 3.5.0 Update 4 host took 1 minute to scan and 16 minutes to remediate
  • An ESX 3.5.0 Update 1 host took 5 minutes to scan and 84 minutes to remediate

So we’re wasting extra time in both of the remediation processes: The scan, and the remediate.

Moral of this section: Update Manager or ESX patch installation or both is slow, but it doesn’t have to be. Same as the moral of the first section: Avoid this pitfall by using the most recent .ISO possible which has all of the updates injected into it up to the release date of the .ISO.

“Cripes, What In The Heck Is Update Manager Doing?!”

So then curiosity got the best of me and I took the lab experiment a little further. Of the 84 minutes spent remediating ESX 3.5.0 Update 1 host above, how much of that time was spent installing Update 4, and how much of the time was spent installing the 15 subsequent post Update 4 patches? Afterall, I already know that remediating the 15 post Update 4 patches by themselves takes only 16 minutes. Will the numbers jive?

To find out, I deployed an ESX 3.5.0 Update 1 host and created a remediation baseline containing ONLY ESX 3.5.0 Update 4. Big sucker – 723MB, but because it’s just one giant service pack, perhaps it will install quicker than the sum of all its updates. Here’s where I was really wrong.

I remediated the host and expected to see 1 task in vCenter describing an installation process, and then a reboot. Instead, I saw a boatload of patches being installed:

6-7-2009 12-26-22 AM

Which brings me to the title of this section “Cripes, What In The Heck Is Update Manager Doing?!” Did I apply the wrong baseline? Did Update Manager become self aware like Skynet and decide to engineer its own creative solutions to datacenter problems? Turns out Update 4 is not a patch or a service pack at all. In and of itself, it doesn’t even include binary RPM data. It’s metadata that points to all ESX 3.5.0 patches dated up to and including 3/30/09. Sure, you can download Update 4 as a 724MB offline installation package from the VMware download section, but mosey on over to their patch repository portal and you’ll see that the giant list of superseded and included updates in Update 4 is merely an 8.0K download. At first I thought that had to be a typo and I was about to drop John Troyer an email but opening up that 8K file just for kicks was the eye opener for me. Take a look at the 8K file and you’ll see the metadata that tells Update Manager to go download many of the incremental patches leading up to 3/30/09. Same concept with the 724MB offline installation package. It’s a .ZIP file. Open it up and you won’t find a large 724MB .RPM. Instead you’ll find a directory structure containing many of the incremental updates leading up to 3/30/09.

Moral of this section: Same as the moral of the first and second sections: Avoid wasting your valuable maintenance window time by avoiding as many incremental ESX patches as possible. Use the most recent .ISO possible which has all of the updates injected into it up to the release date of the .ISO when you deploy a host.

“VMware Update Manager, Updates, and New Builds”

Connect the dots and I think we’ve got a best practice in the making for host deployments using Update Manager. Existing and new host deployments aside, look at the implications of using Update Manager to deploy a major Update (in this discussion, Update 4). It’s actually 5 times faster to rebuild the host with the integrated Update 4 .ISO than it is to patch it with Update Manager. To me that’s bizarre but it is reality if you have automated host deployment methods. For medium to large environments, automated builds are absolutely required. There’s not enough time in the weekend to patch an 18 host cluster, let alone a 32 node cluster using Update Manager. Rebuild from an updated .ISO or span your host updates over several maintenance windows. The latter could get hairy and I definitely would not recommend it.

Great day today and I got a lot accomplished in the lab. Unfortunately towards the end, this happened:

6-7-2009 1-08-09 AM

Replacement unit is already on the way from NewEgg. Thank you vWire for funding the replacement!

VMware ESX 4.0 installation video

April 27th, 2009

Taking a queue from VCritical‘s VMware vCenter Server 4.0 installation video, following is a video showcasing the VMware ESX 4.0 GUI installation on HP Proliant DL385 server hardware with an AMD Opteron 280 dual core 64-bit processor, 4GB RAM, two internal 15kRPM SCSI drives, two onboard Gb NICs, and SAN attached storage via two fibre HBAs.

Points of interest in the video:

  • The ESX 4.0 boot media is DVD ROM (not CD ROM).
  • 9:03:23 F2 is pressed revealing the various kernel arguments tied to each of the embedded boot options.
  • 9:05:03 Custom driver installation (NICs, HBAs, array controllers, etc.)
  • 9:05:42 As in ESX 2.x, the ability to enter host based licensing at the console.
  • 9:05:49 pNIC adapter selection reveals MAC address and connection state (both of which can be very useful).
  • 9:05:57 A “Test these settings” button for the Service Console NIC
  • 9:06:47 Detailed information is shown about SAN attached storage such as storage array type and fabric information.
  • 9:06:58 Careful selection of the target installation disk spares the existing VMFS volumes on SAN attached storage from being formatted. Only the selected disk is initialized.
  • 9:07:12 Choosing an existing or creating a new datastore to hold VMs and VMKernel swap. The volume created here did not appear to be block aligned.
  • 9:07:44 VMware auto created partitioning. Not shown (hidden) is a 250MB /boot partition, 110MB vmkcore partition, and the VMFS volume the installer was instructed to create. VMware’s automatic partitioning schemes over the years don’t seem to follow best practices and lessons learned in the community and in some cases what their instructors teach in the classroom. I think it would help if VMware read the partitioning discussions on their forums or some of the books from the accomplished authors in the community. For instance, creating a mount point for /var/log instead of /var which can contain a subdirectory such as /var/xyz having large core dumps for 3rd party agents or products. I’m all for keeping things simple but not at the price of risking a virtual infrastructure component running 1, 10, or 100+ VMs. From what I’m seeing, manual partitioning is still needed in ESX 4.0 to be in alignment with best practices.
  • 9:09:12 I purposely choose the incorrect date here merely to demonstrate that host names may be used in lieu of IP addresses for NTP Server selection. Upon entering the FQDN of an NTP server, the calendar snaps back to the correct date.
  • 9:08:50 You probably just want to fast forward the video to the end. Nothing very exciting happens during the file copy process.
  • 9:17:17 The file copy completes and the host is rebooted.
  • 9:19:47 All SAN attached volumes are still in tact with data on them.

ESX 4.0 GUI Installation from Jason Boche on Vimeo.

Anti-affinity rules are not honored in cluster with more than 2 virtual machines

March 27th, 2009

We can put a man on the moon and we can hot migrate virtual machines with SMP and gigs of RAM, but we can’t create anti-affinity rules with three or more VMs. This has been a thorn in my side since 2006, long before I requested it fixed in February 2007 on the VMTN Product and Feature Suggestions forum.

VMware updated KB article 1006473 on 3/26 outlining anti-affinity rule behavior when using three or more VMs:

“This is expected behavior, as anti affinity rules can be set only for 2 virtual machines.

When a third virtual machine is added any rule becomes disabled (with 2.0.2 or earlier).

There has been a slight change in behavior with VirtualCenter 2.5, wherein input validation occurs, where a third virtual machine added produces a warning message indicating a maximum of two virtual machines only can be added to this rule.

To workaround this, create more rules to cover all of the combinations of virtual machines.

For example, create rules for (VM1 & VM2), then (VM2 & VM3), and (VM1 & VM3).”

That last sentence is what has been burning my cookies for the longest time. In my last environment, I had several NLB VMs which could not be on the same host for load balancing and redundancy purposes. Rather than create a minimum amount of rules to intelligently handle all of the VMs, I was left with no choice but to create several rules for each potentially deadly combination.

Work harder, not smarter. Come on VMware.

Microsoft Performance Monitor tweaks

February 17th, 2009

Today I discovered the workarounds to a few issues in Microsoft Performance Monitor that have bugged me for quite a while (read: years).

Issue 1: Vertical lines are displayed in the Sysmon tool that obscure the graph view

2-17-2009 9-41-08 PM

Cause: This behavior occurs when there are more than 100 data points to be displayed in chart view.

Resolution: Microsoft KB article 283110

To enable or disable this behavior:

  1. Start Regedit.exe.
  2. Navigate to the following key:
  3. HKEY_CURRENT_USER\Software\Microsoft\SystemMonitor
  4. On the Edit menu, click New, and then click DWord Value.
  5. Type the following value in the Name box:
  6. DisplaySingleLogSampleValue
  7. Set the value to 1 if you do not want to view the vertical line indicators, or set the value to 0, which is the default setting, to display the vertical indicators.


2-17-2009 9-47-48 PM

Issue 2: When looking at large numbers in Performance Monitor (Windows XP), comma separators do not exist thus making it difficult to interpret large numbers.

2-17-2009 9-49-26 PM

Cause: Microsoft

Resolution: Microsoft KB article 300884

Follow these steps, and then quit Registry Editor:

  1. Click Start, click Run, type regedit, and then click OK.
  2. Locate and then click the following key in the registry:
  3. HKEY_CURRENT_USER\Software\Microsoft\SystemMonitor\
  4. On the Edit menu, point to New, and then click DWORD Value.
  5. Type DisplayThousandsSeparator, and then press ENTER.
  6. On the Edit menu, click Modify.
  7. Type 1, and then click OK.


2-17-2009 9-50-51 PM

Extra credit:  Check out Microsoft KB article 281884 for one additional tweak that deals with viewing PIDs in Performance Monitor counters.

Three VirtualCenter security tips Windows administrators should know

January 15th, 2009

Good morning!  I’d like to take the opportunity to talk a bit about something that has been somewhat of a rock in my shoe as a seasoned Windows administrator from the NT 3.5 era:  The VirtualCenter (vCenter Server, VirtualCenter Management Server, VCMS, VC, etc.) security model, or more accurately, its unfamiliar mechanics that can catch Windows administrators off guard and leave them scratching their heads.

Tip #1: The VCMS security model revolves around privileges, roles, and objects.  The more than 100 privileges define rights, roles are a collection of privileges, and roles are assigned to objects which are entities in the virtual infrastructure as shown in the diagram borrowed below:

1-15-2009 11-24-45 AM

Windows administrators will be used to the concept of assigning NTFS permissions to files, folders, and other objects in Active Directory.  It is very common for Windows objects to contain more than one Access Control Entry (ACE) which can be a group (such as “Accounting”, “Marketing”, etc.) or an explicit user (such as “Bob”, Sally”, etc.)  The same holds true for assigning roles to object in VC.

In some instances, which are not uncommon at all, a user may be granted permission to an object by way of more than one ACE.  For example, if both the Accounting and Marketing groups were assigned rights, and Sally was a member of both those groups, Sally would have rights to the object through both of those groups.  Using this same example, if the two ACEs defined different permissions to an object, the end result is a cumulative, so long as the ACE doesn’t contain “deny” which is special:  Sally would have the combined set of permissions.  The same holds true in VC.

Let’s take the above example a step further.  In addition to the two groups, which Sally is a member of, being ACLd to an object, now let’s say Sally’s user account object itself is an explicit ACE in the ACL list.  In the Windows world, the effect is Sally’s rights are still cumulative combining the three ACEs.  This is where the fork in the road lies in the VirtualCenter security model.  Roles explicitly assigned to a user object trump all other assigned or inherited permissions to the same object.  If the explicit ACE defines less permissions, the effective result is Sally will have less permissions than what her group membership would have provided.  If the explicit ACE defines more permissions, the effective result is Sally will have more permissions than what her group membership would have provided.  This is where Windows based VC administrators will be dumbfounded when a user suddenly calls with tales of things gray’d out in VirtualCenter, not enough permissions, etc.  Of course the flip side of the coin is a junior administrator suddenly finds themselves with cool new options in VC.  “Let’s see what this datastore button does”

Moral of the story from a real world perspective:  Assigning explicit permissions to user accounts in VC without careful planning will yield somewhat unpredictable results when inheritance is enabled (which is typical).  To take this to extremes, assigning explicit permissions to user accounts in VC, especially where inheritance in the VC hierarchy is involved, is a security and uptime risk when a user ends up with the wrong permissions accidentally.  For security and consistency purposes, I would avoid assigning permissions explicitly to user accounts unless you have a very clear understanding of the impacts currently and down the road.

Tip #2: Beware the use of the built in role Virtual Machine Administrator.  It’s name is misleading and the permissions it has are downright scary and not much different than the built in Administrator role.  For instance, the Virtual Machine Administrator role:  can modify VC and ESX host licensing, has complete control over the VC folder structure, has complete control over Datacenter objects, has complete control over datastores (short of file management), can remove networks, has complete control over inventory items such as hosts and clusters.  This list goes on and on.  I have three words:  What The Hell?!  I don’t know – the way my brain works is those permissions stretch well beyond the boundaries of what I would delegate for a Virtual Machine Administrator.

Moral of the story from a real world perspective:  Use the Virtual Machine Administrator role with extreme caution.  There is little disparity between the Administrator role and the Virtual Machine Administrator role, minus some items for Update Manager and changing VC permissions themselves. Therefore, any user who has the Virtual Machine Administrator role is practically an administrator.  The Virtual Machine Administrator role should not be used unless you have delegations that would fit this role precisely.  Another option would be clone the role and strip some of the more datacenter impactful permissions out of it.

Tip #3: Audit your effective VirtualCenter permissions on a regular basis, especially if you have large implementation with many administrators “having their hands in the cookie jar” so to speak.  If you use groups to assign roles in VC, then that means you should be auditing these groups as well (above and beyond virtualization conversations, administrative level groups should be audited anyway as a best practice).  This whitepaper has a nice Perl script for dumping VirtualCenter roles and permissions using the VMware Infrastructure Perl Toolkit.  Use of the script will automate the auditing process quite a bit and help transform a lengthy mundane task into a quicker one.  While you’re at it, it wouldn’t be a bad idea to periodically check tasks and events to see who is doing what.  There should be no surprises there.

Moral of the story from a real world perspective:  Audit your VirtualCenter roles and permissions.  When an unexpected datacenter disaster occurs from users having elevated privileges, one of the first questions to be asked in the post mortem meeting will be what your audit process is.  Have a good answer prepared.  Even better, avoid the disaster and down time through the due diligence of auditing your virtual infrastructure security.

For more information about VirtualCenter security, check out this great white paper or download the .pdf version from this link.  Some of the information I posted above I gathered from this document.  The white paper was written by Charu Chaubal, a technical marketing manager at VMware and Ph.D. in numerical modeling of complex fluids, with contributions from Doug Clark, and Karl Rummelhart.

If VirtualCenter security talk really gets your juices flowing, you should check out a new podcast launched by well known and respected VMTN community member/moderator and book author Edward Haletky that starts today called Virtualization Security Round Table.  It is sure to be good!

Uptime lost during VMotion

January 11th, 2009

That’s right. We lose uptime during every VMotion. Relax just a bit – I’m not talking about actual uptime/downtime availability of the guest VM in the datacenter. I’m speaking to the uptime performance metric tracked in the VirtualCenter database. It’s a bug that was introduced in VirtualCenter 2.0 and has remained in the code to this day in VirtualCenter 2.5.0 Update 3. Here’s how it works:

We’ve got a VM named Exchange1. VirtualCenter displays its uptime as 28 days is indicated by the screenshots below:

1-11-2009 12-04-06 AM

1-11-2009 12-10-14 AM

Now the VM has been VMotioned from host SOLO to host LANDO. Notice what has happened to uptime. It has disappeared from the console:

1-11-2009 12-07-37 AM

The real proof in what has ultimately happened is we see from the performance chart the latest uptime metric has been reset from 27.99 days as shown above to 0.0026620 days:

1-11-2009 12-10-54 AM

Sometimes the VIC console will show the uptime counter start over at 0 days, then on to 1 days, etc. Other times the uptime counter will remain blank for days or weeks as you can see from my three other VMs in the first screenshot which show no uptime.

This brings us to an interesting discussion. What would you like uptime in the VIC to mean exactly? Following are my observations and thoughts on VMware’s implementation of the uptime metric in VirtualCenter.

In previous versions of VirtualCenter, a soft reboot of the VM inside of the OS would reset the uptime statistic in VirtualCenter. I believe this was a function of VMware Tools that triggered this.

Today in VirtualCenter 2.5.0 Update 3, a soft reboot inside the guest VM does not reset the uptime statistic back to zero.

A VM which has no VMware Tools installed that is soft rebooted inside of the OS (ie. we’re not talking about any VMware console power operation here) does not reset the uptime statistic.

I could see the community take a few different sides on this as there are two variations of the definition of uptime we’re dealing with here. Uptime of the guest VM OS and uptime of the VM’s virtual hardware.

  1. Should uptime translate into how long the VM virtual hardware has been powered on from a virtual infrastructure standpoint?
  2. Or should uptime translate into how long the OS inside the VM has been up, tracked by VMware Tools?

The VMware administrator cares about the length of time a VM has been powered on. It is the powered on VM that consumes resources from the four resource food groups and impacts capacity.

The guest VM OS administrator, whether it be Windows or Linux, cares about uptime of the guest OS. The owner of the OS is held to SLAs by the business lines.

My personal opinion is that the intended use of the Virtual Infrastructure Client is for the VMware administrator and thus should reflect virtual infrastructure information. My preference is that the uptime statistic in VirtualCenter tracks power operations of the VM irregardless of any reboot sequences of the OS inside the VM. In other words, uptime is not impacted by VMware Tools heartbeats or reboots inside the guest VM. The uptime statistic should only be reset when the VM is powered off or power reset (including instances where HA has recovered a VM).

At any rate, due to the bug that uptime has in VirtualCenter 2.0 and above, it’s a fairly unreliable performance metric for any virtual infrastructure using VMotion and DRS. Furthermore, the term itself can be misleading depending on the administrators interpretation of uptime versus what’s written in the VirtualCenter code.

I submitted a post in VMware’s Product and Feature Suggestions forum in January of 2007 recording the uptime reset on VMotion issue. As this problem periodically bugs me, I followed up a few times. Once in a follow up post in the thread above, and at least one time externally requesting someone from VMware take a look at it. Admittedly I do not have an SR open.

VMware, can we get this bug fixed? After all, if the hypervisor has become an every day commodity item leaving the management tools as the real meat and potatoes, you should make sure our management tools work properly.

Thank you,