Posts Tagged ‘VMotion’

Jumbo Frames Comparison Testing with IP Storage and vMotion

January 24th, 2011

Are you thinking about implementing jumbo frames with your IP storage based vSphere infrastructure?  Have you asked yourself why or thought about the guaranteed benefits? Various credible sources discuss it (here’s a primer).  Some will highlight jumbo frames as a best practice but the majority of what I’ve seen and heard talk about the potential advantages of jumbo frames and what the technology might do to make your infrastructure more efficient.  But be careful to not interpret that as an order of magnitude increase in performance for IP based storage.  In almost all cases, that’s not what is being conveyed, or at least, that shouldn’t be the intent.  Think beyond SPEED NOM NOM NOM.  Think efficiency and reduced resource utilization which lends itself to driving down overall latency.  There are a few stakeholders when considering jumbo frames.  In no particular order:

  1. The network infrastructure team: They like network standards, best practices, a highly performing and efficient network, and zero down time.  They will likely have the most background knowledge and influence when it comes to jumbo frames.  Switches and routers have CPUs which will benefit from jumbo frames because processing less frames but more payload overall makes the network device inherently more efficient while using less CPU power and consequently producing less heat.  This becomes increasingly important on 10Gb networks.
  2. The server and desktop teams: They like performance and unlimited network bandwidth provided by magic stuff, dark spirits, and friendly gnomes.  These teams also like a postive end user experience.  Their platforms, which include hardware, OS, and drivers, must support jumbo frames.  Effort required to configure for jumbo frames increases with a rising number of different hardware, OS, and driver combinations.  Any systems which don’t support network infrastructure requirements will be a showstopper.  Server and desktop network endpoints benefit from jumbo frames much of the same way network infrastructure does: efficiency and less overhead which can lead to slightly measurable amounts of performance improvement.  The performance gains more often than not won’t be noticed by the end users except for process that historically take a long amount of time to complete.  These teams will generally follow infrastructure best practies as instructed by the network team.  In some cases, these teams will embark on an initiative which recommends or requires a change in network design (NIC teaming, jumbo frames, etc.).
  3. The budget owner:  This can be a project sponsor, departmental manager, CIO, or CEO.  They control the budget and thus spending.  Considerable spend thresholds require business justification.  This is where the benefit needs to justify the cost.  They are removed from the most of the technical persuasions.  Financial impact is what matters.  Decisions should align with current and future architectural strategies to minimize costly rip and replace.
  4. The end users:  Not surprisingly, they are interested in application uptime, stability, and performance.  They could care less about the underlying technology except for how it impacts them.  Reduction in performance or slowness is highly visible.  Subtle increases in performance are rarely noticed.  End user perception is reality.

The decision to introduce jumbo frames should be carefully thought out and there should be a compelling reason, use case, or business justification which drives the decision.  Because of the end to end requirements, implementing jumbo frames can bring with it additional complexity and cost to an existing network infrastructure.  Possibly the single best one size fits all reason for a jumbo frames design is a situation where jumbo frames is already a standard in the existing network infrastructure.  In this situation, jumbo frames becomes a design constraint or requirement.  The evangelistic point to be made is VMware vSphere supports jumbo frames across the board.  Short of the previous use case, jumbo frames is a design decision where I think it’s important to weigh cost and benefit.  I can’t give you the cost component as it is going to vary quite a bit from environment to environment depending on the existing network design.  This writing speaks more to the benefit component.  Liberal estimates claim up to 30% performance increase when integrating jumbo frames with IP storage.  The numbers I came up with in lab testing are nowhere close to that.  In fact, you’ll see a few results where IO performance with jumbo frames actually decreased slightly.  Not only do I compare IO with or without jumbo frames, I’m also able to compare two storage protocols with and without jumbo frames which could prove to be an interesting sidebar discussion.

I’ve come across many opinions regarding jumbo frames.  Now that I’ve got a managed switch in the lab which supports jumbo frames and VLANs, I wanted to see some real numbers.  Although this writing is primarily regarding jumbo frames, by way of the testing regimen, it is in some ways a second edition to a post I created one year ago where I compared IO performance of the EMC Celerra NS-120 among its various protocols. So without further ado, let’s get onto the testing.

 

Lab test script:

To maintain as much consistency and integrity as possible, the following test criteria was followed:

  1. One Windows Server 2003 VM with IOMETER was used to drive IO tests.
  2. A standardized IOMETER script was leveraged from the VMTN Storage Performance Thread which is a collaboration of storage performance results on VMware virtual infrastructure provided by VMTN Community members around the world.  The thread starts here, was locked due to length, and continues on in a new thread here.  For those unfamiliar with the IOMETER script, it basically goes like this: each run consists of a two minute ramp up followed by five minutes of disk IO pounding.  Four different IO patterns are tested independently.
  3. Two runs of each test were performed to validate consistent results.  A third run was performed if the first two were not consistent.
  4. One ESXi 4.1 host with a single IOMETER VM was used to drive IO tests.
  5. For the mtu1500 tests, IO tests were isolated to one vSwitch, one vmkernel portgroup, one vmnic, one pNIC (Intel NC360T PCI Express), one Ethernet cable, and one switch port on the host side.
  6. For the mtu1500 tests, IO tests were isolated to one cge port, one datamover, one Ethernet cable, and one switch port on the Celerra side.
  7. For the mtu9000 tests, IO tests were isolated to the same vSwitch, a second vmkernel portgroup configured for mtu9000, the same vmnic, the same pNIC (Intel NC360T PCI Express), the same Ethernet cable, and the same switch port on the host side.
  8. For the mtu9000 tests, IO tests were isolated to a second cge port configured for mtu9000, the same datamover, a second Ethernet cable, and a second switch port on the Celerra side.
  9. Layer 3 routes to between host and storage were removed to lessen network burden and to isolate storage traffic to the correct interfaces.
  10. 802.1Q VLANs were used isolate traffic and categorize standard traffic versus jumbo frame traffic.
  11. RESXTOP was used to validate storage traffic was going through the correct vmknic.
  12. Microsoft Network Monitor and Wireshark were used to validate frame lengths during testing.
  13. Activities known to introduce large volumes of network or disk activity were suspended such as backup jobs.
  14. Dedupe was suspended on all Celerra file systems to eliminate datamover contention.
  15. All storage tests were performed on thin provisioned virtual disks and datastores.
  16. The same group of 15 spindles were used for all NFS and iSCSI tests.
  17. The uncached write mechanism was enabled on the NFS file system for all NFS tests.  You can read more about that in the following EMC best practices document VMware ESX Using EMC Celerra Storage Systems

Lab test hardware:

SERVER TYPE: Windows Server 2003 R2 VM on ESXi 4.1
CPU TYPE / NUMBER: 1 vCPU / 512MB RAM (thin provisioned)
HOST TYPE: HP DL385 G2, 24GB RAM; 2x QC AMD Opteron 2356 Barcelona
STORAGE TYPE / DISK NUMBER / RAID LEVEL: EMC Celerra NS-120 / 15x 146GB 15K / 3x RAID5 5×146
SAN TYPE: / HBAs: NFS / swiSCSI / 1Gb datamover ports (sorry, no FCoE)
OTHER: 3Com SuperStack 3 3870 48x1Gb Ethernet switch

 

Lab test results:

NFS test results.  How much better is NFS performance with jumbo frames by IO workload type?  The best result seen here is about a 7% performance increase by using jumbo frames, however, 100% read is a rather unrealistic representation of a virtual machine workload.  For NFS, I’ll sum it up as a 0-3% IOPS performance improvement by using jumbo frames.

SnagIt Capture

SnagIt Capture

iSCSI test results.  How much better is iSCSI performance with jumbo frames by IO workload type?  Here we see that iSCSI doesn’t benefit from the move to jumbo frames as much as NFS.  In two workload pattern types, performance actually decreased slightly.  Discounting the unrealistic 100% read workload as I did above, we’re left with a 1% IOPS performance gain at best by using jumbo frames with iSCSI.

SnagIt Capture

SnagIt Capture

NFS vs iSCSI test results.  Taking the best results from each protocol type, how do the protocol types compare by IO workload type?  75% of the best results came from using jumbo frames.  The better performing protocol is a 50/50 split depending on the workload pattern.  One interesting observation to be made in this comparison is how much better one protocol performs over the other.  I’ve heard storage vendors state that the IP protocol debate is a snoozer, they preform roughly the same.  I’ll grant that in two of the workload types below, but in the other two, iSCSI pulls a significant performance lead over NFS. Particularly in the Max Throughput-50%Read workload where iSCSI blows NFS away.  That said, I’m not outright recommending iSCSI over NFS.  If you’re going to take anything away from these comparisons, it should be “it depends”.  In this case, it depends on the workload pattern, among a handful of other intrinsic variables.  I really like the flexibility in IP based storage and I think it’s hard to go wrong with either NFS or iSCSI.

SnagIt Capture

SnagIt Capture

vMotion test results.  Up until this point, I’ve looked at the impact of jumbo frames on IP based storage with VMware vSphere.  For curiosity sake, I wanted to to address the question “How much better is vMotion performance with jumbo frames enabled?”  vMotion utilizes a VMkernel port on ESXi just as IP storage does so the ground work has already been established making this a quick test.  I followed roughly the same lab test script outlined above so that the most consistent and reliable results could be produced.  This test wasn’t rocket science.  I simply grabbed a few different VM workload types (Windows, Linux) with varying sizes of RAM allocated to them (2GB, 3GB, 4GB).  I then performed three batches of vMotions of two runs each on non jumbo frames (mtu1500) and jumb frames (mtu9000).  Results varied.  The first two batches showed that jumbo frames provided a 7-15% reduction in elapsed vMotion time.  But then the third and final batch contrasted previous results with data revealing a slight decrease in vMotion efficiency with jumbo frames.  I think there’s more variables at play here and this may be a case where more data sampling is needed to form any kind of reliable conclusion.  But if you want to go by these numbers, vMotion is quicker on jumbo frames more often than not.

SnagIt Capture

SnagIt Capture

The bottom line:

So what is the bottom line on jumbo frames, at least today?  First of all my disclaimer:  My tests were performed on an older 3Com network switch.  Mileage may vary on newer or different network infrastructure.  Unfortunately I did not have access to a 10Gb lab network to perform this same testing.  However, I believe my findings are consistent with the majority of what I’ve gathered from the various credible sources.  I’m not sold on jumbo frames as a provider of significant performance gains.  I wouldn’t break my back implementing the technology without an undisputable business justification.  If you want to please the network team and abide by the strategy of an existing jumbo frames enabled network infrastructure, then use jumbo frames with confidence.  If you want to be doing everything you possibly can to boost performance from your IP based storage network, use jumbo frames.  If you’re betting the business on IP based storage, use jumbo frames.  If you need a piece of plausible deniability when IP storage performance hits the fan, use jumbo frames. If you’re looking for the IP based storage performance promise land, jumbo frames doesn’t get you there by itself.  If you come across a source telling you otherwise, that jumbo frames is the key or sole ingredient to the Utopia of incomprehendable speeds, challenge the source.  Ask to see some real data.  If you’re in need of a considerable performance boost of your IP based storage, look beyond jumbo frames.  Look at optimizing, balancing, or upgrading your back end disk array.  Look at 10Gb.  Look at fibre channel.  Each of these alternatives are likely to get you better overall performance gains than jumbo frames alone.  And of course, consult with your vendor.

Meet the Engineer: VMware vMotion

September 14th, 2010

I caught this VMware video announcement on Twitter but didn’t see a formal blog post or landing page to provide the proper introduction which it deserves, so I’ll go ahead here and do the cheeseful.  I have no shame in this.

vMotion is a historically significant technology in VMware’s portfolio of datacenter products and has become a staple of virtualized datacenter operations.  It paves a foundation which many other key VMware technologies leverage.  Dilpreet Bindra is the Senior Engineering Manager, VM Mobility Team at VMware (which encompases both vMotion and Storage vMotion).  

Dilpreet is the star of this video and he explains some of the barriers his group has conquered in vSphere 4.1 – these are awesome improvements!  Watch the video. You’re being treated to a sizable slice of VMware history.

New ESX(i) 3.5 security patch released; scenarios and installation notes

April 11th, 2009

On Friday April 10th, VMware released two patches:

Both address the same issue:

A critical vulnerability in the virtual machine display function might allow a guest operating system to run code on the host. The Common Vulnerabilities and Exposures Project (cve.mitre.org) has assigned the name CVE-2009-1244 to this issue.

Hackers must love vulnerabilities like this because they can get a lot of mileage out of essentially a single attack. The ability to execute code on an ESX host can impact all running VMs on that host.

Although proper virtualization promises isolation, the reality is that no hardware or software vendor is perfect and from time to time we’re going to see issues like this. Products are under constant attack from hackers (both good and bad) to find exploits. In virtualized environments, it’s important to remember that guest VMs and guest operating systems are no different than their physical counterparts in that they need to be properly protected from the network. That means adequate virus protection, spyware protection, firewalls, encryption, packet filtering, etc.

This vulnerability in VMware ESX and ESXi is really a two factor attack. In order to compromise the ESX or ESXi host, the guest VM must first be vulnerable to compromise on the network to provide the entry point to the host. Once the guest VM is compromised, the next step is to get from the guest VM to the ESX(i) host. Hosts without the patch will be vulnerable to the next attack which we know from reading above will allow who knows what code to be executed on the host. If the host is patched, we maintain our guest isolation and the attack stops at the VM level. Unfortunately, the OS running in the guest VM is still compromised, again highlighting the need for adequate protection of the operating system and applications running in each VM.

The bottom line is this is an important update for your infrastructure. If your ESX or ESXi hosts are vulnerable, you’ll want to get this one tested and implemented as soon as possible.

I installed the updates today in the lab and discovered something interesting that is actually outlined in both of the KB articles above:

  • The ESXi version of the update requires a reboot. Using Update Manager, the patch process goes like this: Remediate -> Maintenance Mode -> VMotion VMs off -> Patch -> Reboot -> Exit Maintenance Mode. The duration of installation of the patch until exiting maintenance mode (including the reboot in between) took 12 minutes.
  • The ESX version of the update does not require a reboot. Using Update Manager, the patch process goes like this: Remediate -> Maintenance Mode -> VMotion VMs off -> Patch -> Exit Maintenance Mode. The duration of installation of the patch until exiting maintenance mode (with no reboot in between) took 1.5 minutes.

Given reboot times of the host, patching ESX hosts goes much quicker than patching ESXi hosts. Reboot times on HP Proliant servers aren’t too bad but I’ve been working with some powerful IBM servers lately and the reboot times on those are significantly longer than HP. Hopefully we’re not rebooting ESX hosts on a regular basis so with that in mind, reboot times aren’t a huge concern, but if you’ve got a large environment with a lot of hosts requiring reboots, the reboot times are going to be cumulative in most cases. Consider my environment above. A 6 node ESXi cluster is going to take 72 minutes to patch, not including VMotions. A 6 node ESX cluster is going to take 9 minutes to patch, not including VMotions. This may be something to really think about when weighing the decision of ESX versus ESXi for your environment.

Update: One more item critical to note is that although the ESX version of the patch requires no reboot, the patch does require three other patches to be installed, at least one of which requires a reboot.  If you already meet the requirements, no reboot will be required for ESX to install the new patch.

In closing, while we are on the subject of performing a lot of VMotions, take a look at a guest blog post from Simon Long called VMotion Performance. Simon shows us how to modify VirtualCenter (vCenter Server) to allow more simultaneous VMotions which will significantly cut down the amount of time spent patching ESX hosts in a cluster.

Andrew Kutz joins Hyper9

February 28th, 2009

This news is a little over a week old but I just found out two nights ago while reading vExpert profiles and it’s definitely worth repeating.

Andrew Kutz is a recently named vExpert by VMware, Inc. and a well known developer in the VMware community. Andrew has authored a number of VirtualCenter plugins, of which the most famous might be his free Storage VMotion (sVMotion) plugin which provides VMware administrators a GUI interface to hot migrate VM storage from one LUN to another. Andrew has received well deserved praise for his work because he makes the lives of VI administrators easier.

Hyper9 is a startup company in Austin, TX that works in the virtualization infrastructure management space, developing tools that automate the management of virtualization in the datacenter. Hyper9 recently secured an additional round of investment funding and it would seem they are totally serious about delivering quality products to the virtualization community in the hiring of Andrew Kutz. What can we expect out of this? Given what I’ve seen from Andrew in the past, I’ll guess the future will be plugin based architecture which I think makes a lot of sense and is probably what the majority of the community wants.

Congratulations to both Andrew Kutz and Hyper9. I look forward to your accomplishments with great anticipation!

Read the official announcement from Hyper9 here.

Tripwire Annoucement

February 19th, 2009

Press release from Tripwire.  I haven’t had time to take a look at the product yet but the announcement comes from a trustworthy and reputable source whom I respect.  I look forward to seeing some commentary either on the blog here or over at vwire.com.

TRIPWIRE ANNOUNCES FREE UTILITY TO HELP MANAGE VMWARE VMOTION, LAUNCHES NEW VIRTUALIZATION COMMUNITY
Tripwire OpsCheck addresses key virtual infrastructure operational issues; vWire.com offers an opportunity for virtual infrastructure professionals to share ideas and best practices

Portland, OR – Feb. 17, 2009 – Tripwire, Inc. today announced a major new initiative for virtual infrastructure (VI) professionals, which includes Tripwire OpsCheckTM, a free tool to manage VMware VMotion, and an online community for VI administrators. Tripwire OpsCheck assesses common configuration problems that may prevent VMotion from operating properly, and provides troubleshooting tips for configuring VMotion based on Tripwire OpsCheck test results. To download Tripwire OpsCheck, go to www.vwire.com.

To further support the needs of VI professionals, Tripwire has unveiled vWire.com, an online community built around the concerns of VI professionals. Virtualization administrators, engineers and architects are invited to join the community and conversation to share best practices, network, and gain new resources and tools. For more information about the forum, visit www.vwire.com.

“Virtualization professionals are faced with unknown territory, requiring new tools to manage the complexities and risks of virtual environments,” said Dan Schoenbaum, chief operating officer of products, Tripwire. “That’s why Tripwire is committed to developing utilities specifically for virtualization, such as OpsCheck and ConfigCheck, and to creating a forum where VI professionals can share their experiences and knowledge.”

Tripwire ConfigCheck, released in 2008, provides an immediate assessment of the configurations of a VMware ESX hypervisor, comparing them against VMware hardening security guidelines, and then providing remediation instructions if any are needed. ConfigCheck is also available for free and can be downloaded at www.vwire.com.

About Tripwire, Inc.
Tripwire helps over 6,500 enterprises worldwide reduce security risk, attain compliance and increase operational efficiency across virtual and physical environments. With its industry leading configuration assessment and change auditing software solutions, IT organizations achieve and maintain configuration control. Tripwire is headquartered in Portland, Ore. with offices worldwide. http://www.tripwire.com/.

Great iSCSI info!

January 27th, 2009

I’ve been using Openfiler 2.2 iSCSI in the lab for a few years with great success as a means for shared storage. Shared storage with VMware ESX/ESXi (along with the necessary licensing) allows us great things like VMotion, DRS, HA, etc. I’ve recently been kicking the tires of Openfiler 2.3 and have been anxious to implement partly due to the ease in its menu driven NIC bonding feature which I wanted to leverage for maximum disk I/O throughput.

Coincidentally, just yesterday a few of the big brains in the storage industry got together and published what I consider one of the best blog entries in the known universe. Chad Sakac and David Black (EMC), Andy Banta (VMware), Vaughn Stewart (NetApp), Eric Schott (Dell/EqualLogic), Adam Carter (HP/Lefthand) all conspired.

One of the iSCSI topics they cover is link aggregation over Ethernet. I read and re-read this section with great interest. My current swiSCSI configuration in the lab consists of a single 1Gb VMKernel NIC (along with a redundant failover NIC) connected to a single 1Gb NIC in the Openfiler storage box having a single iSCSI target with two LUNs. I’ve got more 1Gb NICs that I can add to the Openfiler storage box, so my million dollar question was “will this increase performance?” The short answer is NO with my current configuration. Although the additional NIC in the Openfiler box will provide a level of hardware redundancy, due to the way ESX 3.x iSCSI communicates with the iSCSI target, only a single Ethernet path will be used for by ESX to communicate to the single target backed by both LUNs.

However, what I can do to add more iSCSI bandwidth is to add the 2nd Gb NIC in the Openfiler box along with an additional IP address, and then configure an additional iSCSI target so that each LUN is mapped to a separate iSCSI target.  Adding the additional NIC in the Openfiler box for hardware redundancy is a no brainer and I probably could have done that long ago, but as far as squeezing more performance out of my modest iSCSI hardware, I’m going to perform some disk I/O testing to see if the single Gb NIC is a disk I/O bottleneck.  I may not have enough horsepower under the hood of the Openfiler box to warrant going through the steps of adding additional iSCSI targets and IP addressing.

A few of the keys I extracted from the blog post are as follows:

“The core thing to understand (and the bulk of our conversation – thank you Eric and David) is that 802.3ad/LACP surely aggregates physical links, but the mechanisms used to determine the whether a given flow of information follows one link or another are critical.

Personally, I found this doc very clarifying.: http://www.ieee802.org/3/hssg/public/apr07/frazier_01_0407.pdf

You’ll note several key things in this doc:

* All frames associated with a given “conversation” are transmitted on the same link to prevent mis-ordering of frames. So what is a “conversation”? A “conversation” is the TCP connection.
* The link selection for a conversation is usually done by doing a hash on the MAC addresses or IP address.
* There is a mechanism to “move a conversation” from one link to another (for loadbalancing), but the conversation stops on the first link before moving to the second.
* Link Aggregation achieves high utilization across multiple links when carrying multiple conversations, and is less efficient with a small number of conversations (and has no improved bandwith with just one). While Link Aggregation is good, it’s not as efficient as a single faster link.”

the ESX 3.x software initiator really only works on a single TCP connection for each target – so all traffic to a single iSCSI Target will use a single logical interface. Without extra design measures, it does limit the amount of IO available to each iSCSI target to roughly 120 – 160 MBs of read and write access.

“This design does not limit the total amount of I/O bandwidth available to an ESX host configured with multiple GbE links for iSCSI traffic (or more generally VMKernel traffic) connecting to multiple datastores across multiple iSCSI targets, but does for a single iSCSI target without taking extra steps.

Question 1: How do I configure MPIO (in this case, VMware NMP) and my iSCSI targets and LUNs to get the most optimal use of my network infrastructure? How do I scale that up?

Answer 1: Keep it simple. Use the ESX iSCSI software initiator. Use multiple iSCSI targets. Use MPIO at the ESX layer. Add Ethernet links and iSCSI targets to increase overall throughput. Ser your expectation for no more than ~160MBps for a single iSCSI target.

Remember an iSCSI session is from initiator to target. If use multiple iSCSI targets, with multiple IP addresses, you will use all the available links in aggregate, the storage traffic in total will load balance relatively well. But any individual one target will be limited to a maximum of single GbE connection’s worth of bandwidth.

Remember that this also applies to all the LUNs behind that target. So, consider that as you distribute the LUNs appropriately among those targets.

The ESX initiator uses the same core method to get a list of targets from any iSCSI array (static configuration or dynamic discovery using the iSCSI SendTargets request) and then a list of LUNs behind that target (SCSI REPORT LUNS command).”

Question 4: Do I use Link Aggregation and if so, how?

Answer 4: There are some reasons to use Link Aggregation, but increasing a throughput to a single iSCSI target isn’t one of them in ESX 3.x.

What about Link Aggregation – shouldn’t that resolve the issue of not being able to drive more than a single GbE for each iSCSI target? In a word – NO. A TCP connection will have the same IP addresses and MAC addresses for the duration of the connection, and therefore the same hash result. This means that regardless of your link aggregation setup, in ESX 3.x, the network traffic from an ESX host for a single iSCSI target will always follow a single link.

For swiSCSI users, they also mention some cool details about what’s coming in the next release of ESX/ESXi. Those looking for more iSCSI performance will want to pay attention. 10Gb Ethernet is also going to be a game changer, further threatening fibre channel SAN technologies.

I can’t stress enough how neat and informative this article is. To boot, technology experts from competing storage vendors pooled their knowledge for the greater good. That’s just awesome!

Uptime lost during VMotion

January 11th, 2009

That’s right. We lose uptime during every VMotion. Relax just a bit – I’m not talking about actual uptime/downtime availability of the guest VM in the datacenter. I’m speaking to the uptime performance metric tracked in the VirtualCenter database. It’s a bug that was introduced in VirtualCenter 2.0 and has remained in the code to this day in VirtualCenter 2.5.0 Update 3. Here’s how it works:

We’ve got a VM named Exchange1. VirtualCenter displays its uptime as 28 days is indicated by the screenshots below:

1-11-2009 12-04-06 AM

1-11-2009 12-10-14 AM

Now the VM has been VMotioned from host SOLO to host LANDO. Notice what has happened to uptime. It has disappeared from the console:

1-11-2009 12-07-37 AM

The real proof in what has ultimately happened is we see from the performance chart the latest uptime metric has been reset from 27.99 days as shown above to 0.0026620 days:

1-11-2009 12-10-54 AM

Sometimes the VIC console will show the uptime counter start over at 0 days, then on to 1 days, etc. Other times the uptime counter will remain blank for days or weeks as you can see from my three other VMs in the first screenshot which show no uptime.

This brings us to an interesting discussion. What would you like uptime in the VIC to mean exactly? Following are my observations and thoughts on VMware’s implementation of the uptime metric in VirtualCenter.

In previous versions of VirtualCenter, a soft reboot of the VM inside of the OS would reset the uptime statistic in VirtualCenter. I believe this was a function of VMware Tools that triggered this.

Today in VirtualCenter 2.5.0 Update 3, a soft reboot inside the guest VM does not reset the uptime statistic back to zero.

A VM which has no VMware Tools installed that is soft rebooted inside of the OS (ie. we’re not talking about any VMware console power operation here) does not reset the uptime statistic.

I could see the community take a few different sides on this as there are two variations of the definition of uptime we’re dealing with here. Uptime of the guest VM OS and uptime of the VM’s virtual hardware.

  1. Should uptime translate into how long the VM virtual hardware has been powered on from a virtual infrastructure standpoint?
  2. Or should uptime translate into how long the OS inside the VM has been up, tracked by VMware Tools?

The VMware administrator cares about the length of time a VM has been powered on. It is the powered on VM that consumes resources from the four resource food groups and impacts capacity.

The guest VM OS administrator, whether it be Windows or Linux, cares about uptime of the guest OS. The owner of the OS is held to SLAs by the business lines.

My personal opinion is that the intended use of the Virtual Infrastructure Client is for the VMware administrator and thus should reflect virtual infrastructure information. My preference is that the uptime statistic in VirtualCenter tracks power operations of the VM irregardless of any reboot sequences of the OS inside the VM. In other words, uptime is not impacted by VMware Tools heartbeats or reboots inside the guest VM. The uptime statistic should only be reset when the VM is powered off or power reset (including instances where HA has recovered a VM).

At any rate, due to the bug that uptime has in VirtualCenter 2.0 and above, it’s a fairly unreliable performance metric for any virtual infrastructure using VMotion and DRS. Furthermore, the term itself can be misleading depending on the administrators interpretation of uptime versus what’s written in the VirtualCenter code.

I submitted a post in VMware’s Product and Feature Suggestions forum in January of 2007 recording the uptime reset on VMotion issue. As this problem periodically bugs me, I followed up a few times. Once in a follow up post in the thread above, and at least one time externally requesting someone from VMware take a look at it. Admittedly I do not have an SR open.

VMware, can we get this bug fixed? After all, if the hypervisor has become an every day commodity item leaving the management tools as the real meat and potatoes, you should make sure our management tools work properly.

Thank you,

Jas