VMware Tools causes virtual machine snapshot with quiesce error

July 30th, 2016

Last week I was made aware of an issue a customer in the field was having with a data protection strategy using array-based snapshots which were in turn leveraging VMware vSphere snapshots with VSS quiesce of Windows VMs. The problem began after installing VMware Tools version 10.0.0 build-3000743 (reported as version 10240 in the vSphere Web Client) which I believe is the version shipped in ESXI 6.0 Update 1b (reported as version 6.0.0, build 3380124 in the vSphere Web Client).

The issue is that creating a VMware virtual machine snapshot with VSS integration fails. The virtual machine disk configuration is simply two .vmdks on a VMFS-5 datastore but I doubt the symptoms are limited only to that configuration.

The failure message shown in the vSphere Web Client is “Cannot quiesce this virtual machine because VMware Tools is not currently available.”  The vmware.log file for the virtual machine also shows the following:

2016-07-29T19:26:47.378Z| vmx| I120: SnapshotVMX_TakeSnapshot start: ‘jgb’, deviceState=0, lazy=0, logging=0, quiesced=1, forceNative=0, tryNative=1, saveAllocMaps=0 cb=1DE2F730, cbData=32603710
2016-07-29T19:26:47.407Z| vmx| I120: DISKLIB-LIB_CREATE : DiskLibCreateCreateParam: vmfsSparse grain size is set to 1 for ‘/vmfs/volumes/51af837d-784bc8bc-0f43-e0db550a0c26/rmvm02/rmvm02-000001.
2016-07-29T19:26:47.408Z| vmx| I120: DISKLIB-LIB_CREATE : DiskLibCreateCreateParam: vmfsSparse grain size is set to 1 for ‘/vmfs/volumes/51af837d-784bc8bc-0f43-e0db550a0c26/rmvm02/rmvm02_1-00000
2016-07-29T19:26:47.408Z| vmx| I120: SNAPSHOT: SnapshotPrepareTakeDoneCB: Prepare phase complete (The operation completed successfully).
2016-07-29T19:26:56.292Z| vmx| I120: GuestRpcSendTimedOut: message to toolbox timed out.
2016-07-29T19:27:07.790Z| vcpu-0| I120: Tools: Tools heartbeat timeout.
2016-07-29T19:27:11.294Z| vmx| I120: GuestRpcSendTimedOut: message to toolbox timed out.
2016-07-29T19:27:17.417Z| vmx| I120: GuestRpcSendTimedOut: message to toolbox timed out.
2016-07-29T19:27:17.417Z| vmx| I120: Msg_Post: Warning
2016-07-29T19:27:17.417Z| vmx| I120: [msg.snapshot.quiesce.rpc_timeout] A timeout occurred while communicating with VMware Tools in the virtual machine.
2016-07-29T19:27:17.417Z| vmx| I120: —————————————-
2016-07-29T19:27:17.420Z| vmx| I120: Vigor_MessageRevoke: message ‘msg.snapshot.quiesce.rpc_timeout’ (seq 10949920) is revoked
2016-07-29T19:27:17.420Z| vmx| I120: ToolsBackup: changing quiesce state: IDLE -> DONE
2016-07-29T19:27:17.420Z| vmx| I120: SnapshotVMXTakeSnapshotComplete: Done with snapshot ‘jgb': 0
2016-07-29T19:27:17.420Z| vmx| I120: SnapshotVMXTakeSnapshotComplete: Snapshot 0 failed: Failed to quiesce the virtual machine (31).
2016-07-29T19:27:17.420Z| vmx| I120: VigorTransport_ServerSendResponse opID=ffd663ae-5b7b-49f5-9f1c-f2135ced62c0-95-ngc-ea-d6-adfa seq=12848: Completed Snapshot request.
2016-07-29T19:27:26.297Z| vmx| I120: GuestRpcSendTimedOut: message to toolbox timed out.

After performing some digging, I found VMware had released VMware Tools version 10.0.9 on June 6, 2016. The release notes identify the root cause has been identified and resolved.

Resolved Issues

Attempts to take a quiesced snapshot in a Windows Guest OS fails
Attempts to take a quiesced snapshot after booting a Windows Guest OS fails

After downloading and upgrading VMware Tools version 10.0.9 build-3917699 (reported as version 10249 in the vSphere Web Client), the customer’s problem was resolved. Since the faulty version of VMware Tools was embedded in the customer’s templates used to deploy virtual machines throughout the datacenter, there were a number of VMs needing their VMware Tools upgraded, as well as the templates themselves.

vSphere 5.5 UNMAP Deep Dive

September 13th, 2013

One of the features that has been updated in vSphere 5.5 is UNMAP which is one of two sub-components of what I’ll call the fourth block storage based thin provisioning VAAI primitive (the other sub-component is thin provisioning stun).  I’ve already written about UNMAP a few times in the past.  It was first introduced in vSphere 5.0 two years ago.  A few months later the feature was essentially recalled by VMware.  After it was re-released by VMware in 5.0 Update 1, I wrote about its use here and followed up with a short piece about the .vmfsBalloon file here.

For those unfamiliar, UNMAP is a space reclamation mechanism used to return blocks of storage back to the array after data which was once occupying those blocks has been moved or deleted.  The common use cases are deleting a VM from a datastore, Storage vMotion of a VM from a datastore, or consolidating/closing vSphere snapshots on a datastore.  All of these operations, in the end, involve deleting data from pinned blocks/pages on a volume.  Without UNMAP, these pages, albeit empty and available for future use by vSphere and its guests only, remain pinned to the volume/LUN backing the vSphere datastore.  The pages are never returned back to the array for use with another LUN or another storage host.  Notice I did not mention shrinking a virtual disk or a datastore – neither of those operations are supported by VMware.  I also did not mention the use case of deleting data from inside a virtual machine – while that is not supported, I believe there is a VMware fling for experimental use.  In summary, UNMAP extends the usefulness of thin provisioning at the array level by maintaining storage efficiency throughout the life cycle of the vSphere environment and the array which supports the UNMAP VAAI primitive.

On the Tuesday during VMworld, Cormac Hogan launched his blog post introducing new and updated storage related features in vSphere 5.5.  One of those features he summarized was UNMAP.  If you haven’t read his blog, I’d definitely recommend taking a look – particularly if you’re involved with vSphere storage.  I’m going to explore UNMAP in a little more detail.

The most obvious change to point out is the command line itself used to initiate the UNMAP process.  In previous versions of vSphere, the command issued on the vSphere host was:

vmkfstools -y x (where x represent the % of storage to unmap)

As Cormac points out, UNMAP has been moved to esxcli namespace in vSphere 5.5 (think remote scripting opportunities after XYZ process) where the basic command syntax is now:

esxcli storage vmfs unmap

In addition to the above, there are also three switches available for use; of first two listed below, one is required, and the third is optional.

-l|–volume-label= The label of the VMFS volume to unmap the free blocks.

-u|–volume-uuid= The uuid of the VMFS volume to unmap the free blocks.

-n|–reclaim-unit= Number of VMFS blocks that should be unmapped per iteration.

Previously with vmkfstools, we’d change to VMFS folder in which we were going to UNMAP blocks from.  In vSphere 5.5, the esxcli command can be run from anywhere so specifying the the datastore name or the uuid is one of the required parameters for obvious reasons.  So using the datastore name, the new UNMAP command in vSphere 5.5 is going to look like this:

esxcli storage vmfs unmap -l 1tb_55ds

As for the optional parameter, the UNMAP command is an iterative process which continues through numerous cycles until complete.  The reclaim unit parameter specifies the quantity of blocks to unmap per each iteration of the UNMAP process.  In previous versions of vSphere, VMFS-3 datastores could have block sizes of 1, 2, 4, or 8MB.  While upgrading a VMFS-3 datastore to VMFS-5 will maintain these block sizes, executing an UNMAP operation on a native net-new VMFS-5 datastore results in working with a 1MB block size only.  Therefore, if a reclaim unit value of 100 is specified on a VMFS-5 datastore with a 1MB block size, then 100MB data will be returned to the available raw storage pool per iteration until all blocks marked available for UNAMP are returned.  Using a value of 100, the UNMAP command looks like this:

esxcli storage vmfs unmap -l 1tb_55ds -n 100

If the reclaim unit value is unspecified when issuing the UNMAP command, the default reclaim unit value is 200, resulting in 200MB of data returned to the available raw storage pool per iteration assuming a 1MB block size datastore.

One additional piece to to note on the CLI topic is that in a release candidate build I was working with, while the old vmkfstools -y command is deprecated, it appears to still exist but with newer vSphere 5.5 functionality published in the –help section:

vmkfstools vmfsPath -y –reclaimBlocks vmfsPath [–reclaimBlocksUnit #blocks]

The next change involves the hidden temporary balloon file (refer to my link at the top if you’d like more information about the balloon file but basically it’s a mechanism used to guarantee blocks targeted for UNMAP are not in the interim written to by an outside I/O request until the UNMAP process is complete).  It is no longer named .vmfsBalloon.  The new name is .asyncUnmapFile as shown below.

/vmfs/volumes/5232dd00-0882a1e4-e918-0025b3abd8e0 # ls -l -h -A
total 998408
-r——–    1 root     root      200.0M Sep 13 10:48 .asyncUnmapFile
-r——–    1 root     root        5.2M Sep 13 09:38 .fbb.sf
-r——–    1 root     root      254.7M Sep 13 09:38 .fdc.sf
-r——–    1 root     root        1.1M Sep 13 09:38 .pb2.sf
-r——–    1 root     root      256.0M Sep 13 09:38 .pbc.sf
-r——–    1 root     root      250.6M Sep 13 09:38 .sbc.sf
drwx——    1 root     root         280 Sep 13 09:38 .sdd.sf
drwx——    1 root     root         420 Sep 13 09:42 .vSphere-HA
-r——–    1 root     root        4.0M Sep 13 09:38 .vh.sf
/vmfs/volumes/5232dd00-0882a1e4-e918-0025b3abd8e0 #

As discussed in the previous section, use of the UNMAP command now specifies the the actual size of the temporary file instead of the temporary file size being determined by a percentage of space to return to the raw storage pool.  This is an improvement in part because it helps avoid the catastrophe if UNMAP tried to remove 2TB+ in a single operation (discussed here).

VMware has also enhanced the functionality of the temporary file.  A new kernel interface in ESXi 5.5 allows the user to ask for blocks beyond a a specified block address in the VMFS file system.  This ensures that the blocks allocated to the temporary file were never allocated to the temporary file previously.  The benefit realized in the end is that any size temporary file can be created and with UNMAP issued to the blocks allocated to the temporary file, we can rest assured that we can issue UNMAP on all free blocks on the datastore.

Going a bit deeper and adding to the efficiency, VMware has also enhanced UNMAP to support multiple block descriptors.  Compared to vSphere 5.1 which issued just one block descriptor per UNMAP command, vSphere 5.5 now issues up to 100 block descriptors depending on the storage array (these identifying capabilities are specified internally in the Block Limits VPD (B0) page).

A look at the asynchronous and iterative vSphere 5.5 UNMAP logical process:

  1. User or script issues esxcli UNMAP command
  2. Does the array support VAAI UNMAP?  yes=3, no=end
  3. Create .asyncUnmapFile on root of datastore
  4. .asyncUnmapFile created and locked? yes=5, no=end
  5. Issue 10CTL to allocate reclaim-unit blocks of storage on the volume past the previously allocated block offset
  6. Did the previous block allocation succeed? yes=7, no=remove lock file and retry step 6
  7. Issue UNMAP on all blocks allocated above in step 5
  8. Remove the lock file
  9. Did we reach the end of the datastore? yes=end, no=3

From a performance perspective, executing the UNMAP command in my vSphere 5.5 RC lab showed peak write I/O of around 1,200MB/s with an average of around 200IOPS comprised of a 50/50 mix of read/write.  The UNMAP I/O pattern is a bit hard to gauge because with the asynchronous iterative process, it seemed to do a bunch of work, rest, do more work, rest, and so on.  Sorry no screenshots because is currently down.  Perhaps the most notable takeaway from the performance section is that as of vSphere 5.5, VMware is lifting the recommendation of only running UNMAP during a maintenance window.  Keep in mind this is just a recommendation.  I encourage vSphere 5.5 customers to test UNMAP in their lab first using various reclaim unit sizes.  While do this, examine performance impacts to the storage fabric, the storage array (look at both front end and back end), as well as other applications sharing the array.  Remember that fundamentally the UNMAP command is only going to provide a benefit AFTER its associated use cases have occurred (mentioned at the top of the article).  Running UNMAP on a volume which has no pages to be returned will be a waste of effort.  Once you’ve become comfortable with using UNMAP and understanding its impacts in your environment, consider running it on a recurring schedule – perhaps weekly.  It really depends on how much the use cases apply to your environment.  Many vSphere backup solutions leverage vSphere snapshots which is one of the use cases.  Although it could be said there are large gains to be made with UNMAP in this case, keep in mind backups run regularly and and space that is returned to raw storage with UNMAP will likely be consumed again in the following backup cycle where vSphere snapshots are created once again.

To wrap this up, customers who have block arrays supporting the thin provision VAAI primitive will be able to use UNMAP in vSphere 5.5 environments (for storage vendors, both sub-components are required to certify for the primitive as a whole on the HCL).  This includes Dell Compellent customers with current version of Storage Center firmware.  Customers who use array based snapshots with extended retention periods should keep in mind that while UNMAP will work against active blocks, it may not work with blocks maintained in a snapshot.  This is to honor the snapshot based data protection retention.

Veeam Launches Backup & Replication v7

August 22nd, 2013

Data protection, data replication, and data recovery are challenging.  Consolidation through virtualization has forced customers to retool automated protection and recovery methodologies in the datacenter and at remote DR sites.

For VMware environments, Veeam has been with customers helping them every step of the way with their flagship Backup & Replication suite.  Once just a simple backup tool, it has evolved into an end to end solution for local agentless backup and restore with application item intelligence as well as a robust architecture to fulfill the requirements of replicating data offsite and providing business continuation while meeting aggressive RPO and RTO metrics.  Recent updates have also bridged the gap for Hyper-V customers, rounding out the majority of x86 virtualized datacenters.

But don’t take their word for it.  Talk to one of their 200,000+ customers – for instance myself.  I’ve been using Veeam in the lab for well over five years to achieve nightly backups of not only my ongoing virtualization projects, but my growing family’s photos, videos, and sensitive data as well.  I also tested, purchased, and implemented in a previous position to facilitate the migration of virtual machines from one large datacenter to another via replication.  In December of 2009, I was also successful in submitting a VCDX design to VMware incorporating Veeam Backup & Replication, and followed up in Feburary 2010 successfully defending that design.

Veeam is proud to announce another major milestone bolstering their new Modern Data Protection campaign – version 7 of Veeam Backup & Replication.  In this new release, extensive R&D yields 10x faster performance as well as many new features such as built-in WAN acceleration, backup from storage snapshots, long requested support for tape, and a solid data protection solution for vCloud Director.  Value was added for Hyper-V environments as well – SureBackup automated verification support, Universal Application Item Recovery, as well as the on-demand Sandbox.  Aside from the vCD support, one of the new features I’m interested in looking at is parallel processing of virtual machine backups.  It’s a fact that with globalized business, backup windows have shrunk while data footprints have grown exponentially.  Parallel VM and virtual disk backup, refined compression algorithms, and 64-bit backup repository architecture will go a long way to meet global business challenges.

v7 available now.  Check it out!

This will likely be my last post until VMworld.  I’m looking forward to seeing everyone there!

The .vmfsBalloon File

July 1st, 2013

One year ago, I wrote a piece about thin provisioning and the role that the UNMAP VAAI primitive plays in thin provisioned storage environments.  Here’s an excerpt from that article:

When the manual UNMAP process is run, it balloons up a temporary hidden file at the root of the datastore which the UNMAP is being run against.  You won’t see this balloon file with the vSphere Client’s Datastore Browser as it is hidden.  You can catch it quickly while UNMAP is running by issuing the ls -l -a command against the datastore directory.  The file will be named .vmfsBalloonalong with a generated suffix.  This file will quickly grow to the size of data being unmapped (this is actually noted when the UNMAP command is run and evident in the screenshot above).  Once the UNMAP is completed, the .vmfsBalloon file is removed.

Has your curiosity ever got you wondering about the technical purpose of the .vmfsBalloon file?  It boils down to data integrity and timing.  At the time the UNMAP command is run, the balloon file is immediately instantiated and grows to occupy (read: hog) all of the blocks that are about to be unmapped.  It does this so that during the unmap process, none of the blocks are allocated during the process of new file creation elsewhere.  If you think about it, it makes sense – we just told vSphere to give these blocks back to the array.  If during the interim one or more of these blocks were suddenly allocated for a new file or file growth purposes, then we purge the block, we have a data integrity issue.  More accurately, newly created data will be missing as its block or blocks were just flushed back to the storage pool on the array.

vSphere 5.1 Update 1 Update Sequence

May 6th, 2013

Not so long ago, VMware product releases were staggered.  Major versions of vSphere would launch at or shortly after VMworld in the fall, and all other products such as SRM, View, vCloud Director, etc. would rev on some other random schedule.  This was extremely frustrating for a vEvangelist because we wanted to be on the latest and greatest platform but lack of compatibility with the remaining bolt-on products held us back.

While this was a wet blanket for eager lab rats, it was a major complexity for production environments.  VMware understood this issue and at or around the vSphere 5.0 launch (someone correct me if I’m wrong here), all the development teams in Palo Alto synchronized their watches & revd product in essence at the same time.  This was great and it added the much needed flexibility for production environment migrations.  However, in a way it masked an issue which didn’t really exist before by virtue of product release staggering – a clear and understandable order of product upgrades.  That is why in March of 2012, I looked at all the product compatibility matrices and sort of came up with my own “cheat sheet” of product compatibility which would lend itself to an easy to follow upgrade path, at least for the components I had in my lab environment.

vSphere 5.1 Update 1 launched on 4/25/13 and along with it a number of other products were revd for compatibility.  To guide us on the strategic planning and tactical deployment of the new software bundles, VMware issued KB Article 2037630 Update sequence for vSphere 5.1 Update 1 and its compatible VMware products.

Snagit Capture

Not only does VMware provide the update sequencing information, but there are also exists a complete set of links to specific product upgrade procedures and release notes which can be extremely useful for planning and troubleshooting.

The vCloud Suite continues to evolve providing agile and elastic infrastructure services for businesses around the globe in a way which makes IT easier and more practical for consumers but quite a bit more complex on the back end for those who must design, implement, and support it.  Visit the KB Article and give it 5 stars.  Let VMware know this is an extremely helpful type of collateral for those in the trenches.

vMA 5.1 Patch 1 Released

April 5th, 2013

Expendable news item here only worthy of a Friday post.  For those who may have missed it, VMware has released an update to the vSphere Management Assistant (vMA) 5.1 appliance formally referred to as Patch 1.  This release is documented in VMware KB 2044135 and the updated appliance bits can be downloaded here.  Log in, choose the VMware vSphere link, then the Drivers & Tools tab.

Patch 1 bundles with it the following enhancements:

  • The base operating system is updated to SUSE Linux Enterprise Server 11 SP2 (12-Jan-2013).
  • JRE is updated to JRE 1.6.0_41, which includes several critical fixes.
  • VMware Tools is updated to 8.3.17 (build 870839).
  • A resxtop connection failure issue has been fixed.
    In vMA 5.1, resxtop SSL verification checks has been enabled. This might cause resxtop to fail when connecting to hosts and displays an exception message similar the following:
    HTTPS_CA_FILE or HTTPS_CA_DIR not set.
    This issue is fixed through this patch.