Posts Tagged ‘Storage’

Yet another blog post about vSphere HA and PDL

July 14th, 2014

If you ended up here searching for information on PDL or APD, your evening or weekend plans may be cancelled at this point and I’m sorry for you if that is the case. There are probably 101 or more online resources which discuss the interrelated vSphere storage topics of All Paths Down (known as APD), Permanent Device Loss (known as PDL), and vSphere High Availability (known as HA, and before dinosaurs roamed the Earth – DAS ). To put it in perspective, I’ve quickly pulled together a short list of resources below using Google. I’ve read most of them:

VMware KB: Permanent Device Loss (PDL) and All-Paths

VMware KB: PDL AutoRemove feature in vSphere 5.5

Handling the All Paths Down (APD) condition – VMware Blogs

vSphere 5.5. Storage Enhancements Part 9 – PDL

Permanent Device Loss (PDL) enhancements in vSphere 5.0

APD (All Paths Down) and PDL (Permanent Device Loss

vSphere Metro Storage Cluster solutions and PDL’s

vSphere Metro Stretched Cluster with vSphere 5.5 and PDL

Change in Permanent Device Loss (PDL) behavior for 5.1

VMware KB: PDL AutoRemove feature in vSphere 5.5

PDL AutoRemove – CormacHogan.com

How handle the APD issue in vSphere – vInfrastructure Blog

Interpreting SCSI sense codes in VMware ESXi and ESX

What’s New in VMware vSphere® 5.1 – Storage

vSphere configuration for handling APD/PDL – CloudXC

vSphere 5.1 Storage Enhancements – Part 4: All Paths Down

vSphere 5.5 nuggets: changes to disk – Yellow Bricks

ESXi host disk.terminateVMOnPDLDefault configuration

ESXi host VMkernel.Boot.terminateVMOnPDL configuration

vSphere HA in my opinion is a great feature. It has saved my back side more than once both in the office and at home. Several books have been more or less dedicated to the topic and yet it is so easy to use that an entire cluster and all of its running virtual machines can be protected with default parameters (common garden variety) with just two mouse clicks.

VMware’s roots began with compute virtualization so when HA was originally released in VMware Virtual Infrastructure 3 (one major revision before it became the vSphere platform known today), the bits licensed and borrowed from Legato Automated Availability Manager (AAM) were designed to protect against marginal but historically documented amounts of x86 hardware failure thereby reducing unplanned downtime and loss of virtualization capacity to a minimum. Basically if an ESX host yields to issues relating to CPU, memory, or network, VMs restart somewhere else in the cluster.

It wasn’t really until vSphere 5.0 that VMware began building in high availability for storage aside from legacy design components such as redundant fabrics, host bus adapters (HBAs), multipath I/O (MPIO), failback policies, and with vSphere 4.0 the pluggable storage architecture (PSA) although this is not to say that any of these design items are irrelevant today – quite the opposite.  vSphere 5.0 introduced Permanent Device Loss (PDL) which does a better job of handling unexpected loss of individual storage devices than APD solely did.  Subsequent vSphere 5.x revisions made further PDL improvements such as improving support for single LUN:single target arrays in 5.1. In short, the new vSphere HA re-write (Legato served its purpose and is gone now) covers much of the storage gap such that in the event of certain storage related failures, HA will restart virtual machines, vApps, services, and applications somewhere else – again to minimize unplanned downtime. Fundamentally, this works just like HA when a vSphere host tips over, but instead the storage tips over and HA is called to action. Note that HA can’t do much about an entire unfederated array failing – this is more about individual storage/host connectivity. Aside from gross negligence on the part of administrators, I believe the failure scenarios are more likely to resonate with non-uniform stretched or metro cluster designs. However, PDL can also occur in small intra datacenter designs as well.

I won’t go into much more detail about the story that has unfolded with APD and the new features in vSphere 5.x because it has already been documented many times over in some of the links above.  Let’s just say the folks starting out new with vSphere 5.1 and 5.5 had it better than myself and many others did dealing with APD and hostd going dark. However, the trade off for them is they are going to have to deal with Software Defined * a lot longer than I will.

Although I mentioned earlier that vSphere HA is extremely simple to configure, I did also mention that was with default options which cover a large majority of the host related failures.  Configuring HA to restart VMs automatically and with no user intervention in the event of a PDL condition in theory is just one configuration change for each host in the cluster. Where to configure depends on the version of vSphere host.

vSphere 5.0u1+/5.1: Disk.terminateVMOnPDLDefault = True (/etc/vmware/settings file on each host)

or

vSphere 5.5+: VMkernel.Boot.terminateVMOnPDL = yes (advanced setting on each host, check the box)

One thing about this configuration that had me chasing sense codes in vmkernel logs recently was lack of clarity on the required host reboot. That’s mainly what prompted this article – I normally don’t cover something that has already been covered well by other writers unless there is something I can add, something was missed, or it has caused me personal pain (my blog + SEO = helps ensure I don’t suffer from the same problems twice). In all of the online articles I had read about these configurations, none mentioned a host reboot requirement and it’s not apparent that a host reboot is required until PDL actually happens and automatic VM restart via HA actually does not. The vSphere 5.5 documentation calls it out. Go figure. I’ll admit that sometimes I will refer to a reputable vMcBlog before the product documentation. So let the search engine results show: when configuring  VMkernel.Boot.terminateVMOnPDL a host reboot or restart is required. VMware KB 1038578 also calls out that as of vSphere 5.5 you must reboot the host for VMkernel.boot configuration changes to take effect. I’m not a big fan of HA or any configuration being written into VMkernel.boot requiring host or VSAN node performance/capacity outages when a change is made but that is VMware Engineering’s decision and I’m sure there is a relevant reason for it aside from wanting more operational parity with the Windows operating system.

I’ll also reiterate Duncan Epping’s recommendation that if you’re already licensed for HA and have made the design and operational decision to allow HA to restart VMs in the event of a host failure, then the above configuration should be made on all vSphere clustered hosts, whether they are part of a stretched cluster or not to protect against storage related failures. A PDL can be broken down to one host losing all available paths to a LUN. By not making the HA configuration change above, a storage related failure results in user intervention required to recover all of the virtual machines on the host tied to the failed device.

Lastly, it is mentioned in some of the links above but if this is your first reading on the subject, please allow me to point out that the configuration setting above is for Permanent Device Loss (PDL) conditions only. It is not meant to handle an APD event. The reason behind this is that the storage array is required to send a proper sense code to the vSphere host indicating a PDL condition.  If the entire array fails or is powered off ungracefully taking down all available paths to storage, it has no chance to send PDL sense codes to vSphere.  This would constitute an indefinite All Paths Down or APD condition where vSphere knows storage is unavailable, but is unsure about its return. PDL was designed to answer that question for vSphere, rather than let vSphere go on wondering about it for a long period of time, thus squandering any opportunities to proactively do something about it.

In reality there are a few other configuration settings (again documented well in the links above) which fine tunes HA more precisely. You’ll almost always want to add these as well.

vSphere 5.0u1+: das.maskCleanShutdownEnabled = True (Cluster advanced options) – this is an accompanying configuration that helps vSphere HA distinguish between VMs that were once powered on and should be restarted versus VMs that were already powered off when a PDL occurred therefore these are VMs that don’t need to be and more importantly probably should not be restarted.

vSphere 5.5+: Disk.AutoremoveOnPDL = 0 (advanced setting on each host) – This is a configuration I first read about on Duncan’s blog where he recommends that the value be changed from the default of enabled to disabled so that a device is not automatically removed if it enters a PDL state. Aside from LUN number limits a vSphere host can handle (255), VMware refers to a few cases where the stock configuration of automatically removing a PDL device may be desired although VMware doesn’t really specifically call out each circumstance aside from problems arising from hosts attempting to send I/O to a dead device. There may be more to come on this in the future but for now preventing the removal may save in fabric rescan time down the road if you can afford the LUN number expended. It will also serve as a good visual indicator in the vSphere Client that there is a problematic datastore that needs to be dealt with in case the PDL automation restarts VMs with nobody noticing the event has occurred. If there are templates or powered off VMs that were not evacuated by HA, the broken datastore will visually persist anyway.

That’s the short list of configuration changes to make for HA VM restart.  There’s actually a few more here. For instance, fine grained HA handling can be coordinated on a per-VM basis by modifying the advanced virtual machine option disk.terminateVMOnPDLDefault configuration for each VM. Or scsi#:#.terminateVMOnPDL to fine tune HA on a per virtual disk basis for each VM. I’m definitely not recommending touching if the situation does not call for it.

In a stock vSphere configuration with VMkernel.Boot.terminateVMOnPDL = no configured (or unintentionally misconfigured I suppose), the following events occur for an impacted virtual machine:

  1. PDL event occurs, sense codes are received and vSphere correctly identifies the PDL condition on the supporting datastore. A question is raised by vSphere for each impacted virtual machine to Retry I/O or Cancel I/O.
  2. Stop. Nothing else happens until each of the questions above are answered with administrator intervention. Answering Retry without the PDL datastore coming back online or without hot removing the impacted virtual disk (in most cases the .vmx will be impacted anyway and hot removing disks is next to pointless) sends the VM to hell pretty much. Answering Cancel allows HA to proceed with powering off the VM and restarting it on another host with access to the device which went PDL on the original host.

In a modified vSphere configuration with VMkernel.Boot.terminateVMOnPDL = yes configured, the following events occur for an impacted virtual machine:

  1. PDL event occurs, sense codes are received and vSphere correctly identifies the PDL condition on the supporting datastore. A question is raised by vSphere for each impacted virtual machine to Retry I/O or Cancel I/O.
  2. Due to VMkernel.Boot.terminateVMOnPDL = yes vSphere HA automatically and effectively answers Cancel for each impacted VM with a pending question. Again, if the hosts aren’t rebooted after the VMkernel.Boot.terminateVMOnPDL = yes configuration change, this step will mimic the previous scenario essentially resulting in failure to automatically carry out the desired tasks.
  3. Each VM is powered off.
  4. Each VM is powered on.

I’ll note in the VM Event examples above, leveraging the power of Snagit I’ve cut out some of the noise about alarms triggering gray and green, resource allocations changing, etc.

For completeness, following is a list of the PDL sense codes vSphere is looking for from the supported storage array:

SCSI sense code Description
H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0 LOGICAL UNIT NOT SUPPORTED
H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x4c 0x0 LOGICAL UNIT FAILED SELF-CONFIGURATION
H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x3e 0x3 LOGICAL UNIT FAILED SELF-TEST
H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x3e 0x1 LOGICAL UNIT FAILURE

Two isolated examples of PDL taking place seen in /var/log/vmkernel.log:

Example 1:

2014-07-13T20:47:03.398Z cpu13:33486)NMP: nmp_ThrottleLogForDevice:2321: Cmd 0x2a (0x4136803b8b80, 32789) to dev “naa.6000d31000ebf600000000000000006c” on path “vmhba2:C0:T0:L30” Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x3f 0xe. Act:EVAL
2014-07-13T20:47:03.398Z cpu13:33486)ScsiDeviceIO: 2324: Cmd(0x4136803b8b80) 0x2a, CmdSN 0xe1 from world 32789 to dev “naa.6000d31000ebf600000000000000006c” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x3f 0xe.
2014-07-13T20:47:03.398Z cpu13:33486)NMP: nmp_ThrottleLogForDevice:2321: Cmd 0x2a (0x413682595b80, 32789) to dev “naa.6000d31000ebf600000000000000007c” on path “vmhba2:C0:T0:L2” Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0. Act:FAILOVER

Example 2:

2014-07-14T00:43:49.720Z cpu4:32994)ScsiDeviceIO: 2337: Cmd(0x412e82f11380) 0x85, CmdSN 0x33 from world 34316 to dev “naa.600508b1001c6e17d603184d3555bf8d” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2014-07-14T00:43:49.731Z cpu4:32994)ScsiDeviceIO: 2337: Cmd(0x412e82f11380) 0x4d, CmdSN 0x34 from world 34316 to dev “naa.600508b1001c6e17d603184d3555bf8d” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2014-07-14T00:43:49.732Z cpu4:32994)ScsiDeviceIO: 2337: Cmd(0x412e82f11380) 0x1a, CmdSN 0x35 from world 34316 to dev “naa.600508b1001c6e17d603184d3555bf8d” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2014-07-14T00:48:03.398Z cpu10:33484)NMP: nmp_ThrottleLogForDevice:2321: Cmd 0x2a (0x4136823b2dc0, 32789) to dev “naa.60060160f824270012f6aa422e0ae411” on path “vmhba1:C0:T2:L40” Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0. Act:FAILOVER

In no particular order, I want to thank Duncan, Paudie, Cormac, Mohammed, Josh, Adam, Niran, and MAN1$H for providing some help on this last week.

By the way, don’t name your virtual machines or datastores PDL. It’s bad karma.

Registered Storage Providers Missing After vCenter 5.5 Update 1 Upgrade

March 17th, 2014

Taking a look at my VM Storage Policies compliance in the vSphere Client, I was alerted to a situation that none of my configured virtual machines were compliant with their assigned VM Storage Policy named “Five Nines Compellent Storage”.  Oddly enough, the virtual machine home directories and virtual disks were in fact on the correct datastores and showed as compliant a few days earlier. None had been migrated via Storage vMotion or SDRS.

Snagit Capture

Now you see it, now you don’t

I then verified my VASA configuration by looking at the status of my registered storage provider.  The issue was not so much that the provider was malfunctioning, but rather it was missing completely from the registered storage providers list.  This indeed explains the resulting Not Compliant status of my virtual machines.

Snagit Capture

I checked another upgraded environment where I know I had a registered VASA storage provider.  It reflected the same symptom and confirmed my suspicion that the recent process of upgrading to vCenter Server 5.5 appliance to Update 1 (via the web repository method) may have unregistered the storage provider once the reboot of the appliance was complete.

I had one more similar environment remaining which I had not upgraded yet. I verified the storage provider was registered and functioning prior to the Update 1 upgrade. I proceeded with the upgrade and after the reboot completed the storage provider was no longer registered.

What remains a mystery at this point is the root cause of the unregistered storage provider.  I was unable to find any VMware KB articles related to this issue.

Not the end of the world

The workaround is straightforward: re-register each of the missing storage providers.  For Dell Compellent customers, the storage provider points to the CITV (Compellent Integration Tools for VMware) appliance and the URL is follows the format:

https://fqdn:8443/vasa/services/vasaService

Snagit Capture

Dell Compellent customers should also keep the following in mind for VASA integration:

  • the integration requires the CITV appliance and Enterprise Manager 6.1 and above.
  • the out of box Windows Server Firewall configuration which Enterprise Manager sits on will block the initial VASA configuration in the CITV appliance. TCP 3033 incoming must be allowed or alternatively disable the Windows Firewall (not highly recommended).

Once the applicable storage provider(s) are added back, no additional VM Storage Policy reconfiguration is required other than to check for compliance.  All VMs should fall back into compliance.

Snagit Capture

Once again, I am unsure at this point as to why applying vCenter 5.5 Update 1 to the appliance caused the registered storage providers to go missing or what that connection is.  I will also add that I deployed additional vCenter 5.5 appliances under vCloud Director with a default configuration, no registered vSphere hosts, registered a VASA storage provider, upgraded to Update 1, rebooted, and the storage provider remained. I’m not sure what element in these subsequent tests caused the outcome to change but the problem itself now presents itself as inconsistent.  If I do see it again and find a root cause, as per usual I will be sure to update this article. To reiterate, Update 1 was applied in this case via the web repository method.  There are a few other methods available to apply Update 1 to the vCenter Server appliance and of course there is also the Windows version of vCenter Server – it is unknown by me if these other methods and versions are impacted the same way.

Looks like someone has a case of the Mondays

On a somewhat related note, during lab testing I did find that VM Storage Profiles configured via the legacy vSphere Client do not show up as configured VM Storage Policies in the next gen vSphere Web Client.  Likewise, VM Storage Policies created in the next gen vSphere Web Client are missing in the legacy vSphere Client.  However, registered storage providers themselves carry over from one client to the other – no issue there.  I guess the lesson here is to stick with a consistent method of creating, applying, and monitoring Profile-Driven Storage in your vSphere environment from a vSphere Client perspective.  As of the release of vSphere 5.5 going forward, that should be the next gen vSphere Web Client.  However, this client still seems to lack the ability to identify VASA provided storage capabilities on any given datastore although the entire list of possible capability strings is available by diving into VM Storage Policy configuration.

Last but not least, VMware KB 2004098 vSphere Storage APIs – Storage Awareness FAQ provides useful bits of information about the VASA side of vSphere storage APIs.  One item in that FAQ that I’ve always felt was worded a bit ambiguously in the context of vSphere consolidation is:

The Vendor Provider cannot run on the same host as the vCenter Server.

In most cases, the vCenter Server as well as the VASA integration component(s) will run as virtual machines.  Worded above as is, it would seem the vCenter Server (whether that be Windows or appliance based) cannot reside on the same vSphere host as the VASA integration VM(s).  That’s not at all what that statement implies and moreover it wouldn’t make much sense.  What it’s talking about is the use case of a Windows based vCenter Server.  In this case, Windows based VASA integration components must not be installed on the same Windows server being used to host vCenter Server.  For Dell Compellent customers, the VASA integration comes by way of the CITV appliance which runs atop a Linux platform. However, the CITV appliances does communicate with the Windows based Enterprise Manager Data Collector for VASA integration.  Technically, EM isn’t the provider, the CITV appliance is.  Personally I’d keep the EM and vCenter Server installations separate.  Both appreciate larger amounts of CPU and memory in larger environments and for the sake of performance, we don’t want these two fighting for resources during times of contention.

vSphere Consulting Opportunity in Twin Cities

December 14th, 2013

If you know me well, you know the area I call home.  If you’re a local friend, acquaintance, or member any of the three Minnesota VMware User Groups, then I have an opportunity that has crossed my desk which you or someone you know may be interested in.

A local business here in the Twin Cities has purchased vSphere and EMC VNXe storage infrastructure and is looking for a Consulting Engineer to deploy the infrastructure per an existing design.

Details:

  • Install and configure VMware vSphere 5.1 on two hosts
  • Install and configure VMware vCenter
  • Install and configure VMware Update Manager
  • Configure vSphere networking
  • Configure EMC VNXe storage per final design.

It’s a great opportunity to help a locally owned business deploy a vSphere infrastructure and I would think this would be in the wheelhouse of 2,000+ people I’ve met while running the Minneapolis VMware User Group.  As much as I’d love to knock this out myself, I’m a Dell Storage employee and as such I’m removing myself as a candidate for the role.  The best way I can help is to get the word out into the community.

If you’re interested, email me with your contact information and I’ll get you connected to the Director.

Happy Holidays!

Storage Center 5.6 Released

November 25th, 2013

I don’t have the latest and greatest Dell Compellent SC8000 controllers or SC220 2.5″ drive enclosures in my home lab although I dream nightly about Santa unloading some on me this Christmas.  What I do have is an older Series 20 and I am thankful for that.  But having an older storage array doesn’t mean I cannot leverage some of the latest and greatest features and operating systems available for datacenters.

Storage Center 5.6 was released just a short time ago and it ushers in some feature and platform support currently built into Storage Center 6.x as well as a large number of bug fixes.  This is a big win for me and anyone with 32-bit system (Series 30 or below) needing these features because SCOS 6.x is 64-bit only for Series 40 and newer which today includes the SC8000.

So what are these new features in 5.6 and why am I so excited?  I’m glad you asked.  For this guy, and on top of the list, it’s full support of all VAAI primitives.  Storage Center 5.5 and older boasted support of the block zeroing primitive.  Space Reclamation was there as well although that primitve alone did not satisfy the other component of the thin provisioning primitive which was STUN.

Shown below a Storage Center 5.5 datastore where I lack Atomic Test and Set (aka Hardware Assisted Locking) and XCOPY.  I have block zeroing and Space Reclamation using the Free Space Recovery agent for vSphere guest VMs using physical RDMs. VAAI support status can be obtained in full using esxcli:

Snagit Capture

Or in part using the vSphere Client GUI:

Snagit Capture

After the Storage Center 5.6 upgrade, I’ve got additional VAAI primitive support where Clone in most cases is going to be the biggest one in terms of fabric and host efficiency and performance. Not shown is support for Thin Provisioning Stun but that has been added as well:

Snagit Capture

The vSphere Client GUI now reflects full VAAI support after the 5.6 upgrade:

Snagit Capture

What else? Added support for vSphere 5.5 as an operating system type:

Snagit Capture

Last but not least, added support for Windows 2012 and some of its features including Offloaded Data Transfer, Thin Provisioning, Space Reclamation, and Server Objects:

Snagit Capture

Storage Center 5.6 also adds new storage features which are storage host agnostic such as Background Media Scans (BMS) as well as improved disk and HBA management for server objects.  And the bug fixes I mentioned earlier – refer to the SCOS 5.6 Release Notes for details.

To wrap this up, if you’ve got an older Storage Center model and you want support for these new features while avoiding a forklift upgrade, Storage Center Operating System 5.6 is the way to go.

vSphere 5.5 UNMAP Deep Dive

September 13th, 2013

One of the features that has been updated in vSphere 5.5 is UNMAP which is one of two sub-components of what I’ll call the fourth block storage based thin provisioning VAAI primitive (the other sub-component is thin provisioning stun).  I’ve already written about UNMAP a few times in the past.  It was first introduced in vSphere 5.0 two years ago.  A few months later the feature was essentially recalled by VMware.  After it was re-released by VMware in 5.0 Update 1, I wrote about its use here and followed up with a short piece about the .vmfsBalloon file here.

For those unfamiliar, UNMAP is a space reclamation mechanism used to return blocks of storage back to the array after data which was once occupying those blocks has been moved or deleted.  The common use cases are deleting a VM from a datastore, Storage vMotion of a VM from a datastore, or consolidating/closing vSphere snapshots on a datastore.  All of these operations, in the end, involve deleting data from pinned blocks/pages on a volume.  Without UNMAP, these pages, albeit empty and available for future use by vSphere and its guests only, remain pinned to the volume/LUN backing the vSphere datastore.  The pages are never returned back to the array for use with another LUN or another storage host.  Notice I did not mention shrinking a virtual disk or a datastore – neither of those operations are supported by VMware.  I also did not mention the use case of deleting data from inside a virtual machine – while that is not supported, I believe there is a VMware fling for experimental use.  In summary, UNMAP extends the usefulness of thin provisioning at the array level by maintaining storage efficiency throughout the life cycle of the vSphere environment and the array which supports the UNMAP VAAI primitive.

On the Tuesday during VMworld, Cormac Hogan launched his blog post introducing new and updated storage related features in vSphere 5.5.  One of those features he summarized was UNMAP.  If you haven’t read his blog, I’d definitely recommend taking a look – particularly if you’re involved with vSphere storage.  I’m going to explore UNMAP in a little more detail.

The most obvious change to point out is the command line itself used to initiate the UNMAP process.  In previous versions of vSphere, the command issued on the vSphere host was:

vmkfstools -y x (where x represent the % of storage to unmap)

As Cormac points out, UNMAP has been moved to esxcli namespace in vSphere 5.5 (think remote scripting opportunities after XYZ process) where the basic command syntax is now:

esxcli storage vmfs unmap

In addition to the above, there are also three switches available for use; of first two listed below, one is required, and the third is optional.

-l|–volume-label=<str> The label of the VMFS volume to unmap the free blocks.

-u|–volume-uuid=<str> The uuid of the VMFS volume to unmap the free blocks.

-n|–reclaim-unit=<long> Number of VMFS blocks that should be unmapped per iteration.

Previously with vmkfstools, we’d change to VMFS folder in which we were going to UNMAP blocks from.  In vSphere 5.5, the esxcli command can be run from anywhere so specifying the the datastore name or the uuid is one of the required parameters for obvious reasons.  So using the datastore name, the new UNMAP command in vSphere 5.5 is going to look like this:

esxcli storage vmfs unmap -l 1tb_55ds

As for the optional parameter, the UNMAP command is an iterative process which continues through numerous cycles until complete.  The reclaim unit parameter specifies the quantity of blocks to unmap per each iteration of the UNMAP process.  In previous versions of vSphere, VMFS-3 datastores could have block sizes of 1, 2, 4, or 8MB.  While upgrading a VMFS-3 datastore to VMFS-5 will maintain these block sizes, executing an UNMAP operation on a native net-new VMFS-5 datastore results in working with a 1MB block size only.  Therefore, if a reclaim unit value of 100 is specified on a VMFS-5 datastore with a 1MB block size, then 100MB data will be returned to the available raw storage pool per iteration until all blocks marked available for UNAMP are returned.  Using a value of 100, the UNMAP command looks like this:

esxcli storage vmfs unmap -l 1tb_55ds -n 100

If the reclaim unit value is unspecified when issuing the UNMAP command, the default reclaim unit value is 200, resulting in 200MB of data returned to the available raw storage pool per iteration assuming a 1MB block size datastore.

One additional piece to to note on the CLI topic is that in a release candidate build I was working with, while the old vmkfstools -y command is deprecated, it appears to still exist but with newer vSphere 5.5 functionality published in the –help section:

vmkfstools vmfsPath -y –reclaimBlocks vmfsPath [–reclaimBlocksUnit #blocks]

The next change involves the hidden temporary balloon file (refer to my link at the top if you’d like more information about the balloon file but basically it’s a mechanism used to guarantee blocks targeted for UNMAP are not in the interim written to by an outside I/O request until the UNMAP process is complete).  It is no longer named .vmfsBalloon.  The new name is .asyncUnmapFile as shown below.

/vmfs/volumes/5232dd00-0882a1e4-e918-0025b3abd8e0 # ls -l -h -A
total 998408
-r——–    1 root     root      200.0M Sep 13 10:48 .asyncUnmapFile
-r——–    1 root     root        5.2M Sep 13 09:38 .fbb.sf
-r——–    1 root     root      254.7M Sep 13 09:38 .fdc.sf
-r——–    1 root     root        1.1M Sep 13 09:38 .pb2.sf
-r——–    1 root     root      256.0M Sep 13 09:38 .pbc.sf
-r——–    1 root     root      250.6M Sep 13 09:38 .sbc.sf
drwx——    1 root     root         280 Sep 13 09:38 .sdd.sf
drwx——    1 root     root         420 Sep 13 09:42 .vSphere-HA
-r——–    1 root     root        4.0M Sep 13 09:38 .vh.sf
/vmfs/volumes/5232dd00-0882a1e4-e918-0025b3abd8e0 #

As discussed in the previous section, use of the UNMAP command now specifies the the actual size of the temporary file instead of the temporary file size being determined by a percentage of space to return to the raw storage pool.  This is an improvement in part because it helps avoid the catastrophe if UNMAP tried to remove 2TB+ in a single operation (discussed here).

VMware has also enhanced the functionality of the temporary file.  A new kernel interface in ESXi 5.5 allows the user to ask for blocks beyond a a specified block address in the VMFS file system.  This ensures that the blocks allocated to the temporary file were never allocated to the temporary file previously.  The benefit realized in the end is that any size temporary file can be created and with UNMAP issued to the blocks allocated to the temporary file, we can rest assured that we can issue UNMAP on all free blocks on the datastore.

Going a bit deeper and adding to the efficiency, VMware has also enhanced UNMAP to support multiple block descriptors.  Compared to vSphere 5.1 which issued just one block descriptor per UNMAP command, vSphere 5.5 now issues up to 100 block descriptors depending on the storage array (these identifying capabilities are specified internally in the Block Limits VPD (B0) page).

A look at the asynchronous and iterative vSphere 5.5 UNMAP logical process:

  1. User or script issues esxcli UNMAP command
  2. Does the array support VAAI UNMAP?  yes=3, no=end
  3. Create .asyncUnmapFile on root of datastore
  4. .asyncUnmapFile created and locked? yes=5, no=end
  5. Issue 10CTL to allocate reclaim-unit blocks of storage on the volume past the previously allocated block offset
  6. Did the previous block allocation succeed? yes=7, no=remove lock file and retry step 6
  7. Issue UNMAP on all blocks allocated above in step 5
  8. Remove the lock file
  9. Did we reach the end of the datastore? yes=end, no=3

From a performance perspective, executing the UNMAP command in my vSphere 5.5 RC lab showed peak write I/O of around 1,200MB/s with an average of around 200IOPS comprised of a 50/50 mix of read/write.  The UNMAP I/O pattern is a bit hard to gauge because with the asynchronous iterative process, it seemed to do a bunch of work, rest, do more work, rest, and so on.  Sorry no screenshots because flickr.com is currently down.  Perhaps the most notable takeaway from the performance section is that as of vSphere 5.5, VMware is lifting the recommendation of only running UNMAP during a maintenance window.  Keep in mind this is just a recommendation.  I encourage vSphere 5.5 customers to test UNMAP in their lab first using various reclaim unit sizes.  While do this, examine performance impacts to the storage fabric, the storage array (look at both front end and back end), as well as other applications sharing the array.  Remember that fundamentally the UNMAP command is only going to provide a benefit AFTER its associated use cases have occurred (mentioned at the top of the article).  Running UNMAP on a volume which has no pages to be returned will be a waste of effort.  Once you’ve become comfortable with using UNMAP and understanding its impacts in your environment, consider running it on a recurring schedule – perhaps weekly.  It really depends on how much the use cases apply to your environment.  Many vSphere backup solutions leverage vSphere snapshots which is one of the use cases.  Although it could be said there are large gains to be made with UNMAP in this case, keep in mind backups run regularly and and space that is returned to raw storage with UNMAP will likely be consumed again in the following backup cycle where vSphere snapshots are created once again.

To wrap this up, customers who have block arrays supporting the thin provision VAAI primitive will be able to use UNMAP in vSphere 5.5 environments (for storage vendors, both sub-components are required to certify for the primitive as a whole on the HCL).  This includes Dell Compellent customers with current version of Storage Center firmware.  Customers who use array based snapshots with extended retention periods should keep in mind that while UNMAP will work against active blocks, it may not work with blocks maintained in a snapshot.  This is to honor the snapshot based data protection retention.

The .vmfsBalloon File

July 1st, 2013

One year ago, I wrote a piece about thin provisioning and the role that the UNMAP VAAI primitive plays in thin provisioned storage environments.  Here’s an excerpt from that article:

When the manual UNMAP process is run, it balloons up a temporary hidden file at the root of the datastore which the UNMAP is being run against.  You won’t see this balloon file with the vSphere Client’s Datastore Browser as it is hidden.  You can catch it quickly while UNMAP is running by issuing the ls -l -a command against the datastore directory.  The file will be named .vmfsBalloonalong with a generated suffix.  This file will quickly grow to the size of data being unmapped (this is actually noted when the UNMAP command is run and evident in the screenshot above).  Once the UNMAP is completed, the .vmfsBalloon file is removed.

Has your curiosity ever got you wondering about the technical purpose of the .vmfsBalloon file?  It boils down to data integrity and timing.  At the time the UNMAP command is run, the balloon file is immediately instantiated and grows to occupy (read: hog) all of the blocks that are about to be unmapped.  It does this so that during the unmap process, none of the blocks are allocated during the process of new file creation elsewhere.  If you think about it, it makes sense – we just told vSphere to give these blocks back to the array.  If during the interim one or more of these blocks were suddenly allocated for a new file or file growth purposes, then we purge the block, we have a data integrity issue.  More accurately, newly created data will be missing as its block or blocks were just flushed back to the storage pool on the array.

Redefining Disk.MaxLUN

March 27th, 2013

Regardless of what the vSphere host Advanced Setting Disk.MaxLUN has stated as its definition for years, “Maximum number of LUNs per target scanned for” is technically not correct.  In fact, it’s quite misleading.

Snagit Capture

The true definition looks similar stated in English but carries quite a different meaning and it can be found in my SnagIt hack above or within VMware KB 1998 Definition of Disk.MaxLUN on ESX Server Systems and Clarification of 128 Limit.

The Disk.MaxLUN attribute specifies the maximum LUN number up to which the ESX Server system scans on each SCSI target as it is discovering LUNs. If you have a LUN 131 on a disk that you want to access, for example, then Disk.MaxLUN must be at least 132. Don’t make this value higher than you need to, though, because higher values can significantly slow VMkernel bootup.

The 128 LUN limit refers only to the total number of LUNs that the ESX Server system is able to discover. The system intentionally stops discovering LUNs after it finds 128 because of various service console and management interface limits. Depending on your setup, you can easily have a situation in which Disk.MaxLUN is high (255) but you see few LUNs, or a situation in which Disk.MaxLUN is low (16) but you reach the 128 LUN limit because you have many targets.

For more information about limiting the number of LUNs visible to the server, see http://kb.vmware.com/kb/1467.

Note the last sentence in the first paragraph above in the KB article.  Keep the value as small as possible for your environment when using block storage.  vSphere ships with this value configured for maximum compatibility out of the box which is the max value of 256.  Assuming you don’t assign LUN numbers up to 256 in your environment, this value can be immediately ratcheted down in your build documentation or automated deployment scripts.  Doing so will decrease the elapsed time spent rescanning the fabric for block devices/VMFS datastores.  This tweak may be of particular interest at DR sites when using Site Recovery Manager to carry out a Recovery Plan test, a Planned Migration, or an actual DR execution.  It will allow for a more efficient use of RTO (Recovery Time Objective) time especially where multiple recovery plans are run consecutively.