vSphere 6.7 Storage and action_OnRetryErrors=on

February 8th, 2019 by jason 2 comments »

VMware introduced a new storage feature in vSphere 6.0 which was designed as a flexible option to better handle certain storage problems. Cormac Hogan did a fine job introducing the feature here. Starting with vSphere 6.0 and continuing on in vSphere 6.5, each block storage device (VMFS or RDM) is configured with an option called action_OnRetryErrors. Note that in vSphere 6.0 and 6.5, the default value is off meaning the new feature is effectively disabled and there is no new storage error handling behavior observed.

This value can be seen with the esxcli storage nmp device list command.

vSphere 6.0/6.5:
esxcli storage nmp device list | grep -A9 naa.6000d3100002b90000000000000ec1e1
Device Display Name: sqldemo1vmfs
Storage Array Type: VMW_SATP_ALUA
Storage Array Type Device Config: {implicit_support=on; explicit_support=off; explicit_allow=on; alua_followover=on; action_OnRetryErrors=off; {TPG_id=61459,TPG_state=AO}{TPG_id=61460,TPG_state=AO}{TPG_id=61462,TPG_state=AO}{TPG_id=61461,TPG_state=AO}}
Path Selection Policy: VMW_PSP_RR
Path Selection Policy Device Config: {policy=rr,iops=1000,bytes=10485760,useANO=0; lastPathIndex=0: NumIOsPending=0,numBytesPending=0}
Path Selection Policy Device Custom Config:
Working Paths: vmhba1:C0:T2:L141, vmhba1:C0:T3:L141, vmhba2:C0:T3:L141, vmhba2:C0:T2:L141
Is USB: false

If vSphere loses access to a device on a given path, the host will send a Test Unit Ready (TUR) command down the given path to check path state. When action_OnRetryErrors=off, vSphere will continue to retry for an amount of time because it expects the path to recover. It is important to note here that a path is not immediately marked dead when the first Test Unit Ready command is unsuccessful and results in a retry. It would seem many retries in fact and you’ll be able to see them in /var/log/vmkernel.log. Also note that a device typically has multiple paths and the process will be repeated for each additional path tried assuming the first path is eventually marked as dead.

Starting with vSphere 6.7, action_OnRetryErrors is enabled by default.

vSphere 6.7:
esxcli storage nmp device list | grep -A9 naa.6000d3100002b90000000000000ec1e1
Device Display Name: sqldemo1vmfs
Storage Array Type: VMW_SATP_ALUA
Storage Array Type Device Config: {implicit_support=on; explicit_support=off; explicit_allow=on; alua_followover=on; action_OnRetryErrors=on; {TPG_id=61459,TPG_state=AO}{TPG_id=61460,TPG_state=AO}{TPG_id=61462,TPG_state=AO}{TPG_id=61461,TPG_state=AO}}
Path Selection Policy: VMW_PSP_RR
Path Selection Policy Device Config: {policy=rr,iops=1000,bytes=10485760,useANO=0; lastPathIndex=2: NumIOsPending=0,numBytesPending=0}
Path Selection Policy Device Custom Config:
Working Paths: vmhba1:C0:T2:L141, vmhba1:C0:T3:L141, vmhba2:C0:T3:L141, vmhba2:C0:T2:L141
Is USB: false

If vSphere loses access to a device on a given path, the host will send a Test Unit Ready (TUR) command down the given path to check path state. When action_OnRetryErrors=on, vSphere will immediately mark the path dead when the first retry is returned. vSphere will not continue the retry TUR commands on a dead path.

This is the part where VMware thinks it’s doing the right thing by immediately fast failing a misbehaving/dodgy/flaky path. The assumption here is that other good paths to the device are available and instead of delaying an application while waiting for path failover during the intensive TUR retry process, let’s fail this one bad path right away so that the application doesn’t have to spin its wheels.

However, if all other paths to the device are impacted by the same underlying (and let’s call it transient) condition, what happens is that each additional path iteratively goes through the process of TUR, no retry, immediately mark path as dead, move on to the next path. When all available paths have been exhausted, All Paths Down (APD) for the device kicks in. If and when paths to an APD device become available again, they will be picked back up upon the next storage fabric rescan, whether that’s done manually by an administrator, or automatically by default every 300 seconds for each vSphere host (Disk.PathEvalTime). From an application/end user standpoint, I/O delay for up to 5 minutes can be a painfully long time to wait. The irony here is that VMware can potentially turn a transient condition lasting only a few seconds into a more of a Permanent Device Loss like condition.

All of the above leads me to a support escalation I got involved in with a customer having an Active/Passive block storage array. Active/Passive is a type of array which has multiple storage processors/controllers (usually two) and LUNs are distributed across the controllers in an ownership model whereby each controller owns both the LUNs and the available paths to those LUNs. When an active controller fails or is taken offline proactively (think storage processor reboot due to a firmware upgrade), the paths to the active controller go dark, the passive controller takes ownership of the LUNs and lights up the paths to them – a process which can be measured in seconds, typically more than 2 or 3, often much more than that (this dovetails into the discussion of virtual machine disk timeout best practices). With action_OnRetryErrors=off, vSphere tolerates the transient path outage during the controller failover. With action_OnRetryErrors=on, it doesn’t – each path that goes dark is immediately failed and we have APD for all the volumes on that controller in a fraction of a second.

The problem which was occurring in this customer escalation was a convergence of circumstances:

  • The customer was using vSphere 6.7 and its defaults
  • The customer was using an Active/Passive storage array
  • The customer virtualized Microsoft Windows SQL cluster servers (cluster disk resources are extremely sensitive to APDs in the hypervisor and immediately fail when it detects a dependent cluster disk has been removed – a symptom introduced by APD)
  • The customer was testing controller failovers
Windows failover clusters have zero tolerance for APD disk

To resolve the problem in vSphere 6.7, action_OnRetryErrors needs to be disabled for each device backed by the Active/Passive storage array. This must be performed on every host in the cluster having access to the given devices (again, these can be VMFS volumes and/or RDMs). There are a few ways to go about this.

To modify the configuration without a host reboot, take a look at the following example. A command such as this would need to be run on every host in the cluster, and for each device (ie. in an 8 host cluster with 8 VMFS/RDMs, we need to identify the applicable naa.xxx IDs and run 64 commands. Yes, this could be scripted. Be my guest.):

esxcli storage nmp satp generic deviceconfig set -c disable_action_OnRetryErrors -d naa.6000d3100002b90000000000000ec1e1

I don’t prefer that method a whole lot. It’s tedious and error prone. It could result in cluster inconsistencies. But on the plus side, a host reboot isn’t required, and this setting will persist across reboots. That also means a configuration set at this device level will override any claim rules that could also apply to this device. Keep this in mind if a claim rule is configured but you’re not seeing the desired configuration on any specific device.

The above could also be scripted for a number of devices on a host. Here’s one example. Be very careful that the base naa.xxx string matches all of the devices from one array that should be configured, and does not modify devices from other array types that should not be configured. Also note that this script is a one liner but for blog formatting purposes I manually added a line break starting with esxcli.:

for i in `ls /vmfs/devices/disks | grep -v ":" | grep -i naa.6000D31`; do echo $i; 
esxcli storage nmp satp generic deviceconfig set -c disable_action_OnRetryErrors -d $i; done

Now to verify:

for i in `ls /vmfs/devices/disks | grep -v ":" | grep -i naa.6000D31`; do echo $i; 
esxcli storage nmp device list | grep -A2 $i | egrep -io action_OnRetryErrors=\\w+; done

I like adding a SATP claim rule using a vendor device string a lot better, although changes to claim rules for existing devices generally requires a reboot of the host to reclaim existing devices with the new configuration. Here’s an example:

esxcli storage nmp satp rule add -s VMW_SATP_ALUA -V COMPELNT -P VMW_PSP_RR -o disable_action_OnRetryErrors

Here’s another example using quotes which is also acceptable and necessary when setting multiple option string parameters (refer to this):

esxcli storage nmp satp rule add -s “VMW_SATP_ALUA” -V “COMPELNT” -P “VMW_PSP_RR” -o “disable_action_OnRetryErrors”

When a new claim rule is added, claim rules can be reloaded with the following command.

esxcli storage core claimrule load

Keep in mind the new claim rule will only apply to unclaimed devices. Newly presented devices will inherit the new claim rule. Existing devices which are already claimed will not until the next vSphere host reboot. Devices can be unclaimed without a host reboot but all I/O to the device must be halted – somewhat of a conundrum if we’re dealing with production volumes, datastores being used for heartbeating, etc. Assuming we’re dealing with multiple devices, a reboot is just going to be easier and cleaner.

I like claim rules here better because of the global nature. It’s one command line per host in the cluster and it’ll take care of all devices from the Active/Passive storage array vendor. No need to worry about coming up with and testing a script. No need to worry about spending hours identifying the naa.xxx IDs and making all of the changes across hosts. No need to worry about tagging other storage vendor devices with an improper configuration. Lastly, the claim rule in effect is visible in a SATP claim rule list (sincere apologies for the formatting – it’s bad I know):

esxcli storage nmp satp rule list

Name Device Vendor Model Driver Transport Options Rule Group Claim Options Default PSP PSP Options Description
——————- —— ——– —————- —— ——— —————————- ———- ———————————– ———– ———– ———————————————–
VMW_SATP_ALUA COMPELNT disable_action_OnRetryErrors user VMW_PSP_RR

By the way… to remove the SATP claim rules above respectively:

esxcli storage nmp satp rule remove -s VMW_SATP_ALUA -V COMPELNT -P VMW_PSP_RR -o disable_action_OnRetryErrors

esxcli storage nmp satp rule remove -s “VMW_SATP_ALUA” -V “COMPELNT” -P “VMW_PSP_RR” -o “disable_action_OnRetryErrors”

The bottom line here is there may be a number of VMware customers with Active/Passive storage arrays, running vSphere 6.7. If and when planned or unplanned controller/storage processor failover occurs, APDs may unexpectedly occur, impacting virtual machines and their applications, whereas this was not the case with previous versions of vSphere.

In closing, I want to thank VMware Staff Technical Support Engineering for their work on this case and ultimately exposing “what changed in vSphere 6.7” because we had spent some time trying to reproduce this problem on vSphere 6.5 where we had an environment similar to what the customer had and we just weren’t seeing any problems.


Managing SATPs

No Failover for Storage Path When TUR Command Is Unsuccessful

Storage path does not fail over when TUR command repeatedly returns retry requests (2106770)

Handling Transient APD Conditions


Updated 2-20-19: VMware published a KB article on this issue today:
ESXi 6.7 hosts with active/passive or ALUA based storage devices may see premature APD events during storage controller fail-over scenarios (67006)

MegaSceneryEarth automated HTTP download PowerShell script

July 30th, 2018 by jason No comments »

Regulars: Thank you for stopping by but just as a heads up, this post is not VMware virtualization related.

After a bit of hardware trouble from the local store, my replacement flight sim rig is up and running Lockheed Martin Prepar3D 4.3. I’m trying to shake the rust off after not having flown a leg for quite some time but everything sim and add-on related is looking good and I was able to fly KMSP-KLAX with many more FPS that I’ve been used to the past many years on FSX.

Now, out of sheer coincidence late last week I received a promotion from MegaSceneryEarth. Normally I ignore these but since this was a 50% off deal and I’m essentially starting from ground zero with no scenery for Prepar3D, I took a look. After some chin scratching, I checked out with Minnesota, Arizona, Northern California, and Southern California. I felt like quite the opportunist until the email arrived with file download instructions. Each scenery is broken up into 2GB download chunks. Between the four sceneries, I was presented with 123 download links. With no FTP option that I’m aware of, I’m going to have to babysit this process. With each 2GB download taking anywhere from 5-10 minutes, this is going to feel a lot like installing Microsoft Office 2016 via floppy disk. Except longer. MegaSceneryEarth is available on shipped DVDs or USB sticks they cost a small fortune and of course the shipping and handling time. This is going to be painful.

But… there’s a cheaper, faster, and fully automated way. PowerShell scripting.

Now admittedly for my first scenery, I threw the v1.0 PowerShell script version together quickly using the Invoke-WebRequest cmdlet with my own static scenery-specific lines of code to serve my own needs. And it worked. I launched the script, walked away for a few hours, and when I came back I had a directory full of MegaSceneryEarth .zip files ready for installation. It worked beautifully and it didn’t consume bunches of my time.

I still had three sceneries left to download. While I could have made new static scripts for the remaining three sceneries, my mind was already going down the road that this script would be more powerful if one script worked on a variety of MegaSceneryEarth downloads. It becomes flexible to use for my future MegaSceneryEarth sceneries as well as extensible if shared with the community.

So on Sunday, version 2.0 is what I came up with. It takes three inputs in the form of modifying variables at the beginning of the script before running it:

  1. The download URL for the first file of the scenery
  2. The path on the local hard drive where all of the .zip files should be downloaded to
  3. The starting point or first file to download. Usually this will be 1 but if for some reason a previous download was stopped or interrupted at say 6, we’d want to continue from there assuming we don’t want to re-download the first 5 files all over again.

The script does the rest of the heavy lifting. All three of the above are required for functionality but by far the first variable is the most important and where I spent almost all of my time making the script flexible so that it works across MegaSceneryEarth sceneries. When provided with the download URL of the first file of the scenery, the script does a bunch of parsing, splitting, and concatenation of stuff to automatically determine some things:

  1. It figures out what the base path and file name URL is for the scenery. For any given scenery, these parts never change across all of the download links within a scenery but they will change across sceneries.
  2. It figures out how many .zip files there are to download in total for the entire scenery.
  3. Because MegaSceneryEarth likes to use leading zeroes as part of their naming convention, and leading zero strings aren’t the same thing as leading zero integers in terms of PowerShell constructs, it figures out how to deal with those for file names and looping iterations. Basically when you go from “009” files in a scenery download to “010”, without dealing with that, stuff breaks. I think most MegaSceneryEarth sceneries have more than 9 files, but some have less (Deleware has 1). The script deals with both kinds. That I know of, MegaSceneryEarth doesn’t have a scenery with more than 99 .zip files but version 2.1 of the script will accommodate that.
  4. It figures out the correct file name to save on the local hard drive for each scenery file downloaded. This may sound stupid but that I know of, the Invoke-WebRequest cmdlet requires the output file name and doesn’t automatically assume the file name should be the same as the file name being downloaded via HTTP such as a copy command to a folder would.

Generally speaking, for automation to work properly, it relies on input consistency so that all assumptions it makes are 100% correct. Translated, this script relies on consistent file/path naming conventions used by MegaSceneryEarth. Not only within sceneries, but across sceneries as well.

And so here it is. Copy the PowerShell script below and execute in your favorite PowerShell environment. Most of the time I simply use Windows PowerShell ISE. The code isn’t signed so you may need to temporarily relax your PowerShell ExecutionPolicy to Bypass . It’s there to protect you by default.

# MegaSceneryEarth.ps1 version 2.1 7/31/18
# This PowerShell script automates unattended HTTP download of up to 999 MegaSceneryEarth scenery files
# Relies on MegaSceneryEarth naming conventions current as of version date above
# Tested successfully with four MegaSceneryEarth downloads
# As with anything on the internet, use at your own risk, no warranty provided by the author
# Update the three variables below as necessary for your specific scenery and download parameters
# Jason Boche
# http://boche.net
# jason@boche.net

# Provide the first file download link copied from the MegaSceneryEarth .htm link provided with license key email
# Internet Explorer: Right click the download link and choose "Copy shortcut"
# Google Chrome: Right click the download link and choose "Copy link address"
# Then paste the link between the quotation marks

$firstfile1 = "paste first file download link here.zip"

# Provide the path on your hard drive where the files will temporarily be saved.
# This will usually be where you have the MegaSceneryEarthInstallManager.exe file for subsequent installation
# Syntax requires a trailing backslash (\)

$dlpath = "D:\junk\"

# Provide the starting file count. Usually we start with 1 unless you're starting the download somewhere after 1

$startcount = 1

# Do not edit anything below this line


# Get the HTTP path
$firstfile2 = split-path $firstfile1
$firstfile3 = $firstfile2.Replace("\","/")
$httppath = $firstfile3 + "/"

# Get the .zip file name
$firstfile4 = split-path $firstfile1 -leaf

# Get the static part of the .zip file name
$firstfile5 = $firstfile4 -split "001_of_"
$firstfile6 = $firstfile5[0]
$fileprefix = $firstfile6

# Get the total number of scenery files in string format with and without .zip extension
$firstfile7 = $firstfile5[1] -split ".zip"
$firstfile8 = $firstfile5[1]
# Convert total number of scenery files in string format to number format
$total = [decimal]::Parse($firstfile7)

if ($startcount -gt $total)
Write-Host Error: Starting file specified is greater than total number of files. Please check the `$startcount variable

# If there are 1-9 scenery files, download those
if ($total -ge 1)
$leadingzero = "00"
while ($startcount -le $total)
if ($startcount -gt 9){break}
Write-Host Downloading $httppath$fileprefix$leadingzero$startcount"_of_"$firstfile8 to $dlpath$fileprefix$leadingzero$startcount"_of_"$firstfile8

Invoke-WebRequest -Uri $httppath$fileprefix$leadingzero$startcount"_of_"$firstfile8 -OutFile $dlpath$fileprefix$leadingzero$startcount"_of_"$firstfile8

$startcount ++


# If there are 10-99 scenery files, download those also
if ($total -ge 10)
$leadingzero = "0"
while ($startcount -le $total)
if ($startcount -gt 99){break}
Write-Host Downloading $httppath$fileprefix$leadingzero$startcount"_of_"$firstfile8 to $dlpath$fileprefix$leadingzero$startcount"_of_"$firstfile8

Invoke-WebRequest -Uri $httppath$fileprefix$leadingzero$startcount"_of_"$firstfile8 -OutFile $dlpath$fileprefix$leadingzero$startcount"_of_"$firstfile8

$startcount ++


# If there are 100-999 scenery files, download those also
if ($total -ge 100)
$leadingzero = ""
while ($startcount -le $total)
if ($startcount -gt 999){break}
Write-Host Downloading $httppath$fileprefix$leadingzero$startcount"_of_"$firstfile8 to $dlpath$fileprefix$leadingzero$startcount"_of_"$firstfile8

Invoke-WebRequest -Uri $httppath$fileprefix$leadingzero$startcount"_of_"$firstfile8 -OutFile $dlpath$fileprefix$leadingzero$startcount"_of_"$firstfile8

$startcount ++


VMware Horizon Share Folders Issue with Windows 10

June 12th, 2017 by jason No comments »

I spent some time the last few weekends making various updates and changes to the lab. Too numerous and not all that paramount to go into detail here, with the exception of one issue I did run into. I created a new VMware Horizon pool consisting of Windows 10 Enterprise, Version 1703 (Creators Update). The VM has 4GB RAM and VMware Horizon Agent is installed. This is all key information contributing to my new problem which is the Shared Folders feature seems to have stopped functioning.

That is to say, when launching my virtual desktop from the Horizon Client, there are no shared folders or drives being passed through from where I launched the Horizon Client. Furthermore, the Share Folders menu item is completely missing from the blue Horizon Client pulldown menu.

I threw something out on Twitter and received a quick response from a very helpful VMware Developer by the name of Adam Gross (@grossag).

Adam went on to explain that the issue stems from a registry value defining an amount of memory which is less that the amount of RAM configured in the VM.

The registry key is HKLM\SYSTEM\CurrentControlSet\Control\ and the value configured for SvcHostSplitThresholdInKB is 3670016 (380000 Hex). The 3670016 is expressed in KB which comes out to be 3.5GB. The default Windows 10 VM configuration is deployed with 4GB of RAM which is what I did this past weekend. Since 3.5GB is less than 4GB, the bug rears its head.

Adam mentioned the upcoming 7.2 agent will configure this value at 32GB on Windows 10 virtual machines (that’s 33554432 or 2000000 in Hex) and perhaps even larger in the 7.2 version or some future release of the agent because the reality some day is that 32GB won’t be large enough. Adam went on to explain the maximum amount of RAM supported by Windows 10 x64 is 2TB which comes out to be 2147483648 expressed in KB or 80000000 in Hex. Therefore, it is guaranteed safe (at least to avoid this issue) to set the registry value to 80000001 (in Hex) or higher for any vRAM configuration.

To move on, the value needs to be tweaked manually in the registry. I’ll set mine to 32GB as I’ll likely never have a VDI desktop deployed between now and when the 7.2 agent ships and is installed in my lab.

And the result for posterity.

I found a reboot of the Windows 10 VM was required before the registry change made the positive impact I was looking for. After all was said and done, my shared folders came back as did the menu item from the pulldown on the blue Horizon Client pulldown menu. Easy fix for a rather obscure issue. Once again my thanks to Adam Gross for providing the solution.

VMware Tools causes virtual machine snapshot with quiesce error

July 30th, 2016 by jason No comments »

Last week I was made aware of an issue a customer in the field was having with a data protection strategy using array-based snapshots which were in turn leveraging VMware vSphere snapshots with VSS quiesce of Windows VMs. The problem began after installing VMware Tools version 10.0.0 build-3000743 (reported as version 10240 in the vSphere Web Client) which I believe is the version shipped in ESXI 6.0 Update 1b (reported as version 6.0.0, build 3380124 in the vSphere Web Client).

The issue is that creating a VMware virtual machine snapshot with VSS integration fails. The virtual machine disk configuration is simply two .vmdks on a VMFS-5 datastore but I doubt the symptoms are limited only to that configuration.

The failure message shown in the vSphere Web Client is “Cannot quiesce this virtual machine because VMware Tools is not currently available.”  The vmware.log file for the virtual machine also shows the following:

2016-07-29T19:26:47.378Z| vmx| I120: SnapshotVMX_TakeSnapshot start: ‘jgb’, deviceState=0, lazy=0, logging=0, quiesced=1, forceNative=0, tryNative=1, saveAllocMaps=0 cb=1DE2F730, cbData=32603710
2016-07-29T19:26:47.407Z| vmx| I120: DISKLIB-LIB_CREATE : DiskLibCreateCreateParam: vmfsSparse grain size is set to 1 for ‘/vmfs/volumes/51af837d-784bc8bc-0f43-e0db550a0c26/rmvm02/rmvm02-000001.
2016-07-29T19:26:47.408Z| vmx| I120: DISKLIB-LIB_CREATE : DiskLibCreateCreateParam: vmfsSparse grain size is set to 1 for ‘/vmfs/volumes/51af837d-784bc8bc-0f43-e0db550a0c26/rmvm02/rmvm02_1-00000
2016-07-29T19:26:47.408Z| vmx| I120: SNAPSHOT: SnapshotPrepareTakeDoneCB: Prepare phase complete (The operation completed successfully).
2016-07-29T19:26:56.292Z| vmx| I120: GuestRpcSendTimedOut: message to toolbox timed out.
2016-07-29T19:27:07.790Z| vcpu-0| I120: Tools: Tools heartbeat timeout.
2016-07-29T19:27:11.294Z| vmx| I120: GuestRpcSendTimedOut: message to toolbox timed out.
2016-07-29T19:27:17.417Z| vmx| I120: GuestRpcSendTimedOut: message to toolbox timed out.
2016-07-29T19:27:17.417Z| vmx| I120: Msg_Post: Warning
2016-07-29T19:27:17.417Z| vmx| I120: [msg.snapshot.quiesce.rpc_timeout] A timeout occurred while communicating with VMware Tools in the virtual machine.
2016-07-29T19:27:17.417Z| vmx| I120: —————————————-
2016-07-29T19:27:17.420Z| vmx| I120: Vigor_MessageRevoke: message ‘msg.snapshot.quiesce.rpc_timeout’ (seq 10949920) is revoked
2016-07-29T19:27:17.420Z| vmx| I120: ToolsBackup: changing quiesce state: IDLE -> DONE
2016-07-29T19:27:17.420Z| vmx| I120: SnapshotVMXTakeSnapshotComplete: Done with snapshot ‘jgb’: 0
2016-07-29T19:27:17.420Z| vmx| I120: SnapshotVMXTakeSnapshotComplete: Snapshot 0 failed: Failed to quiesce the virtual machine (31).
2016-07-29T19:27:17.420Z| vmx| I120: VigorTransport_ServerSendResponse opID=ffd663ae-5b7b-49f5-9f1c-f2135ced62c0-95-ngc-ea-d6-adfa seq=12848: Completed Snapshot request.
2016-07-29T19:27:26.297Z| vmx| I120: GuestRpcSendTimedOut: message to toolbox timed out.

After performing some digging, I found VMware had released VMware Tools version 10.0.9 on June 6, 2016. The release notes identify the root cause has been identified and resolved.

Resolved Issues

Attempts to take a quiesced snapshot in a Windows Guest OS fails
Attempts to take a quiesced snapshot after booting a Windows Guest OS fails

After downloading and upgrading VMware Tools version 10.0.9 build-3917699 (reported as version 10249 in the vSphere Web Client), the customer’s problem was resolved. Since the faulty version of VMware Tools was embedded in the customer’s templates used to deploy virtual machines throughout the datacenter, there were a number of VMs needing their VMware Tools upgraded, as well as the templates themselves.

vCenter Server 6 Appliance fsck failed

April 4th, 2016 by jason No comments »

A vCenter Server Appliance (vSphere 6.0 Update 1b) belonging to me was bounced and for some reason was unbootable. The trouble during the boot process begins with /dev/sda3 contains a file system with errors, check forced. At approximately 27% of the way through, the process terminates with fsck failed. Please repair manually and reboot.

Unable to access a bash# prompt from the current state of the appliance, I followed VMware KB 2069041 VMware vCenter Server Appliance 5.5 and 6.0 root account locked out after password expiration, particularly the latter portion of it which provides the steps to modify a kernel option in the GRUB bootloader to obtain a root shell (and subsequently run the e2fsck -y /dev/sda3 repair command.

The steps are outlined in VMware KB 2069041 and are simple to follow.

  1. Reboot the VCSA
  2. Be quick about highlighting the VMware vCenter Server appliance menu option (the KB article recommends hitting the space bar to stop the default countdown)
  3. p (to enter a root password and continue with additional commands the next step)
  4. e (to edit the boot command)
  5. Append init=/bin/bash (followed by Enter to return to the GRUB menu
  6. b (to start the boot process)

This is where e2fsck -y /dev/sda3 is executed to repair file system errors on /dev/sda3 and allow the VCSA to boot successfully.

When the process above completes, reboot the VCSA and that should be all there is to it.

Update 10/9/17: I ran into a similar issue with VCSA 6.5 Update 1 where the appliance wouldn’t boot and I was left at an emergency mode prompt. In this situation, following the steps above isn’t so straight forward in part due to the Photon OS splash screen and no visibility to the GRUB bootloader (following VMware KB 2081464). In this situation, I executed fsck /dev/sda3 at the emergency mode prompt answering yes to all prompts. After reboot, I found this did not resolve all of the issues. I was able to log in by providing the root password twice. The journalctl command revealed a problem with /dev/mapper/log_vg-log. Next I ran fsck /dev/mapper/log_vg-log again answering yes to all prompts to repair. When that was finished, the appliance was rebooted and came up operational.

Error 1603: Java Update did not complete

February 9th, 2016 by jason No comments »

3 Billion Devices Run Java.

Unfortunately (or fortunately depending on how you look at it) my workstation isn’t one of them.

I’m no stranger to Java and I’m more than willing to share my bitter experiences with it but rarely does my dissatisfaction stem from the installation process itself (that is if you don’t count the update frequency). This blog post is for my future self as I’m bound to run into it again.

I encountered a never before seen by me problem installing the Windows Offline (64-bit) version of Java Version 8 Update 73 on Windows 7.

Java Update did not complete

Error Code: 1603

The usual tricks did not work. Uninstalling the old version first. Run as administrator. Directory and registry scrubbing. System reboot. Trying an older version. In fact all version 8 builds exhibited this behavior. It wasn’t until I rolled all the way back to a version 7 build that the installation was successful.

For posterity, this spiceworks thread covers a lot of ground; there seems to be a variety of fixes for an equal number of root causes which all yield the same Java installation failure. The Java knowledgebase covering this issue offers some basic workarounds under Option 1 such as a system reboot, offline installer, uninstalling old versions, all of which I had already tried without success.

The fix in my particular instance was Option 2: Disable Java content (in the web browser) through the Java Control Panel. You’ll find the Java applet in the Windows Control Panel when a 32-bit version of Java is installed.

On the Security tab, temporarily deselect the “Enable Java content in the browser” checkbox.

Complete the 64-bit version of Java installation (launch the installer using “Run as Administrator“) and don’t forget to “Enable Java content in the browser” when finished.

Update 5-12-16: If you’re receiving this error and “Enable Java content in the browser” is already deselected, check the box, Apply, OK, uncheck the box, Apply, OK.


vCloud Director vdnscope-1 could not be found

August 15th, 2015 by jason 3 comments »

For whatever reason, I’ve been spending a pretty fair amount of time lately with vCloud Director both at home as well as at the office. It’s a great product. It always has been, beginning with its Lab Manager roots. Like my last blog post, this writing will exhibit another vCloud Director database editing exercise which stemmed from a problem I encountered in the lab.

I was attempting to get away from my VLAN-backed Network Pool by configuring vCloud Director’s Provider vDC-VXLAN-NP Network Pool which is much more dynamic and powerful in nature. The Provider vDC-VXLAN-NP Network Pool is installed by default in vCloud Director but to configure and use it for Organization and vApp networks, one must follow a set of instructions which basically involves configuring upstream physical switch(es) with jumbo frames, a transport VLAN, and multicast settings, preparing the hosts by installing an agent on each of them using vShield Manager, adding VMkernel ports, Network Scopes, Virtual Wires, and so on (Mike Laverick and Rawlinson Rivera both have easy to follow tutorials. The VMware VXLAN Deployment Guide is also a great read). Once it’s all set up and working, VXLAN is pretty effing cool. Anyway, it sounds like a lot of steps and admittedly it requires some reading and attention to detail, but much of it is automated by vCloud Director, with some bumps along the way.

I did run into a few snags which ultimately lead me to going through the configuration process start to finish a few times. In the end I had to configure the Network Scope in vShield Manager manually when normally this step is performed automatically by vCloud Director via the Enable VXLAN Provider VDC right-click menu item.

Once I got beyond the installation hurdles, there was some residual impact left in the vCloud Director database and vShield Manager such that it all looked to be working properly, except that at the very end I could not power on a vApp with an isolated vApp network which relied on the use of the VXLAN-backed Network Pool. The error message was:

Cannot deploy organization VDC network  (uuid for that network)
com.vmware.vcloud.fabric.nsm.error.VsmException: VSM response error (202): The requested object : vdnscope-1 could not be found. Object identifiers are case sensitive.

[ bb505f5e-27f1-419e-9b05-da0d38a7788f ] Unable to deploy network “vApp net1(urn:uuid:7d813867-d3f1-420d-a0a8-a65263369327)”.

com.vmware.vcloud.fabric.nsm.error.VsmException: VSM response error (202): The requested object : vdnscope-1 could not be found. Object identifiers are case sensitive.

– com.vmware.vcloud.fabric.nsm.error.VsmException: VSM response error (202): The requested object : vdnscope-1 could

not be found. Object identifiers are case sensitive.

– VSM response error (202): The requested object : vdnscope-1 could not be found. Object identifiers are case sensitive.

An object named vdnscope-1 seems to be the obvious problem.

I was not able to make use of the Network Pool Repair function as it was unavailable:

Fortunately I was able to locate a related thread in the VMware Communities which more or less explained what might have happened and what I could try to fix the problem (credit to IamTHEvilONE). This is my interpretation.

Each time a Network Scope is created in the vShield Manager, an underlying object reference is tied to the Network Scope with a naming convention of vdnscope-x where x begins at 1 and is incremented at each create iteration. So the first Network Scope created in vShield Manager by vCloud Director is going to be called vdnscope-1. This object is stored in the vCloud Director database and is referenced each time an Org or vApp network is spun up which leans on the VXLAN-backed Network Pool. This is formally handled at vApp power on. The object is also stored somewhere in the vShield Manager although I was never able to locate it. What happened here is that Network Scope object known by both vCloud Director and vShield Manager were not sync and didn’t match. vCloud Direct dials up vShield Manager and says “I need that vdnscope-1 you have” and vShield Manager responds with “I have no idea what that object is”. Obvious problem.

The solution is fairly simple: Edit the vCloud Director database with the correct Network Scope object reference. But a small problem still remains: I was never able to locate the correct object name in vShield Manager. However, going back to the VMware Communities discussion, I’ll eventually be able to find the correct object name by incrementing the vdnscope-x object reference in the vCloud Director database by 1 until the two sides agree and the vApp powers on successfully.

I’ll borrow the same disclaimer from the previous blog post: An obligatory warning on vCloud database editing. Do as I say, not as I do. Editing the vCloud database should be performed only with the guidance of VMware support. Above all, create a point in time backup of the vCloud database with all vCloud Director cell servers stopped (service vmware-vcd stop). There are a variety of methods in which you can perform this database backup. Use the method that is most familiar to and works for you.

So after stopping the vCloud Director services and getting a vcloud database backup…

Step 1: Open Microsoft SQL Server Management Studio and navigate to the [vcloud].[dbo].[network_pool] table. Under the vdn_scope_id column, increment the vdnscope-1 value from 1 to 2.

Step 2: Start the vCloud Director service in all cell servers (service vmware-vcd start) and verify in vShield Manager the Virtual Wire has been created and the vApp can power on successfully. If it fails, stop vCloud services and repeat Step 1 above while incrementing the vdnscope value to 3, then 4, and so on. In my case, vdnscope-5 did the trick.

vCloud Director is awesome. VXLAN with 16 million networks capability kicks it up a notch.

Updated 8/22/15: I received a tip from Jon Hemming in the form of a blog comment. Jon states he has written a VMware KB article titled Creating an isolated network in VMware vCloud Director reports the error: vdnscope-x does not exist (2065485) which documents a process to get the correct VDN Scope ID via the REST API of vShield as well as update the vCloud Director database. Thank you Jon! I did find the syntax for the curl statement to be slightly off. The KB article calls for the following syntax:

curl -k -u admin:default -X GET https://vshield.boche.lab/api/2.0/vdn/scopes/

The result is HTTP Status 404 The requested resource is not available.

What did work was:

curl -k -u admin:default -X GET https://vshield.boche.lab/api/2.0/vdn/scopes

The only change was removing the trailing forward slash on the URL.