Posts Tagged ‘Cluster’

vSphere 6.7 Storage and action_OnRetryErrors=on

February 8th, 2019

VMware introduced a new storage feature in vSphere 6.0 which was designed as a flexible option to better handle certain storage problems. Cormac Hogan did a fine job introducing the feature here. Starting with vSphere 6.0 and continuing on in vSphere 6.5, each block storage device (VMFS or RDM) is configured with an option called action_OnRetryErrors. Note that in vSphere 6.0 and 6.5, the default value is off meaning the new feature is effectively disabled and there is no new storage error handling behavior observed.

This value can be seen with the esxcli storage nmp device list command.

vSphere 6.0/6.5:
esxcli storage nmp device list | grep -A9 naa.6000d3100002b90000000000000ec1e1
naa.6000d3100002b90000000000000ec1e1
Device Display Name: sqldemo1vmfs
Storage Array Type: VMW_SATP_ALUA
Storage Array Type Device Config: {implicit_support=on; explicit_support=off; explicit_allow=on; alua_followover=on; action_OnRetryErrors=off; {TPG_id=61459,TPG_state=AO}{TPG_id=61460,TPG_state=AO}{TPG_id=61462,TPG_state=AO}{TPG_id=61461,TPG_state=AO}}
Path Selection Policy: VMW_PSP_RR
Path Selection Policy Device Config: {policy=rr,iops=1000,bytes=10485760,useANO=0; lastPathIndex=0: NumIOsPending=0,numBytesPending=0}
Path Selection Policy Device Custom Config:
Working Paths: vmhba1:C0:T2:L141, vmhba1:C0:T3:L141, vmhba2:C0:T3:L141, vmhba2:C0:T2:L141
Is USB: false

If vSphere loses access to a device on a given path, the host will send a Test Unit Ready (TUR) command down the given path to check path state. When action_OnRetryErrors=off, vSphere will continue to retry for an amount of time because it expects the path to recover. It is important to note here that a path is not immediately marked dead when the first Test Unit Ready command is unsuccessful and results in a retry. It would seem many retries in fact and you’ll be able to see them in /var/log/vmkernel.log. Also note that a device typically has multiple paths and the process will be repeated for each additional path tried assuming the first path is eventually marked as dead.

Starting with vSphere 6.7, action_OnRetryErrors is enabled by default.

vSphere 6.7:
esxcli storage nmp device list | grep -A9 naa.6000d3100002b90000000000000ec1e1
naa.6000d3100002b90000000000000ec1e1
Device Display Name: sqldemo1vmfs
Storage Array Type: VMW_SATP_ALUA
Storage Array Type Device Config: {implicit_support=on; explicit_support=off; explicit_allow=on; alua_followover=on; action_OnRetryErrors=on; {TPG_id=61459,TPG_state=AO}{TPG_id=61460,TPG_state=AO}{TPG_id=61462,TPG_state=AO}{TPG_id=61461,TPG_state=AO}}
Path Selection Policy: VMW_PSP_RR
Path Selection Policy Device Config: {policy=rr,iops=1000,bytes=10485760,useANO=0; lastPathIndex=2: NumIOsPending=0,numBytesPending=0}
Path Selection Policy Device Custom Config:
Working Paths: vmhba1:C0:T2:L141, vmhba1:C0:T3:L141, vmhba2:C0:T3:L141, vmhba2:C0:T2:L141
Is USB: false

If vSphere loses access to a device on a given path, the host will send a Test Unit Ready (TUR) command down the given path to check path state. When action_OnRetryErrors=on, vSphere will immediately mark the path dead when the first retry is returned. vSphere will not continue the retry TUR commands on a dead path.

This is the part where VMware thinks it’s doing the right thing by immediately fast failing a misbehaving/dodgy/flaky path. The assumption here is that other good paths to the device are available and instead of delaying an application while waiting for path failover during the intensive TUR retry process, let’s fail this one bad path right away so that the application doesn’t have to spin its wheels.

However, if all other paths to the device are impacted by the same underlying (and let’s call it transient) condition, what happens is that each additional path iteratively goes through the process of TUR, no retry, immediately mark path as dead, move on to the next path. When all available paths have been exhausted, All Paths Down (APD) for the device kicks in. If and when paths to an APD device become available again, they will be picked back up upon the next storage fabric rescan, whether that’s done manually by an administrator, or automatically by default every 300 seconds for each vSphere host (Disk.PathEvalTime). From an application/end user standpoint, I/O delay for up to 5 minutes can be a painfully long time to wait. The irony here is that VMware can potentially turn a transient condition lasting only a few seconds into a more of a Permanent Device Loss like condition.

All of the above leads me to a support escalation I got involved in with a customer having an Active/Passive block storage array. Active/Passive is a type of array which has multiple storage processors/controllers (usually two) and LUNs are distributed across the controllers in an ownership model whereby each controller owns both the LUNs and the available paths to those LUNs. When an active controller fails or is taken offline proactively (think storage processor reboot due to a firmware upgrade), the paths to the active controller go dark, the passive controller takes ownership of the LUNs and lights up the paths to them – a process which can be measured in seconds, typically more than 2 or 3, often much more than that (this dovetails into the discussion of virtual machine disk timeout best practices). With action_OnRetryErrors=off, vSphere tolerates the transient path outage during the controller failover. With action_OnRetryErrors=on, it doesn’t – each path that goes dark is immediately failed and we have APD for all the volumes on that controller in a fraction of a second.

The problem which was occurring in this customer escalation was a convergence of circumstances:

  • The customer was using vSphere 6.7 and its defaults
    action_OnRetryErrors=on
  • The customer was using an Active/Passive storage array
  • The customer virtualized Microsoft Windows SQL cluster servers (cluster disk resources are extremely sensitive to APDs in the hypervisor and immediately fail when it detects a dependent cluster disk has been removed – a symptom introduced by APD)
  • The customer was testing controller failovers
Windows failover clusters have zero tolerance for APD disk

To resolve the problem in vSphere 6.7, action_OnRetryErrors needs to be disabled for each device backed by the Active/Passive storage array. This must be performed on every host in the cluster having access to the given devices (again, these can be VMFS volumes and/or RDMs). There are a few ways to go about this.

To modify the configuration without a host reboot, take a look at the following example. A command such as this would need to be run on every host in the cluster, and for each device (ie. in an 8 host cluster with 8 VMFS/RDMs, we need to identify the applicable naa.xxx IDs and run 64 commands. Yes, this could be scripted. Be my guest.):

esxcli storage nmp satp generic deviceconfig set -c disable_action_OnRetryErrors -d naa.6000d3100002b90000000000000ec1e1

I don’t prefer that method a whole lot. It’s tedious and error prone. It could result in cluster inconsistencies. But on the plus side, a host reboot isn’t required, and this setting will persist across reboots. That also means a configuration set at this device level will override any claim rules that could also apply to this device. Keep this in mind if a claim rule is configured but you’re not seeing the desired configuration on any specific device.

The above could also be scripted for a number of devices on a host. Here’s one example. Be very careful that the base naa.xxx string matches all of the devices from one array that should be configured, and does not modify devices from other array types that should not be configured. Also note that this script is a one liner but for blog formatting purposes I manually added a line break starting with esxcli.:

for i in `ls /vmfs/devices/disks | grep -v ":" | grep -i naa.6000D31`; do echo $i; 
esxcli storage nmp satp generic deviceconfig set -c disable_action_OnRetryErrors -d $i; done

Now to verify:

for i in `ls /vmfs/devices/disks | grep -v ":" | grep -i naa.6000D31`; do echo $i; 
esxcli storage nmp device list | grep -A2 $i | egrep -io action_OnRetryErrors=\\w+; done

I like adding a SATP claim rule using a vendor device string a lot better, although changes to claim rules for existing devices generally requires a reboot of the host to reclaim existing devices with the new configuration. Here’s an example:

esxcli storage nmp satp rule add -s VMW_SATP_ALUA -V COMPELNT -P VMW_PSP_RR -o disable_action_OnRetryErrors

Here’s another example using quotes which is also acceptable and necessary when setting multiple option string parameters (refer to this):

esxcli storage nmp satp rule add -s “VMW_SATP_ALUA” -V “COMPELNT” -P “VMW_PSP_RR” -o “disable_action_OnRetryErrors”

When a new claim rule is added, claim rules can be reloaded with the following command.

esxcli storage core claimrule load

Keep in mind the new claim rule will only apply to unclaimed devices. Newly presented devices will inherit the new claim rule. Existing devices which are already claimed will not until the next vSphere host reboot. Devices can be unclaimed without a host reboot but all I/O to the device must be halted – somewhat of a conundrum if we’re dealing with production volumes, datastores being used for heartbeating, etc. Assuming we’re dealing with multiple devices, a reboot is just going to be easier and cleaner.

I like claim rules here better because of the global nature. It’s one command line per host in the cluster and it’ll take care of all devices from the Active/Passive storage array vendor. No need to worry about coming up with and testing a script. No need to worry about spending hours identifying the naa.xxx IDs and making all of the changes across hosts. No need to worry about tagging other storage vendor devices with an improper configuration. Lastly, the claim rule in effect is visible in a SATP claim rule list (sincere apologies for the formatting – it’s bad I know):

esxcli storage nmp satp rule list

Name Device Vendor Model Driver Transport Options Rule Group Claim Options Default PSP PSP Options Description
——————- —— ——– —————- —— ——— —————————- ———- ———————————– ———– ———– ———————————————–
VMW_SATP_ALUA COMPELNT disable_action_OnRetryErrors user VMW_PSP_RR

By the way… to remove the SATP claim rules above respectively:

esxcli storage nmp satp rule remove -s VMW_SATP_ALUA -V COMPELNT -P VMW_PSP_RR -o disable_action_OnRetryErrors

esxcli storage nmp satp rule remove -s “VMW_SATP_ALUA” -V “COMPELNT” -P “VMW_PSP_RR” -o “disable_action_OnRetryErrors”

The bottom line here is there may be a number of VMware customers with Active/Passive storage arrays, running vSphere 6.7. If and when planned or unplanned controller/storage processor failover occurs, APDs may unexpectedly occur, impacting virtual machines and their applications, whereas this was not the case with previous versions of vSphere.

In closing, I want to thank VMware Staff Technical Support Engineering for their work on this case and ultimately exposing “what changed in vSphere 6.7” because we had spent some time trying to reproduce this problem on vSphere 6.5 where we had an environment similar to what the customer had and we just weren’t seeing any problems.

References:

Managing SATPs

No Failover for Storage Path When TUR Command Is Unsuccessful

Storage path does not fail over when TUR command repeatedly returns retry requests (2106770)

Handling Transient APD Conditions

VSPHERE 6.0 STORAGE FEATURES PART 6: ACTION_ONRETRYERRORS

Setup for Microsoft cluster service

April 1st, 2009

Setting up a Microsoft cluster on VMware used to be a fairly straight forward task with a very minimal set of considerations. Over time, the support documentation has evolved into something that looks like it was written by the U.S Internal Revenue Service. I was an Accountant in my previous life and I remember Alternative Minimum Tax code that was easier to follow than what we have today, a 50 page .PDF representing VMware’s requirements for MSCS support. Even with that, I’m not sure Microsoft supports MSCS on VMware. The Microsoft SVVP program supports explicit versions and configurations of Windows 2000/2003/2008 on ESX 3.5 update 2 and 3, and ESXi 3.5 update 3 but no mention is made regarding clustering. I could not find a definitive answer on the Microsoft SVVP program site other than the following disclaimer:

For more information about Microsoft’s policies for supporting software running in non-Microsoft hardware virtualization software please refer to http://support.microsoft.com/?kbid=897615. In addition, refer to http://support.microsoft.com/kb/957006/ to find more information about Microsoft’s support policies for its applications running in virtual environments.

At any rate, here are some highlights of MSCS setup on VMware Virtual Infrastructure, and by the way, all of this information is fair game for the VMware VCP exam.

Prerequisites for Cluster in a Box

To set up a cluster in a box, you must have:

* ESX Server host, one of the following:

* ESX Server 3 – An ESX Server host with a physical network adapter for the

service console. If the clustered virtual machines need to connect with external

hosts, then an additional network adapter is highly recommended.

* ESX Server 3i – An ESX Server host with a physical network adapter for the

VMkernel. If the clustered virtual machines need to connect with external

hosts, a separate network adapter is recommended.

* A local SCSI controller. If you plan to use a VMFS volume that exists on a SAN, you

need an FC HBA (QLogic or Emulex).

You can set up shared storage for a cluster in a box either by using a virtual disk or by

using a remote raw device mapping (RDM) LUN in virtual compatibility mode

(non‐pass‐through RDM).

When you set up the virtual machine, you need to configure:

* Two virtual network adapters.

* A hard disk that is shared between the two virtual machines (quorum disk).

* Optionally, additional hard disks for data that are shared between the two virtual

machines if your setup requires it. When you create hard disks, as described in this

document, the system creates the associated virtual SCSI controllers.

Prerequisites for Clustering Across Boxes

The prerequisites for clustering across boxes are similar to those for cluster in a box.

You must have:

* ESX Server host. VMware recommends three network adapters per host for public

network connections. The minimum configuration is:

* ESX Server 3 – An ESX Server host configured with at least two physical

network adapters dedicated to the cluster, one for the public and one for the

private network, and one network adapter dedicated to the service console.

* ESX Server 3i – An ESX Server host configured with at least two physical

network adapters dedicated to the cluster, one for the public and one for the

private network, and one network adapter dedicated to the VMkernel.

* Shared storage must be on an FC SAN.

* You must use an RDM in physical or virtual compatibility mode (pass‐through

RDM or non‐pass‐through RDM). You cannot use virtual disks for shared storage.

Prerequisites for Standby Host Clustering

The prerequisites for standby host clustering are similar to those for clustering across

boxes. You must have:

* ESX Server host. VMware recommends three network adapters per host for public

network connections. The minimum configuration is:

* ESX Server 3 – An ESX Server host configured with at least two physical

network adapters dedicated to the cluster, one for the public and one for the

private network, and one network adapter dedicated to the service console.

* ESX Server 3i – An ESX Server host configured with at least two physical

network adapters dedicated to the cluster, one for the public and one for the

private network, and one network adapter dedicated to the VMkernel.

* You must use RDMs in physical compatibility mode (pass‐through RDM).

You cannot use virtual disk or RDM in virtual compatibility mode

(non‐pass‐through RDM) for shared storage.

* You cannot have multiple paths from the ESX Server host to the storage.

* Running third‐party multipathing software is not supported. Because of this

limitation, VMware strongly recommends that there only be a single physical path

from the native Windows host to the storage array in a configuration of

standby‐host clustering with a native Windows host. The ESX Server host

automatically uses native ESX Server multipathing, which can result in multiple

paths to shared storage.

* Use the STORport Miniport driver for the FC HBA (QLogic or Emulex) in the

physical Windows machine.

Cluster in a Box Cluster Across Boxes Standby Host Clustering
Virtual disks Yes No No
Pass-through RDM (physical compatibility mode) No Yes Yes
Non-pass-through RDM (virtual compatibility mode) Yes Yes No

Caveats, Restrictions, and Recommendations

This section summarizes caveats, restrictions, and recommendation for using MSCS in

a VMware Infrastructure environment.

* VMware only supports third‐party cluster software that is specifically listed as

supported in the hardware compatibility guides. For latest updates to VMware

support for Microsoft operating system versions for MSCS, or for any other

hardware‐specific support information, see the Storage/SAN Compatibility Guide for

ESX Server 3.5 and ESX Server 3i.

* Each virtual machine has five PCI slots available by default. A cluster uses four of

these slots (two network adapters and two SCSI host bus adapters), leaving one

PCI slot for a third network adapter (or other device), if needed.

* VMware virtual machines currently emulate only SCSI‐2 reservations and do not

support applications using SCSI‐3 persistent reservations.

* Use LSILogic virtual SCSI adapter.

* Use Windows Server 2003 SP2 (32 bit or 64 bit) or Windows 2000 Server SP4.

VMware recommends Windows Server 2003.

* Use two‐node clustering.

* Clustering is not supported on iSCSI or NFS disks.

* NIC teaming is not supported with clustering.

* The boot disk of the ESX Server host should be on local storage.

* Mixed HBA environments (QLogic and Emulex) on the same host are not

supported.

* Mixed environments using both ESX Server 2.5 and ESX Server 3.x are not

supported.

* Clustered virtual machines cannot be part of VMware clusters (DRS or HA).

* You cannot use migration with VMotion on virtual machines that run cluster

software.

* Set the I/O time‐out to 60 seconds or more by modifying

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Disk\

TimeOutValue.

The system might reset this I/O time‐out value if you recreate a cluster. You must

reset the value in that case.

* Use the eagerzeroedthick format when you create disks for clustered virtual

machines. By default, the VI Client or vmkfstools create disks in zeroedthick

format. You can convert a disk to eagerzeroedthick format by importing,

cloning, or inflating the disk. Disks deployed from a template are also in

eagerzeroedthick format.

* Add disks before networking, as explained in the VMware Knowledge Base article

at http://kb.vmware.com/kb/1513.

phew!

Make VirtualCenter highly available with VMware Virtual Infrastructure

November 17th, 2008

A few days ago I posted some information on how to make VirtualCenter highly available with Microsoft Cluster Services.

Monday Night Football kickoff is coming up but I wanted follow up quickly with another option (as suggested by Lane Leverett): Deploy the VirtualCenter Management Server (VCMS) on a Windows VM hosted on a VMware Virtual Infrastructure cluster. Why is this a good option? Here are a few reasons:

  1. It’s fully supported by VMware.
  2. You probably already have a VI cluster in your environment you can leverage. Hit the ground running without spending the time to set up MSCS.
  3. Removing MSCS removes a 3rd party infrastructure complexity and dependency which requires an advanced skill set to support.
  4. Removing MSCS removes at least one Windows Server license cost and also removes the need for the more expensive Windows Enterprise Server licensing and the special hardware needs required by MSCS.
  5. Green factor: Let VCMS leverage the use of VMware Distributed Power Management (DPM).

How does it work? It’s pretty simple. A virtualized VCMS shares the same advantages any other VM inherently has when running on a VMware cluster:

  1. Resource balancing of the four food groups (vProcessor, vRAM, vDisk, and vNIC) through VMware Distributed Resource Scheduler (DRS) technology
  2. Maximum uptime and quick recovery via VMware High Availability (HA) in the event of a VI host failure or isolation condition (yes, HA will still work if the VCMS is down. HA is a VI host agent)
  3. Maximum uptime and quick recovery via VMware High Availability (HA) in the event of a VMware Tools heartbeat failure (ie. the guest OS croaks)
  4. Ability to perform host maintenance without downtime of the VCMS

A few things to watch out for (I’ve been there and done that, more than once):

  1. If you’re going to virtualize the VCMS, be sure you do so on a cluster with the necessary licensed options to support the benefits I outlined above (DRS, HA, etc.) This means VI Enterprise licensing is required (see the licensing/pricing chart on page 4 of this document). I don’t want to hide the fact that a premium is paid for VI Enterprise licensing, but as I pointed out above, if you’ve already paid for it, the bolt ons are unlimited use so get more use out of them.
  2. If your VCMS (and Update manager) database is located on the VCMS, be sure to size your virtual hardware appropriately. Don’t go overboard though. From a guest OS perspective, it’s easier to grant additional virtual resources from the four food groups than it is to retract them.
  3. If you have a power outage and your entire cluster goes down (and your VCMS along with it), it can be difficult to get things back on their feet while you don’t have the the use of the VCMS. Particularly if you’ve lost the use of other virtualized infrastructure components such as Microsoft Active Directory. Initially it’s going to be command line city so brush up on your CLI. It really all depends on how badly the situation is once you get the VI hosts back up. One example I ran into is host A wouldn’t come back up. Host B wasn’t the registered owner of the VM I needed to bring up. This requires running the vmware-cmd command to register the VM and bring it up on host B.

Well, I missed the first few minutes of Monday Night Football, but everyone who reads (tolerates) my ramblings is totally worth it.

Go forth and virtualize!

Make VirtualCenter highly available with Microsoft Cluster Services

November 12th, 2008

When VirtualCenter was first introduced, many could make the argument that VC was simply a utility class service that provided centralized management for a virtual infrastructure. If the VirtualCenter Management Server (VCMS) was rebooted in the middle of the day or if the VC services were stopped for some reason, it wasn’t too big of a deal providing the outage didn’t interrupt a key task such as a VMotion migration or a cloning process.

Times are changing. VirtualCenter is becoming a fairly critical component in the VI and high availability of VC and the VCMS is becoming increasingly important. Several factors have contributed to this evolution. To identify just a few:

  • Virtual infrastructures are growing rapidly in the datacenter. The need for a functioning centralized management platform increases exponentially.
  • Increased and more granular VC alerting capabilities are relied upon to keep administrators updated with timely information about the load and health of the VI.
  • The introduction of more granular role base security extended Virtual Infrastructure Client or Web Access deployment to more users and groups in the organization increasing dependability on VC and visibility of downtime.
  • The exposure of the VC API/SDK encouraged many new applications and tools to be written against VC. I’m talking about tools that provide important functions such as backup, reporting, automation, replication, capacity analysis, sVMotion, etc. Without VC running, these tools won’t work.
  • The introduction of plugins. Plugins are going to be the preferred bolt on for most administrators because they snap in to a unified management interface. Obvious dependency on VC.
  • The introduction of new features native to VC functionality. DRS, HA, DPM, VCB, Update Manager, Consolidation, snapshot manager, FT, SRM, etc. Like the bullet above, all of these features require a healthy functioning VCMS.
  • The Virtual Datacenter OS was announced at VMworld 2008 and is comprised of the following essential components: Application vServices, Infrastructure vServices, Cloud vServices, and Management vServices. I don’t know about you, but to me those all sound like services that would need to be highly available. While it is not yet known exactly how existing VI components transform into the VDC-OS, we know the components are going to be integral to VMware’s vision and commitment to cloud computing which needs to be highly available, if not continuously available.

VirtualCenter has evolved from a cornerstone of ESX host management into the the entire foundation on which the VI will be built on. Try to imagine what the impacts will be in your environment if and when VirtualCenter is down now and in the future. Dependencies may have waltzed in that you didn’t realize.

A single VCMS design may be what you’re used to, but fortunately there exists a method by which VC may be made highly available on a multi-node Microsoft Cluster. This document, written by none other than my VI classroom training instructor Chris Skinner, explains how to cluster VirtualCenter 2.5.

If you’re on VirtualCenter 2.0.x, follow this version of the document instead.

Update:  Follow up post here.