Posts Tagged ‘vSphere’

Dell Enterprise Manager Client Gets Linux Makeover

April 24th, 2015

Dell storage customers who have been watching the evolution of Enterprise Manager may be interested in the latest release which was just made available.  Aside from adding support for the brand new SCv2000 Series Storage Centers and bundling Java Platform SE 7 Update 67 with the installation of both the Data Collector on Windows and the Client on Windows or Linux (a prerequisite Java installation is no longer required), a Linux client has been introduced for the first time and runs on several Linux operating systems.  The Linux client is Java based and has the same look and feel as the Windows based client.  Some of the details about this release below.

Enterprise Manager 2015 R1 Data Collector and Client management compatibility:

  • Dell Storage Center OS versions 5.5-6.6
  • Dell FS8600 versions 3.0-4.0
  • Dell Fluid Cache for SAN version 2.0.0
  • Microsoft System Center Virtual Machine Manager (SCVMM) versions 2012, 2012 SP1, and 2012 R2
  • VMware vSphere Site Recovery Manager versions 5.x (HCL), 6.0 (compatible)

Enterprise Manager 2015 R1 Client for Linux operating system requirements:

  • RHEL 6
  • RHEL 7
  • SUSE Linux Enterprise 12
  • Oracle Linux 6.5
  • Oracle Linux 7.0
  • 32-bit (x86) or 64-bit (x64) CPU
  • No support for RHEL 5 but I’ve tried it and it seems to work

Although the Enterprise Manager Client for Linux can be installed without a graphical environment, launching and using the client requires the graphical environment.  As an example, neither RHEL 6 or RHEL 7 install a graphical environment by default.  Overall, installing a graphical environment for both RHEL 6 and RHEL 7 is similar in that it requires a yum repository. However, the procedure is slightly different for each version.  There are several resources available on the internet which walk through the process.  I’ll highlight a few below.

Log in with root access.

To install a graphical environment for RHEL 6, create a yum repository and install GNOME or KDE by following the procedure here.

To install a graphical environment for RHEL 7, create a yum repository by following this procedure and install GNOME by following the procedure here.

Installing the Enterprise Manager Client is pretty straightforward.  Copy the RPM to a temporary directory on the Linux host and use rpm -U to install:

rpm -U dell-emclient-15.1.2-45.x86_64.rpm

Alternatively, download the client from the Enterprise Manager Data Collector using the following syntax as an example:

wget em1.boche.lab:3033 –no-check-certificate https://em1.boche.lab:3033/em/EnterpriseManager/web/apps/client/EmClient.rpm

rpm -U EmClient.rpm

Once installed, launch the Enterprise Manager Client from the /var/lib/dell/bin/ directory:

cd /var/lib/dell/bin/

./Client

or

/var/lib/dell/bin/Client

We’re rewarded with the Enterprise Manager 2015 R1 Client splash screen.  New features are found here to immediately manage SCv2000 Series Storage Centers (the SCv2000 Series is the first Storage Center whereby the web based management console has been retired).

Once logged in, it’s business as usual in a familiar UI.

Dell, and before it Compellent, has long since offered a variety of options and integrations to manage Storage Center as well as popular platforms and applications.  The new Enterprise Manager Client for Linux extends that list of management methods available.

VMware Horizon View Agent 6.1.0 Installation Rollback

March 16th, 2015

With the release of vSphere 6 last week, I decided it was time to update some of the infrastructure in the home lab over the weekend. I got an early start Friday as I had my three remaining wisdom teeth pulled in the AM and took the rest of the day off work.  Now I’m not talking about jumping straight to vSphere 6, not just yet.  I’ve got some constraints that prevent me from going to vSphere 6 at the current time but I expect I’ll be ready within a month or two.  For the time being, the agenda involved migrating some guest operating systems from Windows Server 2008 R2 to Windows Server 2012 R2, migrating MS SQL Server 2008 R2 to MS SQL Server 2012, and updating templates with current VMware Tools, and tackling VMware Horizon View getting Composer and the Connection Server migrated from version 5.3 to 6.1.0 including the pool guests and related tools and agents.

I won’t bore anyone with the details on the OS and SQL migrations, that all went as planned. Rather, this writing focuses on an issue I encountered while upgrading VMware Horizon View Agents in Windows 7 guest virtual machines. For the most part, the upgrades went fine as they always have in the past. However I did run into one annoying Windows 7 guest VM which I could not upgrade from View agent 5.1 to View agent 6.1.0. About two thirds of the way through the 6.1.0 agent upgrade/installation when the installation wizard is installing services, a ‘Rolling back action‘ process would occur and the upgrade/installation failed.

The View agent installation generates two fairly large log files located in C:\Users\<username>\AppData\Local\Temp\.  I narrowed down the point in time the problem was occurring in the smaller of the two log files.

svm: 03/16/15 10:54:52 — CA exec: VMEditServiceDependencies
svm: 03/16/15 10:54:52 Getting Property CustomActionData = +;vmware-viewcomposer-ga;BFE;Tcpip;Netlogon
svm: 03/16/15 10:54:52 INFO: about to copy final string
svm: 03/16/15 10:54:52 INFO: *copyIter = RpcSs
svm: 03/16/15 10:54:52 INFO: newDependencyString = RpcSs
svm: 03/16/15 10:54:52 INFO: *copyIter = vmware-viewcomposer-ga
svm: 03/16/15 10:54:52 INFO: newDependencyString = RpcSs vmware-viewcomposer-ga
svm: 03/16/15 10:54:52 ERROR: ChangeServiceConfig failed with error: 5
svm: 03/16/15 10:54:52 End Logging
svm: 03/16/15 10:54:53 Begin Logging
svm: 03/16/15 10:54:53 — CA exec: VMEditServiceDependencies
svm: 03/16/15 10:54:53 Getting Property CustomActionData = -;vmware-viewcomposer-ga;BFE;Tcpip;Netlogon
svm: 03/16/15 10:54:53 Cannot query key value HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\DependOnService for size: 2
svm: 03/16/15 10:54:53 Cannot query key value HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Netlogon\DependOnService for size: 2
svm: 03/16/15 10:54:53 End Logging

In addition, the Windows event log reflected Event ID: 7006 “The ScRegSetValueExW call failed for DependOnService with the following error: Access is denied.

I had made a few different attempts to install the 6.1.0 agent, each time trying a different approach. Checked registry permissions and dependencies, relaxed registry permissions, enabled auditing, temporarily disabled Avast Antivirus, etc.  The VMware Horizon View Agent installs a handful of components. Although I didn’t know yet what the issue was on the OS, I had the problem narrowed down to the VMware Horizon View Composer Agent portion of the installation which installs VMware Horizon View Composer Guest Agent Server service (vmware-viewcomposer-ga is the name of the service if you’re looking in the registry).

After doing some more digging, I found out that some antivirus applications like Panda have a a self-preservation mechanism built in which can cause unexpected application problems. Avast has one as well and it’s called the avast! self-defense module. This defense mechanism works independently of normal real time antivirus scans which I had disabled previously.  I had never run into a problem with Avast in the past but in this particular instance, Avast was blocking the modification of Windows services and dependencies. The easy solution, and I wish I had known this from the start but I don’t invest much time in antivirus or malware unless I absolutely have to, was to disable the avast! self-defense module which can be found in the Troubleshooting area of the Avast settings.

Once the avast! self-defense module was disabled, the installation of the VMware Horizon View Agent 6.1.0 agent, including the VMware Horizon View Composer Agent portion, completed successfully. After the agent installation completed, a reboot was performed and I re-enabled the avast! self-defense module.

Thus far I’m impressed with VMware Horizon 6.1. Not much has changed from UI/management perspective but stability and cleanup within Composer operations has improved. I built up and tore down a 28 Windows 7 guest VDI pool and whereas this has lead to precarious pool states and manual cleanup steps in the past, it has worked flawlessly so far.  I’m definitely looking forward to the jump to vSphere 6 infrastructure in the coming weeks. All but one of the other lab infrastructure components have been upgraded and are ready at this point so it shouldn’t be much longer until I have vSphere 5.x in my rear view mirror.

VMware vRealize Operations Manager 6.0.1 & Dell Storage Speed Run

March 11th, 2015

For the most part – 12:38 was my time.

There are a few spots where I could improve but what you see here is what you get – a quick video I threw together outlining a simple VMware vRealize Operations Manager 6.0.1 appliance deployment, including:

  • vCenter adapter configuration
  • Active Directory role integration
  • Dell Storage Solutions Pack installation and configuration
  • Dashboard sharing

Obviously I trimmed some of the “wait” intervals but the goal here was to cover the quick and easy steps to get vR Ops 6.x up and running from ovf download to collecting in a very short amount of time.

In case you are unaware, VMware vRealize Operations Manager 6.0.1 was released a little under two weeks ago and it includes some improvements over the December 6.0.0 release:

Updates cover all major areas of the product including installation, migration, configuration, licensing, alerting, dashboards, reports, and policies. To take advantage of the following significant enhancements, upgrade to version 6.0.1.

Improved scaling numbers

  • The number of objects that a single large node supports has been increased to 12,000. Also, in multi-node configurations, a four large-node configuration can manage up to 40,000 objects and an eight large-node configuration can manage up to 75,000 objects. For details on scaling numbers and a link to a Sizing Guideline Worksheet, see KB 2093783.

vSphere v6.0 interoperability support

  • With this release, vSphere v6.0 can function both as a platform for vRealize Operations Manager installation, and as an environment to which vRealize Operations Manager can connect for operational assurance.

User interface improvements

  • Corrections in the Views and Reports content for vSphere Hosts and Clusters.
  • Addition of Hierarchical View in the Topology widget.
  • Enhancement to the Geo widget displays objects on a world map.

Licensing improvements

  • New functionality provides a way to use the REST API to add a license key.

Metrics switched to Collection OFF to improve performance

  • Extraneous metrics are switched to Collection OFF in the default Policy. An option to enable Collection is available. However, maintaining metrics in the OFF state saves disk space, improves CPU performance, and has no negative impact on the vRealize Operations Manager functionality to collect and analyze data. For a list of metrics with Collection switched to OFF, see KB 2109869.

Alert Definition Updates

  • Improved alert definitions for vSphere clusters, hosts, and virtual machines, to better detect CPU and memory problems.
  • Improved alert definitions for hosts and virtual machines in the vSphere 5.5 Hardening Guide, to identify and report more non-compliance issues.
  • Additional alert definitions to detect duplicate object names in vCenter and vSphere Storage Management Service errors. Note: To identify duplicate object names in the vCenter Server system, the name-based identification feature must be enabled for the vSphere adapter.

 

I spent a fair amount of time with vC Ops 5.x and I’ll be the first in line to say vR Ops 6.x has a much more polished look and feel which generally makes consumption of this datacenter management tool much more of a pleasure to work with in terms of installation, configuration, and daily use. But don’t take my word for it, see for yourself:

A Common NPIV Problem with a Solution

December 29th, 2014

Several years ago, one of the first blog posts that I tackled was working in the lab with N_Port ID Virtualization often referred to as NPIV for short. The blog post was titled N_Port ID Virtualization (NPIV) and VMware Virtual Infrastructure. At the time it was one of the few blog posts available on the subject because it was a relatively new feature offered by VMware. Over the years that followed, I haven’t heard much in terms of trending adoption rates by customers. Likewise, VMware hasn’t put much effort into improving NPIV support in vSphere or promoting its use. One might contemplate, which is the cause and which is the effect. I feel it’s a mutual agreement between both parties that NPIV in its current state isn’t exciting enough to deploy and the benefits fall into a very narrow band of interest (VMware: Give us in guest virtual Fibre Channel – that would be interesting).

Despite its market penetration challenges, from time to time I do receive an email from someone referring to my original NPIV blog post looking for some help in deploying or troubleshooting NPIV. The nature of the request is common and it typically falls into one of two categories:

  1. How can I set up NPIV with a fibre channel tape library?
  2. Help – I can’t get NPIV working.

I received such a request a few weeks ago from the field asking for general assistance in setting up NPIV with Dell Compellent storage. The correct steps were followed to the best of their knowledge but the virtual WWPNs that were initialized at VM power on would not stay lit after the VM began to POST. In Dell Enterprise Manager, the path to the virtual machine’s assigned WWPN was down. Although the RDM storage presentation was functioning, it was only working through the vSphere host HBAs and not the NPIV WWPN. This effectively means that NPIV is not working:

In addition, the NPIV initialization failure is reflected in the vmkernel.log:

2014-12-15T16:32:28.694Z cpu25:33505)qlnativefc: vmhba64(41:0.0): vlan_id: 0x0
2014-12-15T16:32:28.694Z cpu25:33505)qlnativefc: vmhba64(41:0.0): vn_port_mac_address: 00:00:00:00:00:00
2014-12-15T16:32:28.793Z cpu25:33505)qlnativefc: vmhba64(41:0.0): Assigning new target ID 0 to fcport 0x410a524d89a0
2014-12-15T16:32:28.793Z cpu25:33505)qlnativefc: vmhba64(41:0.0): fcport 5000d3100002b916 (targetId = 0) ONLINE
2014-12-15T16:32:28.809Z cpu27:33505)qlnativefc: vmhba64(41:0.0): Assigning new target ID 1 to fcport 0x410a524d9260
2014-12-15T16:32:28.809Z cpu27:33505)qlnativefc: vmhba64(41:0.0): fcport 5000d3100002b90c (targetId = 1) ONLINE
2014-12-15T16:32:28.825Z cpu27:33505)qlnativefc: vmhba64(41:0.0): Assigning new target ID 2 to fcport 0x410a524d93e0
2014-12-15T16:32:28.825Z cpu27:33505)qlnativefc: vmhba64(41:0.0): fcport 5000d3100002b915 (targetId = 2) ONLINE
2014-12-15T16:32:28.841Z cpu27:33505)qlnativefc: vmhba64(41:0.0): Assigning new target ID 3 to fcport 0x410a524d9560
2014-12-15T16:32:28.841Z cpu27:33505)qlnativefc: vmhba64(41:0.0): fcport 5000d3100002b90b (targetId = 3) ONLINE
2014-12-15T16:32:30.477Z cpu22:19117991)WARNING: ScsiPsaDriver: 1272: Failed adapter create path; vport:vmhba64 with error: bad0040
2014-12-15T16:32:32.477Z cpu22:19117991)WARNING: ScsiPsaDriver: 1272: Failed adapter create path; vport:vmhba64 with error: bad0040
2014-12-15T16:32:34.480Z cpu22:19117991)WARNING: ScsiPsaDriver: 1272: Failed adapter create path; vport:vmhba64 with error: bad0040
2014-12-15T16:32:36.480Z cpu22:19117991)WARNING: ScsiPsaDriver: 1272: Failed adapter create path; vport:vmhba64 with error: bad0040
2014-12-15T16:32:38.482Z cpu22:19117991)ScsiNpiv: 1152: NPIV vport rescan complete, [5:24] (0x410943893dc0) [0x410943680ec0] status=0xbad0040
2014-12-15T16:32:38.503Z cpu22:19117991)ScsiScan: 140: Path ‘vmhba2:C0:T3:L24’: Peripheral qualifier 0x1 not supported
2014-12-15T16:32:38.503Z cpu22:19117991)WARNING: ScsiNpiv: 1141: Physical uid does not match VPORT uid, NPIV Disabled for this VM
2014-12-15T16:32:38.503Z cpu22:19117991)ScsiNpiv: 1152: NPIV vport rescan complete, [3:24] (0x410943856e80) [0x410943680ec0] status=0xbad0132
2014-12-15T16:32:38.503Z cpu22:19117991)WARNING: ScsiNpiv: 1788: Failed to Create vport for world 19117994, vmhba2, rescan failed, status=bad0001
2014-12-15T16:32:38.504Z cpu14:33509)ScsiAdapter: 2806: Unregistering adapter vmhba64

To review, the requirements for implementing NPIV with vSphere are documented by VMware and I outlined the key ones in my original blog post:

  • NPIV support on the fabric switches (typically found in 4Gbps or higher fabric switches but I’ve seen firmware support in 2Gbps switches also)
  • NPIV support on the vShpere host HBAs (this typically means 4Gbps or higher port speeds)
  • NPIV support from the storage vendor
  • NPIV support from a supported vSphere version
  • vSphere Raw Device Mapping
  • Correct fabric zoning configured between host HBAs, the virtual machine’s assigned WWPN(s), and the storage front end ports
  • Storage presentation to the vSphere host HBAs as well as the virtual machine’s assigned NPIV WWPN(s)

If any of the above requirements are not met (plus a handful of others and we’ll get to one of them shortly), vSphere’s NPIV feature will likely not function.

In this particular case, general NPIV requirements were met. However, it was discovered a best practice had been missed in configuring the QLogic HBA BIOS (the QLogic BIOS is accessed at host reboot by pressing CTRL + Q or ALT + Q when prompted). Connection Options remained at its factory default value of 2 or Loop preferred, otherwise point to point.

Dell Compellent storage with vSphere best practices call for this value to be hard coded to 1 or Point to point only. When the HBA has multiple ports, this configuration needs to be made across all ports that are used for Dell Compellent storage connectivity. It goes without saying this also applies across all of the fabric attached hosts in the vSphere cluster.

Once configured for Point to point connectivity on the fabric, the problem is resolved.

Despite the various error messages returned as vSphere probes for possible combinations between the vSphere assigned virtual WWPN and the host WWPNs, NPIV success looks something like this in the vmkernel.log (you’ll notice subtle differences showing success compared to the failure log messages above):

2014-12-15T18:43:52.270Z cpu29:33505)qlnativefc: vmhba64(41:0.0): vlan_id: 0x0
2014-12-15T18:43:52.270Z cpu29:33505)qlnativefc: vmhba64(41:0.0): vn_port_mac_address: 00:00:00:00:00:00
2014-12-15T18:43:52.436Z cpu29:33505)qlnativefc: vmhba64(41:0.0): Assigning new target ID 0 to fcport 0x410a4a569960
2014-12-15T18:43:52.436Z cpu29:33505)qlnativefc: vmhba64(41:0.0): fcport 5000d3100002b916 (targetId = 0) ONLINE
2014-12-15T18:43:52.451Z cpu29:33505)qlnativefc: vmhba64(41:0.0): Assigning new target ID 1 to fcport 0x410a4a569ae0
2014-12-15T18:43:52.451Z cpu29:33505)qlnativefc: vmhba64(41:0.0): fcport 5000d3100002b90c (targetId = 1) ONLINE
2014-12-15T18:43:52.466Z cpu29:33505)qlnativefc: vmhba64(41:0.0): Assigning new target ID 2 to fcport 0x410a4a569c60
2014-12-15T18:43:52.466Z cpu29:33505)qlnativefc: vmhba64(41:0.0): fcport 5000d3100002b915 (targetId = 2) ONLINE
2014-12-15T18:43:52.481Z cpu29:33505)qlnativefc: vmhba64(41:0.0): Assigning new target ID 3 to fcport 0x410a4a569de0
2014-12-15T18:43:52.481Z cpu29:33505)qlnativefc: vmhba64(41:0.0): fcport 5000d3100002b90b (targetId = 3) ONLINE
2014-12-15T18:43:54.017Z cpu0:36379)WARNING: ScsiPsaDriver: 1272: Failed adapter create path; vport:vmhba64 with error: bad0040
2014-12-15T18:43:56.018Z cpu0:36379)WARNING: ScsiPsaDriver: 1272: Failed adapter create path; vport:vmhba64 with error: bad0040
2014-12-15T18:43:58.020Z cpu0:36379)WARNING: ScsiPsaDriver: 1272: Failed adapter create path; vport:vmhba64 with error: bad0040
2014-12-15T18:44:00.022Z cpu0:36379)WARNING: ScsiPsaDriver: 1272: Failed adapter create path; vport:vmhba64 with error: bad0040
2014-12-15T18:44:02.024Z cpu0:36379)ScsiNpiv: 1152: NPIV vport rescan complete, [4:24] (0x4109436ce9c0) [0x410943684040] status=0xbad0040
2014-12-15T18:44:02.026Z cpu2:36379)ScsiNpiv: 1152: NPIV vport rescan complete, [2:24] (0x41094369ca40) [0x410943684040] status=0x0
2014-12-15T18:44:02.026Z cpu2:36379)ScsiNpiv: 1701: Physical Path : adapter=vmhba3, channel=0, target=5, lun=24
2014-12-15T18:44:02.026Z cpu2:36379)ScsiNpiv: 1701: Physical Path : adapter=vmhba2, channel=0, target=2, lun=24
2014-12-15T18:44:02.026Z cpu2:36379)WARNING: ScsiPsaDriver: 1272: Failed adapter create path; vport:vmhba64 with error: bad0040
2014-12-15T18:44:04.028Z cpu2:36379)WARNING: ScsiPsaDriver: 1272: Failed adapter create path; vport:vmhba64 with error: bad0040
2014-12-15T18:44:06.030Z cpu2:36379)WARNING: ScsiPsaDriver: 1272: Failed adapter create path; vport:vmhba64 with error: bad0040
2014-12-15T18:44:08.033Z cpu2:36379)WARNING: ScsiPsaDriver: 1272: Failed adapter create path; vport:vmhba64 with error: bad0040
2014-12-15T18:44:10.035Z cpu2:36379)WARNING: ScsiPsaDriver: 1272: Failed adapter create path; vport:vmhba64 with error: bad0040
2014-12-15T18:44:12.037Z cpu2:36379)ScsiNpiv: 1152: NPIV vport rescan complete, [4:24] (0x4109436ce9c0) [0x410943684040] status=0xbad0040
2014-12-15T18:44:12.037Z cpu2:36379)ScsiNpiv: 1160: NPIV vport rescan complete, [2:24] (0x41094369ca40) [0x410943684040] vport exists
2014-12-15T18:44:12.037Z cpu2:36379)ScsiNpiv: 1701: Physical Path : adapter=vmhba3, channel=0, target=2, lun=24
2014-12-15T18:44:12.037Z cpu2:36379)ScsiNpiv: 1848: Vport Create status for world:36380 num_wwpn=1, num_vports=1, paths=4, errors=3

One last item I’ll note here for posterity is that this particular case, the problem does not present itself uniformly across all storage platforms. This was an element that prolonged troubleshooting to a degree because the vSphere cluster was successful in establishing NPIV fabric connectivity to two other types of storage using the same vSphere hosts, hardware, and fabric switches. Because of this in the beginning it seemed logical to rule out any configuration issues within the vSphere hosts.

To summarize, there are many technical requirements outlined in VMware documentation to correctly configure NPIV. If you’ve followed VMware’s steps correctly but problems with NPIV remain, refer to storage, fabric, and hardware documentation and verify best practices are being met in the deployment.

 

VMware vCenter Site Recovery Manager 5.8 First Look

August 26th, 2014

VMware vCenter Site Recovery Manager made it’s debut this week at VMworld 2014 in San Francisco.  Over the past few weeks I’ve had my hands on a release candidate version and I’ve put together a short series of videos highlighting what’s new and also providing a first look at SRM management through the new web client plug-in.  I hope you enjoy.

I’ll be at VMworld through the end of the week.  Stop and say Hi – I’d love to meet you.

 

VMware vCenter Site Recovery Manager 5.8 Part 1

VMware vCenter Site Recovery Manager 5.8 Part 2

VMware vCenter Site Recovery Manager 5.8 Part 3

Legacy vSphere Client Plug-in 1.7 Released for Storage Center

July 23rd, 2014

Dell Compellent Storage Center customers who use the legacy vSphere Client plug-in to manage their storage may have noticed that the upgrade to PowerCLI 5.5 R2 which released with vSphere 5.5 Update 1 essentially “broke” the plug-in. This forced customers to make the decision to stay on PowerCLI 5.5 in order to use the legacy vSphere Client plug-in, or reap the benefits of the PowerCLI 5.5 R2 upgrade with the downside being they had to abandon use of the legacy vSphere Client plug-in.

For those that are unaware, there is a 3rd option and that is to leverage vSphere’s next generation web client along with the web client plug-in released by Dell Compellent last year (I talked about it at VMworld 2013 which you can take a quick look at below).

Although VMware strongly encourages customers to migrate to the next generation web client long term, I’m here to tell you that in the interim Dell has revd the legacy client plug-in to version 1.7 which is now compatible with PowerCLI 5.5 R2.  Both the legacy and web client plug-ins are free and quite beneficial from an operations standpoint so I encourage customers to get familiar with the tools and use them.

Other bug fixes in this 1.7 release include:

  • Datastore name validation not handled properly
  • Create Datastore, map existing volume – Server Mapping will be removed from SC whether or not it was created by VSP
  • Add Raw Device wizard is not allowing to uncheck a host once selected
  • Remove Raw Device wizard shows wrong volume size
  • Update to use new code signing certificate
  • Prevent Datastores & RDMs with underlying Live Volumes from being expanded or deleted
  • Add support for additional Flash Optimized Storage Profiles that were added in SC 6.4.2
  • Block size not offered when creating VMFS-3 Datastore from Datacenter menu item
  • Add Raw Device wizard is not allowing a host within the same cluster as the select host to be unchecked once it has been selected
  • Add RDM wizard – properties screen showing wrong or missing values
  • Expire Replay wizard – no error reported if no replays selected
  • Storage Consumption stats are wrong if a Disk folder has more than one Storage Type

Yet another blog post about vSphere HA and PDL

July 14th, 2014

If you ended up here searching for information on PDL or APD, your evening or weekend plans may be cancelled at this point and I’m sorry for you if that is the case. There are probably 101 or more online resources which discuss the interrelated vSphere storage topics of All Paths Down (known as APD), Permanent Device Loss (known as PDL), and vSphere High Availability (known as HA, and before dinosaurs roamed the Earth – DAS ). To put it in perspective, I’ve quickly pulled together a short list of resources below using Google. I’ve read most of them:

VMware KB: Permanent Device Loss (PDL) and All-Paths

VMware KB: PDL AutoRemove feature in vSphere 5.5

Handling the All Paths Down (APD) condition – VMware Blogs

vSphere 5.5. Storage Enhancements Part 9 – PDL

Permanent Device Loss (PDL) enhancements in vSphere 5.0

APD (All Paths Down) and PDL (Permanent Device Loss

vSphere Metro Storage Cluster solutions and PDL’s

vSphere Metro Stretched Cluster with vSphere 5.5 and PDL

Change in Permanent Device Loss (PDL) behavior for 5.1

VMware KB: PDL AutoRemove feature in vSphere 5.5

PDL AutoRemove – CormacHogan.com

How handle the APD issue in vSphere – vInfrastructure Blog

Interpreting SCSI sense codes in VMware ESXi and ESX

What’s New in VMware vSphere® 5.1 – Storage

vSphere configuration for handling APD/PDL – CloudXC

vSphere 5.1 Storage Enhancements – Part 4: All Paths Down

vSphere 5.5 nuggets: changes to disk – Yellow Bricks

ESXi host disk.terminateVMOnPDLDefault configuration

ESXi host VMkernel.Boot.terminateVMOnPDL configuration

vSphere HA in my opinion is a great feature. It has saved my back side more than once both in the office and at home. Several books have been more or less dedicated to the topic and yet it is so easy to use that an entire cluster and all of its running virtual machines can be protected with default parameters (common garden variety) with just two mouse clicks.

VMware’s roots began with compute virtualization so when HA was originally released in VMware Virtual Infrastructure 3 (one major revision before it became the vSphere platform known today), the bits licensed and borrowed from Legato Automated Availability Manager (AAM) were designed to protect against marginal but historically documented amounts of x86 hardware failure thereby reducing unplanned downtime and loss of virtualization capacity to a minimum. Basically if an ESX host yields to issues relating to CPU, memory, or network, VMs restart somewhere else in the cluster.

It wasn’t really until vSphere 5.0 that VMware began building in high availability for storage aside from legacy design components such as redundant fabrics, host bus adapters (HBAs), multipath I/O (MPIO), failback policies, and with vSphere 4.0 the pluggable storage architecture (PSA) although this is not to say that any of these design items are irrelevant today – quite the opposite.  vSphere 5.0 introduced Permanent Device Loss (PDL) which does a better job of handling unexpected loss of individual storage devices than APD solely did.  Subsequent vSphere 5.x revisions made further PDL improvements such as improving support for single LUN:single target arrays in 5.1. In short, the new vSphere HA re-write (Legato served its purpose and is gone now) covers much of the storage gap such that in the event of certain storage related failures, HA will restart virtual machines, vApps, services, and applications somewhere else – again to minimize unplanned downtime. Fundamentally, this works just like HA when a vSphere host tips over, but instead the storage tips over and HA is called to action. Note that HA can’t do much about an entire unfederated array failing – this is more about individual storage/host connectivity. Aside from gross negligence on the part of administrators, I believe the failure scenarios are more likely to resonate with non-uniform stretched or metro cluster designs. However, PDL can also occur in small intra datacenter designs as well.

I won’t go into much more detail about the story that has unfolded with APD and the new features in vSphere 5.x because it has already been documented many times over in some of the links above.  Let’s just say the folks starting out new with vSphere 5.1 and 5.5 had it better than myself and many others did dealing with APD and hostd going dark. However, the trade off for them is they are going to have to deal with Software Defined * a lot longer than I will.

Although I mentioned earlier that vSphere HA is extremely simple to configure, I did also mention that was with default options which cover a large majority of the host related failures.  Configuring HA to restart VMs automatically and with no user intervention in the event of a PDL condition in theory is just one configuration change for each host in the cluster. Where to configure depends on the version of vSphere host.

vSphere 5.0u1+/5.1: Disk.terminateVMOnPDLDefault = True (/etc/vmware/settings file on each host)

or

vSphere 5.5+: VMkernel.Boot.terminateVMOnPDL = yes (advanced setting on each host, check the box)

One thing about this configuration that had me chasing sense codes in vmkernel logs recently was lack of clarity on the required host reboot. That’s mainly what prompted this article – I normally don’t cover something that has already been covered well by other writers unless there is something I can add, something was missed, or it has caused me personal pain (my blog + SEO = helps ensure I don’t suffer from the same problems twice). In all of the online articles I had read about these configurations, none mentioned a host reboot requirement and it’s not apparent that a host reboot is required until PDL actually happens and automatic VM restart via HA actually does not. The vSphere 5.5 documentation calls it out. Go figure. I’ll admit that sometimes I will refer to a reputable vMcBlog before the product documentation. So let the search engine results show: when configuring  VMkernel.Boot.terminateVMOnPDL a host reboot or restart is required. VMware KB 1038578 also calls out that as of vSphere 5.5 you must reboot the host for VMkernel.boot configuration changes to take effect. I’m not a big fan of HA or any configuration being written into VMkernel.boot requiring host or VSAN node performance/capacity outages when a change is made but that is VMware Engineering’s decision and I’m sure there is a relevant reason for it aside from wanting more operational parity with the Windows operating system.

I’ll also reiterate Duncan Epping’s recommendation that if you’re already licensed for HA and have made the design and operational decision to allow HA to restart VMs in the event of a host failure, then the above configuration should be made on all vSphere clustered hosts, whether they are part of a stretched cluster or not to protect against storage related failures. A PDL can be broken down to one host losing all available paths to a LUN. By not making the HA configuration change above, a storage related failure results in user intervention required to recover all of the virtual machines on the host tied to the failed device.

Lastly, it is mentioned in some of the links above but if this is your first reading on the subject, please allow me to point out that the configuration setting above is for Permanent Device Loss (PDL) conditions only. It is not meant to handle an APD event. The reason behind this is that the storage array is required to send a proper sense code to the vSphere host indicating a PDL condition.  If the entire array fails or is powered off ungracefully taking down all available paths to storage, it has no chance to send PDL sense codes to vSphere.  This would constitute an indefinite All Paths Down or APD condition where vSphere knows storage is unavailable, but is unsure about its return. PDL was designed to answer that question for vSphere, rather than let vSphere go on wondering about it for a long period of time, thus squandering any opportunities to proactively do something about it.

In reality there are a few other configuration settings (again documented well in the links above) which fine tunes HA more precisely. You’ll almost always want to add these as well.

vSphere 5.0u1+: das.maskCleanShutdownEnabled = True (Cluster advanced options) – this is an accompanying configuration that helps vSphere HA distinguish between VMs that were once powered on and should be restarted versus VMs that were already powered off when a PDL occurred therefore these are VMs that don’t need to be and more importantly probably should not be restarted.

vSphere 5.5+: Disk.AutoremoveOnPDL = 0 (advanced setting on each host) – This is a configuration I first read about on Duncan’s blog where he recommends that the value be changed from the default of enabled to disabled so that a device is not automatically removed if it enters a PDL state. Aside from LUN number limits a vSphere host can handle (255), VMware refers to a few cases where the stock configuration of automatically removing a PDL device may be desired although VMware doesn’t really specifically call out each circumstance aside from problems arising from hosts attempting to send I/O to a dead device. There may be more to come on this in the future but for now preventing the removal may save in fabric rescan time down the road if you can afford the LUN number expended. It will also serve as a good visual indicator in the vSphere Client that there is a problematic datastore that needs to be dealt with in case the PDL automation restarts VMs with nobody noticing the event has occurred. If there are templates or powered off VMs that were not evacuated by HA, the broken datastore will visually persist anyway.

That’s the short list of configuration changes to make for HA VM restart.  There’s actually a few more here. For instance, fine grained HA handling can be coordinated on a per-VM basis by modifying the advanced virtual machine option disk.terminateVMOnPDLDefault configuration for each VM. Or scsi#:#.terminateVMOnPDL to fine tune HA on a per virtual disk basis for each VM. I’m definitely not recommending touching if the situation does not call for it.

In a stock vSphere configuration with VMkernel.Boot.terminateVMOnPDL = no configured (or unintentionally misconfigured I suppose), the following events occur for an impacted virtual machine:

  1. PDL event occurs, sense codes are received and vSphere correctly identifies the PDL condition on the supporting datastore. A question is raised by vSphere for each impacted virtual machine to Retry I/O or Cancel I/O.
  2. Stop. Nothing else happens until each of the questions above are answered with administrator intervention. Answering Retry without the PDL datastore coming back online or without hot removing the impacted virtual disk (in most cases the .vmx will be impacted anyway and hot removing disks is next to pointless) sends the VM to hell pretty much. Answering Cancel allows HA to proceed with powering off the VM and restarting it on another host with access to the device which went PDL on the original host.

In a modified vSphere configuration with VMkernel.Boot.terminateVMOnPDL = yes configured, the following events occur for an impacted virtual machine:

  1. PDL event occurs, sense codes are received and vSphere correctly identifies the PDL condition on the supporting datastore. A question is raised by vSphere for each impacted virtual machine to Retry I/O or Cancel I/O.
  2. Due to VMkernel.Boot.terminateVMOnPDL = yes vSphere HA automatically and effectively answers Cancel for each impacted VM with a pending question. Again, if the hosts aren’t rebooted after the VMkernel.Boot.terminateVMOnPDL = yes configuration change, this step will mimic the previous scenario essentially resulting in failure to automatically carry out the desired tasks.
  3. Each VM is powered off.
  4. Each VM is powered on.

I’ll note in the VM Event examples above, leveraging the power of Snagit I’ve cut out some of the noise about alarms triggering gray and green, resource allocations changing, etc.

For completeness, following is a list of the PDL sense codes vSphere is looking for from the supported storage array:

SCSI sense code Description
H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0 LOGICAL UNIT NOT SUPPORTED
H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x4c 0x0 LOGICAL UNIT FAILED SELF-CONFIGURATION
H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x3e 0x3 LOGICAL UNIT FAILED SELF-TEST
H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x3e 0x1 LOGICAL UNIT FAILURE

Two isolated examples of PDL taking place seen in /var/log/vmkernel.log:

Example 1:

2014-07-13T20:47:03.398Z cpu13:33486)NMP: nmp_ThrottleLogForDevice:2321: Cmd 0x2a (0x4136803b8b80, 32789) to dev “naa.6000d31000ebf600000000000000006c” on path “vmhba2:C0:T0:L30” Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x3f 0xe. Act:EVAL
2014-07-13T20:47:03.398Z cpu13:33486)ScsiDeviceIO: 2324: Cmd(0x4136803b8b80) 0x2a, CmdSN 0xe1 from world 32789 to dev “naa.6000d31000ebf600000000000000006c” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x3f 0xe.
2014-07-13T20:47:03.398Z cpu13:33486)NMP: nmp_ThrottleLogForDevice:2321: Cmd 0x2a (0x413682595b80, 32789) to dev “naa.6000d31000ebf600000000000000007c” on path “vmhba2:C0:T0:L2” Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0. Act:FAILOVER

Example 2:

2014-07-14T00:43:49.720Z cpu4:32994)ScsiDeviceIO: 2337: Cmd(0x412e82f11380) 0x85, CmdSN 0x33 from world 34316 to dev “naa.600508b1001c6e17d603184d3555bf8d” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2014-07-14T00:43:49.731Z cpu4:32994)ScsiDeviceIO: 2337: Cmd(0x412e82f11380) 0x4d, CmdSN 0x34 from world 34316 to dev “naa.600508b1001c6e17d603184d3555bf8d” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2014-07-14T00:43:49.732Z cpu4:32994)ScsiDeviceIO: 2337: Cmd(0x412e82f11380) 0x1a, CmdSN 0x35 from world 34316 to dev “naa.600508b1001c6e17d603184d3555bf8d” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2014-07-14T00:48:03.398Z cpu10:33484)NMP: nmp_ThrottleLogForDevice:2321: Cmd 0x2a (0x4136823b2dc0, 32789) to dev “naa.60060160f824270012f6aa422e0ae411” on path “vmhba1:C0:T2:L40” Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0. Act:FAILOVER

In no particular order, I want to thank Duncan, Paudie, Cormac, Mohammed, Josh, Adam, Niran, and MAN1$H for providing some help on this last week.

By the way, don’t name your virtual machines or datastores PDL. It’s bad karma.