Posts Tagged ‘Storage’

vSphere 6.7 Storage and action_OnRetryErrors=on

February 8th, 2019

VMware introduced a new storage feature in vSphere 6.0 which was designed as a flexible option to better handle certain storage problems. Cormac Hogan did a fine job introducing the feature here. Starting with vSphere 6.0 and continuing on in vSphere 6.5, each block storage device (VMFS or RDM) is configured with an option called action_OnRetryErrors. Note that in vSphere 6.0 and 6.5, the default value is off meaning the new feature is effectively disabled and there is no new storage error handling behavior observed.

This value can be seen with the esxcli storage nmp device list command.

vSphere 6.0/6.5:
esxcli storage nmp device list | grep -A9 naa.6000d3100002b90000000000000ec1e1
naa.6000d3100002b90000000000000ec1e1
Device Display Name: sqldemo1vmfs
Storage Array Type: VMW_SATP_ALUA
Storage Array Type Device Config: {implicit_support=on; explicit_support=off; explicit_allow=on; alua_followover=on; action_OnRetryErrors=off; {TPG_id=61459,TPG_state=AO}{TPG_id=61460,TPG_state=AO}{TPG_id=61462,TPG_state=AO}{TPG_id=61461,TPG_state=AO}}
Path Selection Policy: VMW_PSP_RR
Path Selection Policy Device Config: {policy=rr,iops=1000,bytes=10485760,useANO=0; lastPathIndex=0: NumIOsPending=0,numBytesPending=0}
Path Selection Policy Device Custom Config:
Working Paths: vmhba1:C0:T2:L141, vmhba1:C0:T3:L141, vmhba2:C0:T3:L141, vmhba2:C0:T2:L141
Is USB: false

If vSphere loses access to a device on a given path, the host will send a Test Unit Ready (TUR) command down the given path to check path state. When action_OnRetryErrors=off, vSphere will continue to retry for an amount of time because it expects the path to recover. It is important to note here that a path is not immediately marked dead when the first Test Unit Ready command is unsuccessful and results in a retry. It would seem many retries in fact and you’ll be able to see them in /var/log/vmkernel.log. Also note that a device typically has multiple paths and the process will be repeated for each additional path tried assuming the first path is eventually marked as dead.

Starting with vSphere 6.7, action_OnRetryErrors is enabled by default.

vSphere 6.7:
esxcli storage nmp device list | grep -A9 naa.6000d3100002b90000000000000ec1e1
naa.6000d3100002b90000000000000ec1e1
Device Display Name: sqldemo1vmfs
Storage Array Type: VMW_SATP_ALUA
Storage Array Type Device Config: {implicit_support=on; explicit_support=off; explicit_allow=on; alua_followover=on; action_OnRetryErrors=on; {TPG_id=61459,TPG_state=AO}{TPG_id=61460,TPG_state=AO}{TPG_id=61462,TPG_state=AO}{TPG_id=61461,TPG_state=AO}}
Path Selection Policy: VMW_PSP_RR
Path Selection Policy Device Config: {policy=rr,iops=1000,bytes=10485760,useANO=0; lastPathIndex=2: NumIOsPending=0,numBytesPending=0}
Path Selection Policy Device Custom Config:
Working Paths: vmhba1:C0:T2:L141, vmhba1:C0:T3:L141, vmhba2:C0:T3:L141, vmhba2:C0:T2:L141
Is USB: false

If vSphere loses access to a device on a given path, the host will send a Test Unit Ready (TUR) command down the given path to check path state. When action_OnRetryErrors=on, vSphere will immediately mark the path dead when the first retry is returned. vSphere will not continue the retry TUR commands on a dead path.

This is the part where VMware thinks it’s doing the right thing by immediately fast failing a misbehaving/dodgy/flaky path. The assumption here is that other good paths to the device are available and instead of delaying an application while waiting for path failover during the intensive TUR retry process, let’s fail this one bad path right away so that the application doesn’t have to spin its wheels.

However, if all other paths to the device are impacted by the same underlying (and let’s call it transient) condition, what happens is that each additional path iteratively goes through the process of TUR, no retry, immediately mark path as dead, move on to the next path. When all available paths have been exhausted, All Paths Down (APD) for the device kicks in. If and when paths to an APD device become available again, they will be picked back up upon the next storage fabric rescan, whether that’s done manually by an administrator, or automatically by default every 300 seconds for each vSphere host (Disk.PathEvalTime). From an application/end user standpoint, I/O delay for up to 5 minutes can be a painfully long time to wait. The irony here is that VMware can potentially turn a transient condition lasting only a few seconds into a more of a Permanent Device Loss like condition.

All of the above leads me to a support escalation I got involved in with a customer having an Active/Passive block storage array. Active/Passive is a type of array which has multiple storage processors/controllers (usually two) and LUNs are distributed across the controllers in an ownership model whereby each controller owns both the LUNs and the available paths to those LUNs. When an active controller fails or is taken offline proactively (think storage processor reboot due to a firmware upgrade), the paths to the active controller go dark, the passive controller takes ownership of the LUNs and lights up the paths to them – a process which can be measured in seconds, typically more than 2 or 3, often much more than that (this dovetails into the discussion of virtual machine disk timeout best practices). With action_OnRetryErrors=off, vSphere tolerates the transient path outage during the controller failover. With action_OnRetryErrors=on, it doesn’t – each path that goes dark is immediately failed and we have APD for all the volumes on that controller in a fraction of a second.

The problem which was occurring in this customer escalation was a convergence of circumstances:

  • The customer was using vSphere 6.7 and its defaults
    action_OnRetryErrors=on
  • The customer was using an Active/Passive storage array
  • The customer virtualized Microsoft Windows SQL cluster servers (cluster disk resources are extremely sensitive to APDs in the hypervisor and immediately fail when it detects a dependent cluster disk has been removed – a symptom introduced by APD)
  • The customer was testing controller failovers
Windows failover clusters have zero tolerance for APD disk

To resolve the problem in vSphere 6.7, action_OnRetryErrors needs to be disabled for each device backed by the Active/Passive storage array. This must be performed on every host in the cluster having access to the given devices (again, these can be VMFS volumes and/or RDMs). There are a few ways to go about this.

To modify the configuration without a host reboot, take a look at the following example. A command such as this would need to be run on every host in the cluster, and for each device (ie. in an 8 host cluster with 8 VMFS/RDMs, we need to identify the applicable naa.xxx IDs and run 64 commands. Yes, this could be scripted. Be my guest.):

esxcli storage nmp satp generic deviceconfig set -c disable_action_OnRetryErrors -d naa.6000d3100002b90000000000000ec1e1

I don’t prefer that method a whole lot. It’s tedious and error prone. It could result in cluster inconsistencies. But on the plus side, a host reboot isn’t required, and this setting will persist across reboots. That also means a configuration set at this device level will override any claim rules that could also apply to this device. Keep this in mind if a claim rule is configured but you’re not seeing the desired configuration on any specific device.

The above could also be scripted for a number of devices on a host. Here’s one example. Be very careful that the base naa.xxx string matches all of the devices from one array that should be configured, and does not modify devices from other array types that should not be configured. Also note that this script is a one liner but for blog formatting purposes I manually added a line break starting with esxcli.:

for i in `ls /vmfs/devices/disks | grep -v ":" | grep -i naa.6000D31`; do echo $i; 
esxcli storage nmp satp generic deviceconfig set -c disable_action_OnRetryErrors -d $i; done

Now to verify:

for i in `ls /vmfs/devices/disks | grep -v ":" | grep -i naa.6000D31`; do echo $i; 
esxcli storage nmp device list | grep -A2 $i | egrep -io action_OnRetryErrors=\\w+; done

I like adding a SATP claim rule using a vendor device string a lot better, although changes to claim rules for existing devices generally requires a reboot of the host to reclaim existing devices with the new configuration. Here’s an example:

esxcli storage nmp satp rule add -s VMW_SATP_ALUA -V COMPELNT -P VMW_PSP_RR -o disable_action_OnRetryErrors

Here’s another example using quotes which is also acceptable and necessary when setting multiple option string parameters (refer to this):

esxcli storage nmp satp rule add -s “VMW_SATP_ALUA” -V “COMPELNT” -P “VMW_PSP_RR” -o “disable_action_OnRetryErrors”

When a new claim rule is added, claim rules can be reloaded with the following command.

esxcli storage core claimrule load

Keep in mind the new claim rule will only apply to unclaimed devices. Newly presented devices will inherit the new claim rule. Existing devices which are already claimed will not until the next vSphere host reboot. Devices can be unclaimed without a host reboot but all I/O to the device must be halted – somewhat of a conundrum if we’re dealing with production volumes, datastores being used for heartbeating, etc. Assuming we’re dealing with multiple devices, a reboot is just going to be easier and cleaner.

I like claim rules here better because of the global nature. It’s one command line per host in the cluster and it’ll take care of all devices from the Active/Passive storage array vendor. No need to worry about coming up with and testing a script. No need to worry about spending hours identifying the naa.xxx IDs and making all of the changes across hosts. No need to worry about tagging other storage vendor devices with an improper configuration. Lastly, the claim rule in effect is visible in a SATP claim rule list (sincere apologies for the formatting – it’s bad I know):

esxcli storage nmp satp rule list

Name Device Vendor Model Driver Transport Options Rule Group Claim Options Default PSP PSP Options Description
——————- —— ——– —————- —— ——— —————————- ———- ———————————– ———– ———– ———————————————–
VMW_SATP_ALUA COMPELNT disable_action_OnRetryErrors user VMW_PSP_RR

By the way… to remove the SATP claim rules above respectively:

esxcli storage nmp satp rule remove -s VMW_SATP_ALUA -V COMPELNT -P VMW_PSP_RR -o disable_action_OnRetryErrors

esxcli storage nmp satp rule remove -s “VMW_SATP_ALUA” -V “COMPELNT” -P “VMW_PSP_RR” -o “disable_action_OnRetryErrors”

The bottom line here is there may be a number of VMware customers with Active/Passive storage arrays, running vSphere 6.7. If and when planned or unplanned controller/storage processor failover occurs, APDs may unexpectedly occur, impacting virtual machines and their applications, whereas this was not the case with previous versions of vSphere.

In closing, I want to thank VMware Staff Technical Support Engineering for their work on this case and ultimately exposing “what changed in vSphere 6.7” because we had spent some time trying to reproduce this problem on vSphere 6.5 where we had an environment similar to what the customer had and we just weren’t seeing any problems.

References:

Managing SATPs

No Failover for Storage Path When TUR Command Is Unsuccessful

Storage path does not fail over when TUR command repeatedly returns retry requests (2106770)

Handling Transient APD Conditions

VSPHERE 6.0 STORAGE FEATURES PART 6: ACTION_ONRETRYERRORS

VMware Tools causes virtual machine snapshot with quiesce error

July 30th, 2016

Last week I was made aware of an issue a customer in the field was having with a data protection strategy using array-based snapshots which were in turn leveraging VMware vSphere snapshots with VSS quiesce of Windows VMs. The problem began after installing VMware Tools version 10.0.0 build-3000743 (reported as version 10240 in the vSphere Web Client) which I believe is the version shipped in ESXI 6.0 Update 1b (reported as version 6.0.0, build 3380124 in the vSphere Web Client).

The issue is that creating a VMware virtual machine snapshot with VSS integration fails. The virtual machine disk configuration is simply two .vmdks on a VMFS-5 datastore but I doubt the symptoms are limited only to that configuration.

The failure message shown in the vSphere Web Client is “Cannot quiesce this virtual machine because VMware Tools is not currently available.”  The vmware.log file for the virtual machine also shows the following:

2016-07-29T19:26:47.378Z| vmx| I120: SnapshotVMX_TakeSnapshot start: ‘jgb’, deviceState=0, lazy=0, logging=0, quiesced=1, forceNative=0, tryNative=1, saveAllocMaps=0 cb=1DE2F730, cbData=32603710
2016-07-29T19:26:47.407Z| vmx| I120: DISKLIB-LIB_CREATE : DiskLibCreateCreateParam: vmfsSparse grain size is set to 1 for ‘/vmfs/volumes/51af837d-784bc8bc-0f43-e0db550a0c26/rmvm02/rmvm02-000001.
2016-07-29T19:26:47.408Z| vmx| I120: DISKLIB-LIB_CREATE : DiskLibCreateCreateParam: vmfsSparse grain size is set to 1 for ‘/vmfs/volumes/51af837d-784bc8bc-0f43-e0db550a0c26/rmvm02/rmvm02_1-00000
2016-07-29T19:26:47.408Z| vmx| I120: SNAPSHOT: SnapshotPrepareTakeDoneCB: Prepare phase complete (The operation completed successfully).
2016-07-29T19:26:56.292Z| vmx| I120: GuestRpcSendTimedOut: message to toolbox timed out.
2016-07-29T19:27:07.790Z| vcpu-0| I120: Tools: Tools heartbeat timeout.
2016-07-29T19:27:11.294Z| vmx| I120: GuestRpcSendTimedOut: message to toolbox timed out.
2016-07-29T19:27:17.417Z| vmx| I120: GuestRpcSendTimedOut: message to toolbox timed out.
2016-07-29T19:27:17.417Z| vmx| I120: Msg_Post: Warning
2016-07-29T19:27:17.417Z| vmx| I120: [msg.snapshot.quiesce.rpc_timeout] A timeout occurred while communicating with VMware Tools in the virtual machine.
2016-07-29T19:27:17.417Z| vmx| I120: —————————————-
2016-07-29T19:27:17.420Z| vmx| I120: Vigor_MessageRevoke: message ‘msg.snapshot.quiesce.rpc_timeout’ (seq 10949920) is revoked
2016-07-29T19:27:17.420Z| vmx| I120: ToolsBackup: changing quiesce state: IDLE -> DONE
2016-07-29T19:27:17.420Z| vmx| I120: SnapshotVMXTakeSnapshotComplete: Done with snapshot ‘jgb’: 0
2016-07-29T19:27:17.420Z| vmx| I120: SnapshotVMXTakeSnapshotComplete: Snapshot 0 failed: Failed to quiesce the virtual machine (31).
2016-07-29T19:27:17.420Z| vmx| I120: VigorTransport_ServerSendResponse opID=ffd663ae-5b7b-49f5-9f1c-f2135ced62c0-95-ngc-ea-d6-adfa seq=12848: Completed Snapshot request.
2016-07-29T19:27:26.297Z| vmx| I120: GuestRpcSendTimedOut: message to toolbox timed out.

After performing some digging, I found VMware had released VMware Tools version 10.0.9 on June 6, 2016. The release notes identify the root cause has been identified and resolved.

Resolved Issues

Attempts to take a quiesced snapshot in a Windows Guest OS fails
Attempts to take a quiesced snapshot after booting a Windows Guest OS fails

After downloading and upgrading VMware Tools version 10.0.9 build-3917699 (reported as version 10249 in the vSphere Web Client), the customer’s problem was resolved. Since the faulty version of VMware Tools was embedded in the customer’s templates used to deploy virtual machines throughout the datacenter, there were a number of VMs needing their VMware Tools upgraded, as well as the templates themselves.

Dell Enterprise Manager Client Gets Linux Makeover

April 24th, 2015

Dell storage customers who have been watching the evolution of Enterprise Manager may be interested in the latest release which was just made available.  Aside from adding support for the brand new SCv2000 Series Storage Centers and bundling Java Platform SE 7 Update 67 with the installation of both the Data Collector on Windows and the Client on Windows or Linux (a prerequisite Java installation is no longer required), a Linux client has been introduced for the first time and runs on several Linux operating systems.  The Linux client is Java based and has the same look and feel as the Windows based client.  Some of the details about this release below.

Enterprise Manager 2015 R1 Data Collector and Client management compatibility:

  • Dell Storage Center OS versions 5.5-6.6
  • Dell FS8600 versions 3.0-4.0
  • Dell Fluid Cache for SAN version 2.0.0
  • Microsoft System Center Virtual Machine Manager (SCVMM) versions 2012, 2012 SP1, and 2012 R2
  • VMware vSphere Site Recovery Manager versions 5.x (HCL), 6.0 (compatible)

Enterprise Manager 2015 R1 Client for Linux operating system requirements:

  • RHEL 6
  • RHEL 7
  • SUSE Linux Enterprise 12
  • Oracle Linux 6.5
  • Oracle Linux 7.0
  • 32-bit (x86) or 64-bit (x64) CPU
  • No support for RHEL 5 but I’ve tried it and it seems to work

Although the Enterprise Manager Client for Linux can be installed without a graphical environment, launching and using the client requires the graphical environment.  As an example, neither RHEL 6 or RHEL 7 install a graphical environment by default.  Overall, installing a graphical environment for both RHEL 6 and RHEL 7 is similar in that it requires a yum repository. However, the procedure is slightly different for each version.  There are several resources available on the internet which walk through the process.  I’ll highlight a few below.

Log in with root access.

To install a graphical environment for RHEL 6, create a yum repository and install GNOME or KDE by following the procedure here.

To install a graphical environment for RHEL 7, create a yum repository by following this procedure and install GNOME by following the procedure here.

Installing the Enterprise Manager Client is pretty straightforward.  Copy the RPM to a temporary directory on the Linux host and use rpm -U to install:

rpm -U dell-emclient-15.1.2-45.x86_64.rpm

Alternatively, download the client from the Enterprise Manager Data Collector using the following syntax as an example:

wget em1.boche.lab:3033 –no-check-certificate https://em1.boche.lab:3033/em/EnterpriseManager/web/apps/client/EmClient.rpm

rpm -U EmClient.rpm

Once installed, launch the Enterprise Manager Client from the /var/lib/dell/bin/ directory:

cd /var/lib/dell/bin/

./Client

or

/var/lib/dell/bin/Client

We’re rewarded with the Enterprise Manager 2015 R1 Client splash screen.  New features are found here to immediately manage SCv2000 Series Storage Centers (the SCv2000 Series is the first Storage Center whereby the web based management console has been retired).

Once logged in, it’s business as usual in a familiar UI.

Dell, and before it Compellent, has long since offered a variety of options and integrations to manage Storage Center as well as popular platforms and applications.  The new Enterprise Manager Client for Linux extends that list of management methods available.

VMware vRealize Operations Manager 6.0.1 & Dell Storage Speed Run

March 11th, 2015

For the most part – 12:38 was my time.

There are a few spots where I could improve but what you see here is what you get – a quick video I threw together outlining a simple VMware vRealize Operations Manager 6.0.1 appliance deployment, including:

  • vCenter adapter configuration
  • Active Directory role integration
  • Dell Storage Solutions Pack installation and configuration
  • Dashboard sharing

Obviously I trimmed some of the “wait” intervals but the goal here was to cover the quick and easy steps to get vR Ops 6.x up and running from ovf download to collecting in a very short amount of time.

In case you are unaware, VMware vRealize Operations Manager 6.0.1 was released a little under two weeks ago and it includes some improvements over the December 6.0.0 release:

Updates cover all major areas of the product including installation, migration, configuration, licensing, alerting, dashboards, reports, and policies. To take advantage of the following significant enhancements, upgrade to version 6.0.1.

Improved scaling numbers

  • The number of objects that a single large node supports has been increased to 12,000. Also, in multi-node configurations, a four large-node configuration can manage up to 40,000 objects and an eight large-node configuration can manage up to 75,000 objects. For details on scaling numbers and a link to a Sizing Guideline Worksheet, see KB 2093783.

vSphere v6.0 interoperability support

  • With this release, vSphere v6.0 can function both as a platform for vRealize Operations Manager installation, and as an environment to which vRealize Operations Manager can connect for operational assurance.

User interface improvements

  • Corrections in the Views and Reports content for vSphere Hosts and Clusters.
  • Addition of Hierarchical View in the Topology widget.
  • Enhancement to the Geo widget displays objects on a world map.

Licensing improvements

  • New functionality provides a way to use the REST API to add a license key.

Metrics switched to Collection OFF to improve performance

  • Extraneous metrics are switched to Collection OFF in the default Policy. An option to enable Collection is available. However, maintaining metrics in the OFF state saves disk space, improves CPU performance, and has no negative impact on the vRealize Operations Manager functionality to collect and analyze data. For a list of metrics with Collection switched to OFF, see KB 2109869.

Alert Definition Updates

  • Improved alert definitions for vSphere clusters, hosts, and virtual machines, to better detect CPU and memory problems.
  • Improved alert definitions for hosts and virtual machines in the vSphere 5.5 Hardening Guide, to identify and report more non-compliance issues.
  • Additional alert definitions to detect duplicate object names in vCenter and vSphere Storage Management Service errors. Note: To identify duplicate object names in the vCenter Server system, the name-based identification feature must be enabled for the vSphere adapter.

 

I spent a fair amount of time with vC Ops 5.x and I’ll be the first in line to say vR Ops 6.x has a much more polished look and feel which generally makes consumption of this datacenter management tool much more of a pleasure to work with in terms of installation, configuration, and daily use. But don’t take my word for it, see for yourself:

Dell Compellent Storage Center Command Set Shell cmdlets

January 9th, 2015

If you manage Dell Compellent storage, you may or may not be aware that Windows PowerShell cmdlets are available to ease management pain by way of automation and consistency. While I am able to recognize when scripting is the right tool for the job, I do not author PowerShell scripts on a regular basis. For that reason, I’m not as deeply familiar with all of the cmdlets available within the Dell Compellent Storage Center Command Set Shell as I would like to be.

So how do I get started – what are the cmdlets? There are a few different ways to retrieve a list of cmdlets made available by a PowerShell snapin or module.

VMware vSphere PowerCLI simplifies the process by providing a cmdlet called Get-VICommand. When executed, it returns a list of all the cmdlets provided by the VMware.VimAutomation.Core snapin used to manage a vSphere environment via PowerShell. As of this writing in the 5.5.x generation of vSphere, there are a few other vSphere specific snapins installed with PowerCLI but the cmdlets provided by those aren’t returned by Get-VICommand. Those snapins are:

  • VMware.VimAutomation.Vds – This Windows PowerShell snap-in contains cmdlets that let you manage vSphere Distributed Switches.
  • VMware.VimAutomation.License – This Windows Powershell snap-in contains cmdlets for managing License components.
  • VMware.DeployAutomation – Cmdlets for Rule-Based-Deployment
  • VMware.ImageBuilder – This Windows PowerShell snap-in contains VMware ESXi Image Builder cmdlets used to generate custom images.
  • VMware.VimAutomation.Cloud – This Windows Powershell snap-in contains cmdlets used to manage VMware vCloud Director.

However, not all PowerShell snapins ship with a native shortcut to retrieve a list of their respective cmdlets. In these cases, use Get-Command. Now Get-Command by itself returns cmdlets for all snapins. For a snapin specific list, either of the following will work:

Get-Command -Module “snapin_name”
Get-Command | Where-Object{$_.PSSnapin.Name -eq “snapin_name”}

In the case of Dell Compellent Storage Center Command Set Shell, the snapin is named Compellent.StorageCenter.PSSnapin. To retrieve a list of Dell Compellent cmdlets, use one of the following:

Get-Command -Module “Compellent.StorageCenter.PSSnapin”
Get-Command | Where-Object{$_.PSSnapin.Name -eq “Compellent.StorageCenter.PSSnapin”}

At the time of this writing, there are 105 cmdlets:

Get-Command -Module Compellent.StorageCenter.PSSnapin | Measure-Object

Count    : 105
Average  :
Sum      :
Maximum  :
Minimum  :
Property :

Those who don’t use PowerShell on a regular basis may find the above difficult to easily recall from memory. I had a discussion with Justin Braun (author of The Braun Blog – check out his Dell Compellent articles here) and Mike Matthews (a peer in my office who specialize in Microsoft SQL Server, PowerShell, and is an all around good guy). Is there an easier and persistent method to retrieve cmdlets from a given snapin? What resulted was a function that can be added to a PowerShell profile which performs just like VMware’s Get-VICommand (I’ll be original and call this one Get-SCCommand to get the list of Storage Center cmdlets).

Edit the PowerShell profile (%profile). It’s default location is:

%USERPROFILE%\Documents\WindowsPowerShell\Microsoft.PowerShell_profile.ps1

If the path and profile doesn’t already exist, it can be created in PowerShell using the following cmdlet:

new-item -itemtype file -path $profile -force

If using PowerShell ISE, the default profile location is:

%USERPROFILE%\Documents\WindowsPowerShell\Microsoft.PowerShellISE_profile.ps1

Add the following to verify the Dell Compellent snapin is loaded. If not, load it.

If ( !( Get-PSSnapin | Where-Object { $_.Name -eq “Compellent.StorageCenter.PSSnapin” } ) )
{
Add-PSSnapin Compellent.StorageCenter.PSSnapin | Out-Null
}

Add the Get-SCCommand shortcut function:

Function Get-SCCommand { Get-Command -Module “Compellent.StorageCenter.PSSnapin” }

Save the profile.

Now open any PowerShell environment and use Get-SCCommand which shows a list of 105 Dell Compellent cmdlets (There are 49 additional cmdlets in the compellent.replaymanager.scripting snapin for Replay Manager):

It works with PowerShell ISE as well when the Microsoft.PowerShellISE_profile.ps1 profile is modified:

How about PowerGUI? Yes…

Of course the shortcut function provided in the example above is specific to the Dell Compellent snapin but it should work for for any PowerShell snapin including the list of VMware snapins not included in Get-VICommand discussed at the top of the article.

For more on scripting Storage Center, visit the Dell Storage PowerShell Community. Rick Gouin also has a nice collection of scripts that he has authored.

Have a great weekend!

A Common NPIV Problem with a Solution

December 29th, 2014

Several years ago, one of the first blog posts that I tackled was working in the lab with N_Port ID Virtualization often referred to as NPIV for short. The blog post was titled N_Port ID Virtualization (NPIV) and VMware Virtual Infrastructure. At the time it was one of the few blog posts available on the subject because it was a relatively new feature offered by VMware. Over the years that followed, I haven’t heard much in terms of trending adoption rates by customers. Likewise, VMware hasn’t put much effort into improving NPIV support in vSphere or promoting its use. One might contemplate, which is the cause and which is the effect. I feel it’s a mutual agreement between both parties that NPIV in its current state isn’t exciting enough to deploy and the benefits fall into a very narrow band of interest (VMware: Give us in guest virtual Fibre Channel – that would be interesting).

Despite its market penetration challenges, from time to time I do receive an email from someone referring to my original NPIV blog post looking for some help in deploying or troubleshooting NPIV. The nature of the request is common and it typically falls into one of two categories:

  1. How can I set up NPIV with a fibre channel tape library?
  2. Help – I can’t get NPIV working.

I received such a request a few weeks ago from the field asking for general assistance in setting up NPIV with Dell Compellent storage. The correct steps were followed to the best of their knowledge but the virtual WWPNs that were initialized at VM power on would not stay lit after the VM began to POST. In Dell Enterprise Manager, the path to the virtual machine’s assigned WWPN was down. Although the RDM storage presentation was functioning, it was only working through the vSphere host HBAs and not the NPIV WWPN. This effectively means that NPIV is not working:

In addition, the NPIV initialization failure is reflected in the vmkernel.log:

2014-12-15T16:32:28.694Z cpu25:33505)qlnativefc: vmhba64(41:0.0): vlan_id: 0x0
2014-12-15T16:32:28.694Z cpu25:33505)qlnativefc: vmhba64(41:0.0): vn_port_mac_address: 00:00:00:00:00:00
2014-12-15T16:32:28.793Z cpu25:33505)qlnativefc: vmhba64(41:0.0): Assigning new target ID 0 to fcport 0x410a524d89a0
2014-12-15T16:32:28.793Z cpu25:33505)qlnativefc: vmhba64(41:0.0): fcport 5000d3100002b916 (targetId = 0) ONLINE
2014-12-15T16:32:28.809Z cpu27:33505)qlnativefc: vmhba64(41:0.0): Assigning new target ID 1 to fcport 0x410a524d9260
2014-12-15T16:32:28.809Z cpu27:33505)qlnativefc: vmhba64(41:0.0): fcport 5000d3100002b90c (targetId = 1) ONLINE
2014-12-15T16:32:28.825Z cpu27:33505)qlnativefc: vmhba64(41:0.0): Assigning new target ID 2 to fcport 0x410a524d93e0
2014-12-15T16:32:28.825Z cpu27:33505)qlnativefc: vmhba64(41:0.0): fcport 5000d3100002b915 (targetId = 2) ONLINE
2014-12-15T16:32:28.841Z cpu27:33505)qlnativefc: vmhba64(41:0.0): Assigning new target ID 3 to fcport 0x410a524d9560
2014-12-15T16:32:28.841Z cpu27:33505)qlnativefc: vmhba64(41:0.0): fcport 5000d3100002b90b (targetId = 3) ONLINE
2014-12-15T16:32:30.477Z cpu22:19117991)WARNING: ScsiPsaDriver: 1272: Failed adapter create path; vport:vmhba64 with error: bad0040
2014-12-15T16:32:32.477Z cpu22:19117991)WARNING: ScsiPsaDriver: 1272: Failed adapter create path; vport:vmhba64 with error: bad0040
2014-12-15T16:32:34.480Z cpu22:19117991)WARNING: ScsiPsaDriver: 1272: Failed adapter create path; vport:vmhba64 with error: bad0040
2014-12-15T16:32:36.480Z cpu22:19117991)WARNING: ScsiPsaDriver: 1272: Failed adapter create path; vport:vmhba64 with error: bad0040
2014-12-15T16:32:38.482Z cpu22:19117991)ScsiNpiv: 1152: NPIV vport rescan complete, [5:24] (0x410943893dc0) [0x410943680ec0] status=0xbad0040
2014-12-15T16:32:38.503Z cpu22:19117991)ScsiScan: 140: Path ‘vmhba2:C0:T3:L24’: Peripheral qualifier 0x1 not supported
2014-12-15T16:32:38.503Z cpu22:19117991)WARNING: ScsiNpiv: 1141: Physical uid does not match VPORT uid, NPIV Disabled for this VM
2014-12-15T16:32:38.503Z cpu22:19117991)ScsiNpiv: 1152: NPIV vport rescan complete, [3:24] (0x410943856e80) [0x410943680ec0] status=0xbad0132
2014-12-15T16:32:38.503Z cpu22:19117991)WARNING: ScsiNpiv: 1788: Failed to Create vport for world 19117994, vmhba2, rescan failed, status=bad0001
2014-12-15T16:32:38.504Z cpu14:33509)ScsiAdapter: 2806: Unregistering adapter vmhba64

To review, the requirements for implementing NPIV with vSphere are documented by VMware and I outlined the key ones in my original blog post:

  • NPIV support on the fabric switches (typically found in 4Gbps or higher fabric switches but I’ve seen firmware support in 2Gbps switches also)
  • NPIV support on the vShpere host HBAs (this typically means 4Gbps or higher port speeds)
  • NPIV support from the storage vendor
  • NPIV support from a supported vSphere version
  • vSphere Raw Device Mapping
  • Correct fabric zoning configured between host HBAs, the virtual machine’s assigned WWPN(s), and the storage front end ports
  • Storage presentation to the vSphere host HBAs as well as the virtual machine’s assigned NPIV WWPN(s)

If any of the above requirements are not met (plus a handful of others and we’ll get to one of them shortly), vSphere’s NPIV feature will likely not function.

In this particular case, general NPIV requirements were met. However, it was discovered a best practice had been missed in configuring the QLogic HBA BIOS (the QLogic BIOS is accessed at host reboot by pressing CTRL + Q or ALT + Q when prompted). Connection Options remained at its factory default value of 2 or Loop preferred, otherwise point to point.

Dell Compellent storage with vSphere best practices call for this value to be hard coded to 1 or Point to point only. When the HBA has multiple ports, this configuration needs to be made across all ports that are used for Dell Compellent storage connectivity. It goes without saying this also applies across all of the fabric attached hosts in the vSphere cluster.

Once configured for Point to point connectivity on the fabric, the problem is resolved.

Despite the various error messages returned as vSphere probes for possible combinations between the vSphere assigned virtual WWPN and the host WWPNs, NPIV success looks something like this in the vmkernel.log (you’ll notice subtle differences showing success compared to the failure log messages above):

2014-12-15T18:43:52.270Z cpu29:33505)qlnativefc: vmhba64(41:0.0): vlan_id: 0x0
2014-12-15T18:43:52.270Z cpu29:33505)qlnativefc: vmhba64(41:0.0): vn_port_mac_address: 00:00:00:00:00:00
2014-12-15T18:43:52.436Z cpu29:33505)qlnativefc: vmhba64(41:0.0): Assigning new target ID 0 to fcport 0x410a4a569960
2014-12-15T18:43:52.436Z cpu29:33505)qlnativefc: vmhba64(41:0.0): fcport 5000d3100002b916 (targetId = 0) ONLINE
2014-12-15T18:43:52.451Z cpu29:33505)qlnativefc: vmhba64(41:0.0): Assigning new target ID 1 to fcport 0x410a4a569ae0
2014-12-15T18:43:52.451Z cpu29:33505)qlnativefc: vmhba64(41:0.0): fcport 5000d3100002b90c (targetId = 1) ONLINE
2014-12-15T18:43:52.466Z cpu29:33505)qlnativefc: vmhba64(41:0.0): Assigning new target ID 2 to fcport 0x410a4a569c60
2014-12-15T18:43:52.466Z cpu29:33505)qlnativefc: vmhba64(41:0.0): fcport 5000d3100002b915 (targetId = 2) ONLINE
2014-12-15T18:43:52.481Z cpu29:33505)qlnativefc: vmhba64(41:0.0): Assigning new target ID 3 to fcport 0x410a4a569de0
2014-12-15T18:43:52.481Z cpu29:33505)qlnativefc: vmhba64(41:0.0): fcport 5000d3100002b90b (targetId = 3) ONLINE
2014-12-15T18:43:54.017Z cpu0:36379)WARNING: ScsiPsaDriver: 1272: Failed adapter create path; vport:vmhba64 with error: bad0040
2014-12-15T18:43:56.018Z cpu0:36379)WARNING: ScsiPsaDriver: 1272: Failed adapter create path; vport:vmhba64 with error: bad0040
2014-12-15T18:43:58.020Z cpu0:36379)WARNING: ScsiPsaDriver: 1272: Failed adapter create path; vport:vmhba64 with error: bad0040
2014-12-15T18:44:00.022Z cpu0:36379)WARNING: ScsiPsaDriver: 1272: Failed adapter create path; vport:vmhba64 with error: bad0040
2014-12-15T18:44:02.024Z cpu0:36379)ScsiNpiv: 1152: NPIV vport rescan complete, [4:24] (0x4109436ce9c0) [0x410943684040] status=0xbad0040
2014-12-15T18:44:02.026Z cpu2:36379)ScsiNpiv: 1152: NPIV vport rescan complete, [2:24] (0x41094369ca40) [0x410943684040] status=0x0
2014-12-15T18:44:02.026Z cpu2:36379)ScsiNpiv: 1701: Physical Path : adapter=vmhba3, channel=0, target=5, lun=24
2014-12-15T18:44:02.026Z cpu2:36379)ScsiNpiv: 1701: Physical Path : adapter=vmhba2, channel=0, target=2, lun=24
2014-12-15T18:44:02.026Z cpu2:36379)WARNING: ScsiPsaDriver: 1272: Failed adapter create path; vport:vmhba64 with error: bad0040
2014-12-15T18:44:04.028Z cpu2:36379)WARNING: ScsiPsaDriver: 1272: Failed adapter create path; vport:vmhba64 with error: bad0040
2014-12-15T18:44:06.030Z cpu2:36379)WARNING: ScsiPsaDriver: 1272: Failed adapter create path; vport:vmhba64 with error: bad0040
2014-12-15T18:44:08.033Z cpu2:36379)WARNING: ScsiPsaDriver: 1272: Failed adapter create path; vport:vmhba64 with error: bad0040
2014-12-15T18:44:10.035Z cpu2:36379)WARNING: ScsiPsaDriver: 1272: Failed adapter create path; vport:vmhba64 with error: bad0040
2014-12-15T18:44:12.037Z cpu2:36379)ScsiNpiv: 1152: NPIV vport rescan complete, [4:24] (0x4109436ce9c0) [0x410943684040] status=0xbad0040
2014-12-15T18:44:12.037Z cpu2:36379)ScsiNpiv: 1160: NPIV vport rescan complete, [2:24] (0x41094369ca40) [0x410943684040] vport exists
2014-12-15T18:44:12.037Z cpu2:36379)ScsiNpiv: 1701: Physical Path : adapter=vmhba3, channel=0, target=2, lun=24
2014-12-15T18:44:12.037Z cpu2:36379)ScsiNpiv: 1848: Vport Create status for world:36380 num_wwpn=1, num_vports=1, paths=4, errors=3

One last item I’ll note here for posterity is that this particular case, the problem does not present itself uniformly across all storage platforms. This was an element that prolonged troubleshooting to a degree because the vSphere cluster was successful in establishing NPIV fabric connectivity to two other types of storage using the same vSphere hosts, hardware, and fabric switches. Because of this in the beginning it seemed logical to rule out any configuration issues within the vSphere hosts.

To summarize, there are many technical requirements outlined in VMware documentation to correctly configure NPIV. If you’ve followed VMware’s steps correctly but problems with NPIV remain, refer to storage, fabric, and hardware documentation and verify best practices are being met in the deployment.

 

Legacy vSphere Client Plug-in 1.7 Released for Storage Center

July 23rd, 2014

Dell Compellent Storage Center customers who use the legacy vSphere Client plug-in to manage their storage may have noticed that the upgrade to PowerCLI 5.5 R2 which released with vSphere 5.5 Update 1 essentially “broke” the plug-in. This forced customers to make the decision to stay on PowerCLI 5.5 in order to use the legacy vSphere Client plug-in, or reap the benefits of the PowerCLI 5.5 R2 upgrade with the downside being they had to abandon use of the legacy vSphere Client plug-in.

For those that are unaware, there is a 3rd option and that is to leverage vSphere’s next generation web client along with the web client plug-in released by Dell Compellent last year (I talked about it at VMworld 2013 which you can take a quick look at below).

Although VMware strongly encourages customers to migrate to the next generation web client long term, I’m here to tell you that in the interim Dell has revd the legacy client plug-in to version 1.7 which is now compatible with PowerCLI 5.5 R2.  Both the legacy and web client plug-ins are free and quite beneficial from an operations standpoint so I encourage customers to get familiar with the tools and use them.

Other bug fixes in this 1.7 release include:

  • Datastore name validation not handled properly
  • Create Datastore, map existing volume – Server Mapping will be removed from SC whether or not it was created by VSP
  • Add Raw Device wizard is not allowing to uncheck a host once selected
  • Remove Raw Device wizard shows wrong volume size
  • Update to use new code signing certificate
  • Prevent Datastores & RDMs with underlying Live Volumes from being expanded or deleted
  • Add support for additional Flash Optimized Storage Profiles that were added in SC 6.4.2
  • Block size not offered when creating VMFS-3 Datastore from Datacenter menu item
  • Add Raw Device wizard is not allowing a host within the same cluster as the select host to be unchecked once it has been selected
  • Add RDM wizard – properties screen showing wrong or missing values
  • Expire Replay wizard – no error reported if no replays selected
  • Storage Consumption stats are wrong if a Disk folder has more than one Storage Type