Posts Tagged ‘SAN’

KB1008130: VMware ESX and ESXi 3.5 U3 I/O failure on SAN LUN(s) and LUN queue is blocked indefinitely

January 19th, 2009

I became aware of this issue last week by word of mouth and received the official Email blast from VMware this morning.

The vulnerability lies in a convergence of circumstances:

1. Fibre channel SAN storage with multipathing
2. A fibre channel SAN path failure or planned path transition
3. Metadata update occurring during the fibre channel SAN path failure where metadata updates include but are not limited to:

a. Power operations of a VM
b. Snapshot operations of a VM (think backups)
c. Storage VMotion (sVMotion)
d. Changing a file’s attributes
e. Creating a VMFS volume
f. Creating, modifying, deleting, growing, or locking of a file on a VMFS volume

The chance of a fibre channel path failure can be rated as slim, however, metadata updates can happen quite frequently, or more often than you might think. Therefore, if a fibre channel path failure occurs, chances are good that a metadata update could be in flight which is precisely when disaster will strike. Moreover, the safety benefit and reliance on multipathing is diminished by the vulnerability.

Please be aware of this.

Dear ESX 3.5 Customer,

Our records indicate you recently downloaded VMware® ESX Version 3.5 U3 from our product download site. This email is to alert you that an issue with that product version could adversely effect your environment. This email provides a detailed description of the issue so that you can evaluate whether it affects you, and the next steps you can take to get resolution or avoid encountering the issue.

VMware ESX and ESXi 3.5 U3 I/O failure on SAN LUN(s) and LUN queue is blocked indefinitely. This occurs when VMFS3 metadata updates are being done at the same time failover to an alternate path occurs for the LUN on which the VMFS3 volume resides. The effected releases are ESX 3.5 Update 3 and ESXi 3.5 U3 Embedded and Installable with both Active/Active or Active/Passive SAN arrays (Fibre Channel and iSCSI).

ESX or ESXi Host may get disconnected from Virtual Center
All paths to the LUNs are in standby state
Esxcfg-rescan might take a long tome to complete or never complete (hung)
VMKernel logs show entries similar to the following:

Queue for device vml.02001600006006016086741d00c6a0bc934902dd115241 49442035 has been blocked for 6399 seconds.

Please refer to KB 1008130.

A reboot is required to clear this condition.

VMware is working on a patch to address this issue. The knowledge base article for this issue will be updated after the patch is available.

If you encounter this condition, please collect the following information and open an SR with VMware Support:

1. Collect a vsi dump before reboot using /usr/lib/vmware/bin/vsi_traverse.

2. Reboot the server and collect the vm-support dump.

3. Note the activities around the time where a first “blocked for xxxx seconds” message is shown in the VMkernel.

Please consult your local support center if you require further information or assistance. We apologize in advance for any inconvenience this issue may cause you. Your satisfaction is our number one goal.

Update:  The patch has been released that resolves this

Datacenters need shutdown/startup order

January 1st, 2009

Today I learned of a new blog called Virtual RJ which is owned by Robbert Jan van de Velde (yet another Dutch VMware virtualization enthusiast!).  I was reading an article he had recently written called Making inactive storage active in VirtualCenter.  What hits close to home for me about this article is the need for datacenter playbooks which outline a shutdown/startup order of infrastructure and servers.  Once upon a time, our environment was fairly simple and staff was small.  Although our environment was documented, the need for a formal shutdown/startup order was not so prevalent.  Over the years, staff has grown, new applications have been introduced to the environment, and the number of servers grew into the hundreds.  Not to mention, storage got out of control and with that we brought in SAN infrastructures.

Unless your datacenter is the size of a broom closet, chances are you cannot easily get away with throwing the master power switch to bring up infrastructure and servers in the right order.  Obviously you’re not going to use a power switch to shut everything down ungracefully either, but what may not be so obvious is that a graceful shutdown or startup of servers and infrastructure in random order may not be the best choice considering the health of the environment.

In order to understand the correct shutdown/startup order for your environment, you need to fully understand the web of datacenter dependencies which can range from simple to highly complex.  Knowing your datacenter dependencies means having good documentation of its components:  servers (including clusters), applications, storage, authentication, network, power, cooling, etc.  Virtualization adds a layer as well as I will show in a moment.  Let’s look at a few high level examples of dependencies:

  • Users depend on applications, workstations, network, VDI, etc.
  • Applications depend on databases, network, authentication, storage, other applications, etc.
  • Highly available databases depend on shared storage, clustered servers, etc.
  • Clustered servers depend on shared storage, authentication, network, quorum, etc.
  • Shared storage and network depends on power and cooling.
  • Consolidated virtual infrastructures (including VDI) depend on everything.

The list above may not completely fit your environment, but it should start to get you thinking about what and where the dependencies are in your environment.  Let me re-emphasize that without knowledge of how data flows in your environment, you won’t be able to come up with an accurate dependency tree.  Shutdown and startup orders aside, you’re in a scary position.  Start documenting quickly.  Talk to your peers, developers, managers, etc. to tie your datacenter components together.

So what does the dependency list above mean and how does it translate into a shutdown/startup order?  Well, workstations and VDIs typically have no dependencies and can be shut down first.  Application servers (including VMs) can be shut down next (except for the vCenter server – we’ll need that to shut down VMs and hosts).  Database cluster shutdown follows with the caveat that not all cluster nodes should be shut down at the same time – stagger the shutdown so as not to hang quorum arbitration risking potential corruption of data.  At this point, if all VMs are shut down, we can use vCenter shut down all ESX/ESXi hosts and then the vCenter server.  At this point, authentication should no longer be needed so let’s shut down the domain controllers.  Getting to the end of the list, we can shut down shared storage, SAN switches, and networking equipment (in that order).  Lastly, we pull the plug on phone systems, Twitter, cooling, and then sever the link to street power.  No really, just kidding – Twitter is not that much of a dependency.  I can quit Twitter any time I want.

Now that we know shutdown order, startup order is typically simple – startup order is the reverse or inverse of the shutdown order.  Example:  Throw the switch for street power.  Engage cooling.  Turn on the PBX.  Fire up the network switches and routers.  SAN switches (go grab a coffee) then shared storage.  Domain controllers, ESX hosts, vCenter, app servers, blah blah blah.  You get the idea.

Everyone on your staff has both lists above memorized right?  If not, you need to get it documented in a shutdown/startup playbook.  I don’t feel one needs complex software or hired technical writers to put this together.  If you understand the dependencies, 85% of the work is already done.  My solution for what I put together was embarrassingly simple:  Microsoft Excel.

The tool itself doesn’t need to be incredibly complex, however, that doesn’t mean your shutdown/startup order will be as simple.  In the spreadsheet I maintain for my environment, I have a few hundred rows of information and many columns representing branch dependencies.  I also have a few different tabs in the spreadsheet with slightly different orders.  This is because we have multiple SANs and if we’re only shutting down one of the SANs for planned maintenance, we only need to shut down its dependencies and not the entire datacenter including the other SANs.

Like many other types of documentation, the shutdown/startup order should be considered a living/breating document that needs periodic care and feeding.  When new servers, infrastructure, or applications are brought into the environment, this document needs to be updated to remain current.  When datacenter components are removed, again, a document update is needed.  We’ve got a formal server turnover checklist which catches loose ends like this.  Any server that goes into production must have all the items on its checklist completed first (ie. all documentation complete, added to backup schedule, added to server security plan, etc.)  Likewise, we also maintain a formal server retirement checklist to make sure we’re not trying to back up retired servers or consume static IP addresses of retired servers.

As our team becomes more distributed and expertise is honed to specific areas of the organization, it is important that all staff members resopnsible for the environment understand the requirements to shut it down quickly or in a planned fashion.  That means good documentation.  Better documentation also means your peers have the tools needed to do your job while you’re gone and less chance you’ll be called in the middle of the night or while on vacation.

N_Port ID Virtualization (NPIV) and VMware Virtual Infrastructure

October 28th, 2008

A few weeks ago, an associate got me curious about N_Port ID Virtualization (NPIV for short) and what could be done with it in VMware’s current Virtual Infrastructure offerings (VC 2.5u3, ESX 3.5u2).  Most of my SAN equipment is a little on the older side so I haven’t had much chance to play with NPIV or investigate its benefits.  I decided to head into the lab and kick the tires.

To the best of my knowledge, I thought NPIV was somewhat of a newer technolgoy so the first thing was to inventory my hardware for NPIV capability.

  • VMware Virtual Infrastructure 3.5 – check!
  • Compaq StorageWorks 4/8 SAN switch – bzzz!
  • Preferably 4Gb SFPs but 2Gb should work also – check!
  • QLogic 2Gb HBAs – bzzz!

Right off the bat, I’ve got some obstacles to overcome.  My SAN switch doesn’t support NPIV in the current firmware version but the fact that it’s a 4GB switch leads me to believe there may be hope in a newer firmware version.  The SAN switch needs to support NPIV in any NPIV implementation, VMware or otherwise.  The good news is that there’s newer firmware available for the SAN switch.  I upgraded the SAN switch firmware and see that I now have NPIV configuration options on my SAN switch.  One issue resolved.

To validate whether or not a Brocade switch port supports NPIV, check the Port Admin in the GUI console or run the following command from the switch CLI via telnet:

portcfgshow 1  (where 1 is the switch port number)

If NPIV is disabled, it can easily be enabled via the Port Admin GUI or by using the following command from the switch CLI via telnet:

portCfgNPIVPort 5 1  (where 5 is the port number and 1 is the mode 1=enable, 0=disable)

I don’t have a compatible HBA.  That’s a tough one.  VMware’s documentation explains “Currently, the following vendors and types of HBA provide [NPIV] support”

  • QLogic – any 4GB HBA
  • Emulex – 4GB HBAs that have NPIV-compatible firmware

A quick look online at Ebay reveals that 4Gb HBAs are outside of my lab’s budget range (most of the lab budget this year was reallocated for a new deck and sprinkler system for the house – funny how things at home tend to mimic the politics in the office).  Fortunately, there’s more than one way to skin a cat.  A few emails later and I have a 60 day demo HBA coming from Hewlett Packard (HP’s OEM part number: FC1243 4GB PCI-X 2.0 DC, QLogic’s part number QLA2462). 

To validate whether or not your current HBA supports NPIV, open up the ESX console and run the following command:

cat /proc/scsi/qla2300/1 |grep NPIV  (where qla2300 is the HBA type and 1 is the HBA number)

For Emulex, it’s going to be something like cat /proc/scsi/lpfc/1 |grep NPIV

Obviously, browse your /proc/scsi/ directory to see what HBAs are in use by ESX.


In addition to the hardware issues, VMware sparsely distributes key NPIV information across several different documents in their library.   This is a pet peeve of mine.  Nonetheless, these are the VMware documents you need to pay attention to (but like me, you can choose to save the reading until AFTER you run into issues):

After a few days, the demo HBA from HP arrives.  I notice the firmware is from 2005 so I upgrade the firmware to current.  I then begin my testing.  I connected the fibre beteen the HBA and the SAN switch and powered on the ESX host.  Before allowing the ESX host to boot up, I entered the <CTRL + Q> BIOS configuration of the HBA to see if any new NPIV options had been added with the firmware upgrade.  None.  No mention of NPIV anywhere in the BIOS.  I proceeded to allow ESX to boot up.  Now that the fibre port is hot, I opened the management interface of the Brocade SAN switch and configured the port for the correct speed and NPIV support (this is configured on a port by port basis).  Unfortunately, I’m not seeing that NPIV is in use from the SAN switch point of view.  I decide to create a VM and see if I need to enable NPIV inside the VM first.  Another roadblock as shown below – the NPIV configuration is essentially all gray’d out and I see a hint at the bottom saying I need RDM storage.  I’m not sure why I need RDM.  Seems like an odd requirement, but I’ll find out why a little later.


In the lab I have swiSCSI shared storage suitable enough for testing with RDMs.  A few mouse clicks later and I have myself a VM with an RDM.  I head back to the VM configuration and I’m greeted with the success of being able to add WWNs.  Although I could create the WWNs myself by editing the .vmx file by hand, it’s much easier to let ESX assign them for me.  ESX generates exactly five WWNs:  1x Node WWN and 4x Port WWNs (the Port WWNs are what you should zone to).  It goes without saying that once these WWNs are generated, they should remain static in zoned fabrics (you do zone your fabric don’t you?!).

npiv2     npiv3

The entries in the .vmx file look like this (really, that’s it):

wwn.node = “25bb000c29000ba5”
wwn.port = “25bb000c29000da5,25bb000c29000ca5,25bb000c29000ea5,25bb000c29000fa5”
wwn.type = “vc”
Two steps forward, one step back.  I power cycled the VM a few times and I’m still not seeing any sign of NPIV kicking in on the SAN switch.  I should be seeing the virtual WWNs coming online so that I can zone them to something.  Referring to the sparse VMware documentation on NPIV, I discovered how VMware’s implementation of NPIV (version 1.0) works and I also learned I was missing a critical hardware component:  a SAN.  This ties back to my previous questioning of why an RDM is required for NPIV.  So quickly, here’s how NPIV works on VMware Virtual Infrastructure when NPIV is enabled (I obtained access to a SAN to work all of this out):
  1. When the VM is powered on, before the virtual hardware POSTs, it scans the physical HBAs of the ESX host for the RDM mapping to SAN storage.  SAN storage connected to HBAs is a hard requirement.  If an HBA doesn’t support NPIV, it is skipped in the detection process.  If ESX cannot see the zoned RDM LUN through an NPIV aware HBA, the HBA is skipped in the detection process.
  2. If and when an RDM SAN LUN is discovered through the detection process via an NPIV aware HBA through an NPIV capable SAN switch, fireworks go off and magic happens.  One of the four virtual Port WWN s (in the order as they appear in the .vmx file) are assigned to the phsyical HBA and the NPIV virtual Port WWN is activated on the SAN switch.
  3. ESX will assign a maximum of four NPIV Port WWNs during the detection process.  What this means is that if you have four NPIV HBAs connected to four NPIV aware SAN switch ports which are in turn zoned to four SAN LUNs, all four will be NPIV activated.  If you have only one NPIV HBA, you’ll only use one of the virtual Port WWNs.  If you have six NPIV HBAs, only the first four will be activated with NPIV Port WWNs in the discovery process.
  4. Zoning and storage presentation.  Here’s the catch 22 in this contraption and it’s a big one.
    1. I can’t get the ESX generated NPIV Port WWNs to activate on the switch until the VM can see RDM SAN LUN storage targets!
    2. I can’t easily zone RDM SAN storage processors to NPIV Port WWNs until the SAN switch can see the NPIV Port WWNs come online (I use soft zoning by WWN, not hard zoning by physical switch port)!!
    3. I can’t configure selective storage presentation (easily) on the SAN until the SAN can see the NPIV Port WWNs!!!
    4. The detection process at VM POST literally takes less than five seconds total to be successful or to fail and one second or less per HBA scan so to coordinate the correct GUI screens in the SAN switch management console, the selective storage presentation SAN console, and the VM console to toggle power state, takes incredible hand/eye coordination and timing.  It’s literally lining up all the screens, powering on the VM and hitting the refresh button in each of the SAN management consoles to capture the NPIV Port WWN that briefly comes online during the detection process, then goes away after failing to find an RDM SAN LUN.
    5. The only way to make this all work easily in my favor is to disable zoning on the SAN switch and disable selective storage presentation on the SAN.
  5. At any time during the initial detection process or while the VM is already online in operation, should an NPIV hardware or zoning requirement fail to be met for the RDM raw storage on SAN, the VM will fall back to using the Port WWN of the physical HBA it was traversing through it’s NPIV Port WWN assignment.

Once I met all of the requirements above and got NPIV working, the result was rather anticlimactic for the amount of effort that was involved.  Here’s what NPIV looks like from Port Admin on a Brocade switch (blue is the physical HBA, green in is the NPIV Port WWN that VMware generated): 

 npiv5  (Zoom in using the Flickr toolbar)

 I asked myself the questions “Why would anyone even do this?  What are the benefits?”.  There aren’t many, at least not right now with this implementation.  By far, I think the largest benefit is going to be for the SAN administrator.  Maybe a SAN switch port or storage controller is running hot.  Without NPIV, we have many VMs communicating with back end SAN storage over a shared HBA which to the SAN administrator appears as a single Port WWN in his/her SAN admin tools.  However, with NPIV, the SAN admin tools can now monitor the individual virtualized streams of I/O traffic that tie back to individual VMs.  I liken it much to the unique channels in the Citrix ICA protocol that is carried over TCP/IP.  Each of those channels can be monitored and in some cases be throttled or given priority.  The same concept applies to virtualized channels of VM disk I/O traffic through a physical HBA.  Another analogy would be VLANs for disk I/O traffic, but in a very primatave stage.

Another thought for this is to provide a layer of security if we could zone a SAN storage controller solely to an NPIV Port WWN, however, right now this is impossible because as was explained in #5 above, any time the physical HBA is removed from the NPIV visibility chain, NPIV shuts down and falls back to the physical HBA for traffic, and at that point you’ve zoned out your phyiscal HBA and disk I/O traffic would quickly queue and then halt, sending your VM into obvious distress.

A few tips that I’ve personally come up with in this exploration process:

  1. Don’t remove and then readd NPIV WWNs in the VM once it has all initially been zoned because ESX will assign a completely new set of WWNs.
  2. If you’ve done the above, you can modify the WWNs by hand in the .vmx file.  Remove the VM from inventory first, then modify the .vmx, then readd the VM back to inventory because VirtualCenter (or the VIC) likes to hold on to the generated WWNs if you don’t.
  3. Adding or removing phsyical HBAs on the host or RDMs on the VM causes the discovery process to mismatch different NPIV Port WWNs with physical HBAs thus throwing off the zoning and causing the whole thing to bomb to the point all NPIV discovery fails.
  4. If the above happens, you can change the order of the NPIV Port WWN assignment discovery in the .vmx file.
  5. You can VMotion with NPIV, however, make sure the RDM file is located on the same datastore where the VM configuration file resides.  Storage VMotion or VMotion between datastores isn’t allowed with NPIV enabled.
  6. The location of the RDM metadata (pointer) file can be on SAN or local VMFS storage
  7. On an HP MSA SAN, the hosts and corresponding Port WWNs can manually be created in the CLI (or temporarily disable SSP to ease the zoning process)
  8. Removing/adding RDMs can throw off the NPIV Port WWN assignments which in turn throws off zoning
  9. Discovery order of NPIV Port WWNs is tied to physical HBAs.  Adding or removing HBAs throws off the NPIV Port WWN assignments which in turn throws off zoning

Conclusion:  This is version 1.0 of VMware NPIV and it functions as such.  We need much more flexibility in future versions from all facets:  discovery process, better interface for management, editing of the WWNs in the VIC, pinning of WWNs to physcial HBAs, monitoring of NPIV Port WWN disk I/O traffic in VIC performance graphs, guaranteed isolation for security, etc.

Connect a fibre attached tape device to a VM on ESX

October 27th, 2008

Have you ever considered virtualizing your tape backup server? Maybe you’ve thought about it in the past but reasoning produced drawbacks that were too compelling to go forth and virtualize. For instance, pinning a VM to a clustered ESX host which is connected with a SCSI cable attached tape device hanging off of it. Pinning a VM to a clustered host means you lose the benefit of a VM portability, you lose the flexibility of host maintenance during production cycles, and you lose the use of valuable dollars spent on VMotion, DRS, HA, shared storage, and FT (future).

What if you had the hardware to make it possible? Would you do it? If I had to purchase hardware to specifically make this happen, cost effectiveness would need to be researched. Everything else being equal, if I had the hardware infrastructure in place already, yes I would. I had access to the hardware, so I headed into my lab to give it a shot.

What’s required?

  • Hardware
    • One or more ESX hosts
    • At least one fibre HBA in each ESX host that supportes fibre tape devices (enabled in the HBA BIOS typically at POST)
    • A fibre attached tape device (the fibre HBA in a tape device is called an NSR or Network Storage Router)
    • At least one fibre SAN switch
      • If using more than one in a single fabric, be sure they are ISL’d together
      • If multipathing in separate fabrics, at least two HBAs per host will be required and at least two NSRs in the tape device (although this is really going overboard and would be quite expensive)
    • Fibre cable between ESX host and SAN switch
    • Fibre cable between NSR and SAN switch
    • Optional components: shared storage
  • Software
    • VMware ESX or ESXi
    • Virtual Infrastructure Client
    • Latest firmware on HBA(s), NSR(s), and SAN switch(es)
    • Appropriate zoning created and enabled on SAN switch for all ESX host HBAs and NSR
    • Optional components: VirtualCenter, VMotion, DRS, HA

The steps to setting this up aren’t incredibly difficult.

  1. Attach fibre cables between HBAs and SAN switch
  2. Attach fibre cable between NSR and SAN switch
  3. On the fibre SAN switch, zone the NSR to all HBAs in all ESX hosts that will participate. Be sure to enable the active zone configuration. On Brocade SAN switches this is a two step process.
  4. Perform a scan on the fibre HBA cards (on all ESX hosts) to discover the fibre tape device. In this case, I’ve got an HP MSL5026 autoloader containing a robotic library and two tape drives:
  5. Once each ESX host can “see” the tape device, add the tape device to the VM as a SCSI passthru device. In the drop down selection box, the two tape drives are seen as “tape” and the robotic library is seen as “media”. Take a look at the .vmx file and see how the SCSI passthru device maps back to VMHBA1:1:2 and ultimately the tape drive as a symbolic link:
    fctape2 fctape3 fctape9 fctape10 fctape11
  6. The VM can now see the tape device. Notice it is SCSI and not fibre. At this time, VMs only see SCSI devices. Fibre is not virtualized within the VMware virtual machines to the extent that a VM can see virtual fibre or a virtual HBA. The current implementation of NPIV support in VMware is something different and will be explored in an upcoming blog:

Good news! The fibre attached tape drive works perfectly using Windows ntbackup.exe. Effective throughput rate of many smaller files to tape is 389MB/minute or 6.5MB/second. As expected, running a second backup job with less files but larger sizes, I saw an increased throughput rate of 590MB/minute or nearly 10MB/second. These speeds are not bad:
fctape5 fctape6 fctape8

Now for the bad news. When trying to migrate the VM while it was powered on (VMotion) or powered off (cold migration), I ran into a snag. VMware sees the fibre tape device as a raw disk with an LSI Logic SCSI controller which is not supported for migration (I tried changing the LSI Logic bus to use Physical bus sharing, but that did not work):
fctape7 fctape9

The VM migration component of my test was a failure, but the fibre connectivity was a success. Perhaps we’ll have SCSI passthru migration ability in a future version of VMware Virtual Infrastructure. Maybe v-SCSI passthru is the answer (v-* seems to be the next generation answer to many datacenter needs). What this experiment all boils down to is that I can’t do much more with a fibre attached tape device than I can with a SCSI attached tape device. In addition, a VM with an attached SCSI passthru device remains pinned to an ESX host and therefore doesn’t belong on a clustered host.  However, I can think of a few potential advantages of a fibre attached tape device which may still be of interest:

  1. Fibre cabling offers better throughput speed and more bandwidth than SCSI.
  2. Fibre cabling offers much longer cable run distances in the datacenter.
  3. A failed SCSI card on the host often means a motherboard replacement. A failed HBA on the host means replacing an HBA.
  4. Fibre cabling allows multipathing while SCSI cabling along with the required bus termination does not.
  5. Fibre cabling leverages a SAN fabric infrastructure which can be used to gather detailed reports using native and robust SAN fabric tools such as SAN switch performance monitor, SANsurfer, HBAnywhere, etc.
  6. VMs with fibre attached tape can still be migrated to other zoned hosts by simply removing the tape device in virtual hardware, performing the migration, then re-adding the tape device, all without leaving my chair. A SCSI attached tape device would actually need to be re-cabled behind the rack.