No Failback Bug in ESX(i)

April 7th, 2010 by jason Leave a reply »

I few weeks ago, I came across a networking issue with VMware ESX(i) 4.0 Update 1.  The issue is that configuring a vSwitch or Portgroup for Failback: No doesn’t work as expected in conjunction with a Network Failover Detection type of Link Status Only. 

For the simplest of examples:

  1. Configure a vSwitch with 2 VMNIC uplinks with both NICs in an Active configuration.  I’ll refer to the uplinks as VMNIC0 and VMNIC1.
  2. Configure the vSwitch and/or a Portgroup on the vSwitch for Failback: No.
  3. Create a few test VMs with outbound TCP/IP through the vSwitch. 
  4. Power on the VMs and begin a constant ping to each of the VMs from the network on the far side of the vSwitch.
  5. Pull the network cable from VMNIC0.  You should see little to no network connectivity loss on the constant pings.
  6. With VMNIC0 failed, at this point, all network traffic is riding over VMNIC1.  When VMNIC0 is recovered, the expected behavior with No Failback is that all traffic will continue to traverse VMNIC1.
  7. Now provide VMNIC0 with a link signal by connecting it to a switch port which has no route to the physical network.  For example, simply connect VMNIC0 to a portable Netgear or Linksys switch.
  8. What you should see now is that at least one of the VMs is unpingable.  It has lost network connectivity because ESX has actually failed its network path back to VMNIC0.  In the failback mechanism, VMware appears to balance the traffic evenly.  In a 2 VM test, 1 VM will fail back to the recovered VMNIC.  In a 4 VM test, 2 VMs will fail back to the recovered VMNIC.  Etc.

 The impact spreads to any traffic being supported by the vSwitch, not just VM networking.  Thus, the impact includes Service Console, Management Network, IP based storage, VMotion, and FT.  Scope of the bug includes both the standard vSwitch as well as the vSphere Distributed Switch (vDS).

Based on the following VMTN forum thread, it would appear this bug has existed since August of 2008.  Unfortunately, documentation of the bug never made it to VMware support:

You may be asking yourself at this point, well who really cares?  At the very least, we have on our hands a feature which does not work as documented on page 41 of the following manual:  Organizations which have made a design decision for no failback have done so for a reason and rely on the feature to work as it is documented.

Why would my VMNIC ever come up with no routing capabilities to the physical network?  Granted, my test was simplistic in nature and doesn’t likely represent an actual datacenter design but the purpose was merely to point out to folks that the problem does exist.  The issue actually does present a real world problem for at least one environment I’ve seen.  Consider an HP C-class blade chassis fitted with redundant 10GbE Flex10 Ethernet modules.  Upstream to the Flex10 modules are Cisco Nexus 5k switches and then Nexus 7k switches.

When a Flex10 module fails and is recovered (say for instance it was rebooted – which you can test yourself if you have one), it has an unfortunate habit of bringing up the blade facing network ports (in this case, VMNIC0 labeled 1 in the diagram) up to 20 seconds before a link is established with the upstream Cisco Nexus 5k (labeled 2 in the diagram) which grants network routing to other infrastructure components.  So what happens here?  VMNIC0 shows a link and ESX fails back traffic to it up to 20 seconds before the link to the Nexus 5k is established.  There is a network outage for Service Console, Management Network, IP based storage, VMotion, and FT.  Perhaps some may say they can tolerate this much of an outage for their VM traffic, but most people I have talked to say even an outage of 2 or more seconds is unacceptable.  And what about IP based storage?  Can you afford the 20 second latency?  What about Management Network and Service Console?  Think HA and isolation response impact.  VMs shutting down as a result.  Etc.  It’s a nasty chain of events.  In such a case, a decision can be made to enable no failback as a policy on the vSwith and Portgroups.  However, due to the VMware bug, this doesn’t work.  Some day you may experience an outage which you did not expect.

As pointed out in the VMTN forum thread above, there is a workaround which I have tested and does work: Force at least one VMNIC to act as Standby.  This is not by VMware design, it just happens to make the no failback behavior work correctly.  The impact with this design decision is of course that now one VMNIC stands idle and there are no load balancing opportunities over this VMNIC.  In addition, with no failback enabled, network traffic will tend to become polarized to one side again impacting network load balancing.

An SR has been opened with VMware on this issue.  They have confirmed it is a bug and will be working to resolve the issue in a future patch or release.

Update 4/27/10:  The no failback issue has been root-caused by VMware and a fix has been constructed and is being tested. It will be triaged for a future release.


No comments

  1. Duncan says:

    I am bit surprised this is labelled as a bug as I always thought it was working as designed because of the following bit in our documentation: “If failback is set to Yes (default), the adapter is returned to active duty immediately upon recovery, displacing the standby adapter that took over its slot, if any.”

    The keyword here is the word “standby”. I assumed it only applied in an active/standby scenario.

    Keep us up to date,

  2. NiTRo says:

    Jason, indeed it’s a shame but in your case it is possible to set standby adapter and no failback on the vswitch and only override the network policies for the VM port groups. I know it’s not enough and i really can’t understand why VMware isn’t respond to this bug.

  3. jason says:

    Duncan, I worked quite a bit with VMware support on this to be sure the behavior we were seeing was contra what the documentation described and thus the feature is not working as designed. Having a clear understanding of the feature was the reason I delayed this post for a few weeks. I wanted to be absolutely sure I was not falsely reporting a VMware bug.

  4. jason says:

    Nitro, overriding the vswitch policy at the portgroup level does not work in this case. If you see differently, please let me know.

  5. NiTRo says:

    Jason, can you explain “does not work in this case” ? I usualy do this on my enclosures to be sure my vmotion port group will always be on the same nic and so the vmotion trafic won’t go out of the same enclosure switch.

  6. jason says:

    This blog post is about the failback: no option not working correctly. Whether set at the vSwitch or Portgroup level, configuring failback to no still allows a recovered VMNIC to take on traffic. The only way to make this work correctly is to configure at least 1 VMNIC as a Standby adapter in conjunction with configuring failback to no.

  7. Packet Racer says:

    I agree with Duncan. I don’t consider it a bug because in my opinion the “No failback” option only counts when your load balancing is set to “Explicit failover order”. Unless you specify “Explicit failover order” then the vSwitch will do load balancing. The behavior you observed with half of the VMs becoming unpingable is consistent with port ID based load balancing. You should re-try your experiment with load balancing turned off, if you haven’t done so already.

  8. jason says:

    We definitely examined the terminology. Like I said, I waited several weeks on this before I posted to be sure everyone was on the same page. VMware has confirmed it a bug in that the “no failback” should function whether or not there are standby adapters.

    At the very least, if what you are saying is true, then there would be a UI issue because without a standby adapter, the Failback option should be grayed out. I also submitted this piece of information to VMware. They acknowledged it but said that the Failback mechanism is supposed function as selected with or without standby adapters.

  9. jason says:

    Update 4/27/10: The no failback issue has been root-caused by VMware and a fix has been constructed and is being tested. It will be triaged for a future release.

  10. michael nguyen says:

    We are still seeing this issue in 4.1 U2. Do you know when the bug was fixed ? do you have the patch number or official VMware Release notes stating this particular bug ?