Flow Control

November 29th, 2010 by jason Leave a reply »

Thanks to the help from blog sponsorship, I’m able to maintain a higher performing lab environment than I ever had been up until this point.  One area which I hadn’t invested much in, at least from a lab standpoint, is networking.  In the past, I’ve always had some sort of small to mid density unmanageable Ethernet switch.  And this was fine.  Household name brand switches like Netgear and SMC from Best Buy and NewEgg performed well enough and survived for years in the higher temperature lab environment.  Add to that, by virtue of being unmanaged, they were plug and play.  No time wasted fighting a mis configured network. 

I recently picked up a 3Com SuperStack 3 Switch 3870 (48 1GbE ports).  It’s not 10GbE but it does fit my budget along with a few other networking nice-to-haves like VLANs and Layer 3 routing.  Because this switch is managed, I can now apply some best practices from the IP based storage realm.  One of those best practices is configuring Flow Control for VMware vSphere with network storage.  This blog post is mainly to record some pieces of information I’ve picked up along the way and to open a dialog with network minded readers who may have some input.

So what is network Flow Control? 

NetApp defines Flow Control in TR-3749 as “the process of managing the rate of data transmission between two nodes to prevent a fast sender from over running a slow receiver.”  NetApp goes on to advise that Flow Control can be set at the two endpoints (ESX(i) host level and the storage array level) and at the Ethernet switch(es) in between.

Wikipedia is in agreement with the above and adds more meat to the discussion including the following “The overwhelmed network element will send a PAUSE frame, which halts the transmission of the sender for a specified period of time. PAUSE is a flow control mechanism on full duplex Ethernet link segments defined by IEEE 802.3x and uses MAC Control frames to carry the PAUSE commands. The MAC Control opcode for PAUSE is 0X0001 (hexadecimal). Only stations configured for full-duplex operation may send PAUSE frames.

What are network Flow Control best practices as they apply to VMware virtual infrastructure with NFS or iSCSI network storage?

Both NetApp and EMC agree that Flow Control should be enabled in a specific way at the endpoints as well as at the Ethernet switches which support the flow of traffic:

  • Endpoints (that’s the ESX(i) hosts and the storage arrays) should be configured with Flow Control send/tx on, and receive/rx off.
  • Supporting Ethernet switches should be configured with Flow Control “Desired” or send/tx off and receive/rx on.

One item to point out here is that although both mainstream storage vendors recommend these settings for VMware infrastructures as a best practice, neither of their multi protocol arrays ship configured this way.  At least not the units I’ve had my hands on which includes the EMC Celerra NS-120 and the NetApp FAS3050c.  The Celerra is configured out of the box with Flow Control fully disabled and I found the NetApp configured for Flow Control set to full (duplex?).

Here’s another item of interest.  VMware vSphere hosts are configured out of the box to auto negotiate Flow Control settings.  What does this mean?  Network interfaces are able to advertise certain features and protocols which they were purpose built to understand following the OSI model and RFCs of course.  One of these features is Flow Control.  VMware ESX ships with a Flow Control setting which adapts to its environment.  If you plug an ESX host into an unmanaged switch which doesn’t advertise Flow Control capabilities, ESX sets its tx and rx flags to off.  These flags tie specifically to PAUSE frames mentioned above.  When I plugged in my ESX host into the new 3Com managed switch and configured the ports for Flow Control to be enabled, I subsequently found out using the ethtool -a vmnic0 command that both tx and rx were enabled on the host (the 3Com switch has just one Flow Control toggle: enabled or disabled).  NetApp provides a hint to this behavior in their best practice statement which says “Once these [Flow Control] settings have been configured on the storage controller and network switch ports, it will result in the desired configuration without modifying the flow control settings in ESX/ESXi.”  Jase McCarty pointed out back in January a “feature” of the ethtool in ESX.  Basically, ethtool can be used to display current Ethernet adapter settings (including Flow Control as mentioned above) and it can also be used to configure settings.  Unfortunately, when ethtool is used to hard code a vmnic for a specific Flow Control configuration, that config lasts until the next time ESX is rebooted.  After reboot, the modified configuration does not persist and it reverts back to auto/auto/auto.  I tested with ESX 4.1 and the latest patches and the same holds true.  Jase offers a workaround in his blog post which allows the change to persist by embedding it in /etc/rc.local.

Third item of interest.  VMware KB 1013413 talks about disabling Flow Control using esxcfg-module for Intel NICs and ethtool for Broadcom NICs.  This article specifically talks about disabling Flow Control when PAUSE frames are identified on the network.  If PAUSE frames are indicative of a large amount of traffic which a receiver isn’t able to handle, it would seem to me we’d want to leave Flow Control enabled (by design to mediate the congestion) and perform root cause analysis on exactly why we’ve hit a sustained scaling limit (and what do we do about it long term).

Fourth.  Flow Control seems to be a simple mechanism which hinges on PAUSE frames to work properly.  If the Wikipedia article is correct in that only stations configured for full-duplex operation may send PAUSE frames, then it would seem to me that both network endpoints (in this case ESX(i) and the IP based storage array) should be configured with Flow Control set to full duplex, meaning both tx and rx ON.  This conflicts with the best practice messages from EMC and NetApp although it does align with the FAS3050 out of box configuration.  The only reasonable explanation is that I’m misinterpreting the meaning of full-duplex here.

Lastly, I’ve got myself all worked up into a frenzy over the proper configuration of Flow Control because I want to be sure I’m doing the right thing from both a lab and infrastructure design standpoint, but in the end Flow Control is like the Shares mechanism in VMware ESX(i):  The values or configurations invoked apply only during periods of contention.  In the case of Flow Control, this means that although it may be enabled, it serves no useful purpose until a receiver on the network says “I can’t take it any more” and sends the PAUSE frames to temporarily suspend traffic.  I may never reach this tipping point in the lab but I know I’ll sleep better at night knowing the lab is configured according to VMware storage vendor best practices.

Advertisement

No comments

  1. Wade Holmes says:

    Hi Jason,

    Flow-Control requires 1Gbps end-points to be configured for auto-negotiate per VMware recommendations. This will allow ESX to negotiate the proper pause frame sequence. The statements used by storage vendors is specifically for the sending and receiving of pause frames, and not tx/rx (full-duplex) of all network traffic. The storage ports should still be set to auto-negotiate, as flow-controls requires auto-negotiation. See articles below

    http://kb.vmware.com/kb/1004089

    http://www.cisco.com/en/US/tech/tk389/tk214/technologies_tech_note09186a0080094781.shtml

  2. David Davis says:

    This is a fascinating post, Jason!

    I must admit, when I think about “flow control” I immediately think “RTS/CTS or XON/XOFF” (maybe I shouldn’t say that as it shows my age).

    I am always impressed when I learn about some vSphere config that can only be done at the CLI as it shows that there is so much more to vSphere than just what we see in the GUI.

    Personally, I think it would be cool to use a tool to monitor a 1Gb NIC at the packet layer and see one of these PAUSE frames actually go across, see the timeout indicated, and see if drop in traffic during that time.

    Likely, I don’t have time to do that today 🙂 but I did learn a lot from this post and am interested to see what others think.

    How many out there are using Gig-E Flow Control in their infrastructure?

    Comments, anyone else??

  3. Jase McCarty says:

    Jason,

    Specifically to NetApp, I blogged about some of this back in January 2010.

    Here’s the link to this post: http://www.jasemccarty.com/blog/?p=515

    Good work going in depth on this.

    Jase

  4. jason says:

    Thanks Jase. I had linked to your blog post and credited you with the solution for persistently configuring Flow Control on ESX hosts using ethtool.

  5. Manish Patel says:

    What exactly the point here with using FC on VS4? May be I am missing the point so want to make myself clear. Is it a combination of using FC with other technologies from diff vendor or is the performance a concern.

  6. Kyle McMaster says:

    Jason, do you have any information as to why EMC and NetApp specify Flow Control as a best practice? I’d be interested to see their reasoning.

  7. Great post and question.

    I remember my first run in with Flow Control when I worked at a broadcast media company. I found out there (and still believe) that flow control strongly depends on the environment.

    I don’t claim to be an expert at anything. But, my belief that in a primarily TCP protocol shop Ethernet Flow Control is not only not useful but it may be harmful.

    TCP has it’s own ability to manage flow and is completely unaware of of any layer 2 mechanisms. I have been told that TCP is basically designed to assume that there is no other flow control in place other than itself. This means that PAUSE frames on layer 2 could cause problems with TCP. TCP is designed to only slow down when *it* detects an overloaded situation. Since the layer 2 PAUSE frames are invisible to TCP it will continue to send more and more data assuming that the path is clear.

    Also, Flow Control at Layer 2 doesn’t have built in priority control either, you are allowing it to pause ALL traffic instead of specific traffic that is flooding for an end point that is overloaded. (the next paragraph explains why).

    Another bad/good thing (depending on your environment) is that Ethernet Flow Control is switch to switch only. Even though the mechanism uses mutlicast the frame is not forwarded on from the receiving switch. This means logically TCP is an end-to-end (Server to Client) Flow Control rather than a switch to switch(virtual or not). TCP is more granular and scalable which is a prereq to cloud environments.

    Of course in an environment with a ton of UDP traffic Flow Control at Layer 2 may be useful. Usually UDP-based (VoIP for one) applications rely on QoS and application layer session management and flow control.

    So if you asked me, I would say don’t enable flow control on any ports used by VM’s. If using NFS for vSphere I wouldn’t enable it there either since NFS is using TCP anyways.

    In a lab environment I would definitely never turn it on.

    One other note: The new DCB stuff is going to change the game a bit. Still see people up in the air on how DCB Flow Control/Queuing will affect TCP. Good question for Brad Hedlund.

    .nick

  8. jason says:

    Thanks a lot for your feedback Nick. I wasn’t expecting something that long & formal but it’s certainly appreciated!!

  9. Jason Nash says:

    Great post Jason and great comment Nick. I’m with Nick. I’d really like to see some tests showing the effect that flow control has on higher level protocols such as NFS. Seems to me that TCP would think everything is great even when it isn’t and keep increasing the sliding window up to max which may in fact cause further underlying problems. Sort of a cascading effect.

  10. Jason Nash says:

    Here is another post that follows with what Nick and I think can happen:

    http://virtualthreads.blogspot.com/2006/02/beware-ethernet-flow-control.html

    Might be worth doing some lab testing with Flow Control on and off against something like an NFS datastore. Is it better to use FC or let TCP handle it should you start hammering the fabric so much that frames get dropped or the host has to discard them.

  11. David Davis says:

    Am I right in thinking that, more likely than using Ethernet flow control, instead, you should just monitor your uplink bandwidth utilization and add more uplinks when needed AS WELL AS use something more intelligent in the knowledge of your traffic and something more tweakable like vSphere Network IO Control ?

  12. Omar Sultan says:

    Jason:

    You bring up an interesting topic. In theory, some sort of flow control is a good idea to handle traffic issues and to generally deliver more reliable transport. The practicality of how and where to implement flow control is what makes it interesting and usually leads to different responses based on different circumstances. You can certainly implement flow control at layer 2, but there is also a valid design principle to run wide open at layer 2 and let the upper layer protocols figure things out. The one thing you want to avoid is stacking flow control schemes (for example Ethernet PAUSE and TCP flow control) because they will start stepping on each other.

    The other question is where in the stack you insert flow control. You can certainly let Ethernet handle the task, let TCP do it, or let the apps handle it. Again, you need to look at the specific circumstances. Since you mention the PAUSE frame, the one problem it has is that its a bit ham-fisted. In a typical network environment where one node is chatting with multiple other nodes, the PAUSE frame halts all traffic on the link, not just the conversation with the receiver that is getting overwhelmed. This is where DCB comes in handy. 802.1Qbb will allow pausing of individual flows, making it a much more calibrated and useful tool.

    Hope that helps,

    Omar

  13. Wade Holmes says:

    I agree with the perils of enabling flow control, and have never seen any solid data round performance increases due to its use. I have however seen issues with flow control and NFS that caused extremely poor write performance vmdk’s in NFS datastores. WIll be interesting how extensively priority flow control (PFC) gets adopted. Currently it is used to provided lossless FCoE by Cisco..

    The following article is interesting, as it discusses some of the limitations of flow control and how PFC addresses the limitation of using flow control for specific traffic through class of service (CoE).

    http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/white_paper_c11-542809.html

  14. Andy Daniel says:

    Jason,

    Just wanted to note that in my testing the flow control mechanism is not controlled by ESX itself but by the module/driver for the particular card. I believe the KBs you reference in regards to disabling for specific NICs are the result of very specific hardware issues with specific firmware and driver releases. There are known issues in the 10Gb realm with NetXen cards for example too.

    Andy

  15. Hajo Ehlers says:

    I have a 10GbE clinet Node using multiple 1GbE IO nodes. In the beginning we had terrible performance issue. Thus network data rate was sometimes low as 5MB/s and the speed between 10GbE nodes was also affected.

    Disabling Flow control on the nodes solved the performance issue.

    Just think about to connect a 100MB node in full duplex mode with flow control enabled and talking to a 10GbE node with flow control enabled too…..
    You will stop the network on the 10GbE right away.

    All from memory

  16. JasonG says:

    Nice to read about this here. I remember hearing for ages from people that “flow control is evil” etc… and you better disable it. So here’s my take on it

    By full-duplex, I believe the articles are talking about the physical ethernet link. I.e. For a shared-bus ethernet attached host, sending a pause frame would just cause collisions if the link was saturated with frames it was trying to recieve. I find the manufacturer’s choice of describing flow-control policy with duplex-related words to be misleading at best and disingenuous at worst. Clearly, the settings have to do with send (or not) and receive (or not); it has nothing much to do with those events occuring concurrently.

    TCP has it’s own flow control mechanism (and it is good in general) but in this context of isolated storage traffic, tcp re-xmits have much more latency than ethernet flow control and even a small amount of latency is a killer when it comes to storage.

    If pause frames are causing all flows to be stopped on a link it is an indication of head-of-line blocking, therefore the switch(es) don’t support virtual output queues (and it’s time to upgrade to a more expensive switch if you care).

    I think the bottom line is there are no hard and fast rules for these kind of things. It pays to understand exactly what is happening in each state as it relates to all the parts and configure optimally via that understanding.

  17. Steve Fuller says:

    I know this post is a little old now, but as it came fairly high in my search results when I was researching flow control I thought it still worth adding a comment.

    I believe the statement “Endpoints… should be configured with Flow Control send/tx on, and receive/rx off.” and “Supporting Ethernet switches should be configured with Flow Control Desired or send/tx off and receive/rx on.” is made based on the recommendation in NetApp TR-3749 v2.0 document from January 2010.

    The v3.0 version of TR-3749, which the link in the post points to, states the following:

    For modern network equipment, especially 10GbE equipment, NetApp recommends turning off flow control and allowing congestion management to be performed higher in the network stack.

    Regards

  18. Alvian says:

    Here is a November 2007 article on Ethernet flow control and its adverse effect on a single-switch network with mixed-speed devices. Benchmarks with and without FC are provided in the article.

    http://www.smallnetbuilder.com/content/view/30212/54/

    In summary, Ethernet FC and TCP FC interfere with each other, causing the switch to PAUSE unnecessarily and so making all attached devices to work at the speed of the slowest device.

    The solution is to either turn off Ethernet FC (if your switch permits), or segregate high- and low-speed network devices to separate high-speed switches.

  19. Dustin says:

    I realize this is an old post, but for any new readers I just configured 2x DL380P Gen 8 Servers as SQL 2012 Servers (In a cluster) using ISCSI to an Equallogic PS6100XV SAN with HP V1910-24G switches.

    I was having the strangest behavior when simulating a fail over where larger queries would take much longer than necessary, and complex views would never finish until after leaving the box alone for a while and then trying again.

    Upon testing further I noticed that my ping times to the ISCSI ports from the SQL box were extremely high but would never time out completely. That lead me to research flow control which of course lead me to here and a few other posts. I promptly disabled flow control on the switch for my ISCSI Vlan, and everything is working great.

  20. jason says:

    It’s an older post Dustin but nonetheless still relevant. Thank you for sharing your experience Dustin!

  21. RDS says:

    It is entertaining to read these posts 🙂
    1) Was Ethernet ever designed from the ground up to be lossless? (Rhetorical Question) Of course anyone selling product in the “Ethernet Storage Space” needs to belong to this faith.

    2) Protocols that run above Ethernet are completely blind to any Ethernet Flow Control Mechanisms. e.g TCP’s flow control mechanisms.
    The bottom line is Network devices need to be able to transmit at line rate and have sufficient buffers to mitigate momentary outages.
    Please read the Cisco Section of the following url:
    http://www.networkworld.com/netresources/0913flow2.html
    One of the more sensible responses to the Topic because it does not mention or introduce things from areas outside the scope of the subject “Ethernet” flow control.

    If ones budget forces one to consider only Ethernet based storage then use separate ethernet infrastructure to host. This will work very reliably.
    Cos and QoS are just about what goes into the shitter first they do NOT prevent packet loss.
    You must monitor your network for packet loss % on links and add capacity when you need it.

  22. SysAdmin says:

    I’d like to add some exprimentally got data in support of this “TCP protocol shop Ethernet Flow Control is not only not useful but it may be harmful” statement from comments earlier:

    iperf3 on 10G link between servers in the same switch shows 4.26 Gbits/sec with pause frames on.

    And after ethtool -A p4p2 rx off tx off on the receiver with all the same other conditions I got 6.23 Gbits/sec.

    Rather impressive difference, I think.