Posts Tagged ‘Hardware’

Configure VMware ESX(i) Round Robin on EMC Storage

February 4th, 2010

I recently set out to enable VMware ESX(i) 4 Round Robin load balancing with EMC Celerra (CLARiiON) fibre channel storage.  Before I get to the details of how I did it, let me preface this discussion with a bit about how I interpret Celerra storage architecture. 

The Celerra is built on CLARiiON fibre channel storage and as such, it leverages the benefits and successes CLARiiON has built over the years.  I believe most CLARiiON’s are, by default, active/passive arrays from VMware’s perspective.  Maybe more accurately stated, all controllers are active, however, each controller has sole ownership of a LUN or set of LUNs.  If a host wants access to a LUN, it is preferable to go through the owning controller (the preferred path).  Attempts to access a LUN through any other controller than the owning controller will result in a “Trespass” in EMC speak.  A Trespass is shift in LUN ownership from one controller to another in order to service an I/O request from a fabric host.  When I first saw Trespasses in Navisphere, I was alarmed.  I soon learned that they aren’t all that bad in moderation.  EMC reports that a Trespass occurs EXTREMELY quickly and in almost all cases will not cause problems.  However, as with any array which adopts the LUN ownership model, stacking up enough I/O requests which force a race condition between controllers for LUN access, will cause a condition known as thrashing.   Thrashing causes storage latency and queuing as controllers play tug of war for LUN access.  This is why it is important for ESX hosts, which share LUN access, to consistently access LUNs via the same controller path.  

As I said, the LUN ownership model above is the “out-of-box” configuration for the Celerra, also known as Failover Mode 1 in EMC Navisphere.  The LUN path going through the owning controller will be the Active path from a VMware perspective.  Other paths will be Standby.  This is true for both MRU and Fixed path selection policies.  What I needed to know was how to enable Round Robin path selection in VMware.  Choosing Round Robin in the vSphere Client is easy enough, however, there’s more to it than that because the Celerra is still operating in Failover Mode 1 where I/O can only go through the owning controller. 

So the first step in this process is to read the CLARiiON/VMware Applied Technology Guide which says I need to change the Failover Mode of the Celerra from 1 to 4 using Navisphere (FLARE release 28 version 04.28.000.5.704 or later may be required).  A value of 4 tells the CLARiiON to switch to the ALUA (Asymmetric Logical Unit Access or Active/Active) mode.  In this mode, the controller/LUN ownership model still exists, however, instead of transferring ownership of the LUN to the other controller with a Trespass, LUN access is allowed through the non-owning controller.  The I/O is passed by the non-owning controller to the owning controller via the backplane and then to the LUN.  In this configuration, both controllers are Active and can be used to access a LUN without causing ownership contention or thrashing.  It’s worth mentioning right now that although both controllers are active, the Celerra will report to ESX the owning controller as the optimal path, and the non-owning controller as the non-optimal path.  This information will be key a little later on.  Each ESX host needs to be configured for Failover Mode 4 in Navisphere.  The easiest way to do this is to run the Failover Setup Wizard.  Repeat the process for each ESX host.  One problem I ran into here is that after making the configuration change, each host and HBA still showed a Failover Mode of 1 in the Navisphere GUI.  It was as if the Failover Setup Wizard steps were not persisting.  I failed to accept this so I installed the Navisphere CLI and verified each host with the following command: 

naviseccli -h <SPA_IP_ADDRESS> port -list –all

Output showed that Failover Mode 4 was configured:

Information about each HBA:
HBA UID:                 20:00:00:00:C9:8F:C8:C4:10:00:00:00:C9:8F:C8:C4
Server Name:             lando.boche.mcse
Server IP Address:       192.168.110.5
HBA Model Description:
HBA Vendor Description:  VMware ESX 4.0.0
HBA Device Driver Name:
Information about each port of this HBA:�
    SP Name:               SP A
    SP Port ID:            2
    HBA Devicename:        naa.50060160c4602f4a50060160c4602f4a
    Trusted:               NO
    Logged In:             YES
    Source ID:             66560
    Defined:               YES
    Initiator Type:           3
    StorageGroup Name:     DL385_G2
    ArrayCommPath:         1
    Failover mode:         4
    Unit serial number:    Array

Unfortunately, the CLARiiON/VMware Applied Technology Guide didn’t give me the remaining information I needed to actually get ALUA and Round Robin working.  So I turned to social networking and my circle of VMware and EMC storage experts on Twitter.  They put me on to the fact that I needed to configure SATP for VMW_SATP_ALUA_CX, something I wasn’t familiar with yet. 

So the next step is a multistep procedure to configure the Pluggable Storage Architecture on the ESX hosts.  More specifically, SATP (Storage Array Type Plugin) and the PSP (Path Selection Plugin), in that order. Duncan Epping provides a good foundation for PSA which can be learned here.

Configuring the SATP tells the PSA what type of array we’re using, and more accurately, what failover mode the array is running.  In this case, I needed to configure the SATP for each LUN to VMW_SATP_ALUA_CX which is the EMC CLARiiON (CX series) running in ALUA mode (active/active failover mode 4).  The command to do this must be issued on each ESX host in the cluster for each active/active LUN and is as follows: 

#set SATP
esxcli nmp satp setconfig –config VMW_SATP_ALUA_CX –device naa.50060160c4602f4a50060160c4602f4a
esxcli nmp satp setconfig –config VMW_SATP_ALUA_CX –device naa.60060160ec242700be1a7ec7a208df11
esxcli nmp satp setconfig –config VMW_SATP_ALUA_CX –device naa.60060160ec242700bf1a7ec7a208df11
esxcli nmp satp setconfig –config VMW_SATP_ALUA_CX –device naa.60060160ec2427001cac9740a308df11
esxcli nmp satp setconfig –config VMW_SATP_ALUA_CX –device naa.60060160ec2427001dac9740a308df11

The devices you see above can be found in the vSphere Client when looking at the HBA devices discovered.  You can also find devices with the following command on the ESX Service Console: 

esxcli nmp device list 

I found that changing the SATP requires a host reboot for the change to take effect (thank you Scott Lowe).  After the host is rebooted, the same command used above should reflect that the SATP has been set correctly: 

esxcli nmp device list 

Results in: 

naa.60060160ec2427001dac9740a308df11
    Device Display Name: DGC Fibre Channel Disk (naa.60060160ec2427001dac9740a308df11)
    Storage Array Type: VMW_SATP_ALUA_CX
    Storage Array Type Device Config: {navireg=on, ipfilter=on}{implicit_support=on;explicit_ow=on;alua_followover=on;{TPG_id=1,TPG_state=ANO}{TPG_id=2,TPG_state=AO}}
    Path Selection Policy: VMW_PSP_FIXED
    Path Selection Policy Device Config: {policy=rr,iops=1000,bytes=10485760,useANO=0;lastPat=0,numBytesPending=0}
    Working Paths: vmhba1:C0:T0:L61 

Once the SATP is set, it is time to configure the PSP for each LUN to Round Robin.  You can do this via the vSphere Client, or you can issue the commands at the Service Console: 

#set PSP per device
esxcli nmp psp setconfig –config VMW_PSP_RR –device naa.60060160ec242700be1a7ec7a208df11
esxcli nmp psp setconfig –config VMW_PSP_RR –device naa.60060160ec242700bf1a7ec7a208df11
esxcli nmp psp setconfig –config VMW_PSP_RR –device naa.60060160ec2427001cac9740a308df11
esxcli nmp psp setconfig –config VMW_PSP_RR –device naa.60060160ec2427001dac9740a308df11 

#set PSP for device
esxcli nmp device setpolicy –psp VMW_PSP_RR –device naa.50060160c4602f4a50060160c4602f4a
esxcli nmp device setpolicy –psp VMW_PSP_RR –device naa.60060160ec242700be1a7ec7a208df11
esxcli nmp device setpolicy –psp VMW_PSP_RR –device naa.60060160ec242700bf1a7ec7a208df11
esxcli nmp device setpolicy –psp VMW_PSP_RR –device naa.60060160ec2427001cac9740a308df11
esxcli nmp device setpolicy –psp VMW_PSP_RR –device naa.60060160ec2427001dac9740a308df11 

Once again, running the command: 

esxcli nmp device list 

Now results in: 

naa.60060160ec2427001dac9740a308df11
    Device Display Name: DGC Fibre Channel Disk (naa.60060160ec2427001dac9740a308df11)
    Storage Array Type: VMW_SATP_ALUA_CX
    Storage Array Type Device Config: {navireg=on, ipfilter=on}{implicit_support=on;explicit_ow=on;alua_followover=on;{TPG_id=1,TPG_state=ANO}{TPG_id=2,TPG_state=AO}}
    Path Selection Policy: VMW_PSP_RR
    Path Selection Policy Device Config: {policy=rr,iops=1000,bytes=10485760,useANO=0;lastPat=0,numBytesPending=0}
    Working Paths: vmhba1:C0:T0:L61 

Notice the Path Selection Policy has now changed to Round Robin. 

I’m good to go, right?  Wrong.  I struggled with this last bit for a while.  Using ESXTOP and IOMETER, I could see that I/O was still only going down one path instead of two.  Then I remembered something Duncan Epping had said to me in an earlier conversation a few days ago.  He mentioned something about the array reporting optimal and non-optimal paths to the PSA.  I printed out a copy of the Storage Path and Storage Plugin Management with esxcli document from VMware and took it to lunch with me.  The answer was buried on page 88.  The nmp roundrobin setting useANO is configured by default to 0 which means unoptimized paths reported by the array will not be included in Round Robin path selection unless optimized paths become unavailable.  Remember I said early on that unoptimized and optimized paths reported by the array would be a key piece of information.  We can see this in action by looking at the device list above.  The very last line shows working paths, and only one path is listed for Round Robin use – the optimized path reported by the array.  The fix here is to issue the following command, again on each host for all LUNs in the configuration: 

#use non-optimal paths for Round Robin
esxcli nmp roundrobin setconfig –useANO 1 –device naa.50060160c4602f4a50060160c4602f4a
esxcli nmp roundrobin setconfig –useANO 1 –device naa.60060160ec242700be1a7ec7a208df11
esxcli nmp roundrobin setconfig –useANO 1 –device naa.60060160ec242700bf1a7ec7a208df11
esxcli nmp roundrobin setconfig –useANO 1 –device naa.60060160ec2427001cac9740a308df11
esxcli nmp roundrobin setconfig –useANO 1 –device naa.60060160ec2427001dac9740a308df11

Once again, running the command: 

esxcli nmp device list 

Now results in: 

naa.60060160ec2427001dac9740a308df11
    Device Display Name: DGC Fibre Channel Disk (naa.60060160ec2427001dac9740a308df11)
    Storage Array Type: VMW_SATP_ALUA_CX
    Storage Array Type Device Config: {navireg=on, ipfilter=on}{implicit_support=on;explicit_support=on;explicit_allow=on;alua_followover=on;{TPG_id=1,TPG_state=ANO}
TPG_id=2,TPG_state=AO}}
    Path Selection Policy: VMW_PSP_RR
    Path Selection Policy Device Config: {policy=rr,iops=1000,bytes=10485760,useANO=1;lastPathIndex=1: NumIOsPending=0,numBytesPending=0}
    Working Paths: vmhba0:C0:T0:L61, vmhba1:C0:T0:L61 

Notice the change in useANO which now reflects a value of 1.  In addition, I now have two Working Paths – an optimized path and an unoptimized path. 

I fired up ESXTOP and IOMETER which now showed a flurry of I/O traversing both paths.  I kid you not, it was a Clark Griswold moment when all the Christmas lights on the house finally worked.

So it took a while to figure this out but with some reading and the help of experts, I finally got it, and I was extremely jazzed.  What would have helped was if VMware’s PSA was more plug and play with various array types.  For instance, why can’t PSA recognize ALUA on the CLARiiON and automatically configure SATP for VMW_SATP_ALUA_CX?  Why is a reboot required for an SATP change?  PSA configuration in the vSphere client might have also been convenient but I recognize has diminishing returns or practical use with a large amount of hosts and/or LUNs to configure.  Scripting and CLI is the way to go for consistency and automation reasons or how about PSA configuration via Host Profiles? 

I felt a little betrayed and confused by the Navisphere GUI reflecting Failover Mode 1 after several attempts to change it to 4.  I was looking at host connectivity status. Was I looking in the wrong place? 

Lastly, end to end documentation on how to configure Round Robin would have helped a lot.  EMC got me part of the way there with the CLARiiON/VMware Applied Technology Guide document, but left me hanging, making no mention of the PSA configuration needed.  I’m getting that the end game for EMC multipathing today is PowerPath, which is fine – I’ll get to that, but I really wanted to do some testing with native Round Robin first, if for no other reason to establish a baseline to compare PowerPath to once I get there. 

Thanks again to the people I leaned on to help me through this.  It was the usual crew who can always be counted on.

VMTN Storage Performance Thread and the EMC Celerra NS-120

January 23rd, 2010

The VMTN Storage Performance Thread is a collaboration of storage performance results on VMware virtual infrastructure provided by VMTN Community members around the world.  The thread starts here, was locked due to length, and continues on in a new thread here.  There’s even a Google Spreadsheet version, however, activity in that data repository appears to have diminished long ago.  The spirit of the testing is outlined by thread creater and VMTN Virtuoso christianZ

“My idea is to create an open thread with uniform tests whereby the results will be all inofficial and w/o any warranty. If anybody shouldn’t be agreed with some results then he can make own tests and presents his/her results too. I hope this way to classify the different systems and give a “neutral” performance comparison. Additionally I will mention that the performance [and cost] is one of many aspects to choose the right system.” 

Testing standards are defined by christianZ so that results from each submission are consistent and comparable.  A pre-defined template is used in conjunction with IOMETER to generate the disk I/O and capture the performance metrics.  The test lab environment and the results are then appended to the thread discussion linked above.  The performance metrics measured are:

  1. Average Response Time (in Milliseconds, lower is better) – also known as latency of which VMware declares a potential problem threshold of 50ms in their Scalable Storage Performance whitepaper
  2. Average I/O per Second (number of I/Os, higher is better)
  3. Average MB per Second (in MB, higher is better)

Following are my results with the EMC Celerra NS-120 Unified Storage array

SERVER TYPE: Windows Server 2003 R2 VM ON ESXi 4.0 U1
CPU TYPE / NUMBER: VCPU / 1 / 1GB Ram (thin provisioned)
HOST TYPE: HP DL385 G2, 16GB RAM; 2x QC AMD Opteron 2356 Barcelona
STORAGE TYPE / DISK NUMBER / RAID LEVEL: EMC Celerra NS-120 / 15x 146GB 15K 4Gb FC / RAID 5
SAN TYPE / HBAs: Emulex dual port 4Gb Fiber Channel, HP StorageWorks 2Gb SAN switch
OTHER: Disk.SchedNumReqOutstanding and HBA queue depth set to 64 

Fibre Channel SAN Fabric Test

Test Name Avg. Response Time Avg. I/O per Second Avg. MB per Second
Max Throughput – 100% Read 1.62 35,261.29 1,101.92
Real Life – 60% Rand / 65% Read 16.71 2,805.43 21.92
Max Throughput – 50% Read 5.93 10,028.25 313.38
Random 8K – 70% Read 11.08 3,700.69 28.91
  
 
SERVER TYPE: Windows Server 2003 R2 VM ON ESXi 4.0 U1
CPU TYPE / NUMBER: VCPU / 1 / 1GB Ram (thin provisioned)
HOST TYPE: HP DL385 G2, 16GB RAM; 2x QC AMD Opteron 2356 Barcelona
STORAGE TYPE / DISK NUMBER / RAID LEVEL: EMC Celerra NS-120 / 15x 146GB 15K 4Gb FC / 3x RAID 5 5×146
SAN TYPE / HBAs: swISCSI
OTHER: Shared NetGear 1Gb SoHo Ethernet switch

swISCSI Test

Test Name Avg. Response Time Avg. I/O per Second Avg. MB per Second
Max Throughput – 100% Read 17.52 3,426.00 107.06
Real Life – 60% Rand / 65% Read 14.33 3,584.53 28.00
Max Throughput – 50% Read 11.33 5,236.50 163.64
Random 8K – 70% Read 15.25 3,335.68 22.06
  
 
SERVER TYPE: Windows Server 2003 R2 VM ON ESXi 4.0 U1
CPU TYPE / NUMBER: VCPU / 1 / 1GB Ram (thin provisioned)
HOST TYPE: HP DL385 G2, 16GB RAM; 2x QC AMD Opteron 2356 Barcelona
STORAGE TYPE / DISK NUMBER / RAID LEVEL: EMC Celerra NS-120 / 15x 146GB 15K 4Gb FC / 3x RAID 5 5×146
SAN TYPE / HBAs: NFS
OTHER: Shared NetGear 1Gb SoHo Ethernet switch

NFS Test

Test Name Avg. Response Time Avg. I/O per Second Avg. MB per Second
Max Throughput – 100% Read 17.18 3,494.48 109.20
Real Life – 60% Rand / 65% Read 121.85 480.81 3.76
Max Throughput – 50% Read 12.77 4,718.29 147.45
Random 8K – 70% Read 123.41 478.17 3.74

Please read further below for futher NFS testing results after applying EMC Celerra best practices

Fibre Channel Summary

Not surprisingly, Celerra over SAN fabric beats the pants off of the shared storage solutions I’ve had in the lab previously, HP MSA1000 and Openfiler 2.2 swISCSI before that, in all four IOMETER categories.  I was, however, pleasantly surprised to find that Celerra over fibre channel was one of the top performing configurations among a sea of HP EVA, Hitachi, NetApp, and EMC CX series frames.

swISCSI Summary

Celerra over swISCSIwas only slightly faster than the Openfiler 2.2 swISCSI on HP Proliant ML570 G2 hardware I had in the past on the Max Throughput-100%Read test. In the other three test categories, however, the Celerra left the Openfiler array in the dust.

NFS Summary

Moving on to Celerra over NFS, performance results were consistent with swISCSI in two test categories (Max Throughput-100%Read and Max Throughput-50%Read), but NFS performance numbers really dropped in the remaining two categories as compared to swISCSI (RealLife-60%Rand-65%Read and Random-8k-70%Read). 

What’s worth noting is that both the iSCSI and NFS datastores are backed by the same logical Disk Group and physical disks on the Celerra.  I did this purposely to compare the iSCSI and NFS protocols, with everything else being equal.  The differences in two out of the four categories are obvious.  The question came to mind:  Does the performance difference come from the Celerra, the VMkernel, or a combination of both?  Both iSCSI and NFS have evolved into viable protocols for production use in enterprise datacenters, therefore, I’m leaning AWAY from the theory that the performance degradation over NFS stems from the VMkernel. My initial conclusion here is that Celerra over NFS doesn’t perform as well with Random Read disk I/O patterns.  I welcome your comments and experience here.

Please read further below for futher NFS testing results after applying EMC Celerra best practices

CIFS

Although I did not test CIFS, I would like to take a look at its performance.  CIFS isn’t used directly by VMware virtual infrastructure, but it can be a handy protocol to leverage with NFS storage.  File management (ie. .ISOs, templates, etc.) on ESX NFS volumes becomes easier and more mobile and less tools are required when the NFS volumes are presented as CIFS shares on a predominantly Windows client network.  Providing adequate security through CIFS will be a must to protect the ESX datastore on NFS.

If you’re curious about storage array configuration and its impact on performance, cost, and availability, take a look at this RAID triangle which VMTN Master meistermn posted in one of the performance threads:

The Celerra stroage is currently carved out in the following way:

  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14  
DAE 2 FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC DAE 2
DAE1 NAS NAS NAS NAS NAS Spr Spr                 DAE 1
DAE 0 Vlt Vlt Vlt Vlt Vlt NAS NAS NAS NAS NAS NAS NAS NAS NAS NAS DAE 0
  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14  

FC = fibre channel Disk Group

NAS = iSCSI/NFS Disk Groups

Spr = Hot Spare

Vlt = Celerra Valut drives

I’m very pleased with the Celerra NS-120.  With the first batch of tests complete, I’m starting to formulate ideas on when, where, and how to use the various storage protocols with the Celerra.  My goal is not to eliminate use of the slowest performing protocol in the lab.  I want to work with each of them on a continual basis to test future design and integration with VMware virtual infrastructure.

Update 1/30/10: New NFS performance numbers.  I’ve begun working with EMC vSpecialist to troubleshoot the performance descrepancies between swISCSI and NFS protocols.  A few key things have been identified and a new set of performance metrics have been posted below after making some changes:

  1. The first thing that the EMC vSpecialists (and others on the blog post comments) asked about was whether or not the file system uncached write mechanism was enabled. The uncached write mechanism is designed to improve performance for applications with many connections to a large file, such as a virtual disk file of a virtual machine.  This mechanism can enhance access to such large files through the NFS protocol.  Out of the box, the factory default is the uncached write mechanism is disabled on the Celerra. EMC recommends this feature be enabled with ESX(i).  The beauty here is that the feature can be toggled while the NFS file system is mounted on cluster hosts with VMs running on it.  VMware ESX Using EMC Celerra Storage Systems pages 99-101 outlines this recommendation.
  2. Per VMware ESX Using EMC Celerra Storage Systems pages 73-74, NFS send and receive buffers should be divisible by 32k on the ESX(i) hosts.  Again, these advanced settings can be adjusted on the hosts while VMs are running and the settings do not require a reboot.  EMC recommended a value of 64 (presumably for both).
  3. Use the maximum amount of write cache possible for Storage Processors (SPs). Factory defaults here:  598BM total read cache size, 32MB read cache size, 598MB total write cache size, 566MB write cache size.
  4. Specific to this test – verify that the ramp up time is 120 seconds.  Without the ramp up the results can be skewed. The tests I originall performed were with a 0 second ramp up time.

The new NFS performance tests are below, using some of the recommendations above: 

SERVER TYPE: Windows Server 2003 R2 VM ON ESXi 4.0 U1
CPU TYPE / NUMBER: VCPU / 1 / 1GB Ram (thin provisioned)
HOST TYPE: HP DL385 G2, 16GB RAM; 2x QC AMD Opteron 2356 Barcelona
STORAGE TYPE / DISK NUMBER / RAID LEVEL: EMC Celerra NS-120 / 15x 146GB 15K 4Gb FC / 3x RAID 5 5×146
SAN TYPE / HBAs: NFS
OTHER: Shared NetGear 1Gb SoHo Ethernet switch

New NFS Test After Enabling the NFS file system Uncached Write Mechanism

VMware ESX Using EMC Celerra Storage Systems pages 99-101

Test Name Avg. Response Time Avg. I/O per Second Avg. MB per Second
Max Throughput – 100% Read 17.39 3,452.30 107.88
Real Life – 60% Rand / 65% Read 20.28 2,816.13 22.00
Max Throughput – 50% Read 19.43 3,051.72 95.37
Random 8K – 70% Read 19.21 2,878.05 22.48
Significant improvement here!  
 
 
SERVER TYPE: Windows Server 2003 R2 VM ON ESXi 4.0 U1
CPU TYPE / NUMBER: VCPU / 1 / 1GB Ram (thin provisioned)
HOST TYPE: HP DL385 G2, 16GB RAM; 2x QC AMD Opteron 2356 Barcelona
STORAGE TYPE / DISK NUMBER / RAID LEVEL: EMC Celerra NS-120 / 15x 146GB 15K 4Gb FC / 3x RAID 5 5×146
SAN TYPE / HBAs: NFS
OTHER: Shared NetGear 1Gb SoHo Ethernet switch

New NFS Test After Configuring
NFS.SendBufferSize = 256 (this was set at the default of 264 which is not divisible by 32k)
NFS.ReceiveBufferSize = 128 (this was already at the default of 128)

VMware ESX Using EMC Celerra Storage Systems pages 73-74

Test Name Avg. Response Time Avg. I/O per Second Avg. MB per Second
Max Throughput – 100% Read 17.41 3,449.05 107.78
Real Life – 60% Rand / 65% Read 20.41 2,807.66 21.93
Max Throughput – 50% Read  18.25  3,247.21  101.48
Random 8K – 70% Read  18.55  2,996.54  23.41
Slight change  
 
 
SERVER TYPE: Windows Server 2003 R2 VM ON ESXi 4.0 U1
CPU TYPE / NUMBER: VCPU / 1 / 1GB Ram (thin provisioned)
HOST TYPE: HP DL385 G2, 16GB RAM; 2x QC AMD Opteron 2356 Barcelona
STORAGE TYPE / DISK NUMBER / RAID LEVEL: EMC Celerra NS-120 / 15x 146GB 15K 4Gb FC / 3x RAID 5 5×146
SAN TYPE / HBAs: NFS
OTHER: Shared NetGear 1Gb SoHo Ethernet switch

New NFS Test After Configuring IOMETER for 120 second Ramp Up Time

Test Name Avg. Response Time Avg. I/O per Second Avg. MB per Second
Max Throughput – 100% Read  17.28  3,472.43  108.51
Real Life – 60% Rand / 65% Read  21.05  2,726.38  21.30
Max Throughput – 50% Read  17.73  3,338.72  104.34
Random 8K – 70% Read  17.70  3,091.17  24.15

Slight change

Due to the commentary received on the 120 second ramp up, I re-ran the swISCSI test to see if that changeded things much.  To fairly compare protocol performance, the same parameters must be used across the board in the tests.

SERVER TYPE: Windows Server 2003 R2 VM ON ESXi 4.0 U1
CPU TYPE / NUMBER: VCPU / 1 / 1GB Ram (thin provisioned)
HOST TYPE: HP DL385 G2, 16GB RAM; 2x QC AMD Opteron 2356 Barcelona
STORAGE TYPE / DISK NUMBER / RAID LEVEL: EMC Celerra NS-120 / 15x 146GB 15K 4Gb FC / 3x RAID 5 5×146
SAN TYPE / HBAs: swISCSI
OTHER: Shared NetGear 1Gb SoHo Ethernet switch

New swISCSI Test After Configuring IOMETER for 120 second Ramp Up Time

Test Name Avg. Response Time Avg. I/O per Second Avg. MB per Second
Max Throughput – 100% Read  17.79  3,351.07  104.72
Real Life – 60% Rand / 65% Read  14.74  3,481.25  27.20
Max Throughput – 50% Read  12.17  4,707.39  147.11
Random 8K – 70% Read  15.02  3,403.39  26.59

swISCSI is still performing slightly better than NFS on the Random Reads, however, the margin is much closer

At this point I am content, stroke, happy, (borrowing UK terminology there) with NFS performance.  I am now moving on to ALUA, Round Robin, and PowerPath/VE testing.  I set up NPIV over the weekend with the Celerra as well – look for a blog post coming up on that.

Thank you EMC and to the folks who replied in the comments below with your help tackling best practices and NFS optimization/tuning!

Lab Update

January 19th, 2010

I thought I’d post a lab update since John Troyer nudged me letting me know this week’s weekly podcast was focusing home labs for VCP and VCDX studies.

Read more here.  Scroll down to the Lab Update section.

Unboxing the EMC Celerra NS-120 Unified Storage

January 8th, 2010

Wednesday was a very exciting day! A delivery truck dropped off an EMC Celerra NS-120 SAN. The Celerra NS-120 is a modern entry-level unified storage solution which supports multiple storage protocols such as Fibre Channel (block), iSCSI (block), and NAS (file). This hardware is ideal for the lab, development, QA, small and medium sized businesses, or as a scalable building block for large production datacenters. The NS-120 is a supported storage platform for VMware Virtual Infrastructure which will utilize all three storage protocols mentioned above. It is also a supported on other VMware products such as Site Recovery Manager, Lab Manager, and View.

The Celerra arrived loaded in an EMC CX4 40u rack, nicely cabled and ready to go. Storage is comprised of three (3) 4GB DAE shelves and 45 146GB 15K RPM 4GB drives for about 6.5TB RAW. The list of Bundled software includes:
Navisphere Manager
Navisphere QoS Manager
SnapView
MirrorView/A
PowerPath
Many others

There is so much more I could write about the Celerra, but the truth of the matter is that I don’t know a lot about it yet.  The goal is to get it set up and explore its features.  I’m very interested in comparing FC vs. iSCSI vs. NFS.  DeDupe, Backup, and Replication are also areas I wish to explore.  Getting it online will take some time.  The Electrician is scheduled to complete his work on Saturday; two 220 Volt 30 Amp Single Phase circuits are being installed.  When it will actually be powered on and configured is up in the air right now.  Its final destination is the basement and it will take a few people to help get it down there.  Whether or not it can be done while the equipment is racked or during the winter is another question.  I really don’t want to unrack/uncable it but I may have to just to move it.  Another option would be to hire a moving company.  They have strong people and are very creative; they move big awkward things on a daily basis for a living.

Unloading it from the truck was a bit of a scare but it was successfully lowered to the street.  Weighing in at over 1,000 lbs., it was a challenge getting it up the incline of the driveway with all the snow and ice. We got it placed in its temporary resting place where the unboxing could begin.

The video

I’ve seen a few unboxing videos and although I’ve never created one, I thought this would be a good opportunity as this is one of the larger items I’ve unboxed. It’s not that I wanted to, but I get the sense that unboxing ceremonies are somewhat of a cult fascination in some circles and this video might make a good addition to someone’s collection.  There’s no music, just me. If you get bored or tired of my voice, turn on your MP3 player.  Enjoy!

EMC Celerra NS-120 Unboxing Video pt1 from Jason Boche on Vimeo.

EMC Celerra NS-120 Unboxing Video pt2 from Jason Boche on Vimeo.

EMC Celerra NS-120 Unboxing Video pt3 from Jason Boche on Vimeo.

Tame Electrical and Heating Costs with CPU Power Management

November 11th, 2009

A casual Twitter tweet about my power savings through the use of VMware Distributed Power Management (DPM) found its way to VMware Senior Product Manager for DPM, Ulana Legedza, and Andrei Dorofeev. Ulana was interested in learning more about my situation. I explained how VMware DPM had evaluated workloads between two clustered vSphere hosts in my home lab, and proceeded to shut down one of the hosts for most of the month of October, saving me more than $50 on my energy bill.

Ulana and Andrei took the conversation to the next level and asked me if I was using vSphere’s Advanced CPU Power Management feature (See vSphere Resource Management Guide page 22). I was not, in fact I was unaware of its existence. Power Management is a new feature in ESX(i)4 available to processors supporting Enhanced Intel SpeedStep or Enhanced AMD PowerNow! power management technologies. To quote the .PDF article:

“To improve CPU power efficiency, you can configure your ESX/ESXi hosts to dynamically switch CPU frequencies based on workload demands. This type of power management is called Dynamic Voltage and Frequency Scaling (DVFS). It uses processor performance states (P-states) made available to the VMkernel through an ACPI interface.”

A quick look at the Quad Core AMD Opteron 2356 processors in my HP DL385 G2 showed they support Enhanced AMD PowerNow! Power Management Technology:

There are two steps to enabling this power management feature. The first step is to ensure it is enabled in the server BIOS. On an HP DL385 G2, CPU power management is enabled by default. In this particular server model, it is configured via the BIOS by hitting <F9> at the end of the POST (would require a reboot obviously)

A slightly easier method might be to verify and/or configure the policy through HP’s out of band (OOB) iLO 2, however, a reboot will be requested by the iLO 2 for a policy change to take effect. On an HP server, configure for OS Control mode, but again, this appears to be the default for the HP DL385 G2 so hopefully no reboot is required for you to implement this power saving measure in your environment:

After enabling power management in the BIOS, the second step is to modify the Power Management Policy on each ESX(i) host from the default of static to dynamic. The definitions of these two settings can be found in the .PDF linked above and are as follows:

static – The default. The VMkernel can detect power management features available on the host but does not actively use them unless requested by the BIOS for power capping or thermal events.

dynamic – The VMkernel optimizes each CPU’s frequency to match demand in order to improve power efficiency but not affect performance. When CPU demand increases, this policy setting ensures that CPU frequencies also increase.

You might be asking yourself by this point “Ok, this is nice, but what’s the trade off?” Note the wording in the dynamic definition above “improves power efficiency but does not affect performance”. This is a win/win configuration change!

This step can be performed one of a few ways on each host (again, no reboot required for this change):

  1. Using the vSphere Client, change the Advanced host setting Power.CpuPolicy from static to dynamic
  2. Scriptable: Via the ESX service console, PuTTY, or script, issue the command esxcfg-advcfg -s dynamic /Power/CpuPolicy

The impact on my home lab was quite visible. After 12 hours, the blue area in the following 24 hour graph reflects average electrical consumption was reduced from an average 337 Watts down to 292 Watts. All things being equal and CPU loads balanced by DRS, that’s a reduction in energy consumption of over 13% per host:

An alternate graph shows Btu output dropped from 1,135 Btu to about 1,000 Btu. All things being equal, a reduction of about 135 Btu per host:

A Btu is heat – explained more at wiseGEEK’s What is a Btu? Heat is a byproduct of technology in the datacenter and in most cases is viewed as overhead expense because it requires cooling (additional costs) to maintain optimal operating conditions for the equipment running in the environment. If we can eliminate heat, we eliminate the associated cost of removing the heat. This is known as cost avoidance.

Eliminating heat is as much of an interest to me as reducing my energy bill. The excessive heat generated in the basement eventually finds its way upstairs causing the rest of the house to be a little uncomfortable. The air conditioner in my home wasn’t manufactured to handle the excessive heat. Now, I live in the midwest where we have some frigid winters. Heat in the home is welcomed during the winter months. I could turn off CPU Power Management raising the Btu index as well as my energy bill, in favor of reducing my natural gas heating bill. I don’t know which is more expensive. This could be a great experiment for the January/February time frame.

In summary, we can attack operating costs from two sides by using VMware CPU Power Management:

  1. Reduction in excess electricity used by idle CPU cycles
  2. Reduction in cooling costs by reducing Btu output

I’m excited to see what next month’s energy bill looks like.

Update 11-17-09:  I was just made aware that Simon Seagrave wrote an earlier article on CPU power management here.  Sorry Simon, I was unaware of your article and I did not intentionally copy your topic.  Your article covered the topic well.  I hope we’re still friends :)

VMworld 2009 Wall of Datacenter Video

September 4th, 2009

I’m hanging out in my hotel room on Friday night with my family at Fisherman’s Wharf in San Francisco. I’ve fired up my laptop and I’m just starting to sift through the initial pile of great information published thus far on VMworld 2009.

Without a doubt, there will be many pictures, videos, blogs, and tweets to come out of the show. Here’s one of my favorites so far. Richard Garsthagen, Senior Evangelist at VMware EMEA, interviews “Dan”, whom I believe is an Architect on the VMware Lab Staff team. Dan talks about the $35 million wall of datacenter on display at the bottom of the Moscone Center escalator. As many of the attendees learned, this datacenter was used to power the nearly 40,000 VMs at VMworld 2009. Great video gentlemen!

Closely related, here’s a time lapse photography video of the datacenter build:

Virtualizing the grid

July 8th, 2009

I picked up this interesting map off Christopher Crowhurst’s blog. It’s a visualization of the United States power grid. The source comes from NPR’s article “Visualizing The Grid“. Follow the link to NPR and click on the various tabs at the top to see power plant, solar power, and wind power sources across the United States.

How much power are you saving due to virtualization?  Don’t forget virtualization cuts power consumption in more ways than just one.  The most obvious would be the reduction in server hardware count in the datacenter.  There are other indirect power savings vectors such as reduction in cooling, reduction in network and SAN switches due to server consolidation, less UPS utilization, and maybe even a reduction in datacenter size which in and of itself presents more indirect savings:  security, plumbing, utility lighting, cleaning, maintenance, real estate, etc.

7-8-2009 9-46-46 AM

vSphere Has Arrived

May 21st, 2009

It has been a long wait but last night (and to my surprise) vSphere was finally released and from what I’ve seen so far, it was well worth the wait. Not that VI3 isn’t a great product, but the new features vSphere boasts are absolutely amazing. Whereas with VI3 VMware put any resemblance of competition to shame, vSphere totally and completely annihilates it.

With the vSphere NDA embargo lifted a while back for bloggers, there has already been plenty of coverage on most of the new features so I’m not going to go into each of them in great detail here. I’ll just touch on a few things that have caught my attention. There is plenty more to digest on other blogs and of course VMware’s site.

First of all, let me get this out of the way: By far the best and most complete collection of vSphere resources on the internet can be found at Eric Siebert’s vSphere-land site. If you can’t find what you’re looking for there, it doesn’t exist.

Now, a few of my favorite and notable observations thus far:

  • The What’s New in vSphere 4.0 page – This is the list of new major features in vSphere. Note there are approximately 150 new features in vSphere in all, this is the list of the major notable ones worth highlighting:
    • One feature which was news to me and I hadn’t seen during the private beta was Virtual Machine Performance Counters Integration into Perfmon which seems to have replaced the shortlived and ‘never made it out of experimental support’ VMware Tools Descheduler Service. “vSphere 4.0 introduces the integration of virtual machine performance counters such as CPU and memory into Perfmon for Microsoft Windows guest operating systems when VMware Tools is installed. With this feature, virtual machine owners can do accurate performance analysis within the guest operating system. See the vSphere Client Online Help.”
    • New CLI commands: vicfg-dns, vicfg-ntp, vicfg-user, vmware-cmd, and vicfg-iscsi
    • There appears to be no end in sight for product name changes. VIMA has become vMA. It’s still 64-bit only as far as I know.
    • It’s official, and Rick Vanover reported it first in Virtualization Review magazine: Storage VMotion renamed to Enhanced Storage VMotion, particularly when changing disk formats hot on the fly (ie. full to thin provisioned). Not to be confused with Enhanced VMotion Compatibility (EVC) which is a completely different feature – I predict a lot of people confusing these two technologies interchanging one for the other.
  • The Upgrade Guide – Easy but critically important reading. A few things that I quickly pulled of this document that are worth noting:
    • SQL2000 is not a supported database platform for vCenter. SQL2008 is on the supported list. Good job VMware. Some folks may remember it taking an inconveniently long time to get SQL2005 on the supported database list when VI3 was released.
    • Another vCenter database detail I caught: During an upgrade, DBO must be granted to both MSDB and the vCenter database whereas with VI3 DBO was only needed on MSDB and you didn’t dare grant DBO to the vCenter database or you ended up with new database tables and an empty datacenter.
    • Quickly summarized, the VM upgrade path is: VMware Tools, shut down VM, upgrade VM hardware to version 7, power on. No VMFS datastore upgrades to worry about.
    • Both the 2.5 VIC and vSphere client can be installed simultaneously on the same machine and is supported as such. This will be very helpful for customers straddling both VI environments during their transition. I’ve got a blog entry coming up on ThinApp’ing the client soon which will provide yet another client installation option.
  • Configuration Maximums for VMware vSphere 4.0 – Ahh once again my most favorite VMware document of them all. Look at some of these insanely scalable supported configurations:
    • 8 vCPUs in a VM
    • 255GB RAM in a VM
    • IDE drive support in a VM
    • 10 vNICs in a VM
    • 512 vCPUs per host
    • 320 running VMs on a host
    • 64 lCPUs in a host
    • 20 vCPUs per core
    • 1TB RAM in a host
    • 4,096 virtual switch ports in a host
    • These are just a few that I hand picked. We’re looking at serious consolidation ratio possibilities here!
  • Systems Compatibility Guide – This is the offline version of the vSphere HCL. Ok, in case you have been living under a rock, vSphere is 64-bit only. You’ll want to make sure your hardware is compatible with vSphere. I won’t beat around the bush here – A lot of hardware that was supported by VI3 has dropped off the list (even much of the 64-bit hardware). If you don’t have the required hardware now, plan your 2010 budget accordingly. As a point of interest, I found it odd that an HP DL385G2 and G5 was on the HCL, but the G3 and G4 are missing. Pay close attention, particularly if you plan to utilize FT as that feature carries with it its own set of strict requirements.

There are boatloads of new goodies in vSphere. It’s going to be around for a long time so take your time to learn it. No need to rush or be the first datacenter to run vSphere for bragging rights. Watch the blogs and the bookstores. There will be new vSphere content gushing from all angles for many months and even years to come. Be sure to share your findings with the VMware virtual community. Collaboration and networking makes us strong and successful.

vSphere Memory Hot Add/CPU Hot Plug

May 10th, 2009

I’ve been experimenting with vSphere’s memory hot add and CPU hot plug features to determine its usefulness with Windows Server operating systems. I came up with mixed results depending on the version and architecture of the OS.

A few notes about the results:

  1. Memory hot remove is not supported at all by vSphere. It’s not an option no matter what the guest OS.
  2. Although virtual hardware can be hot added depending on the OS, there are caveats in certain cases
    1. A guest reboot may be required (this is outlined in the table below).
    2. Memory that is hot added to guests that support the hot add without a reboot will result in 100% sustained CPU utilization in the guest OS for a variable period of time that is dependent on the amount of of memory that is added. In my testing (and keep in mind your mileage may vary on different hardware):
      1. 1GB of RAM hot added resulted in 100% CPU for 1-3 seconds.
      2. 3GB of RAM hot added resulted in 100% CPU for about 10 seconds.
  3. CPU hot unplug is supported by vSphere but was not supported by any of the Windows operating systems that I tested.
  4. Going from 1vCPU to 2vCPUs in Windows 2008 guest operating systems did not result in a HAL change. From what I can tell, Windows 2008 uses the same HAL for uniprocessor and SMP. When a vCPU is hot added, it does show up right away in the Device Manager, however, it’s not seen in Task Manager or Computer Properties therefore my assumption is that processes are not being scheduled on the added vCPU until after the reboot at which time the additional vCPU shows up in all places that it should (ie. Task Manager, Computer Properties, etc.)
  5. I certainly like the innovation and flexibility here but I’m not sure hot add technology is going to mesh well with planned change management systems. The most important thing to recognize though is that VMware offers this technology to us as our choice to use or not use. It’s not a feature VMware held back drawing their own conclusion that nobody on the planet could ever use it. Microsoft does this today with Hyper-V memory over commit. Or rather they don’t offer memory over commit in Hyper-V because they made the decision on behalf of all their customers that nobody could or should use memory over commit. Instead you should pad your hosts with more physical memory at additional cost to you.

Here is the table of results I came up with:

Memory hot
add
Memory hot
remove
CPU hot
plug
CPU hot
unplug
Windows Server 2003 STD x86 :-( :-( :-( :-(
Windows Server 2003 STD x64 :-( :-( :-( :-(
Windows Server 2003 ENT x86 8-) :-( :-( :-(
Windows Server 2003 ENT x64 8-) :-( :-( :-(
Windows Server 2008 STD x86 8-) * :-( :-( :-(
Windows Server 2008 STD x64 8-) * :-( 8-) * :-(
Windows Server 2008 ENT x86 8-) :-( :-( :-(
Windows Server 2008 ENT x64 8-) :-( 8-) * :-(
Windows Server 2008 DC x86 8-) :-( :-( :-(
Windows Server 2008 DC x64 8-) :-( 8-) :-(
Windows Server 2008 R2 DC x64
(experimental support only)
8-) :-( 8-) :-(
* Reboot of guest OS required to recognize added hardware

DPM best practices. Look before you leap.

March 16th, 2009

It has previously been announced that VMware’s Distributed Power Management (DPM) technology will be fully supported in vSphere. Although today DPM is for experimental purposes only, virtual infrastructure users with VI Enterprise licensing can nonetheless leverage its usefulness of powering down ESX infrastructure during non-peak periods where they see fit.

Before enabling DPM, there are a few precautionary steps I would go through first to test each ESX host in the cluster for DPM compatibility which will help mitigate risk and ensure success. Assuming most, if not all, hosts in the cluster will be identical in hardware make and model, you may choose to perform these tests on only one of the hosts in the cluster. More on testing scope a little further down.

This first step is optional but personally I’d go through the motions anyway. Remove the hosts to be tested individually from the cluster. If the hosts have running VMs, place the host in maintenance mode first to displace the running VMs onto other hosts in the cluster:

3-16-2009 10-31-19 PM

If the step above was skipped or if the host wasn’t in a cluster to begin with, then the first step is to place the clustered host into maintenance mode. The following step would be to manually place the host in Standby Mode. This is going to validate whether or not vCenter can successfully place a host into Standby Mode automatically when DPM is enabled. One problem I’ve run into is the inability to place a host into Standby Mode because the NIC doesn’t support Wake On LAN (WOL) or WOL isn’t enabled on the NIC:

3-16-2009 10-25-53 PM

Assuming the host has successfully been place into Standby Mode, use the host command menu (similar in look to the menu above) to take the host out of Standby Mode. I don’t have the screen shot for that because the particular hosts I’m working with right now aren’t supporting the WOL type that VMware needs.

Once the host has successfully entered and left Standby Mode, the it can be removed from maintenance mode and added back into the cluster. Now would not be a bad time to take a look around some of the key areas such as networking and storage to make sure those subsystems are functioning properly and they are able to “see” their respective switches, VLANs, LUNs, etc. Add some VMs to the host and power them on. Again, perform some cursory validation to ensure the VMs have network connectivity, storage, and the correct consumption of CPU and memory.

My point in all of this is that ESX has been brought back from a deep slumber. A twelve point health inspection is the least amount of effort we can put forth on the front side to assure ourselves that, once automated, DPM will not bite us down the road. The steps I’m recommending have more to do with DPM compatibility with the different types of server and NIC hardware, than they have to do with VMware’s DPM technology in and of itself. That said, at a minimum I’d recommend these preliminary checks on each of the different hardware types in the datacenter. On the other end of the spectrum if you are very cautious, you may choose to run through these steps for each and every host that will participate in a DPM enabled cluster.

After all the ESX hosts have been “Standby Mode verified”, the cluster settings can be configured to enable DPM. Similar to DRS, DPM can be enabled in a manual mode where it will make suggestions but it won’t act on them without your approval, or it can be set for fully automatic, dynamically making and acting on its own decisions:

3-16-2009 10-24-33 PM

DPM is an interesting technology but I’ve always felt in the back of my mind it conflicts with capacity planning (including the accounting for N+1 or N+2, etc.) and the ubiquitous virtualization goal of maximizing the use of server infrastructure. In a perfect world, we’ll always be teetering on our own perfect threshold of “just enough infrastructure” and “not too much infrastructure”. Having infrastructure in excess of what what would violate availability constraints and admission control is where DPM fits in. That said, if you have a use for DPM, in theory, you have excess infrastructure. Why? I can think of several compelling reasons why this might happen, but again in that perfect world, none could excuse the capital virtualization sin of excess hardware not being utilized to its fullest potential (let alone, powered off and doing nothing). In a perfect world, we always have just enough hardware to meet cyclical workload peaks but not too much during the valleys. In a perfect world, virtual server requests come planned so well in advance that any new infrastructure needed is added the day the VM is spun up to maintain that perfect balance. In a perfect world, we don’t purchase larger blocks or cells of infrastructure than what we actually need because there are no such things as lead times for channel delivery, change management, and installation that we need to account for.

If you don’t live in a perfect world (like me), DPM offers those of us with an excess of infrastructure and excuses an environment friendly and responsible alternative to at least cut the consumption of electricity and cooling while maintaining capacity on demand if and when needed. Options and flexibility through innovation is good. That is why I choose VMware.

Rapid Virtualization Indexing (RVI)

March 8th, 2009

I’m mildly excited for the upcoming week. If all goes well, I’ll be upgrading to AMD Opteron processors which support a virtualization assist technology called Rapid Virtualization Indexing (or RVI for short).

There is overhead introduced in VMware virtualization via the virtual machine monitor (VMM) and comes in three forms:

  1. Virtualization of the CPU (using software based binary translation or BT for short)
  2. Virtualization of the MMU (using software based shadow paging)
  3. Virtualization of the I/O devices (using software based device emulation)

RVI is found in AMD’s second generation of virtualization hardware support and it incorporates MMU (Memory Management Unit) virtualization. This new technology is designed to eliminate traditional software based shadow paging methods for MMU virtualization thereby reducing the overhead in bullet #2 above. VMware lab tests show that RVI provides performance gains of up to 42% for MMU-intensive benchmarks and up to 500% for MMU-intensive microbenchmarks.

How it works:

Software based shadow page tables store information about the guest VM’s physical memory location on the host. The VMM had to intercept guest VM page table updates to keep guest page tables and shadow page tables in sync. By now you can probably see where this is going: applications and VMs which had frequent guest page table updates were not as efficient as those with less frequent guest page table updates.

The above is similar to guest VM kernel mode calls/context switching to access CPU ring 0. Previously, the architecture wouldn’t allow it directly via the hardware so the VMKernel had to intercept these calls and hand-hold each and every ring 0 transaction. Throw 10,000+ ring 0 system calls at the VMKernel per second and the experience starts to become noticeably slower. Both Intel and AMD resolved this issue specifically for virtualized platforms by introducing a ring -1 (a pseudo ring 0) which guest VMs will be able to access directly.

VMware introduced support for RVI in ESX 3.5.0. RVI eliminates MMU related overhead in the VMM by relying on the technology built into the newer RVI capable processors to determine the physical location of guest memory by walking an extra level of page tables maintained by the VMM. RVI is AMD’s nested page table technology. The Intel version of the technology is called Extended Page Tables (EPT) and is expected sometime this year.

One of the applications of RVI that interests me directly is Citrix XenApp (Presentation Server). XenApp receives a direct performance benefit from RVI because it is an MMU-intensive workload. VMware’s conclusion in lab testing was that XenApp performance increased by approximately 29% using RVI. By way of the performance increase, we can increase the number of concurrent users on each virtualized XenApp box. There are two wins here: We increase our consolidation ratios on XenApp and we reduce the aggregate number of XenApp boxes we have to manage due to more densely populated XenApp servers. This is great stuff!

There is a caveat. VMware observed some memory access latency increases for a few workloads, however, they tell us there is a workaround. Use large pages in the guest and the hypervisor to reduce the stress on the Translation Lookaside Buffer (TLB). VMware recommends that TLB-intensive workloads make extensive use of large pages to mitigate the higher cost of a TLB miss. For optimal performance, the ESX VMM and VMKernel aggressively try to use large pages for their own memory when RVI is used.

For more information and deeper technical jibber jabber, please see VMware’s white paper Performance of Rapid Virtualization Indexing (RVI). Something to note is that all testing was performed on ESX 3.5.0 Update 2 with 64 bit guest VMs. I give credit to this document for the information provided in this blog post, including two directly quoted sentences.

For some more good reading, take a look at Duncan Epping’s experience with a customer last week involving MMU, RVI, and memory over commit.

Putting some money where my VMware mouth is

February 15th, 2009

I came home this afternoon from a Valentines Day wedding in North Dakota to find that my one and only workstation in the house (other than the work laptop) had a belated Valentines Day present for me:  It would no longer boot up.  No Windows.  No POST.  No video signal.  No beep codes.

DSC00473

I was feeling adventurous and I needed a relatively quick and inexpensive fix.  I decided to take one of the thin clients I received from Chip PC via VMworld 2008 plus a freshly deployed Windows XP template on the Virtual Infrastructure and promote this VDI solution to main household workstation status for the next few weeks.  The timing on this could not have been better.  The upcoming Minnesota VMUG on Wednesday March 11th is going to be VDI focused.  I guess I’ll have more to contribute at that meeting than I had originally planned on.  With any luck, Chip PC will be in attendance and we can discuss some things.

The thin client:  Chip PC Xtreme PC NG-6600 (model: EX6600N, part number: CPN04209).

Specs:

  • RMI – Alchemy Au 1550, 500MHz RISC processor (equivalent to 1.2GHz x86 TC processors)
  • 128MB DDR RAM
  • 64MB Disk-On-Chip with TFS
  • 128-bit 3D graphics acceleration engine with separate 2×8MB display memory SDRAM
  • Dual DVI ports each supporting 1920×1200 16-bit color.  Supports quad displays up to 1024×768
  • Audio I/O
  • 4 USB 2.0 ports
  • 10/100 Ethernet NIC
  • Power draw:  3.5W work mode, .35W sleep mode
  • OS:  Enhanced Microsoft Windows CE (6.00 R2 Professional)
  • Integrated applications (Plugins – note plugins are downloaded at no charge from the Chip PC website and are not, by default, embedded or included with the thin client – just enough OS concept)
    • Citrix ICA
    • RDP 5.2 and 6
    • Internet Explorer 6.0
    • VDM Client
    • VDI Client
    • Media Player
    • VPN Client
    • Ultra VNC
    • Pericom (Team Talk) Terminal Emulation
    • LPD Printer
    • ELO Touch Screen
  • Compatibility
    • Citrix WinFrame, MetaFrame, and Presentation Server 4.5
    • MS Windows Server 2000/2003
    • MS Windows NT 4.0 – TS Edition
    • VMware Virtual Desktop Interface using RDP
  • Full support of both local and network printers:  LPD, LPR, SMB, LPT, USB, COM
  • Support for USB mass storage (thumb drives – deal breaker for me)
  • Support for wireless USB NIC (not included)
  • etc. etc. etc.

DSC00474

Truth be told, this isn’t really a promotion in the sense that I had already performed extensive testing on it.  I hadn’t even taken the thing out of the box yet other than to register it for the extended warranty.  I’ve had only a little experience on these devices as I have an identical unit in the lab at work which I’ve spent a total of 30 minutes on.  To the best of my knowledge, this is the Cadillac unit from Chip PC.

I don’t have any fancy VDI brokering solutions here in the home lab and I’m not up to speed on VMware View so the plan is to leverage Thin Client -> RDP -> Windows XP desktop on VMware Virtual Infrastructure 3.5.

I think this is going to be a good test.  A trial by fire of VDI (granted, a fairly simple variation).  I spout a lot about the goodness that is VMware and now I’ll be eating some of my own dog food from the desktop workspace.  I’m a power user.  I’ve got my standard set of applications that I use on a regular basis and I’ve got a few hardware devices such as a flatbed scanner, iPod Shuffle, USB thumb drives, digital cameras, etc.  I should know within a short period of time whether or not this will be a viable solution for the short term.  Also add to the mix my wife’s career.  She uses our home computer to access her servers at work on a fairly regular basis.  Lastly, my wife sometimes works from home while I’m away at the office or traveling.  It’s going to be critical that this solution stays up and running and continues to be viable for my wife while I’m remote and not able to provide computer support.

So where am I at now?  I’ve got the VDI session patched along with my most critical applications installed to get me by in the short term:  Quicken, SnagIt, network printer, and Citrix clients.  I’ll install MS Office later but for now I can use the published application version of Office on my virtualized Citrix servers.  I’ve been listening some Electro House on www.di.fm on the VDI and music quality is as good as it was on my PC before it died, although it doesn’t completely drive my 5.1 surround in the den.  Pretty sure I’m getting 2.1 right now.  Oh well, at least the sub is thumpin.  Shhhh… the thin client is sleeping:

DSC00478

So what else?  As long as I’m throwing caution to the wind, I think it’s time to take the training wheels off VMware DPM (Distributed Power Management) and see what happens in a two node cluster.

2-15-2009 10-53-10 PM

Based on the environment below, what do you think will happen?  CPU load is very low, however, memory utilization is close to being over committed in a one host scenario. Will DPM kick in?

2-15-2009 10-53-59 PM

Most of my infrastructure at home is virtual including all components involving internet access both incoming and outgoing.  If the blog becomes unavailable for a while in the near future, I’ll give you one guess as to what happened.  :)

No matter what the outcome, vmwarenews.de aka Roman Haug – you are no longer welcomed to republish my blog articles.  Albeit flattering, the fact that you have not even so much as asked in the first place has officially pissed me off.  You publish my content as if it were your own, written by you as indicated by the “by Roman” header preceeding each duplicated post.  Please remove my content from your site and refrain from syndicating my content going forward.  Thank you in advance.

Update: Roman Haug has offered an apology and I believe we have reached an understanding.  Thank you Roman!

NFL’s Super Bowl IT team gets ready for game day

January 31st, 2009

 I think this would be a neat gig, and probably somewhat stressful.  All infrastructure components from simple to the most advanced must be monitored thoroughly and must not be overlooked.  And hey, virtualization is involved which is a plus.  It’s too bad they don’t specify what flavor of virtualization.  Inquiring minds would like to know.  How about it Computerworld?

January 30, 2009 (Computerworld) The National Football League is fielding three teams for Sunday’s Super Bowl. The first two are well known: the Pittsburgh Steelers and Arizona Cardinals. The third, more anonymous one is the 17-member IT staff that the NFL has assigned to work in Tampa, Fla., the site of this year’s game.

That team was tasked with creating a complete IT operation for Super Bowl XLIII in a matter of weeks. Its coaches are Joe Manto, the NFL’s vice president of IT, and Jon Kelly, the league’s director of infrastructure computing. Their opponent is the same one that IT managers face everywhere: anything that can threaten system availability and uptime.

It doesn’t help matters that one of the four IBM BladeCenter S systems being used in Tampa is located on a wood floor in a tent that lacks any climate control capabilities. But so far, so good – and with the four BladeCenter boxes at different locations, and virtualization software ready to provide redundancy, neither Manto nor Kelly seems all that worried.

“It’s very exciting for IT guys,” Manto said of the experience of setting up a systems infrastructure for the Super Bowl. It’s unlike most IT projects, which involve creating systems that will provide ongoing support to users. Instead, the seven-day-a-week effort in Tampa has a short life span and a clear and unmovable deadline.

“That game is going to kick off on Sunday no matter what happens,” Manto said. And by Tuesday, the IT equipment will be disassembled, packed and shipped out of Tampa. “It’s really an open-and-closed operation, which is sort of unique in the IT world,” he said.

The IT staff has set up systems in a hotel to support business operations for about 200 NFL employees who are on-site in Tampa. It also has also built a tech operation at the convention center in Tampa to support 3,500 media representatives who are covering the event; that setup includes wireless networking and automated access to NFL data.

Another system will manage the credentialing of up to 25,000 people – everyone from construction workers to halftime performers. In addition, about 300 PCs have been networked together.

This is the first year that the NFL has completely turned over its server processing workload for the Super Bowl to blade systems. Each BladeCenter chassis includes two blade servers, each with a pair of sockets for quad-core chips. In the past, the league would bring “tens of servers” to the game to provide IT support, Kelly said.

Manto said he will be able to watch parts of the game, primarily on TV monitors, as he moves around Raymond James Stadium in Tampa checking on system operations. But for the most part, Sunday will be a 14-hour workday for the IT staff. “Our main goal,” he said, “is to make sure that everything about this event is accomplished professionally and in a way that gives the fans the best possible experience.”

 Article above originally posted here.

Great iSCSI info!

January 27th, 2009

I’ve been using Openfiler 2.2 iSCSI in the lab for a few years with great success as a means for shared storage. Shared storage with VMware ESX/ESXi (along with the necessary licensing) allows us great things like VMotion, DRS, HA, etc. I’ve recently been kicking the tires of Openfiler 2.3 and have been anxious to implement partly due to the ease in its menu driven NIC bonding feature which I wanted to leverage for maximum disk I/O throughput.

Coincidentally, just yesterday a few of the big brains in the storage industry got together and published what I consider one of the best blog entries in the known universe. Chad Sakac and David Black (EMC), Andy Banta (VMware), Vaughn Stewart (NetApp), Eric Schott (Dell/EqualLogic), Adam Carter (HP/Lefthand) all conspired.

One of the iSCSI topics they cover is link aggregation over Ethernet. I read and re-read this section with great interest. My current swiSCSI configuration in the lab consists of a single 1Gb VMKernel NIC (along with a redundant failover NIC) connected to a single 1Gb NIC in the Openfiler storage box having a single iSCSI target with two LUNs. I’ve got more 1Gb NICs that I can add to the Openfiler storage box, so my million dollar question was “will this increase performance?” The short answer is NO with my current configuration. Although the additional NIC in the Openfiler box will provide a level of hardware redundancy, due to the way ESX 3.x iSCSI communicates with the iSCSI target, only a single Ethernet path will be used for by ESX to communicate to the single target backed by both LUNs.

However, what I can do to add more iSCSI bandwidth is to add the 2nd Gb NIC in the Openfiler box along with an additional IP address, and then configure an additional iSCSI target so that each LUN is mapped to a separate iSCSI target.  Adding the additional NIC in the Openfiler box for hardware redundancy is a no brainer and I probably could have done that long ago, but as far as squeezing more performance out of my modest iSCSI hardware, I’m going to perform some disk I/O testing to see if the single Gb NIC is a disk I/O bottleneck.  I may not have enough horsepower under the hood of the Openfiler box to warrant going through the steps of adding additional iSCSI targets and IP addressing.

A few of the keys I extracted from the blog post are as follows:

“The core thing to understand (and the bulk of our conversation – thank you Eric and David) is that 802.3ad/LACP surely aggregates physical links, but the mechanisms used to determine the whether a given flow of information follows one link or another are critical.

Personally, I found this doc very clarifying.: http://www.ieee802.org/3/hssg/public/apr07/frazier_01_0407.pdf

You’ll note several key things in this doc:

* All frames associated with a given “conversation” are transmitted on the same link to prevent mis-ordering of frames. So what is a “conversation”? A “conversation” is the TCP connection.
* The link selection for a conversation is usually done by doing a hash on the MAC addresses or IP address.
* There is a mechanism to “move a conversation” from one link to another (for loadbalancing), but the conversation stops on the first link before moving to the second.
* Link Aggregation achieves high utilization across multiple links when carrying multiple conversations, and is less efficient with a small number of conversations (and has no improved bandwith with just one). While Link Aggregation is good, it’s not as efficient as a single faster link.”

the ESX 3.x software initiator really only works on a single TCP connection for each target – so all traffic to a single iSCSI Target will use a single logical interface. Without extra design measures, it does limit the amount of IO available to each iSCSI target to roughly 120 – 160 MBs of read and write access.

“This design does not limit the total amount of I/O bandwidth available to an ESX host configured with multiple GbE links for iSCSI traffic (or more generally VMKernel traffic) connecting to multiple datastores across multiple iSCSI targets, but does for a single iSCSI target without taking extra steps.

Question 1: How do I configure MPIO (in this case, VMware NMP) and my iSCSI targets and LUNs to get the most optimal use of my network infrastructure? How do I scale that up?

Answer 1: Keep it simple. Use the ESX iSCSI software initiator. Use multiple iSCSI targets. Use MPIO at the ESX layer. Add Ethernet links and iSCSI targets to increase overall throughput. Ser your expectation for no more than ~160MBps for a single iSCSI target.

Remember an iSCSI session is from initiator to target. If use multiple iSCSI targets, with multiple IP addresses, you will use all the available links in aggregate, the storage traffic in total will load balance relatively well. But any individual one target will be limited to a maximum of single GbE connection’s worth of bandwidth.

Remember that this also applies to all the LUNs behind that target. So, consider that as you distribute the LUNs appropriately among those targets.

The ESX initiator uses the same core method to get a list of targets from any iSCSI array (static configuration or dynamic discovery using the iSCSI SendTargets request) and then a list of LUNs behind that target (SCSI REPORT LUNS command).”

Question 4: Do I use Link Aggregation and if so, how?

Answer 4: There are some reasons to use Link Aggregation, but increasing a throughput to a single iSCSI target isn’t one of them in ESX 3.x.

What about Link Aggregation – shouldn’t that resolve the issue of not being able to drive more than a single GbE for each iSCSI target? In a word – NO. A TCP connection will have the same IP addresses and MAC addresses for the duration of the connection, and therefore the same hash result. This means that regardless of your link aggregation setup, in ESX 3.x, the network traffic from an ESX host for a single iSCSI target will always follow a single link.

For swiSCSI users, they also mention some cool details about what’s coming in the next release of ESX/ESXi. Those looking for more iSCSI performance will want to pay attention. 10Gb Ethernet is also going to be a game changer, further threatening fibre channel SAN technologies.

I can’t stress enough how neat and informative this article is. To boot, technology experts from competing storage vendors pooled their knowledge for the greater good. That’s just awesome!

KB1008130: VMware ESX and ESXi 3.5 U3 I/O failure on SAN LUN(s) and LUN queue is blocked indefinitely

January 19th, 2009

I became aware of this issue last week by word of mouth and received the official Email blast from VMware this morning.

The vulnerability lies in a convergence of circumstances:

1. Fibre channel SAN storage with multipathing
2. A fibre channel SAN path failure or planned path transition
3. Metadata update occurring during the fibre channel SAN path failure where metadata updates include but are not limited to:

a. Power operations of a VM
b. Snapshot operations of a VM (think backups)
c. Storage VMotion (sVMotion)
d. Changing a file’s attributes
e. Creating a VMFS volume
f. Creating, modifying, deleting, growing, or locking of a file on a VMFS volume

The chance of a fibre channel path failure can be rated as slim, however, metadata updates can happen quite frequently, or more often than you might think. Therefore, if a fibre channel path failure occurs, chances are good that a metadata update could be in flight which is precisely when disaster will strike. Moreover, the safety benefit and reliance on multipathing is diminished by the vulnerability.

Please be aware of this.

Dear ESX 3.5 Customer,

Our records indicate you recently downloaded VMware® ESX Version 3.5 U3 from our product download site. This email is to alert you that an issue with that product version could adversely effect your environment. This email provides a detailed description of the issue so that you can evaluate whether it affects you, and the next steps you can take to get resolution or avoid encountering the issue.

ISSUE DETAILS:
VMware ESX and ESXi 3.5 U3 I/O failure on SAN LUN(s) and LUN queue is blocked indefinitely. This occurs when VMFS3 metadata updates are being done at the same time failover to an alternate path occurs for the LUN on which the VMFS3 volume resides. The effected releases are ESX 3.5 Update 3 and ESXi 3.5 U3 Embedded and Installable with both Active/Active or Active/Passive SAN arrays (Fibre Channel and iSCSI).

PROBLEM STATEMENT AND SYMPTONS:
ESX or ESXi Host may get disconnected from Virtual Center
All paths to the LUNs are in standby state
Esxcfg-rescan might take a long tome to complete or never complete (hung)
VMKernel logs show entries similar to the following:

Queue for device vml.02001600006006016086741d00c6a0bc934902dd115241 49442035 has been blocked for 6399 seconds.

Please refer to KB 1008130.

SOLUTION:
A reboot is required to clear this condition.

VMware is working on a patch to address this issue. The knowledge base article for this issue will be updated after the patch is available.

NEXT STEPS:
If you encounter this condition, please collect the following information and open an SR with VMware Support:

1. Collect a vsi dump before reboot using /usr/lib/vmware/bin/vsi_traverse.

2. Reboot the server and collect the vm-support dump.

3. Note the activities around the time where a first “blocked for xxxx seconds” message is shown in the VMkernel.

Please consult your local support center if you require further information or assistance. We apologize in advance for any inconvenience this issue may cause you. Your satisfaction is our number one goal.

Update:  The patch has been released that resolves this