Posts Tagged ‘VMotion’

New ESX(i) 3.5 security patch released; scenarios and installation notes

April 11th, 2009

On Friday April 10th, VMware released two patches:

Both address the same issue:

A critical vulnerability in the virtual machine display function might allow a guest operating system to run code on the host. The Common Vulnerabilities and Exposures Project (cve.mitre.org) has assigned the name CVE-2009-1244 to this issue.

Hackers must love vulnerabilities like this because they can get a lot of mileage out of essentially a single attack. The ability to execute code on an ESX host can impact all running VMs on that host.

Although proper virtualization promises isolation, the reality is that no hardware or software vendor is perfect and from time to time we’re going to see issues like this. Products are under constant attack from hackers (both good and bad) to find exploits. In virtualized environments, it’s important to remember that guest VMs and guest operating systems are no different than their physical counterparts in that they need to be properly protected from the network. That means adequate virus protection, spyware protection, firewalls, encryption, packet filtering, etc.

This vulnerability in VMware ESX and ESXi is really a two factor attack. In order to compromise the ESX or ESXi host, the guest VM must first be vulnerable to compromise on the network to provide the entry point to the host. Once the guest VM is compromised, the next step is to get from the guest VM to the ESX(i) host. Hosts without the patch will be vulnerable to the next attack which we know from reading above will allow who knows what code to be executed on the host. If the host is patched, we maintain our guest isolation and the attack stops at the VM level. Unfortunately, the OS running in the guest VM is still compromised, again highlighting the need for adequate protection of the operating system and applications running in each VM.

The bottom line is this is an important update for your infrastructure. If your ESX or ESXi hosts are vulnerable, you’ll want to get this one tested and implemented as soon as possible.

I installed the updates today in the lab and discovered something interesting that is actually outlined in both of the KB articles above:

  • The ESXi version of the update requires a reboot. Using Update Manager, the patch process goes like this: Remediate -> Maintenance Mode -> VMotion VMs off -> Patch -> Reboot -> Exit Maintenance Mode. The duration of installation of the patch until exiting maintenance mode (including the reboot in between) took 12 minutes.
  • The ESX version of the update does not require a reboot. Using Update Manager, the patch process goes like this: Remediate -> Maintenance Mode -> VMotion VMs off -> Patch -> Exit Maintenance Mode. The duration of installation of the patch until exiting maintenance mode (with no reboot in between) took 1.5 minutes.

Given reboot times of the host, patching ESX hosts goes much quicker than patching ESXi hosts. Reboot times on HP Proliant servers aren’t too bad but I’ve been working with some powerful IBM servers lately and the reboot times on those are significantly longer than HP. Hopefully we’re not rebooting ESX hosts on a regular basis so with that in mind, reboot times aren’t a huge concern, but if you’ve got a large environment with a lot of hosts requiring reboots, the reboot times are going to be cumulative in most cases. Consider my environment above. A 6 node ESXi cluster is going to take 72 minutes to patch, not including VMotions. A 6 node ESX cluster is going to take 9 minutes to patch, not including VMotions. This may be something to really think about when weighing the decision of ESX versus ESXi for your environment.

Update: One more item critical to note is that although the ESX version of the patch requires no reboot, the patch does require three other patches to be installed, at least one of which requires a reboot.  If you already meet the requirements, no reboot will be required for ESX to install the new patch.

In closing, while we are on the subject of performing a lot of VMotions, take a look at a guest blog post from Simon Long called VMotion Performance. Simon shows us how to modify VirtualCenter (vCenter Server) to allow more simultaneous VMotions which will significantly cut down the amount of time spent patching ESX hosts in a cluster.

Andrew Kutz joins Hyper9

February 28th, 2009

This news is a little over a week old but I just found out two nights ago while reading vExpert profiles and it’s definitely worth repeating.

Andrew Kutz is a recently named vExpert by VMware, Inc. and a well known developer in the VMware community. Andrew has authored a number of VirtualCenter plugins, of which the most famous might be his free Storage VMotion (sVMotion) plugin which provides VMware administrators a GUI interface to hot migrate VM storage from one LUN to another. Andrew has received well deserved praise for his work because he makes the lives of VI administrators easier.

Hyper9 is a startup company in Austin, TX that works in the virtualization infrastructure management space, developing tools that automate the management of virtualization in the datacenter. Hyper9 recently secured an additional round of investment funding and it would seem they are totally serious about delivering quality products to the virtualization community in the hiring of Andrew Kutz. What can we expect out of this? Given what I’ve seen from Andrew in the past, I’ll guess the future will be plugin based architecture which I think makes a lot of sense and is probably what the majority of the community wants.

Congratulations to both Andrew Kutz and Hyper9. I look forward to your accomplishments with great anticipation!

Read the official announcement from Hyper9 here.

Tripwire Annoucement

February 19th, 2009

Press release from Tripwire.  I haven’t had time to take a look at the product yet but the announcement comes from a trustworthy and reputable source whom I respect.  I look forward to seeing some commentary either on the blog here or over at vwire.com.

TRIPWIRE ANNOUNCES FREE UTILITY TO HELP MANAGE VMWARE VMOTION, LAUNCHES NEW VIRTUALIZATION COMMUNITY
Tripwire OpsCheck addresses key virtual infrastructure operational issues; vWire.com offers an opportunity for virtual infrastructure professionals to share ideas and best practices

Portland, OR – Feb. 17, 2009 – Tripwire, Inc. today announced a major new initiative for virtual infrastructure (VI) professionals, which includes Tripwire OpsCheckTM, a free tool to manage VMware VMotion, and an online community for VI administrators. Tripwire OpsCheck assesses common configuration problems that may prevent VMotion from operating properly, and provides troubleshooting tips for configuring VMotion based on Tripwire OpsCheck test results. To download Tripwire OpsCheck, go to www.vwire.com.

To further support the needs of VI professionals, Tripwire has unveiled vWire.com, an online community built around the concerns of VI professionals. Virtualization administrators, engineers and architects are invited to join the community and conversation to share best practices, network, and gain new resources and tools. For more information about the forum, visit www.vwire.com.

“Virtualization professionals are faced with unknown territory, requiring new tools to manage the complexities and risks of virtual environments,” said Dan Schoenbaum, chief operating officer of products, Tripwire. “That’s why Tripwire is committed to developing utilities specifically for virtualization, such as OpsCheck and ConfigCheck, and to creating a forum where VI professionals can share their experiences and knowledge.”

Tripwire ConfigCheck, released in 2008, provides an immediate assessment of the configurations of a VMware ESX hypervisor, comparing them against VMware hardening security guidelines, and then providing remediation instructions if any are needed. ConfigCheck is also available for free and can be downloaded at www.vwire.com.

About Tripwire, Inc.
Tripwire helps over 6,500 enterprises worldwide reduce security risk, attain compliance and increase operational efficiency across virtual and physical environments. With its industry leading configuration assessment and change auditing software solutions, IT organizations achieve and maintain configuration control. Tripwire is headquartered in Portland, Ore. with offices worldwide. http://www.tripwire.com/.

Great iSCSI info!

January 27th, 2009

I’ve been using Openfiler 2.2 iSCSI in the lab for a few years with great success as a means for shared storage. Shared storage with VMware ESX/ESXi (along with the necessary licensing) allows us great things like VMotion, DRS, HA, etc. I’ve recently been kicking the tires of Openfiler 2.3 and have been anxious to implement partly due to the ease in its menu driven NIC bonding feature which I wanted to leverage for maximum disk I/O throughput.

Coincidentally, just yesterday a few of the big brains in the storage industry got together and published what I consider one of the best blog entries in the known universe. Chad Sakac and David Black (EMC), Andy Banta (VMware), Vaughn Stewart (NetApp), Eric Schott (Dell/EqualLogic), Adam Carter (HP/Lefthand) all conspired.

One of the iSCSI topics they cover is link aggregation over Ethernet. I read and re-read this section with great interest. My current swiSCSI configuration in the lab consists of a single 1Gb VMKernel NIC (along with a redundant failover NIC) connected to a single 1Gb NIC in the Openfiler storage box having a single iSCSI target with two LUNs. I’ve got more 1Gb NICs that I can add to the Openfiler storage box, so my million dollar question was “will this increase performance?” The short answer is NO with my current configuration. Although the additional NIC in the Openfiler box will provide a level of hardware redundancy, due to the way ESX 3.x iSCSI communicates with the iSCSI target, only a single Ethernet path will be used for by ESX to communicate to the single target backed by both LUNs.

However, what I can do to add more iSCSI bandwidth is to add the 2nd Gb NIC in the Openfiler box along with an additional IP address, and then configure an additional iSCSI target so that each LUN is mapped to a separate iSCSI target.  Adding the additional NIC in the Openfiler box for hardware redundancy is a no brainer and I probably could have done that long ago, but as far as squeezing more performance out of my modest iSCSI hardware, I’m going to perform some disk I/O testing to see if the single Gb NIC is a disk I/O bottleneck.  I may not have enough horsepower under the hood of the Openfiler box to warrant going through the steps of adding additional iSCSI targets and IP addressing.

A few of the keys I extracted from the blog post are as follows:

“The core thing to understand (and the bulk of our conversation – thank you Eric and David) is that 802.3ad/LACP surely aggregates physical links, but the mechanisms used to determine the whether a given flow of information follows one link or another are critical.

Personally, I found this doc very clarifying.: http://www.ieee802.org/3/hssg/public/apr07/frazier_01_0407.pdf

You’ll note several key things in this doc:

* All frames associated with a given “conversation” are transmitted on the same link to prevent mis-ordering of frames. So what is a “conversation”? A “conversation” is the TCP connection.
* The link selection for a conversation is usually done by doing a hash on the MAC addresses or IP address.
* There is a mechanism to “move a conversation” from one link to another (for loadbalancing), but the conversation stops on the first link before moving to the second.
* Link Aggregation achieves high utilization across multiple links when carrying multiple conversations, and is less efficient with a small number of conversations (and has no improved bandwith with just one). While Link Aggregation is good, it’s not as efficient as a single faster link.”

the ESX 3.x software initiator really only works on a single TCP connection for each target – so all traffic to a single iSCSI Target will use a single logical interface. Without extra design measures, it does limit the amount of IO available to each iSCSI target to roughly 120 – 160 MBs of read and write access.

“This design does not limit the total amount of I/O bandwidth available to an ESX host configured with multiple GbE links for iSCSI traffic (or more generally VMKernel traffic) connecting to multiple datastores across multiple iSCSI targets, but does for a single iSCSI target without taking extra steps.

Question 1: How do I configure MPIO (in this case, VMware NMP) and my iSCSI targets and LUNs to get the most optimal use of my network infrastructure? How do I scale that up?

Answer 1: Keep it simple. Use the ESX iSCSI software initiator. Use multiple iSCSI targets. Use MPIO at the ESX layer. Add Ethernet links and iSCSI targets to increase overall throughput. Ser your expectation for no more than ~160MBps for a single iSCSI target.

Remember an iSCSI session is from initiator to target. If use multiple iSCSI targets, with multiple IP addresses, you will use all the available links in aggregate, the storage traffic in total will load balance relatively well. But any individual one target will be limited to a maximum of single GbE connection’s worth of bandwidth.

Remember that this also applies to all the LUNs behind that target. So, consider that as you distribute the LUNs appropriately among those targets.

The ESX initiator uses the same core method to get a list of targets from any iSCSI array (static configuration or dynamic discovery using the iSCSI SendTargets request) and then a list of LUNs behind that target (SCSI REPORT LUNS command).”

Question 4: Do I use Link Aggregation and if so, how?

Answer 4: There are some reasons to use Link Aggregation, but increasing a throughput to a single iSCSI target isn’t one of them in ESX 3.x.

What about Link Aggregation – shouldn’t that resolve the issue of not being able to drive more than a single GbE for each iSCSI target? In a word – NO. A TCP connection will have the same IP addresses and MAC addresses for the duration of the connection, and therefore the same hash result. This means that regardless of your link aggregation setup, in ESX 3.x, the network traffic from an ESX host for a single iSCSI target will always follow a single link.

For swiSCSI users, they also mention some cool details about what’s coming in the next release of ESX/ESXi. Those looking for more iSCSI performance will want to pay attention. 10Gb Ethernet is also going to be a game changer, further threatening fibre channel SAN technologies.

I can’t stress enough how neat and informative this article is. To boot, technology experts from competing storage vendors pooled their knowledge for the greater good. That’s just awesome!

Uptime lost during VMotion

January 11th, 2009

That’s right. We lose uptime during every VMotion. Relax just a bit – I’m not talking about actual uptime/downtime availability of the guest VM in the datacenter. I’m speaking to the uptime performance metric tracked in the VirtualCenter database. It’s a bug that was introduced in VirtualCenter 2.0 and has remained in the code to this day in VirtualCenter 2.5.0 Update 3. Here’s how it works:

We’ve got a VM named Exchange1. VirtualCenter displays its uptime as 28 days is indicated by the screenshots below:

1-11-2009 12-04-06 AM

1-11-2009 12-10-14 AM

Now the VM has been VMotioned from host SOLO to host LANDO. Notice what has happened to uptime. It has disappeared from the console:

1-11-2009 12-07-37 AM

The real proof in what has ultimately happened is we see from the performance chart the latest uptime metric has been reset from 27.99 days as shown above to 0.0026620 days:

1-11-2009 12-10-54 AM

Sometimes the VIC console will show the uptime counter start over at 0 days, then on to 1 days, etc. Other times the uptime counter will remain blank for days or weeks as you can see from my three other VMs in the first screenshot which show no uptime.

This brings us to an interesting discussion. What would you like uptime in the VIC to mean exactly? Following are my observations and thoughts on VMware’s implementation of the uptime metric in VirtualCenter.

In previous versions of VirtualCenter, a soft reboot of the VM inside of the OS would reset the uptime statistic in VirtualCenter. I believe this was a function of VMware Tools that triggered this.

Today in VirtualCenter 2.5.0 Update 3, a soft reboot inside the guest VM does not reset the uptime statistic back to zero.

A VM which has no VMware Tools installed that is soft rebooted inside of the OS (ie. we’re not talking about any VMware console power operation here) does not reset the uptime statistic.

I could see the community take a few different sides on this as there are two variations of the definition of uptime we’re dealing with here. Uptime of the guest VM OS and uptime of the VM’s virtual hardware.

  1. Should uptime translate into how long the VM virtual hardware has been powered on from a virtual infrastructure standpoint?
  2. Or should uptime translate into how long the OS inside the VM has been up, tracked by VMware Tools?

The VMware administrator cares about the length of time a VM has been powered on. It is the powered on VM that consumes resources from the four resource food groups and impacts capacity.

The guest VM OS administrator, whether it be Windows or Linux, cares about uptime of the guest OS. The owner of the OS is held to SLAs by the business lines.

My personal opinion is that the intended use of the Virtual Infrastructure Client is for the VMware administrator and thus should reflect virtual infrastructure information. My preference is that the uptime statistic in VirtualCenter tracks power operations of the VM irregardless of any reboot sequences of the OS inside the VM. In other words, uptime is not impacted by VMware Tools heartbeats or reboots inside the guest VM. The uptime statistic should only be reset when the VM is powered off or power reset (including instances where HA has recovered a VM).

At any rate, due to the bug that uptime has in VirtualCenter 2.0 and above, it’s a fairly unreliable performance metric for any virtual infrastructure using VMotion and DRS. Furthermore, the term itself can be misleading depending on the administrators interpretation of uptime versus what’s written in the VirtualCenter code.

I submitted a post in VMware’s Product and Feature Suggestions forum in January of 2007 recording the uptime reset on VMotion issue. As this problem periodically bugs me, I followed up a few times. Once in a follow up post in the thread above, and at least one time externally requesting someone from VMware take a look at it. Admittedly I do not have an SR open.

VMware, can we get this bug fixed? After all, if the hypervisor has become an every day commodity item leaving the management tools as the real meat and potatoes, you should make sure our management tools work properly.

Thank you,

Jas

A great disturbance in the Force

December 15th, 2008

Today I felt a great disturbance in the Force, as if millions of voices cried out in terror.  Mohamed Fawzi of the blog Zeros & Ones posted a VMware vs Hyper-V comparison that I felt was neither fair nor truthful.  In fact, I think it is the worst bit of journalism I’ve witnessed in quite a while and even in the face of the VMworld 2008/Microsoft Hyper-V poker chip fiasco, I don’t know if Microsoft would even endorse this tripe.

I didn’t have a lot of time today for rebuttal and thus following are my brief responses:

Cost: It is impossible to summarize cost of a product (and TCO) in one short sentence as you have done.

Support: VMware was the first virtualization company to be listed on the Microsoft SVVP program.  Enough said about that.  If you want to talk about Linux, VMware supports many distros.  Hyper-V last time I checked supports one.

Hardware Requirements: No comparison.  Microsoft does not have VMotion/hot migration or similar.  New server “farms” are not necessarily needed, although a rolling upgrade can be performed using Enhanced VMotion Compatibility where the majority of the technology that will allow this comes from the processor hardware vendors.

Advanced Memory Management: Content based page sharing is a proven technology that I use in a production environment with no performance impacts.  Microsoft does not have this technology and therefore forces their customers to achieve higher consolidation ratios by spending more money on RAM than what would be needed in a VMware datacenter.  Other memory overcommit technologies such as ballooning and swapping come with varying levels of penalty and VMware offers the flexibility to the customer as to what they would like to do in these areas.  Microsoft offers no flexibility or choices.

Hypervisor: ESXi embedded is 32MB.  ESXi installable is about 1GB.  Hyper-V’s comparable products once installed are 1GB and in the 4-10GB neighborhood.  Your point of the Hyper-V hypervisor being 872KB, whether truth or not, bears no relevance for comparison purposes.

Drivers Support: VMware maintains tight control which fosters platform stability.  Installation of XYZ drivers and software adds to instability, support costs, and down time.

Processor Support: False.  ESX/ESXi operates on x86 32bit and x64 64bit processors.  Current 3rd party vendor neutral performance benchmarking between ESX and Hyper-V shows no performance degradation in ESX compared to Hyper-V as a result of address translation or otherwise.  A more truthful headline to be exposed here is Hyper-V isn’t compatible with 32-bit hardware.  Why didn’t you mention this in your Hardware Requirements section?

Application Support: I don’t see any Windows support issues.  Again I remind you, VMware is certified on the Microsoft SVVP program.  Another comparison is made with a particular VMotion restriction.  I’ll grant you that if you admit Microsoft has no VMotion or hot migration at all.

Product Hypervisor Technology: We already covered this in the Drivers Support section.

Epic virtualization and storage blogger Scott Lowe provides his responses here.

Mohamed Fawzi, while it is nice to meet you, it is unfortunate that we met under these terms.  Having just discovered your blog today, I hope you don’t mind if I take a look at some of your other material as it looks like you’ve been at the blogging for a while (much longer than I).  I hope to find some good and interesting reads.

Symantec declares VMware VMotion unsupported

November 18th, 2008

Bad news for VMware VI Enterprise customers everywhere. I just found out I have 110 unsupported production and development VMs in my datacenter. Symantec published Document ID 2008101607465248 on 10/15/08 removing VMware VMotion support from its Symantec Antivirus (SAV) and Symantec Endpoint Protection (SEP) products.

Operating systems impacted are: All Windows operating systems.

Reported issues include but are not limited to:

  • Client communication problems
  • Symantec Endpoint Protection Manager (SEPM) communication issues
  • Content update failures
  • Policy update failures
  • Client data does not get entered into the database
  • Replication failures

This is of grave concern as many enterprise datacenters and VDI deployments are going to be impacted. My personal take is that someone jumped the gun in publishing a document with mysteriously vague detail, but we’ll have to wait and see what shakes out.

I hope that VMware can approach Symantec to get this resolved ASAP. It’s in everyone’s best interest.

Thank you vinternals for the heads up on this.

Update: Symantec has updated their support document stating that the problems a few customers have seen may or may not be related to VMware and VMotion. Until further notice, Symantec is supporting their products on VMware with VMotion. If you experience an issue with Symantec products, please contact Symantec technical support. This confirms my opinion that someone at Symantec jumped the gun by issuing the 10/15/08 support document stating VMware and VMotion is unuspported. Everyone can breathe a sigh of relief now. Or at least I can.

Live migration between CPU vendors demonstrated by AMD and Red Hat

November 11th, 2008

Live migration (VMotion in VMware speak) across AMD and Intel processors is a feature we don’t have today and a technology that many would describe as nearly impossible.

The capability could be in your datacenter sooner than you think. Last Thursday, the Inquirer published an article along with a video where Red Hat and AMD demonstrate the process (of course using streaming video and sound to drive home the point of no interruption) proving that it is possible and the technology to do so may not be so far off. The article goes on to explain that not only can live migration occur between CPU vendors, the same or similar technology can be used to live migrate between CPU architectures from the same vendor (ie. AMD Barcelona Opteron <–> AMD Shanghai Opteron).

Take a look at the video: