VMworld 2014 U.S. Top Ten Sessions

August 28th, 2014 by jason No comments »

Following is the tabulated listing of VMworld 2014 U.S. top ten session as of noon PST 8/28/14. If you plan on catching up on recorded sessions later, this top ten list should be highly considered. Nice job goes out to all of the presenters in this list as well as all presenters at VMworld.

Tuesday – STO1965.1 – Virtual Volumes Technical Deep Dive
Rawlinson Rivera, VMware
Suzy Visvanathan, VMware

Tuesday – NET1674 – Advanced Topics & Future Directions in Network Virtualization with NSX
Bruce Davie, VMware

Tuesday – BCO1916.1 – Site Recovery Manager and Stretched Storage: Tech Preview of a New Approach to Active-Active Data Centers
Shobhan Lakkapragada, VMware
Aleksey Pershin, VMware

Tuesday – INF1522 – vSphere With Operations Management: Monitoring the Health, Performance and Efficiency of vSphere with vCenter Operations Manager
Kyle Gleed, VMware
Ryan Johnson, VMware

Tuesday – SDDC3327 – The Software-defined Datacenter, VMs, and Containers: A “Better Together” Story
Kit Colbert, VMware

Tuesday – SDDC1600 – Art of IT Infrastructure Design: The Way of the VCDX – Panel
Mark Gabryjelski, Worldcom Exchange, Inc.
Mostafa Khalil, VMware
chris mccain, VMware
Michael Webster, Nutanix, Inc.

Tuesday – VAPP1318.1 – Virtualizing Databases Doing IT Right – The Sequel
Michael Corey, Ntirety – A Division of Hosting
Jeff Szastak, VMware

Tuesday – SEC1959-S – The “Goldilocks Zone” for Security
Martin Casado, VMware
Tom Corn, VMware

Monday – HBC1533.1 – How to Build a Hybrid Cloud – Steps to Extend Your Datacenter
Chris Colotti, VMware
David Hill, VMware

Monday – INF1503 – Virtualization 101
Michael Adams, VMware

VMware vCenter Site Recovery Manager 5.8 First Look

August 26th, 2014 by jason No comments »

VMware vCenter Site Recovery Manager made it’s debut this week at VMworld 2014 in San Francisco.  Over the past few weeks I’ve had my hands on a release candidate version and I’ve put together a short series of videos highlighting what’s new and also providing a first look at SRM management through the new web client plug-in.  I hope you enjoy.

I’ll be at VMworld through the end of the week.  Stop and say Hi – I’d love to meet you.

 

VMware vCenter Site Recovery Manager 5.8 Part 1

VMware vCenter Site Recovery Manager 5.8 Part 2

VMware vCenter Site Recovery Manager 5.8 Part 3

Legacy vSphere Client Plug-in 1.7 Released for Storage Center

July 23rd, 2014 by jason No comments »

Dell Compellent Storage Center customers who use the legacy vSphere Client plug-in to manage their storage may have noticed that the upgrade to PowerCLI 5.5 R2 which released with vSphere 5.5 Update 1 essentially “broke” the plug-in. This forced customers to make the decision to stay on PowerCLI 5.5 in order to use the legacy vSphere Client plug-in, or reap the benefits of the PowerCLI 5.5 R2 upgrade with the downside being they had to abandon use of the legacy vSphere Client plug-in.

For those that are unaware, there is a 3rd option and that is to leverage vSphere’s next generation web client along with the web client plug-in released by Dell Compellent last year (I talked about it at VMworld 2013 which you can take a quick look at below).

Although VMware strongly encourages customers to migrate to the next generation web client long term, I’m here to tell you that in the interim Dell has revd the legacy client plug-in to version 1.7 which is now compatible with PowerCLI 5.5 R2.  Both the legacy and web client plug-ins are free and quite beneficial from an operations standpoint so I encourage customers to get familiar with the tools and use them.

Other bug fixes in this 1.7 release include:

  • Datastore name validation not handled properly
  • Create Datastore, map existing volume – Server Mapping will be removed from SC whether or not it was created by VSP
  • Add Raw Device wizard is not allowing to uncheck a host once selected
  • Remove Raw Device wizard shows wrong volume size
  • Update to use new code signing certificate
  • Prevent Datastores & RDMs with underlying Live Volumes from being expanded or deleted
  • Add support for additional Flash Optimized Storage Profiles that were added in SC 6.4.2
  • Block size not offered when creating VMFS-3 Datastore from Datacenter menu item
  • Add Raw Device wizard is not allowing a host within the same cluster as the select host to be unchecked once it has been selected
  • Add RDM wizard – properties screen showing wrong or missing values
  • Expire Replay wizard – no error reported if no replays selected
  • Storage Consumption stats are wrong if a Disk folder has more than one Storage Type

The VMworld US Session Builder Is Now Open

July 14th, 2014 by jason No comments »

For those not hearing the news on Twitter, notice from VMware was email blasted this morning. I received mine at 9:03am CST.

Of the 455 sessions available, over 14% cover NSX and VSAN which were the two major themes at last year’s show. This is almost equal to the total number of vSphere sessions available this year.

Go go go!

Yet another blog post about vSphere HA and PDL

July 14th, 2014 by jason No comments »

If you ended up here searching for information on PDL or APD, your evening or weekend plans may be cancelled at this point and I’m sorry for you if that is the case. There are probably 101 or more online resources which discuss the interrelated vSphere storage topics of All Paths Down (known as APD), Permanent Device Loss (known as PDL), and vSphere High Availability (known as HA, and before dinosaurs roamed the Earth – DAS ). To put it in perspective, I’ve quickly pulled together a short list of resources below using Google. I’ve read most of them:

VMware KB: Permanent Device Loss (PDL) and All-Paths

VMware KB: PDL AutoRemove feature in vSphere 5.5

Handling the All Paths Down (APD) condition – VMware Blogs

vSphere 5.5. Storage Enhancements Part 9 – PDL

Permanent Device Loss (PDL) enhancements in vSphere 5.0

APD (All Paths Down) and PDL (Permanent Device Loss

vSphere Metro Storage Cluster solutions and PDL’s

vSphere Metro Stretched Cluster with vSphere 5.5 and PDL

Change in Permanent Device Loss (PDL) behavior for 5.1

VMware KB: PDL AutoRemove feature in vSphere 5.5

PDL AutoRemove – CormacHogan.com

How handle the APD issue in vSphere – vInfrastructure Blog

Interpreting SCSI sense codes in VMware ESXi and ESX

What’s New in VMware vSphere® 5.1 – Storage

vSphere configuration for handling APD/PDL – CloudXC

vSphere 5.1 Storage Enhancements – Part 4: All Paths Down

vSphere 5.5 nuggets: changes to disk – Yellow Bricks

ESXi host disk.terminateVMOnPDLDefault configuration

ESXi host VMkernel.Boot.terminateVMOnPDL configuration

vSphere HA in my opinion is a great feature. It has saved my back side more than once both in the office and at home. Several books have been more or less dedicated to the topic and yet it is so easy to use that an entire cluster and all of its running virtual machines can be protected with default parameters (common garden variety) with just two mouse clicks.

VMware’s roots began with compute virtualization so when HA was originally released in VMware Virtual Infrastructure 3 (one major revision before it became the vSphere platform known today), the bits licensed and borrowed from Legato Automated Availability Manager (AAM) were designed to protect against marginal but historically documented amounts of x86 hardware failure thereby reducing unplanned downtime and loss of virtualization capacity to a minimum. Basically if an ESX host yields to issues relating to CPU, memory, or network, VMs restart somewhere else in the cluster.

It wasn’t really until vSphere 5.0 that VMware began building in high availability for storage aside from legacy design components such as redundant fabrics, host bus adapters (HBAs), multipath I/O (MPIO), failback policies, and with vSphere 4.0 the pluggable storage architecture (PSA) although this is not to say that any of these design items are irrelevant today – quite the opposite.  vSphere 5.0 introduced Permanent Device Loss (PDL) which does a better job of handling unexpected loss of individual storage devices than APD solely did.  Subsequent vSphere 5.x revisions made further PDL improvements such as improving support for single LUN:single target arrays in 5.1. In short, the new vSphere HA re-write (Legato served its purpose and is gone now) covers much of the storage gap such that in the event of certain storage related failures, HA will restart virtual machines, vApps, services, and applications somewhere else – again to minimize unplanned downtime. Fundamentally, this works just like HA when a vSphere host tips over, but instead the storage tips over and HA is called to action. Note that HA can’t do much about an entire unfederated array failing – this is more about individual storage/host connectivity. Aside from gross negligence on the part of administrators, I believe the failure scenarios are more likely to resonate with non-uniform stretched or metro cluster designs. However, PDL can also occur in small intra datacenter designs as well.

I won’t go into much more detail about the story that has unfolded with APD and the new features in vSphere 5.x because it has already been documented many times over in some of the links above.  Let’s just say the folks starting out new with vSphere 5.1 and 5.5 had it better than myself and many others did dealing with APD and hostd going dark. However, the trade off for them is they are going to have to deal with Software Defined * a lot longer than I will.

Although I mentioned earlier that vSphere HA is extremely simple to configure, I did also mention that was with default options which cover a large majority of the host related failures.  Configuring HA to restart VMs automatically and with no user intervention in the event of a PDL condition in theory is just one configuration change for each host in the cluster. Where to configure depends on the version of vSphere host.

vSphere 5.0u1+/5.1: Disk.terminateVMOnPDLDefault = True (/etc/vmware/settings file on each host)

or

vSphere 5.5+: VMkernel.Boot.terminateVMOnPDL = yes (advanced setting on each host, check the box)

One thing about this configuration that had me chasing sense codes in vmkernel logs recently was lack of clarity on the required host reboot. That’s mainly what prompted this article – I normally don’t cover something that has already been covered well by other writers unless there is something I can add, something was missed, or it has caused me personal pain (my blog + SEO = helps ensure I don’t suffer from the same problems twice). In all of the online articles I had read about these configurations, none mentioned a host reboot requirement and it’s not apparent that a host reboot is required until PDL actually happens and automatic VM restart via HA actually does not. The vSphere 5.5 documentation calls it out. Go figure. I’ll admit that sometimes I will refer to a reputable vMcBlog before the product documentation. So let the search engine results show: when configuring  VMkernel.Boot.terminateVMOnPDL a host reboot or restart is required. VMware KB 1038578 also calls out that as of vSphere 5.5 you must reboot the host for VMkernel.boot configuration changes to take effect. I’m not a big fan of HA or any configuration being written into VMkernel.boot requiring host or VSAN node performance/capacity outages when a change is made but that is VMware Engineering’s decision and I’m sure there is a relevant reason for it aside from wanting more operational parity with the Windows operating system.

I’ll also reiterate Duncan Epping’s recommendation that if you’re already licensed for HA and have made the design and operational decision to allow HA to restart VMs in the event of a host failure, then the above configuration should be made on all vSphere clustered hosts, whether they are part of a stretched cluster or not to protect against storage related failures. A PDL can be broken down to one host losing all available paths to a LUN. By not making the HA configuration change above, a storage related failure results in user intervention required to recover all of the virtual machines on the host tied to the failed device.

Lastly, it is mentioned in some of the links above but if this is your first reading on the subject, please allow me to point out that the configuration setting above is for Permanent Device Loss (PDL) conditions only. It is not meant to handle an APD event. The reason behind this is that the storage array is required to send a proper sense code to the vSphere host indicating a PDL condition.  If the entire array fails or is powered off ungracefully taking down all available paths to storage, it has no chance to send PDL sense codes to vSphere.  This would constitute an indefinite All Paths Down or APD condition where vSphere knows storage is unavailable, but is unsure about its return. PDL was designed to answer that question for vSphere, rather than let vSphere go on wondering about it for a long period of time, thus squandering any opportunities to proactively do something about it.

In reality there are a few other configuration settings (again documented well in the links above) which fine tunes HA more precisely. You’ll almost always want to add these as well.

vSphere 5.0u1+: das.maskCleanShutdownEnabled = True (Cluster advanced options) – this is an accompanying configuration that helps vSphere HA distinguish between VMs that were once powered on and should be restarted versus VMs that were already powered off when a PDL occurred therefore these are VMs that don’t need to be and more importantly probably should not be restarted.

vSphere 5.5+: Disk.AutoremoveOnPDL = 0 (advanced setting on each host) – This is a configuration I first read about on Duncan’s blog where he recommends that the value be changed from the default of enabled to disabled so that a device is not automatically removed if it enters a PDL state. Aside from LUN number limits a vSphere host can handle (255), VMware refers to a few cases where the stock configuration of automatically removing a PDL device may be desired although VMware doesn’t really specifically call out each circumstance aside from problems arising from hosts attempting to send I/O to a dead device. There may be more to come on this in the future but for now preventing the removal may save in fabric rescan time down the road if you can afford the LUN number expended. It will also serve as a good visual indicator in the vSphere Client that there is a problematic datastore that needs to be dealt with in case the PDL automation restarts VMs with nobody noticing the event has occurred. If there are templates or powered off VMs that were not evacuated by HA, the broken datastore will visually persist anyway.

That’s the short list of configuration changes to make for HA VM restart.  There’s actually a few more here. For instance, fine grained HA handling can be coordinated on a per-VM basis by modifying the advanced virtual machine option disk.terminateVMOnPDLDefault configuration for each VM. Or scsi#:#.terminateVMOnPDL to fine tune HA on a per virtual disk basis for each VM. I’m definitely not recommending touching if the situation does not call for it.

In a stock vSphere configuration with VMkernel.Boot.terminateVMOnPDL = no configured (or unintentionally misconfigured I suppose), the following events occur for an impacted virtual machine:

  1. PDL event occurs, sense codes are received and vSphere correctly identifies the PDL condition on the supporting datastore. A question is raised by vSphere for each impacted virtual machine to Retry I/O or Cancel I/O.
  2. Stop. Nothing else happens until each of the questions above are answered with administrator intervention. Answering Retry without the PDL datastore coming back online or without hot removing the impacted virtual disk (in most cases the .vmx will be impacted anyway and hot removing disks is next to pointless) sends the VM to hell pretty much. Answering Cancel allows HA to proceed with powering off the VM and restarting it on another host with access to the device which went PDL on the original host.

In a modified vSphere configuration with VMkernel.Boot.terminateVMOnPDL = yes configured, the following events occur for an impacted virtual machine:

  1. PDL event occurs, sense codes are received and vSphere correctly identifies the PDL condition on the supporting datastore. A question is raised by vSphere for each impacted virtual machine to Retry I/O or Cancel I/O.
  2. Due to VMkernel.Boot.terminateVMOnPDL = yes vSphere HA automatically and effectively answers Cancel for each impacted VM with a pending question. Again, if the hosts aren’t rebooted after the VMkernel.Boot.terminateVMOnPDL = yes configuration change, this step will mimic the previous scenario essentially resulting in failure to automatically carry out the desired tasks.
  3. Each VM is powered off.
  4. Each VM is powered on.

I’ll note in the VM Event examples above, leveraging the power of Snagit I’ve cut out some of the noise about alarms triggering gray and green, resource allocations changing, etc.

For completeness, following is a list of the PDL sense codes vSphere is looking for from the supported storage array:

SCSI sense code Description
H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0 LOGICAL UNIT NOT SUPPORTED
H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x4c 0x0 LOGICAL UNIT FAILED SELF-CONFIGURATION
H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x3e 0x3 LOGICAL UNIT FAILED SELF-TEST
H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x3e 0x1 LOGICAL UNIT FAILURE

Two isolated examples of PDL taking place seen in /var/log/vmkernel.log:

Example 1:

2014-07-13T20:47:03.398Z cpu13:33486)NMP: nmp_ThrottleLogForDevice:2321: Cmd 0x2a (0x4136803b8b80, 32789) to dev “naa.6000d31000ebf600000000000000006c” on path “vmhba2:C0:T0:L30” Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x3f 0xe. Act:EVAL
2014-07-13T20:47:03.398Z cpu13:33486)ScsiDeviceIO: 2324: Cmd(0x4136803b8b80) 0x2a, CmdSN 0xe1 from world 32789 to dev “naa.6000d31000ebf600000000000000006c” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x3f 0xe.
2014-07-13T20:47:03.398Z cpu13:33486)NMP: nmp_ThrottleLogForDevice:2321: Cmd 0x2a (0x413682595b80, 32789) to dev “naa.6000d31000ebf600000000000000007c” on path “vmhba2:C0:T0:L2” Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0. Act:FAILOVER

Example 2:

2014-07-14T00:43:49.720Z cpu4:32994)ScsiDeviceIO: 2337: Cmd(0x412e82f11380) 0x85, CmdSN 0x33 from world 34316 to dev “naa.600508b1001c6e17d603184d3555bf8d” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2014-07-14T00:43:49.731Z cpu4:32994)ScsiDeviceIO: 2337: Cmd(0x412e82f11380) 0x4d, CmdSN 0x34 from world 34316 to dev “naa.600508b1001c6e17d603184d3555bf8d” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2014-07-14T00:43:49.732Z cpu4:32994)ScsiDeviceIO: 2337: Cmd(0x412e82f11380) 0x1a, CmdSN 0x35 from world 34316 to dev “naa.600508b1001c6e17d603184d3555bf8d” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2014-07-14T00:48:03.398Z cpu10:33484)NMP: nmp_ThrottleLogForDevice:2321: Cmd 0x2a (0x4136823b2dc0, 32789) to dev “naa.60060160f824270012f6aa422e0ae411” on path “vmhba1:C0:T2:L40” Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0. Act:FAILOVER

In no particular order, I want to thank Duncan, Paudie, Cormac, Mohammed, Josh, Adam, Niran, and MAN1$H for providing some help on this last week.

By the way, don’t name your virtual machines or datastores PDL. It’s bad karma.

VMware vCenter Operations Manager Essentials

July 10th, 2014 by jason No comments »

A new vSphere book has just arrived and has been added to my library. The book’s title is VMware vCenter Operations Manager Essentials which was authored by Technical Virtualization Architect and vExpert Lauren Malhoit (@malhoit) with reviews from Michael Poore, Mike Preston, and Chris Wahl.

I ordered this book while attending Dell User Forum a few weeks ago where I did some breakout session speaking on vC Ops and the new Dell Storage adapters for vC Ops.

“This book is written for administrators, engineers, and architects of VMware vSphere as well as those who have or are interested in purchasing the vCenter Operations Manager Suite. It will particularly help administrators who are hoping to use vCenter Operations Manager to optimize their VMware environments as well as quickly troubleshoot both long-term and short-term issues.”

Skimming through the chapter list covering 236 pages, it looks like it’s going to be a pretty good read.

Chapter 1: Introduction to vCenter Operations Manager

Chapter 2: Installing vCenter Operations Manager

Chapter 3: Dashboards and Badges (badges?…. had to be said)

Chapter 4: Troubleshooting Our Virtual Environment with vCenter Operations Manager

Chapter 5: Capacity Planning with vCenter Operations Manager

Chapter 6: Reports

Chapter 7: vCenter Configuration Manager

Chapter 8: Log Insight

Chapter 9: VMware Horizon View Integration with vCenter Operations Manager

Chapter 10: vCenter Infrastructure Navigator

Chapter 11: EMC Storage Analytics

Why did I pick up this book? vC Ops is extremely powerful and it has a bit of a learning curve to it. This is what resonated with me the most when I first began using the product. Over time, vCenter has become an integral component in VMware vSphere virtualized datacenters and it will continue to do so as more and more applications and services are integrated with and become dependent on it. vC Ops ties together many datacenter infrastructure pieces and allows virtualization, IaaS, cloud computing, and VDI to be delivered more intelligently. I would like to learn more about vC Ops and hopefully pick up some helpful tips on building custom dashboards with stock and add-on adapters/collectors as well as custom widgets

Drive-through Automation with PowerGUI

July 9th, 2014 by jason No comments »

One of the interesting aspects of shared infrastructure is stumbling across configuration changes made by others who share responsibility in managing the shared environment. This is often the case in the lab but I’ve also seen it in every production environment I’ve supported to date as well. I’m not pointing any fingers – my back yard is by no means immaculate. Moreover, this bit is regarding automation, not placing blame (Note the former is productive while the latter is not).

Case in point this evening when I was attempting to perform a simple remediation of a vSphere 5.1 four-host cluster via Update Manager. I verified the patches and cluster configuration, hit the remediate button in VUM, and left the office.  VUM, DRS, and vMotion does the heavy lifting. I’ve done it a thousand times or more in the past in environments 100x this size.

I wrap up my 5pm appointment on the way home from the office, have dinner with the family, and VPN into the network to verify all the work was done. Except nothing had been accomplished. Remediation on the cluster was a failure.  Looking at the VUM logs reveals 75% of the hosts being remediated contain virtual machines with attached devices preventing VUM, DRS, and vMotion from carrying out the remediation.

Obviously I know how to solve this problem but to manually check and strip every VM of it’s offending device is going to take way too long. I know what I’m supposed to do here. I can hear the voices in my head of PowerShell gurus Alan, Luc, etc. saying over and over the well known automation battle cry “anything repeated more than once should be scripted!

That’s all well and good, I completely get it, but I’m in that all too familiar place of:

  1. Carrying out the manual tasks will take 30 minutes.
  2. Authoring, finding, testing a suitable PowerShell/PowerCLI script to automate will also take 30 minutes, probably more.
  3. FML, I didn’t budget time for either of the above.

There is a middle ground. I view it as drive-through efficiency automation. It’s call PowerGUI and it has been around almost forever. In fact, it comes from Quest which my employer now owns. And I’ve already got it along with the PowerPacks and Plug-ins installed on my new Dell Precision M4800 laptop. Establishing a PowerGUI session and authenticating with my current infrastructure couldn’t be easier. From the legacy vSphere Client, choose the Plug-ins pull down, PowerGUI Administrative Console.

The VMware vSphere Management PowerPack ships stock with not only the VM query to find all VMs with offensive devices attached, but also a method to highlight all the VMs and Disconnect.

Depending on the type of device connect to the virtual machines, VUM may also be able to handle the issue as it has the native ability to Disable any removable media devices connect to the virtual machines on the host. In this case, the problem is solved with automation (I won’t get beat up on Twitter) and free community (now Dell) automation tools. Remediation completed.

RVTools (current version 3.6) also has identical functionality to quickly locate and disconnect various devices across a virtual datacenter.  Click on the image below to read more about RVTools.

Click on the image below to read more about PowerGUI.