Archive for the ‘Virtualization’ category

VMware Update Manager Becomes Self-Aware

March 4th, 2010

@Mikemohr on Twitter tonight said it best:

“Haven’t we learned from Hollywood what happens when the machines become self-aware?”

I got a good chuckle.  He took my comment of VMware becoming “self-aware” exactly where I wanted it to go.  A reference to The Terminator series of films in which a sophisticated computer defense system called Skynet becomes self-aware and things go downhill for mankind from there.

Metaphorically speaking in today’s case, Skynet is VMware vSphere and mankind is represented by VMware vSphere Administrators.

During an attempt to patch my ESX(i)4  hosts, I received an error message (click the image for a larger version):

At that point, the remediation task fails and the host is not patched.  The VUM log file reflects the same error in a little more detail:

[2010-03-04 14:58:04:690 ‘JobDispatcher’ 3020 INFO] [JobDispatcher, 1616] Scheduling task VciHostRemediateTask{675}
[2010-03-04 14:58:04:690 ‘JobDispatcher’ 3020 INFO] [JobDispatcher, 354] Starting task VciHostRemediateTask{675}
[2010-03-04 14:58:04:690 ‘VciHostRemediateTask.VciHostRemediateTask{675}’ 2676 INFO] [vciTaskBase, 534] Task started…
[2010-03-04 14:58:04:908 ‘VciHostRemediateTask.VciHostRemediateTask{675}’ 2676 INFO] [vciHostRemediateTask, 680] Host host-112 scheduled for patching.
[2010-03-04 14:58:05:127 ‘VciHostRemediateTask.VciHostRemediateTask{675}’ 2676 INFO] [vciHostRemediateTask, 691] Add remediate host: vim.HostSystem:host-112
[2010-03-04 14:58:13:987 ‘InventoryMonitor’ 2180 INFO] [InventoryMonitor, 427] ProcessUpdate, Enter, Update version := 15936
[2010-03-04 14:58:13:987 ‘InventoryMonitor’ 2180 INFO] [InventoryMonitor, 460] ProcessUpdate: object = vm-2642; type: vim.VirtualMachine; kind: 0
[2010-03-04 14:58:17:533 ‘VciHostRemediateTask.VciHostRemediateTask{675}’ 2676 WARN] [vciHostRemediateTask, 717] Skipping host solo.boche.mcse as it contains VM that is running VUM or VC inside it.
[2010-03-04 14:58:17:533 ‘VciHostRemediateTask.VciHostRemediateTask{675}’ 2676 INFO] [vciHostRemediateTask, 786] Skipping host 0BC5A140, none of upgrade and patching is supported.
[2010-03-04 14:58:17:533 ‘VciHostRemediateTask.VciHostRemediateTask{675}’ 2676 ERROR] [vciHostRemediateTask, 230] No supported Hosts found for Remediate.
[2010-03-04 14:58:17:737 ‘VciRemediateTask.RemediateTask{674}’ 2676 INFO] [vciTaskBase, 583] A subTask finished: VciHostRemediateTask{675}

Further testing in the lab revealed that this condition will be caused with a vCenter VM and/or a VMware Update Manager (VUM) VM. I understand from other colleagues on the Twitterverse that they’ve seen the same symptoms occur with patch staging.

The work around is to manually place the host in maintenance mode, at which time it has no problem whatsoever evacuating all VMs, including infrastructure VMs.  At that point, the host in maintenance mode can be remediated.

VMware Update Manager has apparently become self-aware in that it detects when its infrastructure VMs are running on the same host hardware which is to be remediated.  Self-awareness in and of itself isn’t bad, however, its feature integration is.  Unfortunately for the humans, this is a step backwards in functionality and a reduction in efficiency for a task which was once automated.  Previously, a remediation task had no problem evacuating all VMs from a host, infrastructure or not. What we have now is… well… consider the following pre and post “self-awareness” remediation steps:

Pre “self-awareness” remediation for a 6 host cluster containing infrastructure VMs:

  1. Right click the cluster object and choose Remediate
  2. Hosts are automatically and sequentially placed in maintenance mode, evacuated, patched, rebooted, and brought out of maintenance mode

Post “self-awareness” remediation for a 6 host cluster containing infrastructure VMs:

  1. Right click Host1 object and choose Enter Maintenance Mode
  2. Wait for evacutation to complete
  3. Right click Host1 object and choose Remediate
  4. Wait for remediation to complete
  5. Right click Host1 object and choose Exit Maintenance Mode
  6. Right click Host2 object and choose Enter Maintenance Mode
  7. Wait for evacutation to complete
  8. Right click Host2 object and choose Remediate
  9. Wait for remediation to complete
  10. Right click Host2 object and choose Exit Maintenance Mode
  11. Right click Host3 object and choose Enter Maintenance Mode
  12. Wait for evacutation to complete
  13. Right click Host3 object and choose Remediate
  14. Wait for remediation to complete
  15. Right click Host3 object and choose Exit Maintenance Mode
  16. Right click Host4 object and choose Enter Maintenance Mode
  17. Wait for evacutation to complete
  18. Right click Host4 object and choose Remediate
  19. Wait for remediation to complete
  20. Right click Host4 object and choose Exit Maintenance Mode
  21. Right click Host5 object and choose Enter Maintenance Mode
  22. Wait for evacutation to complete
  23. Right click Host5 object and choose Remediate
  24. Wait for remediation to complete
  25. Right click Host5 object and choose Exit Maintenance Mode
  26. Right click Host6 object and choose Enter Maintenance Mode
  27. Wait for evacutation to complete
  28. Right click Host6 object and choose Remediate
  29. Wait for remediation to complete
  30. Right click Host6 object and choose Exit Maintenance Mode

It’s Saturday and your kids want to go to the park. Do the math.

Update 5/5/10: I received this response back on 3/5/10 from VMware but failed to follow up with finding out if it was ok to share with the public.  I’ve received the blessing now so here it is:

[It] seems pretty tactical to me. We’re still trying to determine if this was documented publicly, and if not, correct the documentation and our processes.

We introduced this behavior in vSphere 4.0 U1 as a partial fix for a particular class of problem. The original problem is in the behavior of the remediation wizard if the user has chosen to power off or suspend virtual machines in the Failure response option.

If a stand-alone host is running a VM with VC or VUM in it and the user has selected those options, the consequences can be drastic – you usually don’t want to shut down your VC or VUM server when the remediation is in progress. The same applies to a DRS disabled cluster.

In DRS enabled cluster, it is also possible that VMs could not be migrated to other hosts for configuration or other reasons, such as a VM with Fault Tolerance enabled. In all these scenarios, it was possible that we could power off or suspend running VMs based on the user selected option in the remediation wizard.

To avoid this scenario, we decided to skip those hosts totally in first place in U1 time frame. In a future version of VUM, it will try to evacuate the VMs first, and only in cases where it can’t migrate them will the host enter a failed remediation state.

One work around would be to remove such a host from its cluster, patch the cluster, move the host back into the cluster, manually migrate the VMs to an already patched host, and then patch the original host.

It would appear VMware intends to grant us back some flexibility in future versions of vCenter/VUM.  Let’s hope so. This implementation leaves much to be desired.

Update 5/6/10: LucD created a blog post titled Counter the self-aware VUM. In this blog post you’ll find a script which finds the ESX host(s) that is/are running the VUM guest and/or the vCenter guest and will vMotion the guest(s) to another ESX host when needed.

11 New ESX(i) 4.0 Patch Definitions Released; 6 Critical

March 3rd, 2010

Eleven new patch definitions have been released for ESX(i) 4.0 (7 for ESX, 2 for ESXi, 2 for the Cisco Nexus 1000V).  Previous versions of ESX(i) are not impacted.

6 of the 11 patch definitions are rated critical and should be evaluated quickly for application in your virtual infrastructure.

ID: ESX400-201002401-BG Impact: Critical Release date: 2010-03-03 Products: esx 4.0.0 Updates vmkernel64,vmx,hostd etc

This patch provides support and fixes the following issues:

  • On some systems under heavy networking and processor load (large number of virtual machines), some NIC drivers might randomly attempt to reset the device and fail.
    The VMkernel logs generate the following messages every second:
    Oct 13 05:19:19 vmkernel: 0:09:22:33.216 cpu2:4390)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic1: transmit timed out
    Oct 13 05:19:20 vmkernel: 0:09:22:34.218 cpu8:4395)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic1: transmit timed out
  • ESX hosts do not display the proper status of the NFS datastore after recovering from a connectivity loss.
    Symptom: In vCenter Server, the NFS datastore is displayed as inactive.
  • When using NPIV, if the LUN on the physical HBA path is not same as the LUN on the virtual port (VPORT) path, though the LUNID:TARGETID pairs are same, then I/O might be directed to the wrong LUN causing a possible data corruption. Refer KB 1015290 for more information.
    Symptom: If NPIV is not configured properly, I/O might be directed to the wrong disk.
  • On Fujitsu systems, the OEM-IPMI-Command-Handler that lists the available OEM IPMI commands do not work as intended. No custom OEM IPMI commands are listed, though they were initialized correctly by the OEM. After applying this fix, running the VMware_IPMIOEMExtensionService and VMware_IPMIOEMExtensionServiceImpl objects displays the supported commands as listed in the command files.
  • Provides prebuilt kernel module drivers for Ubuntu 9.10 guest operating systems.
  • Adds support for upstreamed kernel PVSCSI and vmxnet3 modules.
  • Provides a change to the maintenance mode requirement during Cisco Nexus 1000V software upgrade. After installing this patch if you perform Cisco Nexus 1000V software upgrade, the ESX host goes into maintenance mode during the VEM upgrade.
  • In certain race conditions, freeing journal blocks from VMFS filesystems might fail. The WARNING: J3: 1625: Error freeing journal block (returned 0) <FB 428659> for 497dd872-042e6e6b-942e-00215a4f87bb: Lock was not free error is written to the VMware logs.
  • Changing the resolution of the guest operating system over a PCoIP connection (desktops managed by View 4.0) might cause the virtual machine to stop responding.
    Symptoms: The following symptoms might be visible:

    • When you try to connect to the virtual machine through a vCenter Server console, a black screen appears with the Unable to connect to MKS: vmx connection handshake failed for vmfs {VM Path} message.
    • Performance graphs for CPU and memory usage in vCenter Server drop to 0.
    • Virtual machines cannot be powered off or restarted.

ID: ESX400-201002402-BG Impact: Critical Release date: 2010-03-03 Products: esx 4.0.0 Updates initscripts

This patch fixes an issue where pressing Ctrl+Alt+Delete on service console causes ESX 4.0 hosts to reboot.

ID: ESX400-201002404-SG Impact: HostSecurity Release date: 2010-03-03 Products: esx 4.0.0 Updates glib2

The service console package for GLib2 is updated to version glib2-2.12.3-4.el5_3.1. This GLib update fixes an issue where the functions inside GLib incorrectly allows multiple integer overflows leading to heap-based buffer overflows in GLib’s Base64 encoding and decoding functions. This might allow an attacker to possibly execute arbitrary code while a user is running the application. The Common Vulnerabilities and Exposures Project (cve.mitre.org) has assigned the name CVE-2008-4316 to this issue.

ID: ESX400-201002405-BG Impact: Critical Release date: 2010-03-03 Products: esx 4.0.0 Updates megaraid-sas

This patch fixes an issue where some applications do not receive events even after registering for Asynchronous Event Notifications (AEN). This issue occurs when multiple applications register for AENs.

ID: ESX400-201002406-SG Impact: HostSecurity Release date: 2010-03-03 Products: esx 4.0.0 Updates newt

The service console package for Newt library is updated to version newt-0.52.2-12.el5_4.1. This security update of Newt library fixes an issue where an attacker might cause a denial of service or possibly execute arbitrary code with the privileges of a user who is running applications using the Newt library. The Common Vulnerabilities and Exposures Project (cve.mitre.org) has assigned the name CVE-2009-2905 to this issue.

ID: ESX400-201002407-SG Impact: HostSecurity Release date: 2010-03-03 Products: esx 4.0.0 Updates nfs-utils

The service console package for nfs-utils is updated to version nfs-utils-1.0.9-42.el5. This security update of nfs-utils fixes an issue that might permit a remote attacker to bypass an intended access restriction. The Common Vulnerabilities and Exposures Project (cve.mitre.org) has assigned the name CVE-2008-4552 to this issue.

ID: ESX400-201002408-BG Impact: Critical Release date: 2010-03-03 Products: esx 4.0.0 Updates Enic driver

In scenarios where Pass Thru Switching (PTS) is in effect, if virtual machines are powered on, the network interface might not come up. In PTS mode, when the network interface is brought up, PTS figures the MTU from the network. There is a race in this scenario, where the enic driver might incorrectly indicate that the driver fails. This issue might occur frequently on a CISCO UCS system. This patch fixes the issue.

ID: ESXi400-201002401-BG Impact: Critical Release date: 2010-03-03 Products: embeddedEsx 4.0.0 Updates Firmware

This patch provides support and fixes the following issues:

  • On some systems under heavy networking and processor load (large number of virtual machines), some NIC drivers might randomly attempt to reset the device and fail.
    The VMkernel logs generate the following messages every second:
    Oct 13 05:19:19 vmkernel: 0:09:22:33.216 cpu2:4390)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic1: transmit timed out
    Oct 13 05:19:20 vmkernel: 0:09:22:34.218 cpu8:4395)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic1: transmit timed out
  • ESX hosts do not display the proper status of the NFS datastore after recovering from a connectivity loss.
    Symptom: In vCenter Server, the NFS datastore is displayed as inactive.
  • When using NPIV, if the LUN on the physical HBA path is not same as the LUN on the virtual port (VPORT) path, though the LUNID:TARGETID pairs are same, then I/O might be directed to the wrong LUN causing a possible data corruption. Refer KB 1015290 for more information.
    Symptom: If NPIV is not configured properly, I/O might be directed to the wrong disk.
  • On Fujitsu systems, the OEM-IPMI-Command-Handler that lists the available OEM IPMI commands do not work as intended. No custom OEM IPMI commands are listed, though they were initialized correctly by the OEM. After applying this fix, running the VMware_IPMIOEMExtensionService and VMware_IPMIOEMExtensionServiceImpl objects displays the supported commands as listed in the command files.
  • Provides prebuilt kernel module drivers for Ubuntu 9.10 guest operating systems.
  • Adds support for upstreamed kernel PVSCSI and vmxnet3 modules.
  • Provides a change to the maintenance mode requirement during Cisco Nexus 1000V software upgrade. After installing this patch if you perform Cisco Nexus 1000V software upgrade, the ESX host goes into maintenance mode during the VEM upgrade.
  • In certain race conditions, freeing journal blocks from VMFS filesystems might fail. The WARNING: J3: 1625: Error freeing journal block (returned 0) <FB 428659> for 497dd872-042e6e6b-942e-00215a4f87bb: Lock was not free error is written to the VMware logs.
  • Changing the resolution of the guest operating system over a PCoIP connection (desktops managed by View 4.0) might cause the virtual machine to stop responding.
    Symptoms: The following symptoms might be visible:

    • When you try to connect to the virtual machine through a vCenter Server console, a black screen appears with the Unable to connect to MKS: vmx connection handshake failed for vmfs {VM Path} message.
    • Performance graphs for CPU and memory usage in vCenter Server drop to 0.
    • Virtual machines cannot be powered off or restarted.

ID: ESXi400-201002402-BG Impact: Critical Release date: 2010-03-03 Products: embeddedEsx 4.0.0 Updates VMware Tools

This patch fixes an issue where pressing Ctrl+Alt+Delete on service console causes ESX 4.0 hosts to reboot.

ID: VEM400-201002001-BG Impact: HostGeneral Release date: 2010-03-03 Products: embeddedEsx 4.0.0, esx 4.0.0 Cisco Nexus 1000V VEM

ID: VEM400-201002011-BG Impact: HostGeneral Release date: 2010-03-03 Products: embeddedEsx 4.0.0, esx 4.0.0 Cisco Nexus 1000V VEM

Thank You

February 27th, 2010

Once in a while, I’m a witness to acts of extraordinary kindness from a person or group of persons.  It may not occur on a regular basis, but when it does, it is something special to behold.  It happened this afternoon at the Minneapolis Area VMware Users Group (VMUG) meeting.

It started out as a fairly typical event.  I called the meeting to order, briefly went through some general business and current events in the VMware virtualization community, and then turned the meeting over to our first speakers Craig Drugge and Pavan Jhamnani of Syncsort.  I took a seat, prepared to learn about Syncsort’s data protection and rapid recovery technologies.  However that was not to be, at least not right away.  Instead, Pavan invited Michael Cardinal of ThinLaunch up on stage.  I was curious about what was transpiring since this was Syncsort’s hour and I wasn’t aware that ThinLaunch had any ties to Syncsort’s technology.

Michael took the stage with a white paper bag in hand and began speaking to the audience about a person he has known for a few years.  A person who diggs virtualization.  A person whom he’d bumped into at VMware Partner Exchange early Wednesday morning at Starbucks Mandalay Bay.  I caught on pretty quickly that he was referring to me.  Michael proceeded to announce my recent VCDX certification accomplishment.  I thought that was extremely generous of him, but there was more.  Michael asked me to come up on stage where he presented me with a gift.  This was something that he, his wife, and Bill Hinkens (Territory Manager, VMware) collaborated on.  Michael turned the bag around to reveal the VMware diamond plate artwork along with my name and VCDX #34 on it.  Inside the bag was a black VMware fleece sweater, again with my name, VCDX, and #34 on it.  I was at a loss for words.  I accpeted the gift, thanked Michael, and we took our seats. The meeting continued from its brief diversion.

The sweater, the bag, the presentation, the planning, the thought, these were all wonderful gifts from a group of people who went out of their way which I will remember for a long time.  Virtualization, for me, has built a great community of people and in many cases has yielded friendships at a professional as well as a personal level.  For that I am very thankful and each day I look forard to what the future brings.

Thank you.

RVTools 2.8.1 Released

February 21st, 2010

Rob de Veij has released version 2.8.1 of his stellar virtualization utility RVTools.  I love this free tool as it provides valuable information about my infrastructure in a fast and easy format.

New in this version:
– On vHost tab new field: number of running vCPUs
– On vSphere VMs in vApp where not displayed.
– Filter not working correct when annotations or custum fields contains null value.
– When NTP server(s) = null the time info fields are not displayed on the vHost tabpage.
– When datastore name or virtual machine name containts spaces the inconsistent foldername check was not working correct.
– Tools health check now only executed for running VMs.

Go download this tool today and be sure to tell Rob how much you appreciate his development efforts!

VMware, much of this information is vital as it pertains to configuration maximums and should be available in the VMware vSphere Client for capacity planning purposes.

VCDX #34 – The Conclusion of a Journey

February 19th, 2010

Last Sunday I wrote about my VCDX Defense experience. This evening I am fortunate enough to share the news that I have passed the final board review and have achieved VCDX certification. I was awarded VCDX #34.  For the others who defended last week in Las Vegas, I offer my congratulations to you all on a job which I’m sure was well done.  Without a doubt, it was a journey which I’m sure will benefit me for many years to come.  I’m proud to have walked down a path paved by so much collective brilliance before me. I am inspired and driven by the knowledge shared in the virtualization community. I hope that I can continue provide the best I have to offer in return.

It is not my intent to turn this into the Acadamy Awards, but I would be extremely negligent if I didn’t thank key people who devoted their time to ensure my success by reviewing my design, challenging me with questions, as well as those who provided tips and encouragement for the defense.  I had several weaknesses exposed and with your help I was able to strengthen in those areas prior to my defense.

Amy (I didn’t receive your note until after the defense, but I was really touched. Your support, patience, and understanding is nothing short of amazing)
Gary Bowman (old guy… mock defense was very helpful!)
Gabrie Van Zanten (seriously, with the questions, you had too much fun…)
Roger Lund (great questions from you, thank you for taking the time)
David Davis (tremendous help from a CCIE… I’m not even worthy)
Scott Lowe (thank you for the offer and last minute design tips)
Michael Cardinal (Wednesday morning shot of confidence at Starbucks)
Rick Scherer (tips on calming nerves were great – I followed to a T)
John Arrasjid (so many great VCDX tips, invaluable!)
Duncan Epping (I got a lot more than breakfast out of you Tuesday morning, you don’t even know)
Frank Denneman (thank you for the help, confidence, & for not making faces at me)
Rich Brambley (UGG who told me Tuesday evening I can do this)
Andrew Hald (Tuesday dinner.. thank you for letting me join you)
Spencer Critchlow (your tips were invaluable!)
Doug Hazelman (Veeam played a helpful role in my design)
Dawn Theirl (thank you for the encouragement)

Tips for the Defense:
1) Know your design, I mean really know it.
2) Refer to tip #1

Good luck.

My VCDX Defense Experience

February 14th, 2010

Last Wednesday morning in Las Vegas, I participated in my VMware Certified Design Expert (VCDX) Defense.  A successful Defense is the last in a series of required steps to obtain VCDX certification.  Defense experiences have been shared by others such as Rick Scherer, Dave Convery, and Duncan Epping.  I found my own Defense experience to be similar to theirs.

Prior to the Defense, I submitted an application and a design for the panelists to review.  As Dave Convery pointed out, this may be the hardest part of the entire process as far as the volume of work goes.  The design is a complete set of documentation that must meet key requirements outlined in the application.  There is not a lot of time to complete the application and design once you are invited for that step.  My best advice would be to clear your schedule as much as possible to crank out quality documentation.  Also, be sure the application is filled out completely and the design covers all requirements.  Missing information risks outright rejection and you’ll likely miss the opportunity for the upcoming defense.  It is absolutely critical that all fields in the application are completed.  This cannot be stressed enough. The panelists will spend up to 8 hours reviewing the design.  The submitted documentation is more about quality than quantity. Be sure the documentation submitted is relavant to the design.  Any information the panelists cannot pull from the submitted design will need to be clarified during the defense which is then a pressure situation for the candidate.

Once the application and design is accepted, the defense date is scheduled around a major VMware event.  Typically VMworld or Partner Exchange (PEX).  My defense was scheduled at PEX in Las Vegas.  As I am not a partner nor do I work for a partner, I did not attend PEX or any of its sessions.  I flew in on a Tuesday morning and left a day later, merely for the defense. This strategy is fine with me as I would rather stay focused on the defense and my design, and not face daily distractions and new information released at a conference.

During the days leading up to my defense, I felt very confident.  I had been studying my design and going over all the Enterprise Admin and Design exam study material on a daily basis.  I had been brushing up on white papers and blog articles for areas which I felt I was weak on or had forgotten details of.  I brought a 3 ring binder filled with about 400 pages of documentation as well as every VI3 published .pdf known to mankind on my thumb drive.  While I didn’t read all the .pdf files, they were with me if I needed them for reference.  As it turned out, a few of the documents I crammed on the night before my panel would play a nice role during part of my defense.

After arriving in Las Vegas Tuesday morning, my confidence level remained as high as ever.  I had spent the entire 3 hours on the plane reading out of my 3 ring binder.  Outside of having breakfast with a friend, I spent a good portion of Tuesday studying which was my intent in booking a Tuesday morning arrival.  Early Tuesday afternoon, exhaustion hit me like a ton of bricks. I decided to try to take a nap. I laid in bed for close to an hour and couldn’t fall asleep. I decided to try a long bath in my swanky bathroom with a TV in it (my favorite part of the trip I think). I got my second wind and attended a meeting Tuesday evening for about an hour where I met up with fellow vExperts.  Asked how I felt about the following morning’s defense, again my answer was mostly confident, cool, and collected.  I just wanted to get it over with.  The anxiety of the approaching defense date was starting to mount.  I found myself calculating the hours remaining in my head. img056“In 15 hours I will have started my defense.  In 17 hours I will have finished the first defense section.  In 18 hours it will all be over with.”  After the meeting, some of the guys were going out on the strip for a nice dinner.  I really wanted to go but knew had no time for this social event.  I hung back and had a quick buffet dinner with a guy who I would find out was a VCDX himself and a panelist from VMware.  I was back to my room by 8:30pm and studied until about 10:15pm.  At that point, I was getting tired and decided to take the wise advice of Rick Scherer and John Arrisjid by getting a good night’s sleep.

I was getting good sleep until… I woke up at 3:30am and couldn’t fall back asleep.  I laid in bed for a full 2.5 hours thinking about my upcoming defense, points I wanted to make, design choices, etc. It’s a long time to dwell on these items but it was quiet and peaceful and I was well rested. I shot out of bed at my 6am wake up call, got ready, packed, and headed out.  I stopped by the hotel business center to print 4 copies of a presentation slide update I had made the night before. I forgot to print current slide only and instead printed 4 copies of the entire deck. Expensive lesson printing 60 pages which couldn’t be cancelled (how convenient for the hotel). At least they were in B&W and not color. The plan was to get a good breakfast to calm any nerves that may develop (advice from Rick Scherer).  Unfortunately, there was no breakfast open at 6:30am. The restaurants didn’t open until 7am.  I headed to Starbucks to start getting caffienated. While having coffee and going through my slides, I decided to create 3 new slides right then and there.  I felt they would be beneficial for the executive presentation but a small part of me challenged “is this really wise throwing these in at the last second?”  Why not.  SEs do it all the time prior to arriving at customer sites.  At this point I still felt pretty confident and didn’t really have any nerves.

img057At 7:30 I finished my coffee and headed to breakfast. Last minute cramming at the breakfast buffet table downing coffee and some food. As the clock passed 8am, I had less than an hour left to head upstairs for my defense panel.  I could start to feel the nervousness set in. I continued to study until I realized it was 8:50am and I had less than 10 minutes to get through the casino over to 2nd level of the convention center. Whoops.  I arrived at Breakers L with maybe 2 minutes to spare and Melissa greeted me.  The panelists were waiting inside and not quite ready for me yet. In the mean time, I walked across the hall and poked my head in a large auditorium to see who was speaking. It was Steve Herrod talking about a technology which I cannot repeat at this point in time. I told myself repeatedly that I am not nervous but I was only lying to myself. It’s inevitible. When the exam room doors open and you see the panel of experts in there, you feel it. I surmise it may be a bit like meeting the Father, the Son, and the Holy Ghost for the first time. People who spend a lot of time in front of customers are still nervous for these defense panels. It’s unavoidable. One candidate who finished his defense Tuesday evening likened the experience to having “a proctology exam”.

The first 75 minutes is spent “defending” my design.  I’ve got about a 15 slide deck to get through and to use as reference throughout the design defense.  I recommend putting as much reference as you can in the slide deck which you can yourself refer to during the defense.  It will help illustrate design choices and jog your memory for design elements which you’ve forgotten due to nervousness. The first 5-10 minutes I was pretty nervous and stuttered once or twice during my presentation. After that, I warmed up and it felt more like a good technical discussion with co-workers which I enjoyed. As the questions started coming in, I made good use of some of the slides to help explain decisions.  Good slides to have here are architecture diagrams, network, storage, etc.  I felt my performance during this section of the defense was passable based on the questioning I received, but the honest truth is it’s too hard to tell with the scoring method that is used.  It’s about accumulating points.  What’s unknown is how many points were left to accumulate and areas to talk about which we did not get to due to the 75 minutes of time expiring? Afterwards, I can’t help but think about 1 technical question I knew I jumped the gun on and answered incorrectly, failing to correct myself. I’m told by a current VCDX to not worry about it, nobody is perfect in the defense – that is to say, the scoring of the defense will allow for X number of mistakes. I’ve also spent time playing back other areas of the defense, wondering if I clarified my points clear enough? Trying to remember if the panelists understood that one of the points I was making was in the context of a specific circumstance and it would be important that they would understand that for it to be technically correct.  Did they understand the physical network topology well enough between sites or draw a harmful conclusion that I was contradicting myself during explanation?  I can’t stress enough how fast the time elapses in front of the panel.  At least it did for me.

After the 75 minute defense, we took a short break and proceeded with the 30 minute mock design.  In retrospect, the scenario which was thrown at me wasn’t too bad.  Unfortunately I didn’t get through nearly as much of it as I wanted to.  I spent a lot of time digging in areas where there were probably no more points to be had I should have just moved on.  I wish I had another shot at it and I would have moved faster.  The idea in this section is to ask a lot of intelligent questions to frame out a design in 30 minutes.  But don’t spend too much time in one area.  This section is more about “the journey” than the final design.  Questions need to be asked of the “customers” during the design process so they can see how you think on your feet.  They may also not provide all of the needed information for the design which is, again, where asking questions comes in.  Once again, time flies.  Be quick but be as thorough as possible.  Think out loud.

After completing the 30 minute mock design section, we moved right into the last section which is a 15 minute troubleshooting scenario.  The three panelists are once again the customers in this scenario and they came to me with a VMware Infrastructure 3 problem they are experiencing.  Once again, this process is more about “the journey” than the final result.  It’s about thinking out loud, asking questions of the customer, and showing them the throught process to isolate root cause of a problem. I feel I did well in this section and will go so far as to say that I found the root cause. Before I could get acknowledgement, however, the 15 minute timer expired.  I do not know how each section is weighted, if it is, but hopefully I did do well enough on the last section to help carry me through the two previous sections.  A common occurrance through the Enterprise Admin and Design written exams was that I felt I did poorly in one section, but stellar in another, which carried me through to a passing score on each written exam.

The panelists and observers were a good group of people and I can honestly say that once I got beyond that first 5-10 minutes of nerves, the pressure wasn’t nearly as bad as I thought it would be.  I think it all depends on how prepared one is for the experience.  You may have heard other people say “Know your design inside and out”. This could not be closer to the truth. Know it up, down, sideways, back, and front. Be prepared for any question relating to your design, including upstream and downstream impacts. Know the infrastructure components well such as storage and hardware platforms. Anything you list in your design you need to be able to speak to. If you cannot speak to everything in your design, then how do you know it is appropriate for your design? “Because”, and “Best Practice” are not complete answers.  I’ve collected a ton of tips along the way (like these) and each of them contributed to getting me as far as I’ve gotten at this point.  Social networking tools have helped immensely.  I can’t imagine going this alone in a vacuum.  I would have been totally unprepared for the design defense, if I even made it that far.

So after my defense, I was told “7 days” in regard to getting my results. I was hoping to be pleasantly surprised with results late Friday after the defenses at Mandalay Bay wrapped up.  However, having not received them yet and tomorrow is a holiday, it looks like it will take the full week (and hopefully not longer) to get the results.  It has been difficult waiting this long.  Anxiety is building and I’ve been watching email like a man possessed.  I’ve been replaying the scenarios in my head, both good and bad.  It’s unhealthy for sure. Although no formal statistics have been released by VMware, I gather through word of mouth that about 50% of the candidates pass their defense attempt, while the other 50% do not.  With two individuals from this past week already pronounced as having passed and becoming VCDX certified, the odds are starting to stack up against those like me who still wait for their results.  I’m trying to keep my mind occupied on other things but it is difficult.  I periodically take comfort in thinking about things far more important, like smiles on my childrens’ faces. For those that pass, I’m sure they look back upon the efforts as well spent and the reward of passing as well deserved.  I know that I have already benefited from what I have learned through the process. It has taught me to be more of a thinker which maps directly to my Design and Engineering role at work. I would love nothing more at this point than to have the VCDX certificate to go along with it.  I look at the VCDX as a highly coveted certification with a lot of integrity built into the program and process which is sure to last a long time. There is no possibility of a “paper VCDX” as far as I’m concerned. That means value for cert holders and businesses for many years to come.

Oh I almost forgot, I brought my own whiteboard dry erase marker on the trip and used it during my defense. I had been using it for practice on my whiteboard at home and thought it may bring me good luck in the defense. Shabby dry erase markers can be a distraction.  In addition, it has a fine eraser on the opposite end which comes in handy and can save time wiping away small details instead of using the huge brick eraser.  The panel didn’t seem to have any reservations with me using it.  Click the image to view a larger version.

VMkernel Networks, Jumbo Frames, and ESXi 4

February 12th, 2010

Question:  Can I implement jumbo frames on ESXi 4 Update 1 VMkernel networks?

Answer:  Who in the hell knows?

You see, the ESXi 4.0 Update 1 Configuration Guide states on page 54:

“Jumbo frames are not supported for VMkernel networking interfaces in ESXi.”

Duncan Epping of Yellow Bricks also reports:

“Jumbo frames are not supported for VMkernel networking interfaces in ESXi. (page 54)”

One month after the release of ESXi 4 Update 1, Charu Chaubal of VMware posted on the ESXi Chronicles blog:

“I am happy to say that this is merely an error in the documentation. In fact, ESXi 4.0 DOES support Jumbo Frames on VMkernel networking interfaces. The correction will hopefully appear in a new release of the documentation, but in the meantime, go ahead and configure Jumbo frames for your ESXi 4.0 hosts.”

Shortly after, Duncan Epping of Yellow Bricks confirms Charu Chaubal’s report that jumbo frames are supported on ESXi VMkernel networks.

Now, nearly two months after Charu’s clarification and three months after the release of ESXi 4 Update 1, the documentation remains dubious on page 54 stating that jumbo frames are not supported on ESXi 4 VMkernel networks which is a direct contradition to a VMware ESXi blog.

I opened a Business Critical Support SR with VMware on the question.  I was told by VMware BCS that jumbo frames are NOT supported on ESXi 4 Update 1 VMkernel networks and a reference was made to the documentatation on page 54. 

Our dedicated VMware onsite Engineer escalated and I was then told ESXi 4 Update 1 DOES support jumbo frames on VMkernel networks, making reference to Charu’s article.

Hey VMware, which is it?  If this is a documentation mistake, why are you dragging your feet in getting the documentation updated two months after a VMware employee discovers the error and blogs it?  Waiting for the next release of ESXi?  Unacceptable!  You update the public documentation as soon as you discover the error and be damned sure your BCS support Engineers know the right answer!  Do you know how much companies pay for BCS?  You owe your customers the correct answer.  If misinformation comes as a result of a known documentation error, SHAME ON YOU!  Architecture and design decisions are being made daily on this information or misinformation, which ever it may be.

Update 2/23/10:  Toby Kraft (@vmwarewriter on Twitter) will be updating the documentation by next week.  Thank you Toby!

Update 3/1/10:  VMware has updated their documentation to reflect currently supported configurations.  Thank you VMware (and Toby)!