VKernel Capacity Analyzer

May 6th, 2010 by jason No comments »

Last month, I attended Gestalt IT Tech Field Day in Boston.  This is an independent conference made up of hand selected delegates and sponsored by the technology vendors whom we were visiting.  All of the vendors boast products which tie into a virtualized datacenter which made the event particularly exciting for me!

One of the vendors we met with is VKernel.  If you’re a long time follower of my blog, you may recall a few of my prior VKernel posts including VKernel CompareMyVM.  Our VKernel briefing covered Capacity Analyzer.  This is a product I actually looked at in the lab well over a year ago, but it was time to take another peek to see what improvements have been made.

Before I get into the review, some background information on VKernel:

VKernel helps systems administrators manage server and storage capacity utilization in their virtualized datacenters so they can:

  • Get better utilization from existing virtualization resources
  • Avoid up to 1/2 the cost of expanding their virtualized datacenter
  • Find and fix or avoid capacity related performance problems

VKernel provides easy to use, highly affordable software for systems managers that:

  • Integrates with their existing VMware systems
  • Discovers their virtualized infrastructure and
  • Determines actual utilization vs. provisioned storage, memory, and CPU resources

And the VKernel Capacity Analyzer value proposition:

Capacity Analyzer proactively monitors shared CPU, memory, network, and disk (storage and disk I/O) utilization trends in VMware and Hyper-V environments across hosts, clusters, and resource pools enabling you to:

  • Find and fix current and future capacity bottlenecks
  • Safely place new VMs based on available capacity
  • Easily generate capacity utilizatino alerts

Capacity Analyzer lists for $299/socket, however, VKernel was nice enough to provide each of the delegates with a 10 socket/2 year license which was more than adequate for evaluation in the lab.  From this point forward, I will refer to Capacity Analyzer as CA.

One of the things which was noticed right away by another delegate and by myself was the quick integration and immediate results.  CA 4.2 Standard Edition ships as a virtual appliance in OVF or Converter format.  The 32-bit SLES VM is pre-built, pre-configured, and pre-optimized for the role which it was designed for in the virtual infrastructure.  The 600MB appliance deploys in just minutes.  The minimum deployment tasks consist of network configuration (including DHCP support), licensing, and pointing at a VI3 or vSphere virtual infrastructure.

CA is managed by HTTP web interface which has been the subject of noticable improvement and polishing since the last time I reviewed the product.  The management and reporting interface is presented in a dashboard layout which makes use of the familiar stoplight colors.  A short period of time after deployment, I was already seeing data being collected.  I should note that the product supports management of multiple infrastructures.  I pointed CA at VI3 and vSphere vCenters simultaneously.

5-5-2010 10-58-08 PM

One of the dashboard views in CA is the “Top VM Consumers” for metrics such as CPU, Memory, Storage, CPU Ready, Disk Bus Resets, Disk Commands Aborted, Disk Read, and Disk Write.  The dashboard view shows the top 5, however, detailed drilldown is available which lists all the VMs in my inventory.

5-5-2010 10-48-59 PM

Prior to deploying CA, I felt I had a pretty good feel for the capacity and utilization in the lab.  After letting CA digest the information available, I thought it would be interesting to compare results provided by CA with my own perception and experience.  I was puzzled by the initial findings.  Consider the following physical two node cluster information from vCenter.  Each node is configured identically with 2xQC AMD Opteron processors and 16GB RAM. Each host is running about 18 powered on VMs.  Host memory is and always has been my limiting resource, and it’s evident here, however, with HA admission control disabled, there is still capacity to register and power on several more “like” VMs.

5-5-2010 10-46-54 PM

So here’s where things get puzzling for me.  Looking at the Capacity Availability Map, CA is stating
1) Memory is my limiting resource – correct
2) There is no VM capacity left on the DL385 G2 Cluster – that’s not right

5-5-2010 10-46-01 PM

After further review, the discrepancy is revealed.  The Calculated VM Size (slot size if you will) for memory is at 3.5GB.  I’ not sure where CA is coming up with this number. It’s not the HA calculated slot size, I checked.  3.5GB is nowhere near the average VM memory allocation in the lab.  Most of my lab VMs are thinly provisioned from a memory standpoint since host memory is my limiting resource.  I’ll need to see if this can be adjusted because these numbers are not accurate, thus not reliable.  I wouldn’t want to base a purchasing decision on this information.

5-5-2010 10-59-20 PM

Here’s an example of a drilldown.  Again, I like the presentation, although this screen seems to have some justification inconsistencies (right vs. center).  Reports in CA can be saved in .PDF or .CSV format, making them ideal for sharing, collaboration, or archiving.  Another value add is a recommendation section which is stated in plain English in the event the reader is unable to interpret the numbers.  What I’m somewhat confused about is fact that the information provided in different areas is contradicting.  In this case, the summary reports VM backupexec “is not experiencing problems with memory usage… the VM is getting all required memory resources”.  However, it goes on to say there is a problem in that there exists a Memory usage bottleneck… the VM may experience performance degradation if memory usage increases.  Finally, it recommends incresaing the VM memory size to almost double the currently assigned value – and this Priority is ranked as High.

5-5-2010 10-42-01 PM

It’s not clear to me from the drilldown report if there is a required action here or not. With the high priority status, there is a sense of urgency, but to do what?  The analysis states performance could suffer if memory usage increases.  That typically will be the case for virtual or physical machines alike.  The problem as I see it is the analysis is concerned with a future event of which may or may not occur.  If the VM has shown no prior history of higher memory consumption and there is no change to the application running in the VM, I would expect the memory utilization to remain constant.  VKernel is on the right track, but I think the out-of-box logic needs tuning so that it is more intuitive.  Else this is a false alarm which would cause me to overutilize host capacity or I would learn to ignore which is dangerous and provides no return on investment in a management tool.

I’ve got more areas to explore with VKernel Capacity Analyzer and I welcome input, clarification, corrections from VKernel.  Overall I like the direction of the product and I think VKernel has the potential to service capacity planning needs for virtual infrastructures of all sizes.  The ease in deployment provides rapid return. As configuration maximums and VM densities increase, capacity planning becomes more challenging.  When larger VMs are deployed, significant dents are being made in the virtual infrastructure causing shared resources to deplete more rapidly per instance than in years past.  Additional capacity takes time to procure. We need to be able to lean on tools like these to provide the automated analysis and alarms to stay ahead of capacity requests and not be caught short on infrastructure resources.

Flickr Manager Plugin Fix

April 27th, 2010 by jason No comments »

I’m a visual and hands-on kind of person and as such, I tend to make use of images in my blog posts. Flickr is an online provider that hosts images free of charge which saves me bandwidth costs and delivers content to blog readers quickly. In a sense, they are a cloud provider. Flickr Manager is a WordPress plugin that allows me to efficiently browse and insert Flickr images from the comfort of my WordPress blog editor, among other things.

Several months ago, the Flickr Manager overlay stopped working correctly.  The overlay was no longer inserting images into my blog posts as I had been instructing it to.  I filed a bug (#144) with the author as follows:

What steps will reproduce the problem?

1. Create a new blog post or page

2. Click on the “Add Flickr Photo” icon.

3. In the overlay under “My Photos” tab, click on a photo to insert.

4. In the summary overlay page, once the photo is selected in the overlay, click the “Insert into Post” button.

5. The summary overlay page for the photo returns and no photo is inserted into the blog post.

What is the expected output? What do you see instead?

I expect the photo to be inserted into the blog post and the Flickr overlay should close. Instead, the overlay stays open as if nothing has happened. The same thing happens if I check the box “Close on insert” on the overlay page.

What version of the plugin are you using? Which version of WordPress? Flickr Manager version 2.3. WordPress 2.9.2

Please provide a link to your photo gallery, or the page that has the bug: My Flickr Photostream is at http://www.flickr.com/photos/31838982@N08/

Which hosting provider are you on? What version of Apache or IIS are you using? Self hosted out of my home. Windows Server 2003, IIS 6

Please provide any additional information below.

This plugin was working fine for the first several months but after a while it stopped inserting photos. I can’t associate the breakage with any sort of upgrade such as a WordPress upgrade, plugin upgrade, or theme change. Any help would be appreciated.

Browsing my Flickr album, grabbing URLs for images, and inserting them into my blog posts manually is a painful process involving multiple browser windows.  I was really missing the functionality of Flickr Manager.  It was deterring me from writing blog posts which I knew I wanted to incorporate images.  Using Google, I was able to locate a few others who had stumbled onto this problem, but I was unable to find any solutions.

I turned to Twitter, a universe of technical expertise, among many other things I’m sure.  Kelly Culwell and Grant Bivens, Solution Architect and Web Developer resepectively of Interworks, Inc., answered the call.  I had spoken with Kelly off and on the past few months regarding VMware topics.  They quickly turned me on to this page which described fix.  All I had to do was modify three of the plugin files, removing any occurrance of the @ symbol.  Grant described the problem as a JavaScript selector the author used which has since been depreciated.

wordpress-flickr-manager/js/wfm-lightbox.php
wordpress-flickr-manager/js/media-panel.php
wordpress-flickr-manager/js/wfm-hs.php

Happy days once again, the solution worked!  These guys wanted nothing in return but their kind offer to help and quick solution definitely deserves mention.  My faith in humanity has been partially restored thanks to these gentlemen.  Kudos and great job!

Two Quick Announcements

April 19th, 2010 by jason No comments »

I’ve got several blog posts in the queue but unfortunately I haven’t had time to get them cranked out yet.  In the interim, here are a couple of hot items I wanted to help spread the word on.

There’s a 45 minute VMware certification podcast coming up early this Wednesday morning.  APAC Vitalization Roundtable (Certifications & the VCDX path).  Podcast guests are Andrew Mitchell, Duncan Epping, and Alastair Cooke.  The Talkshoe podcast is scheduled for the following time:
EDT (USA) – 7AM
PDT (USA) – 4AM
Perth (Australia) – 7PM
Hong Kong (Hong Kong) – 7PM
Kuala Lumpur (Malaysia) – 7PM
Tokio (Japan) – 8PM
Auckland (New Zealand) – 11PM
London (UK) – 12Noon
 

The vSphere 4.0 Hardening Guide, a collaborative effort which has been under development for several months with a few interim beta distributions, has been finalized and released.  The document is presented in a flexible format in that a few different levels of hardening can be used as low to high risk boundaries.  The user then ultimately decides on the appropriate hardening level depending on the level of risk the user is comfortable with assuming.The guide covers several key VMware Virtual Infrastructure areas.  To quote VMware’s announcement: 

Overall, there are more than 100 guidelines, with the following major sections: 

  • Introduction
  • Virtual Machines
  • Host (both ESXi and ESX)
  • vNetwork
  • vCenter
  • Console OS (for ESX only)

If you’ve recently deployed vSphere or if you are about to, check out these hardening guides to help secure your datacenter.  You can’t beat the price – free!

No Failback Bug in ESX(i)

April 7th, 2010 by jason No comments »

I few weeks ago, I came across a networking issue with VMware ESX(i) 4.0 Update 1.  The issue is that configuring a vSwitch or Portgroup for Failback: No doesn’t work as expected in conjunction with a Network Failover Detection type of Link Status Only. 

For the simplest of examples:

  1. Configure a vSwitch with 2 VMNIC uplinks with both NICs in an Active configuration.  I’ll refer to the uplinks as VMNIC0 and VMNIC1.
  2. Configure the vSwitch and/or a Portgroup on the vSwitch for Failback: No.
  3. Create a few test VMs with outbound TCP/IP through the vSwitch. 
  4. Power on the VMs and begin a constant ping to each of the VMs from the network on the far side of the vSwitch.
  5. Pull the network cable from VMNIC0.  You should see little to no network connectivity loss on the constant pings.
  6. With VMNIC0 failed, at this point, all network traffic is riding over VMNIC1.  When VMNIC0 is recovered, the expected behavior with No Failback is that all traffic will continue to traverse VMNIC1.
  7. Now provide VMNIC0 with a link signal by connecting it to a switch port which has no route to the physical network.  For example, simply connect VMNIC0 to a portable Netgear or Linksys switch.
  8. What you should see now is that at least one of the VMs is unpingable.  It has lost network connectivity because ESX has actually failed its network path back to VMNIC0.  In the failback mechanism, VMware appears to balance the traffic evenly.  In a 2 VM test, 1 VM will fail back to the recovered VMNIC.  In a 4 VM test, 2 VMs will fail back to the recovered VMNIC.  Etc.

 The impact spreads to any traffic being supported by the vSwitch, not just VM networking.  Thus, the impact includes Service Console, Management Network, IP based storage, VMotion, and FT.  Scope of the bug includes both the standard vSwitch as well as the vSphere Distributed Switch (vDS).

Based on the following VMTN forum thread, it would appear this bug has existed since August of 2008.  Unfortunately, documentation of the bug never made it to VMware support:

http://communities.vmware.com/thread/165302

You may be asking yourself at this point, well who really cares?  At the very least, we have on our hands a feature which does not work as documented on page 41 of the following manual: http://www.vmware.com/pdf/vsphere4/r40_u1/vsp_40_u1_esxi_server_config.pdf.  Organizations which have made a design decision for no failback have done so for a reason and rely on the feature to work as it is documented.

Why would my VMNIC ever come up with no routing capabilities to the physical network?  Granted, my test was simplistic in nature and doesn’t likely represent an actual datacenter design but the purpose was merely to point out to folks that the problem does exist.  The issue actually does present a real world problem for at least one environment I’ve seen.  Consider an HP C-class blade chassis fitted with redundant 10GbE Flex10 Ethernet modules.  Upstream to the Flex10 modules are Cisco Nexus 5k switches and then Nexus 7k switches.

When a Flex10 module fails and is recovered (say for instance it was rebooted – which you can test yourself if you have one), it has an unfortunate habit of bringing up the blade facing network ports (in this case, VMNIC0 labeled 1 in the diagram) up to 20 seconds before a link is established with the upstream Cisco Nexus 5k (labeled 2 in the diagram) which grants network routing to other infrastructure components.  So what happens here?  VMNIC0 shows a link and ESX fails back traffic to it up to 20 seconds before the link to the Nexus 5k is established.  There is a network outage for Service Console, Management Network, IP based storage, VMotion, and FT.  Perhaps some may say they can tolerate this much of an outage for their VM traffic, but most people I have talked to say even an outage of 2 or more seconds is unacceptable.  And what about IP based storage?  Can you afford the 20 second latency?  What about Management Network and Service Console?  Think HA and isolation response impact.  VMs shutting down as a result.  Etc.  It’s a nasty chain of events.  In such a case, a decision can be made to enable no failback as a policy on the vSwith and Portgroups.  However, due to the VMware bug, this doesn’t work.  Some day you may experience an outage which you did not expect.

As pointed out in the VMTN forum thread above, there is a workaround which I have tested and does work: Force at least one VMNIC to act as Standby.  This is not by VMware design, it just happens to make the no failback behavior work correctly.  The impact with this design decision is of course that now one VMNIC stands idle and there are no load balancing opportunities over this VMNIC.  In addition, with no failback enabled, network traffic will tend to become polarized to one side again impacting network load balancing.

An SR has been opened with VMware on this issue.  They have confirmed it is a bug and will be working to resolve the issue in a future patch or release.

Update 4/27/10:  The no failback issue has been root-caused by VMware and a fix has been constructed and is being tested. It will be triaged for a future release.

Announcing the Drobo FS Storage Appliance

April 6th, 2010 by jason No comments »

SANTA CLARA, CA – April 6, 2010 – Data Robotics, Inc., the company that delivers the best data storage experience, today introduced Drobo FS, a breakthrough Drobo designed for simple, expandable file sharing.  By providing network file sharing capabilities along with automated data protection, the Drobo FS greatly simplifies shared storage for connected home, home office and small office users. Based on the revolutionary BeyondRAID technology and a flexible platform for adding features and capacity as needed, the Drobo FS can quickly and easily be customized and scaled to meet current or future storage requirements.

“Adding to Data Robotics’ offering of self-managing storage solutions, the Drobo FS offers users the ability to share data between computers quickly and easily,” said Liz Conner, research analyst for IDC’s storage systems and personal storage device & systems.  “More than ever before, home users and small offices want to access and share a growing amount of data, but they don’t need a large, expensive system that requires specific expertise or extensive management.  With the Drobo FS, Data Robotics addresses a need for which many Drobo users have been looking forward.”

Drobo FS extends automated data protection across connected systems using Data Robotics’ award-winning BeyondRAID virtualized storage technology. Drobo FS features a one-click toggle between single- and dual-drive redundancy and provides protection against up to two simultaneous drive failures. In addition, the Drobo FS enables users to add storage on-the-fly without ever losing access to data.

“We have been using Drobo solutions to store our data for almost three years and continue to be happy with their performance and simplicity.  With four computers on our network, we also needed a solution that would allow us to share data between users. Typical shared storage devices were complicated or too expensive, so we’re thrilled that Data Robotics has created the Drobo FS,” said Seth Resnick, Co-Founder at D65. “Drobo FS comes with just the functionality we need, so it is easy to use while still providing the automated data protection that comes with BeyondRAID.  The additional DroboApps allow us to add new features as we need them, which is really unique.”

Drobo FS Features and Benefits

  • Plug In and Share – The Drobo FS connects directly to any Gigabit Ethernet network for a true plug in and share set-up experience. Supports standard data transfer protocols including Apple File Protocol (AFP) and Microsoft Common Internet File System (CIFS).
  • 5-Drive Capacity and Instant Expansion to 10TB and Beyond – Customers with growing storage requirements can easily add data capacity by simply inserting a new hard drive or replacing the smallest drive with a larger one, even when all five drive bays are full. With Drobo FS, expansion is automatic, instantaneous, and access to data is always maintained.
  • Single- and Dual-Drive Redundancy – The Drobo FS dual drive redundancy option protects against the simultaneous failure of up to two hard drives. Customers can engage this option with a single click without ever losing access to their data.
  • Self-Healing Technology – With BeyondRAID, the Drobo FS continually examines data blocks and sectors on each drive to flag potential issues. The preemptive “scrubbing” helps ensure data is being written only to healthy drive areas and automatically keeps data in the safest state possible even when a drive fails.
  • Customizable Storage Utilizing the growing library of DroboApps, including media and web applications, users can customize the Drobo FS to further enhance their sharing experience.

“This is the decade of being connected, no matter where you are or what kind of data you are storing. The Drobo FS was designed to best serve the needs of our customers with file sharing needs – from small offices and home offices to connected homes.  By reducing the complexity of data sharing and providing a truly flexible platform for adding capacity and features, the Drobo FS is ideal for users that have the need to share ever increasing amounts of data,” said Tom Buiocchi, CEO of Data Robotics.

Pricing and Availability Drobo FS is currently available at a starting price of $699 MSRP, with multiple configurations to $1,449 for a 10TB (5x 2TB drives) bundle. Drobo FS is available now from authorized partners worldwide and on www.drobostore.com. To learn more about Drobo FS, please visit www.drobo.com/drobo-fs.

About Data Robotics
Data Robotics, Inc., the company that delivers the best storage experience ever, develops automated storage products designed to ensure data is always protected, accessible, and simple to manage. The award-winning Drobo storage arrays are the first to provide the protection of traditional RAID without the complexity. The revolutionary BeyondRAID technology frees users from making the difficult and confining choice of “Which RAID level to deploy?” by providing an unprecedented combination of advanced features and automation, including single- and dual-drive redundancy, instant expansion, self-monitoring, data awareness, self-healing, and an easy-to-understand visual status and alert panel. For more information, visit Data Robotics at www.datarobotics.com.

I spent some time with Data Robotics listening about the Drobo FS in addition to their existing storage offerings.  Here are some of the things I learned about the FS:

  • FS = File Sharing
  • True plug and play setup
    • Drobo FS is automatically discovered on PC and Mac
  • Gigabit Ethernet
  • Up to 5 hot-swappable drives
  • Single or dual drive redundancy
  • All the magic and ease found in the Drobo S (such as BeyondRAID) is now implemented into a network storage device (such as DroboApps)
  • Adds network based storage while removing the complexity of managing RAID
  • Target uses:
    • Shared file storage
    • Network backup
    • Private cloud (IP storage accessible via the internet)
  • Add additional functionality to the Drobo FS via DroboApps
    • Bolt on applications are accessible from Drobo Share page
    • Backed by a developer community
    • DroboApps are free; no plans to charge fees
    • Apps are small, 1MB or less in size, and stored on reserved space of the Drobo FS
    • Apps available at launch: iTunes media server, UPnP/DLNA media server, BitTorrent client, Web server, FTP server
  • CIFS and AFP (Apple File Protocol) are natively embedded protocols
  • NFS is a bolt on DroboApp
    • Will likely work with VMware virtual infrastructure, however, is not currently on the VMware HCL
  • iSCSI protocol not available
    • this is a file oriented device, not block oriented
    • iSCSI is available in the DroboPro and DroboElite models
  • Self healing technology: Drobo FS examines blocks and sectors during idle periods to ensure data is written only to healthy areas of the disk
  • Viable competitor to the Lacie Big5 and iOMEGA with the following throughput rates:
    • 40-50 MB/s Read
    • 30-40 MB/s Write
  • Mix and match drive densities
  • Pay as you grow (Buy what you need today, expand with additional drives later)
  • Available via Amazon, CDW, NewEgg, MacMall, eXpansys, B&H Photo, etc.

MSRP Pricing:

Product Configuration Pricing USD Pricing GBP Pricing EUR
Drobo FS Base Base appliance, no drives $699 £469 519€
Drobo FS 4.5TB 4.5TB (3 x 1.5TB) $999 £669 749€
Drobo FS 7.5TB 7.5TB (5 x 1.5TB) $1,149 £769 859€
Drobo FS 10TB 10TB (5 x 2TB) $1,449 £969 1079€

The Drobo FS looks looks and sounds like a nice product.  Watch for competitive pricing.  If I’m able to get my hands on a unit, I’ll perform some lab testing and further writing I’m sure.

Happy Easter!

April 4th, 2010 by jason No comments »

Allison vEGG from Jason Boche on Vimeo.

VMware Disks Moving Back To .DSK File Name Format

April 1st, 2010 by jason No comments »

VMware Administrators who use VMware Update Manager (and I highly recommend its use), will have noticed that 10 new patches were released for VMware vSphere on 4/1/2010.  A few for the Cisco Nexus 1000V VEM, and the remainder of the updates are for ESX(i) 4.0.

Two of the patches which I’d like to highlight, ESX400-201003401-BG and ESXi400-201003401-BG, make some required changes to the virtual infrastructure which you should be aware of.  The .VMDK file naming convention, introduced originally in ESX 2.x, is being retired in favor of the original naming convention .DSK which existed prior to ESX 2.x.  The reasoning for this is not yet known but as the saying goes, “You can’t make an omelette without breaking a few eggs.”  So suck it up; it’s not all about you.

How does this affect you?  The impacts will vary depending on your role with VMware Virtual Infrastructure:

  • VI Administrators will need to update any scripts which may be impacted.
  • 3rd party tool vendors will need to update any code which references .VMDK files.
  • Book authors would be advised to perform a mass “find and replace” before their next printing.  For tactical advice on how to do this, speak to Scott Herold or Ron Oglesby as they have experience in this.  As for books and articles already published on VMware disk technologies using the .VMDK naming scheme, refer to the omelette statement above.
  • Lab Manager and View environments are not yet compatible with the file naming convention change.  An updated release for each of these products should be available by Q4 this year, Q1 2011 at the latest.

I am completely in favor of this change.  I never did adapt fully to the migration from .DSK to .VMDK, so this will be just like old times for me.  We need more radical ideas like this to break the chains of complacency.  From a sales perspective, VMware can totally pitch changes like these as innovation which keeps them several steps ahead of their competitors.  So what’s next?  Inside sources tell me HA naming is going back to its roots:  Hello Dynamic Availability Service!

For more information, please follow this link. 😀