Sometimes I take things for granted. For instance, the health and integrity of the lab environment. Although it is “lab”, I do run some workloads which are key to keep online on a regular basis. Primarily the web server which this blog is served from, the email server which is where I do a lot of collaboration, and the Active Directory Domain Controllers/DNS Servers which provide the authentication mechanisms, mailbox access, external host name resolution to fetch resources on the internet, and internal host name resolution.
The workloads and infrastructure in my lab are 100% virtualized. The only “physical” items I have are type 1 hypervisor hosts, storage, and network. By this point I’ll assume most are familiar with the benefits of consolidation. The downside is that when the wheels come off in a highly consolidated environment, the impacts can be severe as they fan out and tip over down stream dependencies like dominos.
A few weeks ago I had decided to recarve the EMC Celerra fibre channel SAN storage. The VMs which were running on the EMC fibre channel block storage were all moved to NFS on the NetApp filer. Then last week, the Gb switch which supports all the infrastructure died. Yes it was a single point of failure – it’s a lab. The timing for that to happen couldn’t have been worse since all lab workloads were running on NFS storage. All VMs had lost their virtual storage and the NFS connections on the ESX(i) hosts eventually timed out.
The network switch was replaced later that day and since all VMs were down and NFS storage had disconnected, I took the opportunity to gracefully reboot the ESX(i) hosts; good time for a fresh start. Not surprised, I had to use the vSphere Client to connect to each host by IP address since at that point I had no functional DNS name resolution in the lab whatsoever. When the hosts came back online, I was about to begin powering up VMs, but instead I encountered a situation which I hadn’t planned for – all the VMs were grayed out, esentially disconnected. I discovered the cause of this was that after the host reboot, the NFS storage hadn’t come back online – both NetApp and EMC Celerra – on both hosts. There’s no way both storage cabinets and/or both hosts were having a problem at the same time so I assumed it was a network or cabling problem. With the NFS mounts in the vSphere client staring back at me in their disconnected state, it dawned on me – lack of DNS name resolution was preventing the hosts from connecting to the storage. The hosts could not resolve the FQDN name of the EMC Celerra or the NetApp filer storage. I modified /etc/hosts on each ESX(i) host, adding the TCP/IP address and FQDN for the NetApp filer and Celerra Data Movers. Shortly after I was back in business.
What did I learn? Not much. It was more a reiteration of important design considerations which I was already aware of:
- 100% virtualization/consolidation is great – when it works. The web of upstream/downstream dependencies makes it a pain when something breaks. Consolidated dependencies which you might consider leaving physical or placing in a separate failure domain:
- vCenter Management
- Update Manager
- SQL/Oracle back ends
- Name Resolution (DNS/WINS)
- DHCP
- Routing
- Active Directory/FSMO Roles/LDAP/Authentication/Certification Authorities
- Internet connectivity
- Hardware redundancy is always key but expensive. Perform a risk assessment and make a decision based on the cost effectiveness.
- When available, diversify virtualized workload locations to reduce failure domain, particularly to split workloads which provide redundant infrastructure support such as Active Directory Domain Controllers, DNS servers. This can mean placing workloads on separate hosts, separate clusters, separate datastores, separate storage units, maybe even separate networks depending on the environment.
- Static entires in /etc/hosts isn’t a bad idea as a fallback if you plan on using NFS in an environment with unreliable DNS but I think the better point to discuss is the risk and pain which will be realized in deploying virtual infrastructure in an unreliable environment. Garbage In – Garbage Out. I’m not a big fan of using IP addresses to mount NFS storage unless the environment is small enough.























