Today I learned of a new blog called Virtual RJ which is owned by Robbert Jan van de Velde (yet another Dutch VMware virtualization enthusiast!). I was reading an article he had recently written called Making inactive storage active in VirtualCenter. What hits close to home for me about this article is the need for datacenter playbooks which outline a shutdown/startup order of infrastructure and servers. Once upon a time, our environment was fairly simple and staff was small. Although our environment was documented, the need for a formal shutdown/startup order was not so prevalent. Over the years, staff has grown, new applications have been introduced to the environment, and the number of servers grew into the hundreds. Not to mention, storage got out of control and with that we brought in SAN infrastructures.
Unless your datacenter is the size of a broom closet, chances are you cannot easily get away with throwing the master power switch to bring up infrastructure and servers in the right order. Obviously you’re not going to use a power switch to shut everything down ungracefully either, but what may not be so obvious is that a graceful shutdown or startup of servers and infrastructure in random order may not be the best choice considering the health of the environment.
In order to understand the correct shutdown/startup order for your environment, you need to fully understand the web of datacenter dependencies which can range from simple to highly complex. Knowing your datacenter dependencies means having good documentation of its components: servers (including clusters), applications, storage, authentication, network, power, cooling, etc. Virtualization adds a layer as well as I will show in a moment. Let’s look at a few high level examples of dependencies:
- Users depend on applications, workstations, network, VDI, etc.
- Applications depend on databases, network, authentication, storage, other applications, etc.
- Highly available databases depend on shared storage, clustered servers, etc.
- Clustered servers depend on shared storage, authentication, network, quorum, etc.
- Shared storage and network depends on power and cooling.
- Consolidated virtual infrastructures (including VDI) depend on everything.
The list above may not completely fit your environment, but it should start to get you thinking about what and where the dependencies are in your environment. Let me re-emphasize that without knowledge of how data flows in your environment, you won’t be able to come up with an accurate dependency tree. Shutdown and startup orders aside, you’re in a scary position. Start documenting quickly. Talk to your peers, developers, managers, etc. to tie your datacenter components together.
So what does the dependency list above mean and how does it translate into a shutdown/startup order? Well, workstations and VDIs typically have no dependencies and can be shut down first. Application servers (including VMs) can be shut down next (except for the vCenter server – we’ll need that to shut down VMs and hosts). Database cluster shutdown follows with the caveat that not all cluster nodes should be shut down at the same time – stagger the shutdown so as not to hang quorum arbitration risking potential corruption of data. At this point, if all VMs are shut down, we can use vCenter shut down all ESX/ESXi hosts and then the vCenter server. At this point, authentication should no longer be needed so let’s shut down the domain controllers. Getting to the end of the list, we can shut down shared storage, SAN switches, and networking equipment (in that order). Lastly, we pull the plug on phone systems, Twitter, cooling, and then sever the link to street power. No really, just kidding – Twitter is not that much of a dependency. I can quit Twitter any time I want.
Now that we know shutdown order, startup order is typically simple – startup order is the reverse or inverse of the shutdown order. Example: Throw the switch for street power. Engage cooling. Turn on the PBX. Fire up the network switches and routers. SAN switches (go grab a coffee) then shared storage. Domain controllers, ESX hosts, vCenter, app servers, blah blah blah. You get the idea.
Everyone on your staff has both lists above memorized right? If not, you need to get it documented in a shutdown/startup playbook. I don’t feel one needs complex software or hired technical writers to put this together. If you understand the dependencies, 85% of the work is already done. My solution for what I put together was embarrassingly simple: Microsoft Excel.

The tool itself doesn’t need to be incredibly complex, however, that doesn’t mean your shutdown/startup order will be as simple. In the spreadsheet I maintain for my environment, I have a few hundred rows of information and many columns representing branch dependencies. I also have a few different tabs in the spreadsheet with slightly different orders. This is because we have multiple SANs and if we’re only shutting down one of the SANs for planned maintenance, we only need to shut down its dependencies and not the entire datacenter including the other SANs.
Like many other types of documentation, the shutdown/startup order should be considered a living/breating document that needs periodic care and feeding. When new servers, infrastructure, or applications are brought into the environment, this document needs to be updated to remain current. When datacenter components are removed, again, a document update is needed. We’ve got a formal server turnover checklist which catches loose ends like this. Any server that goes into production must have all the items on its checklist completed first (ie. all documentation complete, added to backup schedule, added to server security plan, etc.) Likewise, we also maintain a formal server retirement checklist to make sure we’re not trying to back up retired servers or consume static IP addresses of retired servers.
As our team becomes more distributed and expertise is honed to specific areas of the organization, it is important that all staff members resopnsible for the environment understand the requirements to shut it down quickly or in a planned fashion. That means good documentation. Better documentation also means your peers have the tools needed to do your job while you’re gone and less chance you’ll be called in the middle of the night or while on vacation.




(22)
(25)
(0)
Hi Jason
Great article about a subject nobody thinks about to often. Our data centre provider had a major change to their power delivery setup last year which resulted in the first full shutdown of our datacentre since we outsourced. We had to pull a run order for shutdown together very quickly and it all went very smoothly in the end. One thing you didn’t mention was application dependencies, working in the finance sector for example tends to provide systems that consist of multiple servers and numerous dependencies. I wonder what other industries have these complex application environments to add into the mix.
Regards
C Stewart ( VirtualPro)
@C stewart,
Thanks for the comment. In my article, I implied sorting out application dependencies on other applications where I should have been more explicit about it. In my environment, we have plenty of application dependencies which makes the startup order for application servers particularly interesting. Depending on how the application has been architected and the platforms they use, databases, middleware, and applications need to come up in the correct order. Add domain controllers to that mix as well. It’s a safe assumption that domain controllers (referred to as DCs in my screenshot) always need to come up first and shut down last, again, in a staggered fashion like cluster nodes.
Hi Jason,
Found your article again using Google, and again I can’t agree more. Have been evangelizing this for quite a while.
One problem is how to get control of the dependencies. Use makefiles one said, but that’s complex and it doesn’t leave you with some overview.
Thus I am still looking for a graphical tool that can manage my components (servers, installation, etc.), services (dhcp, AD, etc.) and their respective relationships.
Not only that – the tool should help me with start-up shut-down plan, including attributes of my components and services like their respective IP addresses, locations, accounts, passwords (?), contact information (just in case) etc.
This would be my ideal tool for a good preparation controlled DC operations – and will give good insight when shit hits the fan!!
Just haven’t found this tool yet… Visio won’t cut it. All tips will be appreciated.