Datacenters need shutdown/startup order

January 1st, 2009 by jason Leave a reply »

Today I learned of a new blog called Virtual RJ which is owned by Robbert Jan van de Velde (yet another Dutch VMware virtualization enthusiast!).  I was reading an article he had recently written called Making inactive storage active in VirtualCenter.  What hits close to home for me about this article is the need for datacenter playbooks which outline a shutdown/startup order of infrastructure and servers.  Once upon a time, our environment was fairly simple and staff was small.  Although our environment was documented, the need for a formal shutdown/startup order was not so prevalent.  Over the years, staff has grown, new applications have been introduced to the environment, and the number of servers grew into the hundreds.  Not to mention, storage got out of control and with that we brought in SAN infrastructures.

Unless your datacenter is the size of a broom closet, chances are you cannot easily get away with throwing the master power switch to bring up infrastructure and servers in the right order.  Obviously you’re not going to use a power switch to shut everything down ungracefully either, but what may not be so obvious is that a graceful shutdown or startup of servers and infrastructure in random order may not be the best choice considering the health of the environment.

In order to understand the correct shutdown/startup order for your environment, you need to fully understand the web of datacenter dependencies which can range from simple to highly complex.  Knowing your datacenter dependencies means having good documentation of its components:  servers (including clusters), applications, storage, authentication, network, power, cooling, etc.  Virtualization adds a layer as well as I will show in a moment.  Let’s look at a few high level examples of dependencies:

  • Users depend on applications, workstations, network, VDI, etc.
  • Applications depend on databases, network, authentication, storage, other applications, etc.
  • Highly available databases depend on shared storage, clustered servers, etc.
  • Clustered servers depend on shared storage, authentication, network, quorum, etc.
  • Shared storage and network depends on power and cooling.
  • Consolidated virtual infrastructures (including VDI) depend on everything.

The list above may not completely fit your environment, but it should start to get you thinking about what and where the dependencies are in your environment.  Let me re-emphasize that without knowledge of how data flows in your environment, you won’t be able to come up with an accurate dependency tree.  Shutdown and startup orders aside, you’re in a scary position.  Start documenting quickly.  Talk to your peers, developers, managers, etc. to tie your datacenter components together.

So what does the dependency list above mean and how does it translate into a shutdown/startup order?  Well, workstations and VDIs typically have no dependencies and can be shut down first.  Application servers (including VMs) can be shut down next (except for the vCenter server – we’ll need that to shut down VMs and hosts).  Database cluster shutdown follows with the caveat that not all cluster nodes should be shut down at the same time – stagger the shutdown so as not to hang quorum arbitration risking potential corruption of data.  At this point, if all VMs are shut down, we can use vCenter shut down all ESX/ESXi hosts and then the vCenter server.  At this point, authentication should no longer be needed so let’s shut down the domain controllers.  Getting to the end of the list, we can shut down shared storage, SAN switches, and networking equipment (in that order).  Lastly, we pull the plug on phone systems, Twitter, cooling, and then sever the link to street power.  No really, just kidding – Twitter is not that much of a dependency.  I can quit Twitter any time I want.

Now that we know shutdown order, startup order is typically simple – startup order is the reverse or inverse of the shutdown order.  Example:  Throw the switch for street power.  Engage cooling.  Turn on the PBX.  Fire up the network switches and routers.  SAN switches (go grab a coffee) then shared storage.  Domain controllers, ESX hosts, vCenter, app servers, blah blah blah.  You get the idea.

Everyone on your staff has both lists above memorized right?  If not, you need to get it documented in a shutdown/startup playbook.  I don’t feel one needs complex software or hired technical writers to put this together.  If you understand the dependencies, 85% of the work is already done.  My solution for what I put together was embarrassingly simple:  Microsoft Excel.

The tool itself doesn’t need to be incredibly complex, however, that doesn’t mean your shutdown/startup order will be as simple.  In the spreadsheet I maintain for my environment, I have a few hundred rows of information and many columns representing branch dependencies.  I also have a few different tabs in the spreadsheet with slightly different orders.  This is because we have multiple SANs and if we’re only shutting down one of the SANs for planned maintenance, we only need to shut down its dependencies and not the entire datacenter including the other SANs.

Like many other types of documentation, the shutdown/startup order should be considered a living/breating document that needs periodic care and feeding.  When new servers, infrastructure, or applications are brought into the environment, this document needs to be updated to remain current.  When datacenter components are removed, again, a document update is needed.  We’ve got a formal server turnover checklist which catches loose ends like this.  Any server that goes into production must have all the items on its checklist completed first (ie. all documentation complete, added to backup schedule, added to server security plan, etc.)  Likewise, we also maintain a formal server retirement checklist to make sure we’re not trying to back up retired servers or consume static IP addresses of retired servers.

As our team becomes more distributed and expertise is honed to specific areas of the organization, it is important that all staff members resopnsible for the environment understand the requirements to shut it down quickly or in a planned fashion.  That means good documentation.  Better documentation also means your peers have the tools needed to do your job while you’re gone and less chance you’ll be called in the middle of the night or while on vacation.

Advertisement

No comments

  1. C stewart says:

    Hi Jason

    Great article about a subject nobody thinks about to often. Our data centre provider had a major change to their power delivery setup last year which resulted in the first full shutdown of our datacentre since we outsourced. We had to pull a run order for shutdown together very quickly and it all went very smoothly in the end. One thing you didn’t mention was application dependencies, working in the finance sector for example tends to provide systems that consist of multiple servers and numerous dependencies. I wonder what other industries have these complex application environments to add into the mix.

    Regards

    C Stewart ( VirtualPro)

  2. jason says:

    @C stewart,

    Thanks for the comment. In my article, I implied sorting out application dependencies on other applications where I should have been more explicit about it. In my environment, we have plenty of application dependencies which makes the startup order for application servers particularly interesting. Depending on how the application has been architected and the platforms they use, databases, middleware, and applications need to come up in the correct order. Add domain controllers to that mix as well. It’s a safe assumption that domain controllers (referred to as DCs in my screenshot) always need to come up first and shut down last, again, in a staggered fashion like cluster nodes.

  3. Mark Bouman says:

    Hi Jason,

    Found your article again using Google, and again I can’t agree more. Have been evangelizing this for quite a while.

    One problem is how to get control of the dependencies. Use makefiles one said, but that’s complex and it doesn’t leave you with some overview.

    Thus I am still looking for a graphical tool that can manage my components (servers, installation, etc.), services (dhcp, AD, etc.) and their respective relationships.
    Not only that – the tool should help me with start-up shut-down plan, including attributes of my components and services like their respective IP addresses, locations, accounts, passwords (?), contact information (just in case) etc.

    This would be my ideal tool for a good preparation controlled DC operations – and will give good insight when shit hits the fan!!

    Just haven’t found this tool yet… Visio won’t cut it. All tips will be appreciated.

  4. Felipe Pereira says:

    I thought about something related to Ubuntu upstart.
    There are cases when you need to specify what I’d call internal order, I mean, a way to specify dependency using services inside a VM:
    vm1.service2 depends vm2.service1
    vm2.service1 depends vm1.service1

    It would be nice to have this granularity. Do not consider only virtual. A VM may depend on a physical host. So let’s say:
    host1.service2 depends host2.service1
    etc.

    So this datacenter-start-stop system doesnt need to be related to VMware. It should have a module that talked with vCenter/hosts at least. Another module for each operating system, maybe even for management cards (RSA/DRAC/iLO) and IPMI.

    Those modules should help write the dependency logic. You should have ways to:
    – change/read host on/off status
    – change/read service on/off status
    – service could be:
    – storage (including SAN, iSCSI, NFS)
    – network service
    – application running
    – hardware (network card up?)

    We have some hosts that use each other’s NFS servers and it’s a little bit messy. I was looking for a solution and hit this page. I mean:
    vm1 needs vm2:/share, vm3:/share2
    vm2 needs vm4:/share

    I think one could use VMware tools custom scripts to say “allow shutdown”/”deny shut down” for a vm. From example above, if I try to shutdown vm4 and vm2 is still up and mouting its vm4:/share, vm4 custom script vm-poweroff should return “deny”, possibly with a error message saying why.

    That could be extended. vCenter could ask via custom script (say “vm-depends”) to a vm which vm’s are its dependencies. It’s getting complicated. We would need a way for a vm to say that needs a service and not a vm (think of a DNS service that could be provided by more than one vm and other clustered services).

  5. jaime says:

    Hi, have a question regarding Application maintenance Status checklist for server power shutdown. Looking for a simple one.

  6. Rubin Simons says:

    I stumbled upon this thread because my software company, RAAF Technology is creating software specifically for the purpose of doing state control of services spread out over multiple systems. One of the components of this software is called Session and is used in various large scale environments to statefully start and stop large application chains on many systems (350+ systems, many more services on those systems).

    It’s open-source and interrested parties can download Session from its homepage here: http://www.openpcf.org/projects/session/

  7. Steven Spray says:

    Great guide but I just have one question.

    How do you turn off the DC’s when you’ve already shutdown the host(s) on which the DC’s lives?

    Unless of course, the DC’s are the only server(s) being physical :/