You may already be aware that installing VMware Tools in a Windows VM configures a registry value which controls the I/O timeout for all Windows disk in the event of a short storage outage. This is to help the guest operating system survive high latency or temporary outage conditions such as SAN path failover or maybe a network failure in Ethernet based storage. VMware Tools changes the Windows default value of 10 seconds for non-cluster nodes, 20 seconds for cluster nodes, to 60 seconds (or x03c hex).
Did you know that disk I/O timeout is a configurable parameter in other guest operating systems as well? And why not, it makes sense that we would want every guest OS to be able to outlast a storage deficiency.
NetApp offers a document titled VMware ESX Guest OS I/O Timeout Settings for NetApp Storage Systems. It’s published as kb41511 and you’ll need a free NetApp NOW account to access the document. This white paper serves a few useful purposes:
- Defines recommended disk I/O timeout settings for various guest operating systems on NetApp storage systems
- Defines benchmark disk I/O timeout settings for various guest operating systems which could be used on any storage system, including local SCSI
- In some cases provides scripts to make the necessary changes
- Explains the methods to make the disk I/O timeout changes on the following guest operating systems:
- Solaris 10
Now on the subject disk I/O timeouts, understand the above is to be used as chance for extending the uptime of a VM during adverse storage conditions. As in life, there are no guarantees. A guest OS with high disk I/O activity may not be able to tolerate sustained read and/or write requests for the duration of the timeout value. Windows guests may freeze or BSOD. Linux guests may go read-only on their root volumes which requires a reboot. Which brings me to the next point…
A larger timeout value isn’t necessarily better. In extending disk I/O timeout values, we’re applying virtual duct tape to an underlying storage issue which needs further looking into. Given the complex and wide variety of shared storage systems available to the datacenter today, storage issues can be caused by many variables including but not limited to disks (spindles), target controllers, fabric components such as fibre cables, SFP/GBICs, HBAs, fabric switches, zoning, network components such as copper cabling, NICs, network switches, routers, and firewalls. Also keep in mind that while the OS may survive the disk I/O interruption, application(s) running on the OS platform may not. Applications themselves implement response timeout values which are likely going to be hard coded and non-configurable by a platform or virtualization administrator in the application itself.
Lastly, try to remember that if you go through the effort of increasing your disk I/O timeout values on Windows guests beyond 60 seconds, future installation of VMware Tools or other applications/updates may reset the disk I/O timeout back to 60 seconds. What this means is that in medium to large environments, you’re going to need an automated method to deploy custom disk I/O timeout values at least for Windows guests. For those with NetApp storage, NetApp pushes these standards firmly, along with other VMware best practices which I’ll save for a future blog article.
Update 4/28/10: VMware Tools for vSphere installation doesn’t change the disk timeout setting if a custom value was previously set (ie. 190 seconds)
Update 9/12/11: See also VMware KB article 1009465 Increasing the disk timeout values for a Linux 2.6 virtual machine