VMWare woes with DISK IO & Possible Solution

Posted by Benjamin Close on August 12, 2009 under ClearChain, Computers | Read the First Comment

Recently the server hosting clearchain.com (aka Leo) has been having disk io errors. This has had me quite perplexed. You see Leo is a virtual machine running on redundant hardware as part of an VMWare ESX cluster. Hence whilst I can understand slow performance and delayed access at some times, disk IO’s don’t make sense.

You see Leo has a virtual disk. This virtual disk is provided by software running on the ESX host. The only way for disk IO errors are if the physical media that it runs on has errors or something inbetween the physical media and the machine providing the virtual disk has errors.  I’ve been assure by our hosting provider (Hmon – A great little company) that neither of these situations have happened.

Despite that disk errors (Virtual in this case) have been occurring:

Aug 11 02:00:49 leo root: ZFS: vdev I/O failure, zpool=tank path=/dev/da0s1e offset=103662968832 size=16384 error=5
Aug 11 02:00:49 leo root: ZFS: vdev I/O failure, zpool=tank path=/dev/da0s1e offset=103662968832 size=8192 error=5
Aug 11 02:00:49 leo root: ZFS: vdev I/O failure, zpool=tank path=/dev/da0s1e offset=103662977024 size=8192 error=5
Aug 11 02:00:49 leo root: ZFS: vdev I/O failure, zpool=tank path=/dev/da0s1e offset=169141542912 size=98304 error=5
Aug 11 02:00:49 leo root: ZFS: vdev I/O failure, zpool=tank path=/dev/da0s1e offset=169141542912 size=1024 error=5
Aug 11 02:00:49 leo root: ZFS: vdev I/O failure, zpool=tank path=/dev/da0s1e offset=169141543936 size=1024 error=5
Aug 11 02:00:49 leo root: ZFS: vdev I/O failure, zpool=tank path=/dev/da0s1e offset=169141544960 size=1024 error=5
Aug 11 02:00:49 leo root: ZFS: vdev I/O failure, zpool=tank path=/dev/da0s1e offset=169141545984 size=1024 error=5
Aug 11 02:00:49 leo root: ZFS: vdev I/O failure, zpool=tank path=/dev/da0s1e offset=169141547008 size=1024 error=5
Aug 11 02:00:49 leo root: ZFS: vdev I/O failure, zpool=tank path=/dev/da0s1e offset=169141548032 size=1024 error=5

A little perplexed I started to try and work out why. Then I hit the article: http://virtualgeek.typepad.com/virtual_geek/2009/06/vmware-io-queues-micro-bursting-and-multipathing.html
and it all made sense. The problem is Unix machines are much harder on disks sending many more requests to them than a standard windows machine. Looking at the logs it’s clear there is lots of requests happening very rapidly. Whilst the disk IO (iscsi backplanes in this case) can keep up, the burst rate of requests (IO’s per second) appears to be too low. Hence vmware panics as it can’t service the requests via the vscsi queues quick enough and hence the virtual machine (Leo in this case) gets told there is a disk error. All the while the physical disks in the virtual cluster haven’t even been stressing.

Whilst this is currently only a theory I have, I’ve asked Hmon to look into increasing the vscsi queues to help deal with the burst rate IO better.



Donations keep this site alive

  • avatar

    Gustaf said,

    Hi!

    I’m getting similar issues! Have you found a solution?

    regards,

    Gustaf

Add A Comment

*