Gradwell January Stability Improvements
Dear Customers,
Starting from Monday 19th January, core Gradwell services (email, web hosting and VoIP) suffered from repeated outages. Each outage lasted several hours at a time, and all customers were affected.
The team have now completed all planned work to improve the stability of our services, and are now monitoring the situation to determine whether any further emergency work is required. I apologise for the disruption caused by these outages, and want to reassure our customers that this continues to be my top priority.
In this blog post, my Technical Manager, Stuart and I want to explain to our customers what went wrong, and the changes we have made to put it right.
Ultimately, the root cause is technical equipment failure, but our design decisions have exacerbated the situation. We have received, and are grateful for, the large amount of customer feedback and have identified opportunities to improve our communication, operational process and the management of both our operations and customer support teams.
Some of these improvements have been implemented immediately, others will take a little longer. However, Gradwell is a well supported business, by customers, our staff (and even our bankers!) and there is no question that we cannot progress forward and work with customers to rectify circumstances as required and reconfirm our position as a leading provider of Internet Services to UK Small Business.
Stuart explains below in more detail what went wrong and how we’ve taken action to prevent it re-occurrence. We caused severe disruption to all, both customers and staff and we sincerely regret our failure to deliver excellent service in the last few weeks.
Yours sincerely
Peter Gradwell
e: peter@gradwell.com, t: 01225 800 810Technical Overview
Gradwell relies heavily on server virtualisation, using the industry-leading VMWare ESX platform. All of our services run on the ESX platform, and rely on virtualised storage running on dedicated networked storage servers.
On Monday 19th January, we began to experience multiple issues with two of our networked storage servers - a combination of faulty hardware and trying to work the networked storage servers too hard. These simultaneous issues caused the networked storage servers to hang, which in turn caused all of the ESX servers to hang. During each individual outage, we recovered services by resetting the networked storage servers and then resetting all of the ESX servers. Regrettably, the process of rebooting all of these servers in the correct sequence took several hours to complete from start to finish, causing our services to be unavailable for extended periods of time.
Gradwell has addressed these issues by:
- The faulty hardware has been either replaced or removed from service.
- We have purchased and installed five additional HP storage servers partly to replace the failed hardware, partly to reduce the demands on the remaining storage servers, and partly to ensure better redundancy against future failures.
We apologise for the length of time it has taken to complete the work to address these issues. This was due to the lead times for purchasing, testing and installing the new storage servers.
We apologise for the inconvenience caused by these outages, and want to reassure our customers that we will be continuing to improve to our infrastructure to ensure we do not have future outages like this.
If you’d like the full technical details of the faults and how they have been addressed, please read on for more details.
Gradwell Architecture prior to January 09
In December 2007, Gradwell made the strategic decision to migrate all products and services to run in a virtualised environment. We chose VMWare ESX as the virtualisation platform, as it is the clear market leader and is backed by a multi-billion dollar US corporation. Until December 2008, this migration had proceeded successfully and given customers improved stability, performance and protection against hardware failure.
In a virtualised environment, physical servers run VMWare’s ESX software, and are known as hosts. Each host in turn runs multiple virtual servers, known as guests or virtual machines. The guests aren’t stored on the disk drives inside the host servers; instead they are stored on dedicated networked storage servers. This is best practice, and is intended to make it easier to get services back up and running when an individual host suffers a hardware failure.
Gradwell’s environment in January relied on four network storage servers. Every ESX host had access to all of the networked storage servers. iSCSI-1, iSCSI-2 and th-vfile-1 are smaller units, and are connected to our core network via a single 1 Gigabit ethernet link. iSCSI-3 is a much more substantial unit, and is connected to our core network via four 1 Gigabit ethernet links. All four network storage servers share their hard drives with the ESX hosts using the iSCSI network protocol.
As events have proven, this architecture was vulnerable to a number of issues. Here’s a list of the issues that occurred, with details of how they have been addressed.
Issue 1: Network Storage Server “iSCSI-3″ Network Unbalancing
On Sunday 17th January, we performed substantial maintenance on our ESX platform, upgrading the majority of the HP DL380’s (the host servers) with four additional CPU cores and the latest firmware. As each ESX host was powered back on, the majority of the ESX hosts started communicating with the “iSCSI-3″ Network Storage server down the same ethernet link, instead of traffic being balanced over all four network links.
The unbalancing was caused by a bug in VMWare that failed to persist the network multiple-path configuration.
This unbalanced network traffic eventually caused the disk controller in iSCSI-3 to time out, causing VMWare ESX to crash. This unbalancing was corrected once it was uncovered.
Issue 2: Network Storage Server “iSCSI-2″ Failures
During the same maintenance on Sunday 17th January, a number of ESX hosts were unable to reconnect to iSCSI-2 after being rebooted. We made the decision to transfer a number of guests onto iSCSI-3 to avoid further disruption to our services. This left iSCSI-2 still in service, but running with reduced demand.
Over the following week, we moved all guests off of iSCSI-2, so that we could reset iSCSI-2 to clear the glitch and update its software to the same version used on iSCSI-3. iSCSI-2 was then brought back into service by Saturday 24th Jan, and guests were moved back onto iSCSI-2 to relieve the load on iSCSI-3.
That proved to be a mistake, as iSCSI-2 then hung on several occasions, causing VMWare ESX to crash. The decision was taken to move all guests off of iSCSI-2 once more, but because of its unreliability that move took over a week to achieve.
Once empty, iSCSI-2 was taken out of service, and disconnected from our VMWare ESX cluster.
Issue 3: Network Storage Server “iSCSI-3″ Failures
iSCSI-2 was brought back into service because we believed at the time that iSCSI-3 was still being overloaded, which was causing it to hang and therefore causing VMWare ESX to crash. This turned out to be an incorrect deduction on our part. The real culprit turned out to be a faulty disk drive in iSCSI-3.
Early on, we had noticed that one of the disk drives in iSCSI-3 was reporting errors, and had started the process to remove the faulty disk drive from the RAID array so that it could be replaced. Unfortunately, this process never completed, and it was iSCSI-3’s attempts to read data from the faulty drive that repeatedly caused iSCSI-3 to hang, which also made it difficult for us to diagnose the fault correctly.
The failed hard drive was physically removed from iSCSI-3, and the RAID array was then successfully rebuilt.
Issue 4: ESX Cluster Vulnerability
Although individual failures of iSCSI-2 and iSCSI-3 caused problems in their own right, we discovered the hard way that the entire ESX cluster was vulnerable to secondary problems. Whenever iSCSI-2 or iSCSI-3 hung, all of the VMWare ESX hosts that they were connected to became unstable and had to be rebooted. Once the host was recovered, we then had the laborious task of starting up all of the virtual machines that had been running on the host. This recovery process took several hours each time, greatly prolonging the outages for our customers.
We addressed this in four ways:
- Instead of every ESX host being connected to all of our storage, we reconfigured the hosts so that (with a couple of exceptions) they only connected to some of our storage. This way, a failure in either iSCSI-2 or iSCSI-3 would no longer crash all of our ESX hosts, and we’d be able to restore service quicker.
- We purchased a number of HP servers and turned them into additional storage servers, so that we could take the faulty iSCSI-2 out of service and reduce the demand on iSCSI-3.
- Instead of using iSCSI on the new storage servers, we switched to using NFS instead. Our research indicated that NFS would prove more reliable than iSCSI, and our testing proved that the new storage servers would perform better using NFS.
- In the past, we’ve split the guests for our services (broadband, email, web hosting, voip) evenly across the four storage units. The problems in January proved that this approach magnifies the impact of any one issue, which was not what we intended at the time. We’ve taken the new storage servers, and used them as dedicated storage for specific services - services that are clustered to be more resilient. If any one of these storage server fails, the impact will be isolated to a subset of our services, and should not be customer-visible.
Gradwell Architecture mid-February 2009
With all of these changes made, Gradwell’s architecture has evolved to look like this:
iSCSI-1, iSCSI-3 and th-vfile-1 remain as general-purpose storage for our ESX cluster. Later in the year, we will introduce more dedicated storage to ensure improved resilience against future hardware failure.
th-vfile-2 and th-vfile-3 have been dedicated to providing storage for our email queues. Our email queues create a lot of disk i/o; moving them onto dedicated storage will ensure that they cannot overload any of our other storage devices.
th-vfile-5, th-vfile-6 and th-vfile-7 have been dedicated to providing storage for our web hosting service. (Customer home directories remain on lon-file-3 and lon-file-4). We have six PHP4 servers (two on each storage server) as well as six PHP5 servers (again, two on each storage server), sat behind six Squid proxies (two on each storage server). If any one storage server fails, we should be able to maintain service using the other two.
Current Status
At the time of writing, we have completed all the work we believe is necessary to stabilise our services. We are monitoring the situation to determine whether or not any additional emergency work is required. We won’t be giving the ‘all-clear’ until we have enjoyed at least two weeks of stability.
Future Work
We are determined to ensure that our customers do not suffer this level of instability in our services in future. From a strategic point of view, we will move away from large lumps of general-purpose storage and continue to introduce smaller groups of storage that is dedicated for specific Gradwell services. We’ll also continue to evolve our services to be more resilient to the failure of individual storage servers.


