Gradwell January Stability Improvements
Dear Customers,
Starting from Monday 19th January, core Gradwell services (email, web hosting and VoIP) suffered from repeated outages. Each outage lasted several hours at a time, and all customers were affected.
The team have now completed all planned work to improve the stability of our services, and are now monitoring the situation to determine whether any further emergency work is required. I apologise for the disruption caused by these outages, and want to reassure our customers that this continues to be my top priority.
In this blog post, my Technical Manager, Stuart and I want to explain to our customers what went wrong, and the changes we have made to put it right.
Ultimately, the root cause is technical equipment failure, but our design decisions have exacerbated the situation. We have received, and are grateful for, the large amount of customer feedback and have identified opportunities to improve our communication, operational process and the management of both our operations and customer support teams.
Some of these improvements have been implemented immediately, others will take a little longer. However, Gradwell is a well supported business, by customers, our staff (and even our bankers!) and there is no question that we cannot progress forward and work with customers to rectify circumstances as required and reconfirm our position as a leading provider of Internet Services to UK Small Business.
Stuart explains below in more detail what went wrong and how we’ve taken action to prevent it re-occurrence. We caused severe disruption to all, both customers and staff and we sincerely regret our failure to deliver excellent service in the last few weeks.
Yours sincerely
Peter Gradwell
e: peter@gradwell.com, t: 01225 800 810Technical Overview
Gradwell relies heavily on server virtualisation, using the industry-leading VMWare ESX platform. All of our services run on the ESX platform, and rely on virtualised storage running on dedicated networked storage servers.
On Monday 19th January, we began to experience multiple issues with two of our networked storage servers - a combination of faulty hardware and trying to work the networked storage servers too hard. These simultaneous issues caused the networked storage servers to hang, which in turn caused all of the ESX servers to hang. During each individual outage, we recovered services by resetting the networked storage servers and then resetting all of the ESX servers. Regrettably, the process of rebooting all of these servers in the correct sequence took several hours to complete from start to finish, causing our services to be unavailable for extended periods of time.
Gradwell has addressed these issues by:
- The faulty hardware has been either replaced or removed from service.
- We have purchased and installed five additional HP storage servers partly to replace the failed hardware, partly to reduce the demands on the remaining storage servers, and partly to ensure better redundancy against future failures.
We apologise for the length of time it has taken to complete the work to address these issues. This was due to the lead times for purchasing, testing and installing the new storage servers.
We apologise for the inconvenience caused by these outages, and want to reassure our customers that we will be continuing to improve to our infrastructure to ensure we do not have future outages like this.
If you’d like the full technical details of the faults and how they have been addressed, please read on for more details.
