Gradwell Blog

Service Announcements

Gradwell January Stability Improvements

Dear Customers,

Starting from Monday 19th January, core Gradwell services (email, web hosting and VoIP) suffered from repeated outages.  Each outage lasted several hours at a time, and all customers were affected.

The team have now completed all planned work to improve the stability of our services, and are now monitoring the situation to determine whether any further emergency work is required.  I apologise for the disruption caused by these outages, and want to reassure our customers that this continues to be my top priority.

In this blog post, my Technical Manager, Stuart and I want to explain to our customers what went wrong, and the changes we have made to put it right.

Ultimately, the root cause is technical equipment failure, but our design decisions have exacerbated the situation. We have received, and are grateful for, the large amount of customer feedback and have identified opportunities to improve our communication, operational process and the management of both our operations and customer support teams.

Some of these improvements have been implemented immediately, others will take a little longer. However, Gradwell is a well supported business, by customers, our staff (and even our bankers!) and there is no question that we cannot progress forward and work with customers to rectify circumstances as required and reconfirm our position as a leading provider of Internet Services to UK Small Business.

Stuart explains below in more detail what went wrong and how we’ve taken action to prevent it re-occurrence. We caused severe disruption to all, both customers and staff and we sincerely regret our failure to deliver excellent service in the last few weeks.

Yours sincerely

Peter Gradwell

e: peter@gradwell.com, t: 01225 800 810

Technical Overview

Gradwell relies heavily on server virtualisation, using the industry-leading VMWare ESX platform.  All of our services run on the ESX platform, and rely on virtualised storage running on dedicated networked storage servers.

On Monday 19th January, we began to experience multiple issues with two of our networked storage servers - a combination of faulty hardware and trying to work the networked storage servers too hard.  These simultaneous issues caused the networked storage servers to hang, which in turn caused all of the ESX servers to hang.  During each individual outage, we recovered services by resetting the networked storage servers and then resetting all of the ESX servers.  Regrettably, the process of rebooting all of these servers in the correct sequence took several hours to complete from start to finish, causing our services to be unavailable for extended periods of time.

Gradwell has addressed these issues by:

  1. The faulty hardware has been either replaced or removed from service.
  2. We have purchased and installed five additional HP storage servers partly to replace the failed hardware, partly to reduce the demands on the remaining storage servers, and partly to ensure better redundancy against future failures.

We apologise for the length of time it has taken to complete the work to address these issues.  This was due to the lead times for purchasing, testing and installing the new storage servers.

We apologise for the inconvenience caused by these outages, and want to reassure our customers that we will be continuing to improve to our infrastructure to ensure we do not have future outages like this.

If you’d like the full technical details of the faults and how they have been addressed, please read on for more details.

Read the rest of this entry »

RE: Important: Upcoming Web Hosting Changes

We recently posted a blog entry informing you about our plans to replace the servers that host your website. Our original plan was to replace the PHP 4 web hosting cluster tonight, 25th November 2008. Unfortunately, due to emergency maintenance on our email service, we have had to reschedule this work for a week tonight, Tuesday 2nd December 2008.

We apologise for any inconvenience caused by this delay.

As always, you can contact the support team for any help that you may need.

Important: Upcoming Web Hosting Changes

We are going to be replacing the servers that host your website as part of our PHP4 web hosting cluster on the evening of 25th November 2008. As well as physical replacement, we are also changing the software we use, moving from FreeBSD to Centos Linux.

These changes will mean your website runs ontop of our “cloud computing” platform, built using VMWare, and as such we will be able to maintain our web servers, add new features and greater capacity more quickly in the future.

You should be aware that development for the PHP4 web language has now been officially ceased by the PHP community, and that we also now offer customers a new PHP5 web hosting cluster to use. Please contact the support team if you’re interested in migrating.

The new server cluster has been built with the following characteristics:

  • CentOS 5.2
  • Apache 2.2.9 + mpm-itk
  • mod_php 4.4.9
  • Perl 5.8.8 via CGI
  • mod_perl 2.0.4
  • Python 2.4.3 via CGI

WHAT YOU NEED TO DO

Whilst we expect the majority of web sites to continue working without any issues, it is important, because of the software changes, that you test your website as soon as possible to see if there are any problems that need to be addressed. To test your website on our new PHP 4 servers, you can take advantage of our test system by pointing your browser to http://www.yourdomain.com.v-web-php4.gradwell.net.

We will be switching off our current FreeBSD PHP 4 servers late on Tuesday 25th November 2008. Apart from ensuring compatibility, you do not need to make any changes to your site or DNS and there will only be a brief period of downtime to your website.