AAISP.net Broadband - Broadband you can work with

Skip to Navigation / Skip to Content

6th Jan 2009 a very long day with very little internet

On 6th Jan 2009, from shortly before 8am until nearly 6pm all of our 20CN broadband customers (about 90% of customers) had no broadband internet connection. This is, we believe, the longest ever major outage our service has ever had and we sincerely apologise for any inconvenience this has caused customers.

What happened

Firstly the power tripping out in a rack in Telecity. This is somewhat unexpected, and we quickly arranged for power to be restored. However it was then apparent a number of systems were not working still. These included a customer server, and one of our backup routers and, crucially, an ATM switch. We cannot say why three bits of kit would all die at the same time and there is no proof of any sort of power surge.

We have a lot of redundant equipment and spares and hot standby equipment, but the crucial failure in this case was the ATM switch. This is used to connect to all of our broadband customers on 20CN. This is a known single point of failure which is why we also purchased ATM cards when we first got it and set about building a backup ATM switch based on a linux router. Sadly we found that the cards were not in a router in the data centre but actually in storage in our other office. As such we quickly put together a linux router with the cards and sent an engineer in to London with it.

At the same time we reasoned that the ATM switch may simply have blown a power supply. Our engineer had spares, but to save time we enlisted the help of a volunteer (rauxon) who as a trained hardware engineer was happy to try a different power supply and was close to the data centre. Sadly this did not help but we do thank him for volunteering.

At this stage the fact we had to put together a new ATM router and send an engineer meant several hours delay already. The router was installed and connected and we set about what should have been a simple task of configuring it. The ATM cards were supported in a standard kernel (or so we thought). However this turned out not to be the case. Two of our engineers worked all day with some help from additional volunteers who had great experience in working on kernels (MikeB and Oryn). They eventually got the necessary module working shortly after 6pm when we had already restored service.

It was apparent during the day that things were not going well. We had, all day, also been trying to source a replacement of the original ATM switch or any other equipment we could buy or hire at short notice, but got nowhere! However, to our pleasant surprise, offers of help started coming in from customers and other ISPs. We would like to thank drPoggs (Peter Hicks) and his employer for quickly finding a suitable CISCO box and ATM cards, and AlistairC for running a suitable GBIC over to the data centre. With this kind support from some of our more technical customers we were then able to restore service.

So I wish to thank all of those that offered help and support today and my staff for their work in solving the problem as well.

Who is to blame

OK, so someone has to be to blame. We can find a number of excuses, like there must have been a power spike or it is stupid how BT 20CN lines are such that they have one connection to us and make a single point of failure or even stupid kernel should have worked, however, at the end of the day we have to say that our contingency plans were simply not up to scratch.

It is not that we did not have plans - we had purchased the (expensive) ATM cards to make the backup ATM router and tasked a couple of people with the job - but when things are working well such things go to the bottom of the pile and at the end of the day, as company director, I should have made sure it happened. We had good quality equipment that was working well. We don't guarantee that we will never have outages as you know, and fixing a fault within 10 hours is well within our service level targets... But I personally think this type of delay is not really acceptable for a quality service.

As a good will gesture we have arranged a free daytime time usage allowance top-up on all 20CN customers that will carry over until next month. The top-up is the same as your current tariff, so it is a month's extra usage allowance.

How is this fixed so it cannot happen again

One of the problems is that we never expected to be using the 20CN BT Centrals at all by now. We expected to have switched our end over to 21CN links early last year. The long term plan is that we will be doing that some time in the next few months and in the mean time an increasing number of lines are being moved to 21CN. The 21CN centrals present two physical links to us that we terminate on two bits of equipment so there is hardware and link redundancy in the first place.

In the shorter term, we aim to purchase the CISCO we have been loaned and continue using that for now, assuming it continues to work well. We will also be getting the linux based ATM router configured and ready. This is expected to be done tomorrow. We may schedule some maintenance late at night or on a Sunday to test the linux ATM router on the BT links to be sure we have a backup in place, but this will be disruptive. We may be able to confirm the switch is ready without doing this.

Single points of failure are always a bad idea, and we have always aimed to both eliminate them using redundant links, and have spares in place ready to switch in. We continue that aim and continue to improve our network all of the time so as to avoid such issues. This type of outage brings home the importance of this policy even for something as simple as an ATM switch. None of us want another day like today.

If you have any questions please do ask me on irc or news and I will be happy to explain any further details.

Knowledge Base

  • All the technical information your geek heart could desire.
Find out more