RSS Feed!

About Me

I’m a 30 something Irish guy that works in the IT business. Inside the trade I’m interested in Linux, Internet technologies and mobile hardware and services. Outside, I enjoy a good book, a nice beer and decent game of rugby……

P.S. This is a personal blog, and while I do have a professional involvement in a lot of the technical topics I mention in some of my posts, they do not reflect company policy or ethos.

View Gary Pigott's profile on LinkedIn

Major San Francisco websites knocked offline due to a power outage

One of the big San Francisco colos, 365main.com went tits up earlier and major customers such as Craigslist (free classified ads) and Six Apart (the Typepad and Livejournal blog people) are off line. The power went down to the neighbourhood and the colo’s backup power either didn’t kick in or couldn’t cope with the load. I know our primary colo (InterXion in Dublin) doesn’t sell rack space unless it’s got the generator capacity to manage, and does regular tests to verify it. I get the test reports after the fact. Now I know high availability and redundancy are part of my business so I’m a bit biased, but how do websites get this big and not consider what an outage like this will do to their business?

But why rely on the colo to do all the hard work? Some people put their own UPS equipment in their racks, but if the colo’s power dies, then the aircon goes too, so personally I’d have the servers drop suddenly rather than slowly cook themselves. Besides, the power to the meet me rooms will go too, so you’re going to be without connectivity in many cases too. Fires, floods and earthquakes happen, so all the power, server and storage resilience in the world won’t help with that.

What we’ve had to do is setup a remote warm site. If we lose Dublin totally we use BGP and a few cooperative bandwidth suppliers to route the traffic to our London site where we’ve got replicated data and servers powered on and waiting in the racks. In theory the changeover is instant, but in practice some customers could see up to 5 minutes of an outage, but most are back connecting again within 90 seconds. It’s not cheap to operate this way, but surely Craigslist & Co. have the cash.

Edit: This would have to happen on the very day that they issued a press release boasting about 2 years of 100% uptime for one of their major customers who closed their redundant site because they didn’t need it :-)

By gary | 25. Jul 2007 | Hardware, Internet, Technology | No Comments »

Leave a Reply