A Long Day: Status Update on Yesterday’s Server Issues

Yesterday was admittedly a long day for me. Jim, Martha, and I were all traveling down to Atlanta to prepare for the Domain Incubator conference that Emory University is putting on Friday and Saturday. It’s an exciting opportunity to brainstorm how schools can use services like this (or DIY) to build a Domain of One’s Own type of program and I particularly like that Atlanta is thinking of it as a regional hub rather than simply attempting to shoehorn their institution into buying into it wholesale. But I found myself yesterday scrambling to fix issues on our servers that I had never seen before and since the issues persisted for much of the day I wanted to provide a little redux post of what happened here.

Wednesday night I received a report from someone that their site was running really slow and sometimes unavailable. I checked on it and I could load her website but I did notice occasionally it wouldn’t connect. I would refresh and it would come right back. I checked the server she was on to see if there were any load issues or high-traffic sites causing issues. I didn’t see anything abnormal but I did a few various tweaks to some settings and asked her to let me know if it continued and I went to bed. In the morning I woke up to a few messages not just from the one customer anymore, but a few others reporting spotty availability. What immediately struck me was that the customers were across multiple servers. Either there was a coordinated attack on all our servers at the same time, or there was a larger issue perhaps with our data center. As I mentioned yesterday we were getting ready to head out of town so I had to get my daughter to daycare and get packed up but as soon as I got in the office I put in a support request to our server provider asking if there were any network issues we should know about and explaining the issues we were having. They replied that they were unaware of any issues at that time.

So I spent the better part of the morning trying every trick in the book I could think of to resolve the issue. Rebuild Apache, check MySQL, suspend a few accounts with high traffic temporarily, reboot, reboot again, check the firewall, disable the firewall. Nothing was working and the issue persisted on both servers we own. I asked the company if we could pay them to look at this issue as I had no other ideas left to try and I had never seen a spotty connection like this before (many websites were actually fully functional, but most could not access their dashboard, edit pages, do any administrative tasks without lots of refreshing and frustration). By this point I had to get on a plane to Atlanta and figured I’d have to continue working on this once I got to the hotel.

Luckily our airplane had wifi so I was able to keep working on things. I finally received an email during the flight (this was about 5:30pm) from the company that said “We have narrowed down the cause of the issue and We will be performing emergency maintenance tonight at 10pm.” Hallelujah! It turned out to be a bad network switch that routes the network to and from both of our servers (along with many others I’m sure). I updated everyone on Twitter and sure enough, when the switch was replaced at 10pm the problem resolved itself.

Obviously no time is a good time for downtime but I’m especially aware of the fact that many final projects are due around this time period as schools begin to finish up their Spring semester. Ironically I was using my own domain for a final project in a graduate program I’m in and I host it here (eating my own dogfood) so I too was feeling that pain. This was an especially difficult issue because it’s one of the first times that we’ve had to look above our own servers to our network provider as the cause of the issue (and the company initially did not indicate there was an issue at all). In the future as Reclaim Hosting grows we will likely be able to find ways to mitigate this by diversifying the companies we use for servers, the location of those servers, etc. I’ve never compared us to enterprise hosting like GoDaddy, Bluehost, Dreamhost, MediaTemple, etc because I believe what we’re doing here is and can be fundamentally different. With that comes some growing pains of course, but I’m glad to have you all be a part of what we’re doing here and I look forward to continue to build Reclaim right alongside you all.

Thanks for your support,

Tim Owens