Whenever I talk with somebody at a company that has a need for dedicated servers, I jump on the opportunity to sell them on Rackspace. No, I don’t get any commission or anything from them. I just feel that when it is so apparently clear to me that Rackspace is exactly what a company needs, I feel compelled share so that they don’t go needlessly down a wrong path.
On Friday, I was having lunch with two guys from one such company in Blacksburg, and they asked me “What is the biggest thing you’ve learned about hosting a system as large as Webmail.us’ at Rackspace?” Man, where do I begin? The biggest thing. Hmmm…
I told them the story of how when we first moved our email hosting system to Rackspace, we were running it on just 5 servers. These were powerful dual-Xeon boxes, lots of RAM, fast expensive SCSI drives, the works. Not cheap boxes. This was 2003, and our business was starting to boom. Soon 5 servers turned into 7 servers. Then 9. Our application started becoming more complex too… adding dns-caching, multiple replicated databases, load balanced spam filtering servers, etc. We had each of our servers running several of these applications so that we could get the most bang-for-the-buck out of the machines. This started to get complex fast, and was about to become a nightmare to manage.
With multiple applications per server, it became increasingly difficult to troubleshoot problems. For example, when a disk starts running slow or a server starts going wacky (technical jargon), how do you determine which of the 4 applications running on that server are the culprit. Lots of stopping and starting services, and watching /proc/* values. But with just 9 servers, you don’t have an excessive amount of redundancy and don’t want to have to do this all that often. Or worse, when an application crashes a box, it takes down all of the apps that were running on that box. If there was a better way to scale, we needed to find it.
We started Webmail.us while still in college, and while we had interned at some pretty neat companies, we didn’t have a whole lot of experience to lean on in order to figure things like this out. In computer engineering / computer science they teach you how to code, but they don’t teach you how to manage clusters of servers. We were learning how to run a technology company by making decisions through gut instinct and trial-and-error – not by doing what has been done in the past at other companies. And even after we had hired a decent number of employees, very smart employees and some with lots of experience, there were still many areas that our team was lacking expertise in. So what did our gut tell us to do in order to learn how to scale things the right way?… Get help from an expert.
Having a company like Rackspace on our side has been a huge asset. With a collection of talented engineers the size of theirs, they seem to always have at least one person who is an expert on just about anything that we have needed help with.
In 2005, by working with people at Rackspace like Paul, Eric, Alex, Antony and others, we decided to re-architect our system to give each of our internal applications and databases their own independent server clusters. The idea was to use smaller servers, and more of them; with smart software to manage it all so that hardware failures can be tolerated (hmm… have you ever heard of a system like this before?). With this approach, each application is completely isolated from the next. When a server starts acting wacky, we can just take it down to replace hardware, re-install the system image, or whatever… and the load balancers and data replication software knows how to deal with that.
We ended up completely ditching the beastly dual-Xeon servers in favor of 54 shiny new single-cpu AMD Athlon boxes, each with a 1 GB RAM and SATA hard drives. Basically equivalent to the hardware you could get at the time in a $1000 Dell desktop. We’ve grown this system over 3x since we first launched it with 54 servers. We still mostly use Athlon cpus, but have some Opteron and even some dual-Opteron boxes now in clusters that require a lot of CPU such as spam filtering.
Today it is just as easy to manage 180 servers as it was with 54 servers, because we’ve built things the right way.
Rackspace’s expertise was invaluable in creating this new system. However, we are not the type of company that likes to be completely dependent on another company, even if that other company is Rackspace. So, we didn’t just let them build this new system for us. We had them show us how to build it. They may have built the first pair of load balancer servers out of basic Linux boxes; but then we ripped them up, and built them again from scratch. Then we did it again. We did this until we understood how each component worked and we didn’t need Eric or Alex’s help anymore. We did this with everything that we built in 2005, and we continue to do this whenever we lean on Rackspace for help.
So my advice for these two guys who had been selling their software for almost 10 years and were about to move it towards a hosted web service model, was this… As much as you think you know about hosting your software, you are going to run into things that nobody at your company will have done before. Things that you guys are not experts at. If you stick your servers in a colo cabinet somewhere, you are going to have to figure those things out on your own. This will be slow and will probably not result in the best solution every time. I highly recommend that you consider hosting your app at a company like Rackspace who can help you when you need it. You are going to pay more to per server going this route if you simply look at the raw hosting cost. However, you will be able to get things online faster, work through problems effectively, and you will learn how to host your system from the best.