Why Amazon EC2 Outage Won’t Change Anything in Cloud Computing

Warning: strong cloud advocacy ahead.

With the recent Amazon EC2 outage, many Internet companies such as the walled garden Quora and even one of our competitors remained offline for 24 hours or more. Amazon does not seem to have fully recovered yet, after almost 48 hours, which is a bit disappointing. I have been barely affected, because of all cloud-based services I use, only Springpad is hosted in the North Virginia Amazon datacenter.

As expected, Amazon turned into an evil corporation in no time. Everyone likes to point fingers because it’s easy and makes you feel important (“See? I told you the cloud is bad!”), but it’s useless. Sure it’s Amazon’s fault, but what then? It was not the first and won’t be the last outage.

The problem here is not reliability and availability of cloud platforms. The problem is the cost of not building on top of a cloud platform and here is why.

Without getting too technical, let’s start with the basics. If you want to build a business offering a service to a certain amount of users, you have to guarantee a minimum level of availability and reliability. So you’ll need at least two servers, because if one decides to take a nap, you’ll need another one to keep the service operational. So you’ll need either a fault-tolerant load balancing system, which uses both servers in parallel to balance workload, or “simply” a way to transparently fallback to the other server when the first goes offline (plus you’d have to guarantee data redundancy). It turns out both ways are neither simple or cheap. Then you also have to keep the operating systems up-to-date and correctly configured. When you want to deploy a new version of your software, you’ll need a way to do it transparently on both servers and without service disruption. If you’re like me, you’ll want to deploy a new version at least every week.

So one day your business catches on and you need to add another server. And another. And another. Because at peak hours you have lots of concurrent users. You see, every time you add a server, you have to maintain it, replace faulting components, remembering and managing the differences between the machines, all of this while having a growing bunch of obsolete hardware. You’ll begin asking yourself questions like “We’re running out of disk space, should we replace all those 3 years old hard drives, or should we simply throw away the servers and replace them with faster, bigger, better new machines?” or “Should we upgrade the operating system to get this new feature, with the risk of breaking something in the process?”. You can see a pattern emerging: you resist change, and that’s particularly bad in IT.

Now, if you think outside the box, you can see that you spend tons of money in:

hardware that is unused most of the time (because you have to design capacity for peak times) and that you’ll have to throw away in a couple of years anyway
time you don’t really have when you’re small and that is very expensive even when you can afford it
know-how that is really boilerplate, because your customers can’t even tell the difference between their DSL modem and your load balancer, so why should you waste brain cells for that?

The solution is offered by cloud computing platforms, like Amazon EC2, Windows Azure and a few others. You purchase computing power, storage space and bandwidth and let the provider figure out all the rest. You get to focus on building your service and providing value to your customers. Look at how things have changed over the past years:

at first you could shell out largish amounts of money to purchase or rent a server and place it in a real datacenter
then the first virtual server providers started to pop out, allowing you to rent a virtual machine for a fraction of the cost of a physical one, effectively lowering the barrier to start a new Internet-based service (yet the process was similar to renting a physical server, with long setup time and non-zero setup fees)
lately you can sign up, pay with a credit card and have a virtual server up and running in minutes, with the possibility to create and destroy virtual servers at will without additional costs (this opened up lots of new possibilities, lowering the barrier even more)
today you can build an application and deploy it in “the cloud”, almost without having to worry about servers, either physical or virtual, and just knowing that you can tap into virtually unlimited computing power and storage, with transparent failover and load balancing goodness, all of that with a metered pricing and no long-term contracts.

I consider a real cloud computing platform only the one achieved in the fourth step in this evolution. If you still have to manage individual machines, even if they’re virtual, there’s not much improvement other than some (still not irrelevant of course) economic advantage.

Getting back to the original topic, Amazon’s outage: even without considering all the enormous advantages cloud platforms have to offer, we must ask ourselves if we’re better at managing servers than a full-time, dedicated team of professionals like Amazon’s. If even they fail sometimes, how can we be better? This reasoning is especially important for small companies – or startups – where the CEO writes code, does customer support and pays phone bills. We already have enough overhead that we can’t avoid, so trying to move the burden of managing servers and infrastructure onto someone else is a wise move that allows us to focus on real work.

To conclude this post, I think that Amazon was a pioneer in cloud platforms and still has a dominant position in the market. Their outage was felt as outrageous and huge, but don’t forget that there are hundreds of outages every day in small datacenters all over the world, so we’d better off having our application hosted on Amazon EC2 or Windows Azure. It’s cheaper, easier, safer and even greener.

Why Amazon EC2 Outage Won’t Change Anything in Cloud Computing

Comments

One response to “Why Amazon EC2 Outage Won’t Change Anything in Cloud Computing”