Many people outside the knitting world probably don’t think about the fact that knitters have conferences too, where they register for classes taught by famous people at some venue. Recently a famous knitter (Stephanie Pearl-McPhee, aka Yarnharlot) organised such an event. I think she got some bad advice from her IT people, whoever they were, about what would be required to run the online registration system.
To be fair, the IT people thought the organisers were being optimistic about how many people would show up. I’m going to summarise the salient numbers; if you want more details, read the blog post. With 12000 on the mailing list, they figured 5000 people was the number to expect, competing for about 4000 spots. The organisers “built a huge server and a pretty good system” for those expected 5000 people. In the event, they had over 30,000 simultaneous connections, and the server couldn’t handle it.
It seems to me that these requirements are precisely what cloud computing should be able to handle. For this particular event, it was possible that only 1000 people would try to register at once, or that lots more would. The load could have been spread over a couple of months if the conference seats sold slowly, or over an hour if they sold fast. Buying a server big enough to handle the maximum expected in this actual case resulted in a server and system that were too small; it could have also happened that money was wasted on something that was far too powerful for what was needed.
What I’d like to know is how, in general terms, should such a system be architected? If you were using this as a case study on how to do cloud computing, what would you propose? Some more requirements: People can register for more than one class. Class sizes are limited, and the size depends on the class. The system has to include an online payment system.
I’m not looking for lots of details, just a broad-brush outline of a paragraph or two, like “put X on one virtual server that can scale up, and Y on another”. My personal experience so far of “the cloud” has been for storage rather than these sorts of systems, and this use case has intrigued me.
We actually have a similar problem to this year on year for the Melbourne Cup because we have a very poor idea of how many people are likely to visit the website on the actual day of the Melbourne Cup — not least because the horses are finalised only a couple of days prior and if there’s international riders or horses we get a lot of interest from those nations and not just Australia.
Last year we created a scaleable cloud solution. Initially we were going to go for purely just storage as this is the classic model based on alleviating just enough pressure off the webservers to serve web pages only and not heavy assets. In the end though we took a guess on front line servers then had more in reserve that could be run up and dropped into the array very quickly based on a standard image. It worked really well in the end and there was no performance degradation on the array for the entire period.
After doing this now my approach is very much the following:
Stick all your “heavy” assets on a storage cloud that self-scales eg S3 and use cloudfront if you need to get it closer to your end users.
Where possible create as few connections into your database as possible as this is an obvious bottleneck. Create a cluster if you need to. Generally reads are going to be orders of magnitude higher frequency than writes so you can scale out here.
If your pages are mostly static (ie managed through a CMS but aren’t real time data) then publish them as flat files and get these onto commodity hardware using something like EC2. Servers in this instance can be thrown at the load issue so are horizontally scalable.
All that’s left then is your applications. These need to be optimised, using things like data caching to enhance performance. Using the example above you only need to update the session availability data when someone commits to a purchase not when they view the available seats thus you rebuild the availability cache only when it has been changed leading to less DB requests and better performance of the application. Your app servers can then become horizontally scalable as well as they are being much more picky about the times they hit your DB.
taking a model like this keeps people away from harming the most fragile part of your service — the transaction. If you can stop other parts of your service causing collateral damage on this one area chances are you’ll have a successful outcome. A booking gateway should easily be able to book hundreds of people all at once — but only if it’s not being affected by the other thousands of people on the site at one time…
My storage preference over S3 would be OpenSolaris on EC2. You can set up a ZFS boot mirror (with Solaris, Amazon gives you your own box with two HDs), and as many ZFS pools as needed — adding or removing EBS units to/from ZFS pools is a cinch.
I have my own little OpenSolaris distro based on b107, but you can’t run your own kernel on most cloud offerings. I’m trying RedPlaid, which is VMware-based and allows any kernel that runs on VMware. I have an httpd zone and a mysql zone on the first instance. To scale, the mysql zone would be moved to its own instance, so the apps can then scale independently.
Better yet, if a system is being built from scratch, would be to use Google Base instead of MySQL. That’s a service, not a cloud… So I guess my answer is “put A on cloud X, put B on cloud Y, and put C on service Z”. Amazing, all the possibilities these days!
Having just gotten started with “cloud” I’d like to point out another gotcha, besides the fixed-kernel issue. Scaling is mostly something you do manually via CP, or through an API with your own algorithm. EC2 just added “autoscaling” which automates the process, finally. Hopefully EBS units will be able to autoscale as well, in the future. But I’d rather see a proper REST API standardized, so I can write a curl script to handle autoscaling to my specs and have it work cross-cloud.
The point is, even if you’re on the cloud, gross underestimation of traffic will lead to the same problems the knitting conference had. It’s just easier to react to that sort of disaster. I think that’s the most perniciously overhyed aspect of cloud computing. Scaling up and down is easier, but automation needs to be configured properly, if it’s even offered by the cloud provider (or possible with their API).
I’d think that for a registration system, at least at the initial bursty load, the amount of database writing is going to be much greater than typical. So I’d use the biggest database I could afford. (The database folks have years of building scalable products.)
That brings out a concern I have about clouds — where do you keep the data? For many organizations, putting it in the cloud is fine. For many many more, it’s not. I’m sure that Expedia, etc., would like to use “pay as you need it” storage to some third party, but I’m also sure they want to keep their data to themselves. Same for Priceline. And that’s not even getting into regulated industries like financial, health, some manufacturing (airplanes), and so on.
The place where the cloud can really help is by allowing you to bring up additional front-ends. I don’t know of any cloud services that offer this right now, but certainly there are commercial products often integrated with a company’s J2EE server (e.g., IBM’s WebSphere Virtual Enterprise) that do this kind of thing. Generalizing that, hooking it into a load balancer/gateway, seems like a good thing for a cloud provider to offer.