Alex headshot

AlBlue’s Blog

Macs, Modularity and More

Amazon web services -- S3 and EC2


Amazon have recently launched a couple of web services that they're marketing to the public, and are worth keeping an eye on: S3 and EC2.

S3 is the Simple Storage Service, is a mechanism that allows you to store data on Amazon storage systems. Unlike other storage systems, it's not accessed via a filing system, but rather via web services over SOAP. You can upload data -- anywhere from 1k to 5Gb -- and it will be permanently associated with a URL, which can then be accessed from any computer in the world, over HTTP. You can choose the URL format, but you can't change the URL once the object has been uploaded (though I'm not quite sure what the difference would be to delete it and then re-upload to a different URL afterwards). This is already being used in some backup programs, where the data can be stored and retrieved transparently by the application using SOAP.

The only reason that this is worth mentioning is that the pricing structure for storage is very simple. You pay for the amount of data that's stored, and the bandwidth used. Not only that, but the pricing is fairly competitive when compared with other storage systems (the price list is obviously subject to change, but is shown at the S3 homepage). And, given that the data can be used on demand, you could use it to host large downloads for very short periods of time (say, when a new version of your popular IDE is released). Not only do you take advantage of the storage, it's also about bandwidth; Amazon is probably one of the better companies from an internet connectivity perspective. Once you've finished distributing it, you can tear the storage down, and you only pay for the duration that it was hosted or downloaded.

EC2 is the Elastic Computing Cloud, which is in essence a remote virtualisation system. You provide an image of an operating system, and then bring up that system on virtualised hardware to do whatever processing you want to (although I'm sure there's probably some restrictions in the terms of service). Like S3, the payment structure is simple enough; you pay for the time that your virtualised hardware is running, and the bandwidth consumed by the software. The base virtual hardware spec seems reasonable enough; a fairly sizeable hard-drive, memory and processor are yours, and if there's any faults with the (real) hardware, your OS can be brought up again on another instance within minutes. So it gives you the benefit of a managed server, but at somewhat more competitive rates; it works out around 70$ per month.

The image for the OS must be hosted on S3, but you aren't charged bandwidth for S3--EC2 communication. So, whilst paring down the operating system will give you cheaper results, you can bounce it as many times as you like and not have to pay for downloading the image each time.

Unfortunately, at the moment, EC2 is only useful for a sub-set of tasks. Grid computing is the obvious one; where you need processor power to do the work. But as an 'unreliable' service, you can't guarantee that whatever you store on the OS hard drive will be there. If your system crashes, and it's restarted on a different box, you just don't see whatever is there. It's somewhat like mounting and installing an operating system into /tmp on Linux. (By 'unreliable', I don't mean to suggest that it will fall over a lot of the time; rather, it isn't 'reliable' in the sense that you can't rely on it being there next time.) The problem is that some systems (notably transactional ones, but also others like mail servers) require a persistent store to be able to record the data on, and almost always the interface to this is done via a filing system. If this new system effectively doesn't have persistent storage, then how are you going to deal with that problem? The same is true for any other system that accepts data (e.g. databases) and that you assume that the system is in a consistent state at all times.

So whilst EC2 is good for the category of computational grids and static-filing systems (like web-servers, on the assumption that your logs of who's visiting isn't important), it's going to be useless to anyone who wants to deploy application-level servers such as J2EE or similar ilk. Which is a shame, really, because this is exactly the kind of platform that you'd want to be able to use with this; and I can't see that the S3 is a particularly good fit. Now, if only they'd make an S3 filing system (e.g. mount it remotely like NFS), then they've got a winner on their hands.