In honor of Earth Day, I'm going to talk a little bit about why these new utility computing services (or "cloud computing") are good for the environment (and business too).
One of the tenets of cloud computing is that you use what you need and you pay for what you use. Amazon S3 and Simple DB and Google App Engine all charge based on storage, bandwidth, and CPU time. Additionally, these services run on shared infrastructure, so you don't have separate physical boxes serving your traffic. This allows Amazon and Google to run at high utilization knowing that—statistically—not all of their users will be hammering the service at once. For a detailed discussion of these statistics, read Power Provisioning for a Warehouse-sized Computer [PDF]. I guess you could say the same thing about Dreamhost (or any other shared server provider) since they cram many people on a box and run at high utilization, but they don't provide the same level of scalability as Amazon or Google (you are limited to a single box).
Why does utilization matter? If you can do the same amount of work on fewer computers by having higher utilization, you save the environmental cost of building those computers. Additionally, computers use lots of energy idling. Even though you increase single-computer power usage when consolidating, net power decreases as you take other systems offline. For a concrete example, imagine you have two computers that use 100W idling and 200W at full load. If each machine is one-third utilized, it takes 266W to perform your work (133W for each computer). However, consolidating this onto a single machine results in a total draw of 166W, or a 37% savings. Now… imagine running a datacenter at 80% or more load.
In the second paragraph I didn't mention Amazon's EC2. EC2 differs from S3 or GAE because you do pay for what you don't use. EC2 charges based on how long you have a computer under your control, not by the utilization of that computer. However, this is still a much more granular level of control than colocation, since you can scale your fleet up or down as needed to meet load. When you return a machine, someone else is free to grab it. The end result is high utilization, because people won't hold on to a machine they are idling, and will thus reduce their fleet size to compensate.
In my opinion this makes Hadoop the killer app for EC2. Users can spin up a cluster, run their job at full bore, and then return the computers back to "the cloud". Companies such as Powerset and the NY Times have used EC2 for this very reason. This is a triumphant example of the market's ability to reduce energy consumption and resource usage (in the form of unneeded computers), because it directly translates into monetary savings—no heavy handed government mandates required.
Let me use an example of my own usage of S3 for backups. All the numbers will be based on those in the power provisioning paper mentioned above. I have 20 GB of data backed up in S3 and I want it to be available for instant access. This necessitates a constantly spinning hard disk. If I were to do it myself, I could just add another hard disk in my computer and stick a copy on that. Power usage of hard disk: 12W. This is cheap and easy, but doesn't give me offsite backup in case of a disaster. For the storage server—using the numbers from the paper—we have 200W base power plus 12W for each disk. Giving the server eight disks results in a total draw of 296W and 8192 GB of disk space. Let's double that to take into account battery backup and air conditioning in the datacenter, so 592W. Since the server is shared with others (we are aiming for high utilization, remember?), I'll have to figure out my usage as a proportion of the total. Before I do that, I should take durability into account. Distributed file systems store multiple copies of each file because hard disks and computers constantly fail in large systems. The industry standard seems to be three copies—judging by GFS and HDFS. Three copies results in 60 GB dedicated to my backups. Adding this all up results in the follow equation: 592W * (60 GB / 8192 GB) = 4.33W. As you can see, even with all the additional overhead and infrastructure, using S3 saves two-thirds the power versus having an extra spinning hard disk in my own machine. Plus, my apartment is no longer a single point of failure.
Hopefully this article has provided you a new look at utility computing. Next time the media tries to make a fuss over the growing power usage of datacenters, think back to this and realize they might actually be saving energy. Happy Earth Day, Andrew.