Today Jeff Dean gave a talk at the University of Washington about BigTable - their system for storing large amounts of data in a semi-structured manner. I was unable to find much info about BigTable on the internet, so I decided to take notes and write about it myself.
First an overview. BigTable has been in development since early 2004 and has been in active use for about eight months (about February 2005). There are currently around 100 cells for services such as Print, Search History, Maps, and Orkut. Following Google's philosophy, BigTable was an in-house development designed to run on commodity hardware. BigTable allows Google to have a very small incremental cost for new services and expanded computing power (they don't have to buy a license for every machine, for example). BigTable is built atop their other services, specifically GFS, Scheduler, Lock Service, and MapReduce.
Each table is a multi-dimensional sparse map. The table consists of rows and columns, and each cell has a time version. There can be multiple copies of each cell with different times, so they can keep track of changes over time. In his examples, the rows were URLs and the columns had names such as "contents:" (which would store the file data) or "language:" (which would contain a string such as "EN").
In order to make each manage the huge tables, the tables are split at row boundaries and saved as tablets. Tablets are each around 100-200 MB and each machine stores about 100 of them (they are stored in GFS). This setup allows fine grain load balancing (if one tablet is receiving lots of queries, it can shed other tablets or move the busy tablet to another machine) and fast rebuilding (when a machine goes down, other machines take one tablet from the downed machine, so 100 machines get new tablet, but the load on each machine to pick up the new tablet is fairly small).
Tablets are stored on systems as immutable SSTables and a tail of logs (one log per machine). When system memory is filled, it compacts some tablets. He went kind of fast through this, so I didn't have time to write everything down, but here is the overview: There are minor and major compactions. Minor compactions involve only a few tablets, while major ones involve the whole system. Major compactions can reclaim hard disk space. The location of the tablets are actually stored in special BigTable cells. The lookup is a three-level system. The clients get a pointer to the META0 tablet (there is only one). This tablet is heavily used, and so one machine usually ends up shedding all its other tablets to support the load. The META0 tablet keeps track of all the META1 tablets. These tables contain the location of the actual tablet being looked up. There is no big bottleneck in the system, because they make heavy use of pre-fetching and caching.
Back to columns. Columns are in the form of "family:optional_qualifier". In his example, the row "www.cnn.com" might have the columns "contents:" with the HTML of the page, "anchor:cnn.com/news" with the anchor text of that link ("CNN Homepage"), and "anchor:stanford.edu/" with that anchor text ("CNN"). Columns have type information. Columns families can have attributes/rules that apply to their cells, such as "keep n time entries" or "keep entries less than n days old". When tablets are rebuilt, these rules are applied to get rid of any expired entries. Because of the design of the system, columns are easy to create (and are created implicitly), while column families are heavy to create (since you specify things like type and attributes). In order to optimize access, column families can be split into locality groups. Locality groups cause the columns to be split into different SSTables (or tablets?). This increases performance because small, frequently accessed columns can be stored in a different spot than the large, infrequent columns.
All the tablets on one machine share a log; otherwise, one million tablets in a cluster would result in way too many files opened for writing (there seems to be a discrepancy here, he said 100 tablets per machine and 1000 machines, but that doesn't equal one million tablets). New log chunks are created every so often (like 64 MB, which would correspond with the size of GFS chunks). When a machine goes down, the master redistributes its log chunks to other machines to process (and these machines store the processed results locally). The machines that pick up the tablets then query the master for the location of the processed results (to update their recently acquired tablet) and then go directly to the machine for their data.
There is a lot of redundant data in their system (especially through time), so they make heavy use of compression. He went kind of fast and I only followed part of it, so I'm just going to give an overview. Their compression looks for similar values along the rows, columns, and times. They use variations of BMDiff and Zippy. BMDiff gives them high write speeds (~100MB/s) and even faster read speeds (~1000MB/s). Zippy is similar to LZW. It doesn't compresses as highly as LZW or gzip, but it is much faster. He gave an example of a web crawl they compressed with the system. The crawl contained 2.1B pages and the rows were named in the following form: "com.cnn.www/index.html:http". The size of the uncompressed web pages was 45.1 TB and the compressed size was 4.2 TB, yielding a compressed size of only 9.2%. The links data compressed to 13.9% and the anchors data compressed to 12.7% the original size.
They have their eye on the future with some features under consideration. 1. Expressive data manipulation, including having scripts sent to clients to modify data. 2. Multi-row transaction support. 3. General performance for larger cells. 4. BigTable as a service. It sounds like each service (such as Maps or Search History) have their own cluster running BigTable. They are considering running a Google-wide BigTable system, but that would require fairly splitting resources and compute time, etc.
The talk was very interesting and it contained lots new information I had never heard before (I haven't seen a BigTable paper yet). This information is accurate to the best of my knowledge, but if you see a mistake, please e-mail me. I wonder what the next Google talk will be about... I can't wait.
Google has finally released the official BigTable paper.