New Features in Hadoop 0.20
New Features in Hadoop 0.20
2009-04-28 01:41:33 by Andrew Hitchcock G+

Hadoop 0.20 was released on April 22nd. I was curious about some of the changes in 0.20 so I did a little research and decided to blog about what I found. These are the changes, features, and improvements that were interesting to me. For a full list of changes, see the Hadoop 0.20 changelog.

My first impression was that 0.20 is less ambitious feature-wise than 0.19. That is not a denigration of Hadoop's committers, but an artifact of Hadoop's quarterly release cycle. You can read about Hadoop 0.19's features in Cloudera's excellent write up.

Context Object For Mapper and Reducer
The biggest change in 0.20 is a large refactoring of the core MapReduce classes (HADOOP-1230). The commit message lists the changes best:

  1. All of the methods take Context objects that allow us to add new methods without breaking compatability.
  2. Mapper and Reducer now have a "run" method that is called once and contains the control loop for the task, which lets applications replace it.
  3. Mapper and Reducer by default are Identity Mapper and Reducer.
  4. The FileOutputFormats use part-r-00000 for the output of reduce 0 and part-m-00000 for the output of map 0.
  5. The reduce grouping comparator now uses the raw compare instead of object compare.
  6. The number of maps in FileInputFormat is controlled by min and max split size rather than min size and the desired number of maps.

An example of the changes can be seen in map's method signature (before and after):

void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter)
protected void map(KEYIN key, VALUEIN value, Context context)

OutputCollector and Reporter can both be accessed through the new Context object.

This is a large change, so Hadoop added the classes in a new package at org.apache.hadoop.mapreduce. The old classes have been deprecated but can still be found under org.apache.hadoop.mapred.

This feature will allow Hadoop Core to quickly iterate and add new features without breaking end user code every release. This is an important milestone on the path to stability as Hadoop approaches 1.0.

Hadoop 0.20 also includes a new performance tool called Vaidya (HADOOP-4179). Vaidya scans your job logs using a collection of rules to identify potential performance issues. For example, if it notices that your Mapper is writing a lot of data to disk, it'll suggest you create a Combiner. It reminds me a bit of Fortify or FindBugs (or if you're feeling less generous: Clippy) in that it suggests improvements when using bad practices (although, not using powerful static analysis like the aforementioned software).

HADOOP-3063 gives us BloomMapFile, a fast-failing implementation of MapFile. MapFiles don't seem to get much attention, although having used them for a project in my CSE 490h class with Hadoop 0.7.2, I appreciate the addition.

Removal of LZO
Finally, the last change that caught my attention was the removal of the LZO compression codec in HADOOP-4874. The LZO code was infected with GPL (I hear that's been going around recently) and thus had to be removed. However, the LZO code lives on as a side project on Google Code.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.