Menu

Definitive Overview of the Hadoop Ecosystem

June 02, 2015

The Hadoop ecosystem has grown significantly over time. “Hadoop: The Definitive Guide” provides an overview of the framework’s most important topics and projects.

In addition to explaining the foundations of Hadoop—specifically HDFS, YARN, and MapReduce—extensions such as HBase, Spark, Zookeeper, Avro, and many more are covered in detail.

To get the most out of this book, however, you should already be familiar with the MapReduce paradigm and have some initial experience with Hadoop. It is not a book for absolute beginners. At the same time, it is not an ultimate expert guide either, as it often doesn’t delve deep enough.

What I personally miss is a “Hadoop in Practice” section that shares real-world experiences. This should cover tips and “best practices” for operation and development. What do typical use cases look like? What problems arise? How do uncompressed files usually affect performance? What changes when using compression? etc.

The book reads like a very long documentation in some places. There are also many superfluous tables, such as the table of all primitive data types in Hive. As a developer, I don’t use the book as a reference; I can Google that faster than looking it up in the book.

I find it a pity that machine learning, for example with Mahout, is not covered in the book at all. Monitoring Hadoop is also only touched upon; it would have been interesting to see how it can be used with tools like Nagios, Ganglia, Puppet, Chef, or Ambari. And nowadays, containerization with Docker should certainly not be missing either.

But all in all, it is a very comprehensive book, and I liked most parts, such as the handling of bad data, debugging, and profiling in Chapter 6, or the detailed description of how MapReduce jobs are invoked in Chapter 7.

I can recommend the book to anyone with the basic knowledge mentioned above.

  • Tom White
  • Hadoop: The Definitive Guide, 4th ed.
  • O’Reilly
  • 2015

See also the review on Amazon.

categoryBig data & data science