Good Hands-on Introduction to Apache Spark
July 06, 2015
Anyone who wants to learn the basics of Spark is well-advised to read the book “Learning Spark”. I particularly liked that the book is very practice-oriented and that you can try out and follow the examples using the Spark Shell as you go.
I also found the description of setting up a cluster in Chapter 7 very helpful.
To benefit from this book, you should already have advanced programming skills in Python, Scala, or Java. You don’t need to know MapReduce.
Unfortunately, the topics GraphX and SparkR are not covered.
Most examples are provided in the three languages Python, Scala, and Java 7. Unfortunately, Java 8 was not used here, which makes the examples very asymmetrical. The Java 7 code almost ruins the book. However, in the authors’ defense, it must be said that the transition to Java 8 had not yet taken place in the Hadoop ecosystem at that time.
Sometimes translation errors have also crept in here and the programs are not always the same (e.g., the VerifyCallLogs() in Java is missing in the other versions, 6-18).
Using example 4-25, the “Scala Page Rank,” I learned myself that type information can also be helpful. Although the examples in Python and Scala are short and concise, I find that the data types in the loop below are missing for better understanding. It is not code that you understand immediately. I will therefore voluntarily label my transformations in Scala with types so that they are more readable.
In the “Spark SQL Performance” section, it sounds a bit like the authors are also the developers of Spark SQL, along the lines of “Look at our great system.” They are a bit uncritical here, as SQL optimization is a very broad and complicated field, and Spark SQL is probably not yet at the level of traditional SQL databases.
The example in “Machine Learning Basics” is, in my opinion, too complicated for an introduction. And in example 6-12, the authors use an anti-pattern, as they program against the concrete implementation ArrayList rather than the List interface.
And - last but not least - the book is called “Learning.” Why are there no exercises or tasks? Perhaps even with solutions in the appendix?
But overall, a good and successful introduction.
- Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia
- Learning Spark
- O’Reilly
- 2015
See also the review on Amazon.