Menu

Best CUDA Book for Advanced Users

April 16, 2014

With CUDA, it’s easy to speed up a calculation by a factor of 2-5. How to achieve higher speedups is explained in “CUDA Programming” by Shane Cook.

I recommend this book as a “second book” on CUDA. Anyone who already has basic CUDA knowledge, e.g., from the book “CUDA by Example” by Sanders and Kandrot, can take the next steps here.

With CUDA, it’s easy to speed up a calculation by a factor of 2-5. You can achieve initial success relatively easily. However, in many cases, a factor of 9-10 would also be possible (*). To do this, you need background knowledge and some optimization techniques. The goal of this book is to turn a CUDA beginner with a small speedup into a CUDA professional who can achieve this factor of 10 (p. xiii).

And that is exactly what the author has achieved. This book contains a lot of useful information, and Chapter 9 in particular should be mandatory reading for every CUDA developer.

Unfortunately, in my opinion, the order of the chapters is not ideal. Very specific optimizations are already demonstrated in the opening chapters before you get a proper overview of the basics and various optimization techniques in Chapter 9. In particular, the AES case study in Chapter 7 felt disruptive and should be skipped on the first reading and read last.

Then there are a few minor shortcomings: For example, I couldn’t find some words in the index. It is also bad practice to print information in books that can quickly become outdated. This includes, for instance, installation routines, command-line arguments, and IDEs. Space is also used quite generously in some places, with entire code listings being printed. The book could be 50 pages shorter and lose no information. It is also a bit distracting that benchmark time measurements were always copied directly from the command-line output into the book rather than being formatted as a table. In some places, the book is also a bit outdated, so some information is no longer correct, such as the requirement of having two GPUs for debugging (p. 63).

(*) Note: A speedup factor of 10 is, of course, not always possible, as it depends on the sequential dependencies of the algorithm. There are also problems that cannot be efficiently parallelized, the so-called P-complete problems.

  • Shane Cook
  • CUDA Programming
  • Morgan Kaufmann
  • 2012

See also the review on Amazon.

categoryGPU Computing