Examples for parallelism: ray tracer

I wanted to learn some frameworks for parallel computing and i decided to implement a simple ray tracer as a "visual benchmark".

A ray tracer is a good choice, because

  • it is easy to parallelize, because each pixel can be processed independend of the other pixels. It is a data parallel problem, some even called it "embarassingly parallel".
  • on the other side, ray tracing is not regular. The work for each pixel can be different. It depends on how many times the ray is reflected. So ray tracing is not a too simple problem.

The following table shows the implemented variants (as of 2011)

Language Framework CPU/GPU Cores/SMs/Threads
C++ sequential CPU 1 core = 1 thread
C++ POSIX threads (pthreads) CPU 4 cores * 2 hyperthread = 8 threads
C++ OpenMP CPU 4 cores * 2 hyperthread = 8 threads
C++ NVIDIA CUDA GPU 16 SM = 512 cores = 512 threads
C++ Cell processor CPU 6 SPUs
Java sequential CPU 1
Java 7 ForkJoinPool CPU 4 cores * 2 hyperthread = 8 threads

Cell Broadband Engine / PLAYSTATION 3 (2009)

I started in 2009 with the PS3. The cell broadband engine is a multi-core processor. One of the cores, the so called PPE, is a general processor that can handle I/O, memory, etc. And there are 6 so called SPEs that are spezialized to number crunching. All the cores are 128-bit SIMD vector processors.

There are two ways to parallelize.

  1. Run the ray tracer on the six SPEs and merge the results.
  2. Rewrite the ray tracer to process 4 rays simultaneously using the SIMD vectors.

I implemented the first part. The following film shows the ray tracer in action. I started with 1 SPE and incrementally increased the number until the maximal allowed 6 SPEs.

C++ und POSIX Threads (2009)

As a second example i ported the ray tracer to the Mac and the PC. I used the pthreads library. The following vide shows an Intel Core 2 Duo with 2.2 MHz (a laptop processor of 2009).

And here on an Intel Core i7 920 with 2.67 MHz.

NVIDIA CUDA (2010 - 2011)

The port to the CUDA framework was relatively easy. And: it was worth it. Because GPUs are much faster. An NVIDA 580 achieves 650 FPS for 640x480 pixels and 98 FPS for 1920x1148 pixels (in the video the FPSs are lower, because the camera app took processor performance).

An NVIDA 285 achieved 250 FPS for 640x480 pixels and 57 FPS for 1920x1030 pixels.

Since then i made further tests with newer GPUs.

C++ und OpenMP

With OpenMP.


In Java 7 there is the new class ForkJoinPool, that is used in the following example.

Playlist of the videos

I created a playlist on youtube, that contains all the videos above.

 "Das beste Buch über OpenCL 1.x" "Examples for parallelism: ray tracer on the GPU"