k-d-trees with Apache Spark and Scala
July 02, 2015
This article shows how to use k-d-trees with Apache Spark.
Sometimes it is useful for companies to know where exactly there customers are: are they distributed equally across the country or are there clusters. The company can use this information to improve services, acquire new customers, open new stores, etc. For this geometrical data structures are needed, that support orthogonal range queries like e. g. k-d-trees.
In this example the data from the
Stanford Network Analysis Project (SNAP) is used. The data set called
Gowalla with the file name
loc-gowalla_totalCheckins.txt.gz has the following structure:
For our example application we need the
latitude and the
The goal was, to calculate for every
user the number of other users in his neighborhood, for
in a 5km^2 large area. The process should be able to run in a distributed cluster. The data are saved in
HDFS. The application was testet in a virtual cluster
with 3 nodes / Apache Spark workers.
- k-d-trees in Scala
- Processing of the data with Apache Spark
- Data analysis and reporting with R
k-d-tree in Scala
A k-d-tree ist a generalization of the one dimensional search tree from Computer Science 101.
KdTree supports a
rangQuery, a range query.
A k-d-tree can either be a empty, a leaf or an inner node. These three definitions can be specified
in Scala inductively with
This kind of inductive definition of algebraic datatypes is very similiar to Haskell. In my diploma thesis you'll find a thorough description and explanation of the implementation (but only in German).
Processing with Apache Spark
The data in
fileSortedByUser is filtered and only the valid rows
at the time point
dt are taken.
This is implemented in the function
kdt is created with the help of methods defined for the
resilient distributed dataset (
ns are created by the range query
Haversine.neighborhood() method returns
the rectangular neighborhood.
The neighbors are saved in a CSV file. Remark: type information like
RDD[(Point2, CustomerId)] is not necessary in Scala, but
it helps readability and maintainability a lot.
Analysis with R
With Apache Spark one can easily create sums, aggregations and reductions. These can be plotted
in R with
The plot on the left shows the number of check-ins per day and the simple moving average for 7 days. The graph in the middle shows the check-ins per month and the graph to thew right the check-ins per hour of the day.
The first diagram is calculated with the folloding code (slightly simplified):
The following histogram shows the number users that have a specific number of neighbours in a 5km^2 area.
ggmap you also can create maps with Google or
The following map shows the "hot spots" of people with many neighbours in red: Houston and San Francisco.
The source code is available at GitHub .
Remark: This post was adapted to the new blog format in November 2016.