k-d-trees with Apache Spark and Scala
July 02, 2015
This article shows how to use k-d-trees with Apache Spark.
Sometimes it is useful for companies to know where exactly there customers are: are they distributed equally across the country or are there clusters. The company can use this information to improve services, acquire new customers, open new stores, etc. For this geometrical data structures are needed, that support orthogonal range queries like e. g. k-d-trees.
In this example the data from the
Stanford Network Analysis Project (SNAP) is used. The data set called
Gowalla with the file name
loc-gowalla_totalCheckins.txt.gz
has the following structure:
For our example application we need the latitude
and the longitude
.
The goal was, to calculate for every user
the number of other users in his neighborhood, for
example
in a 5km^2 large area. The process should be able to run in a distributed cluster. The data are saved in
HDFS. The application was testet in a virtual cluster
with 3 nodes / Apache Spark workers.
Further requirements:
- k-d-trees in Scala
- Processing of the data with Apache Spark
- Data analysis and reporting with R
k-d-tree in Scala
A k-d-tree ist a generalization of the one dimensional search tree from Computer Science 101.
A KdTree
supports a rangQuery
, a range query.
A k-d-tree can either be a empty, a leaf or an inner node. These three definitions can be specified
in Scala inductively with case
-classes.
This kind of inductive definition of algebraic datatypes is very similiar to Haskell. In my diploma thesis you'll find a thorough description and explanation of the implementation (but only in German).
Processing with Apache Spark
The data in fileSortedByUser
is filtered and only the valid rows
at the time point dt
are taken.
This is implemented in the function filterToLatest
.
The k-d-tree kdt
is created with the help of methods defined for the
resilient distributed dataset (RDD
):
groupByKey()
and mapValues
.
The neighbors ns
are created by the range query
kdt.rangeQuery(rect)
.
The
Haversine.neighborhood()
method returns
the rectangular neighborhood.
The neighbors are saved in a CSV file. Remark: type information like
RDD[(Point2, CustomerId)]
is not necessary in Scala, but
it helps readability and maintainability a lot.
Analysis with R
With Apache Spark one can easily create sums, aggregations and reductions. These can be plotted
in R with ggplot2
.
The plot on the left shows the number of check-ins per day and the simple moving average for 7 days. The graph in the middle shows the check-ins per month and the graph to thew right the check-ins per hour of the day.
The first diagram is calculated with the folloding code (slightly simplified):
The following histogram shows the number users that have a specific number of neighbours in a 5km^2 area.
With ggmap
you also can create maps with Google or
OpenStreetMap.
The following map shows the "hot spots" of people with many neighbours in red: Houston and San Francisco.
The source code is available at GitHub .