Sometimes it is useful for companies to know where exactly there customers are: are they distributed equally
across the country or are there clusters. The company can use this information to improve services,
acquire new customers, open new stores, etc.
For this geometrical data structures are needed, that support
orthogonal range queries like e. g. k-d-trees.
For our example application we need the latitude and the longitude.
The goal was, to calculate for every user the number of other users in his neighborhood, for
in a 5km^2 large area. The process should be able to run in a distributed cluster. The data are saved in
HDFS. The application was testet in a virtual cluster
with 3 nodes / Apache Spark workers.
A k-d-tree can either be a empty, a leaf or an inner node. These three definitions can be specified
in Scala inductively with case-classes.
This kind of inductive definition of algebraic datatypes is very similiar to Haskell.
In my diploma thesis you'll find a
thorough description and explanation of the implementation (but only in German).
Processing with Apache Spark
The data in fileSortedByUser is filtered and only the valid rows
at the time point dt are taken.
This is implemented in the function filterToLatest.
The k-d-tree kdt is created with the help of methods defined for the
resilient distributed dataset (RDD):
groupByKey() and mapValues.
The neighbors are saved in a CSV file. Remark: type information like
RDD[(Point2, CustomerId)] is not necessary in Scala, but
it helps readability and maintainability a lot.
Analysis with R
With Apache Spark one can easily create sums, aggregations and reductions. These can be plotted
in R with ggplot2.
The plot on the left shows the number of check-ins per day and the simple moving average for 7 days.
The graph in the middle shows the check-ins per month and
the graph to thew right the check-ins per hour of the day.
The first diagram is calculated with the folloding code (slightly simplified):
The following histogram shows the number users that have a specific number of neighbours in a 5km^2 area.
With ggmap you also can create maps with Google or
The following map shows the "hot spots" of people with many neighbours in red: Houston and San Francisco.