Using Elasticsearch with Spring Boot - Technical background

This is the third part in a series of four. It explains the technical background.

With Spring Boot it is easy to glue together different components into a complex application. The following is the list of dependencies for this project used by the build tool gradle:

dependencies {
    compile 'com.fasterxml.jackson.core:jackson-core:2.6.4'
    compile 'com.fasterxml.jackson.core:jackson-databind:2.6.4'
    compile 'org.springframework.boot:spring-boot-starter-data-elasticsearch'
    compile 'org.springframework.boot:spring-boot-starter-mail'
    compile 'org.springframework.boot:spring-boot-starter-web'
    compile 'org.codehaus.groovy:groovy'

    providedRuntime 'org.springframework.boot:spring-boot-starter-tomcat'
    providedRuntime 'org.apache.tomcat.embed:tomcat-embed-jasper'
    providedRuntime 'javax.servlet:jstl'

    testCompile 'org.springframework.boot:spring-boot-starter-test'
    testCompile 'org.spockframework:spock-core:1.0-groovy-2.4'
}

The spring-boot-starter-projects make it possible, to glue the components together into Springs IoC container.

Build and running this projects is a one liner on the command line: $ gradle bootRun

Importing the emails into Elastiksearch

A class that should be persisted in Elasticsearch has to be marked with the @Document annotation from the Spring Data Elasticsearch project. The index and the type that Elasticsearch should use is specified as parameters. In this case an index named "email" and a type named "email" is used.

In this example an Email has a list of recipients, a list of senders a subject, a sentDate, a receivedDate and a list of texts (email is usually send as multipart message, so we have to use a list of texts). We ignore attached documents and images:

@Document(indexName = "email", type = "email")
class Email {

    @Id
    Long id         // Spring Data needs an @Id, so we use a surrogate one

    @Field( type = FieldType.Object )
    List<EmailAddress> recipients

    @Field( type = FieldType.Object )
    List<EmailAddress> senders

    String subject

    @Field( type = FieldType.Date, format = DateFormat.custom, pattern = Constants.DATE_FORMAT)
    @JsonFormat(shape = JsonFormat.Shape.STRING, pattern = "yyyy-MM-dd HH:mm:ss")
    Date sentDate

    @Field( type = FieldType.Date, format = DateFormat.custom, pattern = Constants.DATE_FORMAT)
    @JsonFormat(shape = JsonFormat.Shape.STRING, pattern = "yyyy-MM-dd HH:mm:ss")
    Date receivedDate

    List<String> texts

    Email() {
        recipients = new LinkedList<EmailAddress>()
        froms = new LinkedList<EmailAddress>()
    }
    ...

For simple datatypes nothing has to be specified, see for example the subject, which is a plain string. For complex datatypes like the lists of EmailAddresses, Elasticsearch has to know that the data should be stored as an internal document. This is done by using the @Field annotation and settting the type to FieldType.Object.

An email looks like this to Elasticsearch:

{
    _index: "email",
    _type: "email",
    _id: "-9223372036854775763",
    _source: {
        id: -9223372036854776000,
        recipients: [
            {
                orig: "joern@dinkla.com",
                name: "",
                email: "joern@dinkla.com"
            }
        ],
        froms: [
            {
                orig: "Some company <no_reply@somecompany.de>",
                name: "Some company",
                email: "no_reply@somecompany.de"
            }
        ],
        subject: "Very important message ...",
        sentDate: "2015-08-07 00:38:11",
        receivedDate: "2015-08-07 00:38:12",
        texts: [
            "Sehr geehrter Herr Dinkla, anbei erhalten Sie ...",
            "<html><head>..."
        ]
    }
}

The conversion from and to JSON is done by the Jackson JSON library. The @JsonFormat annotations in the class definition specify the date format.

Querying the emails

In the application we want to count the number of emails that contain a specific text in the subject or in the body. This is an aggregation. In SQL you would write something like:

SELECT dt, topic, COUNT(*) as num
FROM table
WHERE topic IN topiclist
GROUP BY dt, topic

In Spring Boot and Spring Data the class that communicates with the database is called a repository. Spring Data has powerful mechanism to automatically create a repository with many methods to query the repository. If you just want the vanilla functionality then it is sufficient to create an interface that extends a repository class. In this app we need

import org.springframework.data.elasticsearch.repository.ElasticsearchRepository

interface EmailRepository extends ElasticsearchRepository<Email, Long>, EmailRepositoryCustom  {
}

The user defined methods are provided in the interface EmailRepositoryCustom.

interface EmailRepositoryCustom {

    Long findMaximalId()

    Histogram<String, Integer> getWeeklyHistogram(String topic)
}

These two methods are implemented in EmailRepositoryCustom.

@Repository
class EmailRepositoryImpl implements EmailRepositoryCustom {

    @Autowired
    ElasticsearchTemplate elasticsearchTemplate;

    Long findMaximalId() { ...

    Histogram<String, Integer> getWeeklyHistogram(String topic) {
        SearchQuery searchQuery = new NativeSearchQueryBuilder()
                .withQuery(matchQuery("texts", topic))
                .withSearchType(SearchType.COUNT)
                .withIndices("email")
                .withTypes("email")
                .addAggregation(
                    AggregationBuilders.dateHistogram(topic)
                        .field("sentDate")
                        .interval(DateHistogram.Interval.WEEK)
                        .format("yyyy-MM-dd"))
                .build();
        Aggregations aggregations = elasticsearchTemplate.query(searchQuery, new ResultsExtractor<Aggregations>() {
            @Override
            Aggregations extract(SearchResponse response) {
                return response.getAggregations()
            }
        });
        Map a = aggregations.asMap()
        InternalDateHistogram tmpHist = a[topic]
        return new Histogram<String, Integer>(topic, tmpHist)
    }
}

The @Repository annotation tells Spring Boot that this is the implementation of a repository. The @Autowired annotation causes Spring Boot to instantiate the field with a "bean" of type ElasticsearchTemplate.

The ElasticsearchTemplate is used in the method getWeeklyHistogram to execute a query build with the NativeSearchQueryBuilder.

Remark: This post was adapted to the new blog format in November 2016.

 "Using Elasticsearch with Spring Boot - Installation and Usage" "Using Elasticsearch with Spring Boot - Analyzing the emails with Kibana"