Introduction to ElasticSearch

This article is a conference-talk-turned-blog-post. I’ve given talks on this subject at both the phpBenelux user group (March 2013) and during the DamnData event (October 2013). Head over here if you prefer slides or video.

Scope

This blog post is an introduction to some of the basic concepts of using ElasticSearch in a web application. It uses examples and lessons learned from working on the social media tool Engagor over the course of the last two years.
ElasticSearch is marketed as a flexible and powerful open source, distributed real-time search and analytics engine for the cloud.
The application we’ve developed is mainly built in php, but since this post will detail the HTTP REST interface of ElasticSearch, implementation language is irrelevant. We’ll also touch briefly on a few other, related technologies.
I’m presenting a real use case that showcases most of the main features of ElasticSearch, so please bear with me while I explain a few things about our product and our company.

About Engagor

Screenshots used throughout this blog post are from a tool called Engagor, which is a social media monitoring and management application for real time customer analysis and engagement on Twitter, Facebook, Instagram and other social platforms.
We’re based in Gent (Belgium) and have recently opened an office in San Francisco and our team consists of 30 people, with a technical staff of 10.
From a technical point of view you can picture Engagor as a huge database of social messages (tweets, Facebook comments, blog posts, Instagram photos, Foursquare check-ins, etc.; whatever keyword combination or social profiles our users set up to monitor and/or manage). We show several types of graphs (from volume per day, over geographic distributions, to author demographics) in a section called Insights and offer our clients several advanced search and workflow tools (assignments, tagging, real-time collaboration tools) in a section called Inbox to work with the monitored data.
All of this is powered by … wait for it … ElasticSearch.

What can I do with ElasticSearch?

Instead of immediately diving into the technical details, I’ll start with showing you one of the coolest things we’ve built with ElasticSearch so far.

One of the things are clients use Engagor for, is social customer service. Often our users are dedicated social media teams managing the Twitter, Facebook a.o. accounts of bigger consumer facing brands. Webcare consists of assigning messages to other team members with the right skills, adding metadata like tags for easier classification, and replying to the customer.
Below is a screenshot from Engagor where you see the amount of incoming messages per day for a certain client, and how often and how fast they have been replied to.



  • The dataset for these graphs is about 40k social messages, with a date range of 28 days.
  • The purple bars in the graph on top are “average response times per day”; showing how fast social mentions in the dataset are being replied to.
  • The graphs at the bottom show similar data (response times), but now grouped per day-of-the-week and hour-of-the-day; both during and outside business hours.

Our clients use this dashboard to evaluate performance of the customer support they deliver eg. through Twitter & Facebook.

The screenshot below shows a similar graph, but now with data from 3 months (adding up to 140k messages), and some KPI’s thrown in for good measure (above the graph), plus breakdowns on team members (the tables at the bottom).
On this graph you can see that the response time of this account has considerably improved over time.



The last – promised! – feature I want to brag about is the advanced filtering that comes with these graphs. Let’s say our user is interested in the response times but only for mentions in a certain language, from a certain region, in a certain sentiment, with a certain meta-tag added to it, or any combination of these, eg. because the business goals of this company say messages that match this filter need to be prioritized. The interface used to set up a filter like this is shown below. It includes field restrictions, AND & OR combinations and nesting.



This filter can be applied to all pages within the tool, resulting in the graphs from screenshots 1 and 2, but with statistics about the exact subset of the data you want.

The cool thing about all this is that these graphs and search capabilities are all real-time. As soon as a Tweet is posted on Twitter, as soon as a social message is assigned in Engagor, it’s included in these graphs.
The graphs above were generated with a single call to our ElasticSearch cluster that took 32 milliseconds.
A single call might not be a very scientific benchmark, I know, but I can assure you we get impressive speeds from the ElasticSearch cluster, even for larger datasets.

In the next sections, I will highlight the different features of ElasticSearch that we used to build this.

ElasticSearch’ History and Features in a Nutshell

If we talk ElasticSearch we have to begin at its core, which is Lucene. Lucene is the search engine under the hood of ElasticSearch and its proven technology dates back to 1999, when it was built to offer features like text analysis (tokenizing, stemming, filtering, etc.), free text search (wildcard, range, proximity & fielded searches) and match scoring (relevance and field scoring). Lucene is an Apache Foundation project still under development and used in various other projects, including Solr and ElasticSearch.

ElasticSearch first saw the light of day in 2010 with a public release (v0.4) in February 2010. It was a rewrite by Shay Banon of his earlier Compass project, but this time with scalability features built into the very core of the product. ElasticSearch is a Java product making it inherently cross-platform (Yes, it runs on Windows.) and the current release is v0.90.5. If you want to use it, need we say it’s free & open source? The tagline of the product is “You know, for search”. Well, at least for a while it was …

ElasticSearch transforms Lucene, a Java search engine library, into a full featured service, adding an easy API, features for easier scaling, performance and high availability.

  • ElasticSearch can run as a cluster of several machines, called nodes, that act as a single entity. (The nodes do internal routing to pass operations to the node responsible for that part of the data.)
  • Your data is split into several blocks, called shards (of which the amount is configurable), that are automatically balanced over the nodes in the cluster. Each shard can have zero or more replicas which will increase both read performance and availability. Replicas will be distributed over the cluster so that in the event of a node shutdown, replicas will transparantly take over from primary shards.
  • The cluster is self-managing with automatic master detection and failover.

Installation

If you’re on a Mac, with Java installed, testing ElasticSearch is as easy as downloading, unzipping and starting the service.

$ cd ~/Downloads
$ wget https://download [...] /elasticsearch-0.90.5.tar.gz
$ tar -xzf elasticsearch-0.90.5.tar.gz
$ cd elasticsearch-0.90.5/
$ ./bin/elasticsearch

(For detailed instructions on how to install and configure ElasticSearch on your system, head over to the official installation documentation.)

Once up and running, you can access ElasticSearch through its HTTP interface, meaning you can browse to it using your browser. Go to http://localhost:9200 to see a general status page, like the one below.



As you can see the response from ElasticSearch is JSON. This endpoint of the API shows some basic version and status information. Other management and administration endpoints are available for more detailed information on cluster, node and index health.

Adding data to ElasticSearch

If you want to start adding data to your ElasticSearch cluster, you can do so right away with HTTP PUT requests. Let’s say we want to build an index of the matches of the Belgian Red Devils from the qualification round for the world cup in Brasil. (Yes, we’re going!)

From the command line we fire up a curl request like this:

$ curl -XPUT http://localhost:9200/reddevils/matches/1 -d '
{"date": "2013-10-15T19:00:00Z", "opponent": "Wales", "result": "1-1"}'

This will add a record with id ’1′ from type ‘match’ to the index ‘reddevils’.

(There’s no need to define indices, types or schemas upfront.ElasticSearch tries to be smart about which data you’re giving it and does type detection. It might not always be smart to trust this, as we’ll discuss later.)

The HTTP response from ElasticSearch will be:

{"ok":true,"_index":"reddevils","_type":"matches","_id":"1","_version":1}

If we add a second record to the index we fire up a second request, which might look like this for the match played against Croatia:

$ curl -XPUT http://localhost:9200/reddevils/matches/2 -d '
{"date": "2013-10-11T15:00:00Z", "opponent": "Croatia", "result": "1-2"}'

{"ok":true,"_index":"reddevils","_type":"matches","_id":"2","_version":1}

If we later want to add some more relevant and interesting information to a certain record we can do so by sending a second HTTP PUT request for a certain record.

$ curl -XPUT http://localhost:9200/reddevils/matches/2 -d '
{"date": "2013-10-11T15:00:00Z", "opponent": "Croatia", "result": "1-2",
"girlfriend_attention_span": 30}'

{"ok":true,"_index":"reddevils","_type":"matches","_id":"2","_version":2}

You can see ElasticSearch has bumped its internal version number for this record to 2. Fetching the details of this record can be done by issueing a HTTP GET request, resulting in the following response.



Searching your Indices

What we have showed so far are the features of a NoSQL store (and you can use ElasticSearch as your NoSQL store), but that’s obviously not where the product shines, so let’s move on to the interesting stuff …

In the following example we do a HTTP GET request to the /_search endpoint of our index and add the query string via the ‘q’ url parameter.


http://localhost:9200/reddevils/matches/_search/?q=croatia

The result is a list of scored matching documents;



Passing search queries via the q parameter is a shorthand method which doesn’t really expose the full functionality of ElasticSearch. The query language used by ElasticSearch for defining search requests is called the Query DSL. There’s 3 different types of search queries.

Full Text Search (query string)

In this case you will be searching in bits of natural language for (partially) matching query strings. The Query DSL alternative for searching for “Croatia” in all documents, would look like:

curl -XGET 'http://localhost:9200/reddevils/matches/_search?pretty=true' -d '{
    "query": {
        "query_string": {
            "query": "croatia"
        }
    }
}'

And the result could look like:

{
  "took" : 18,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.40240064,
    "hits" : [ {
      "_index" : "reddevils",
      "_type" : "matches",
      "_id" : "2",
      "_score" : 0.40240064, "_source" : {"date": "2013-10-11T15:00:00Z",
        "opponent": "Croatia", "result": "1-2"}
    }, {
      "_index" : "reddevils",
      "_type" : "matches",
      "_id" : "4",
      "_score" : 0.3125, "_source" : {"date": "2012-09-11T15:00:00Z",
        "opponent": "Croatia", "result": "1-1"}
    } ]
  }
}

ElasticSearch exposes the advanced free text search features of Lucene, so you’re free to experiment with;

  • near searches ("word1 word2"^4 to find documents with both word1 and word2 with a maximum distance of 4)
  • wildcards
Structured Search (filter)

In this type of search request we’re only looking for exact matches. The example given will return all matches of the Belgian Red Devils with the exact outcome of “1-1″.

curl -XGET 'http://localhost:9200/reddevils/matches/_search?pretty=true' -d '{
    "query": {
        "constant_score": {
            "filter": {
                "term": {
                    "result": "1-1"
                }
            }
        }
    }
}'

When possible it’s always faster to choose for ElasticSearch filters over Full Text Search with field restrictions (eg. query_string result:"1-1"), because filtered results will be cached more efficiently by ElasticSearch.

Analytics (facets)

Requests of this type will not return a list of matching documents, but a statistical breakdown of the documents. So instead of a list of documents you get eg. “the sum of matches played against each component” or “the average attention span of my girlfriend”.

curl -XGET 'http://localhost:9200/reddevils/matches/_search?pretty=true' -d '{
    "size": 0,
    "facets": {
        "opponent": {
            "terms": {
                "field": "opponent"
            }
        }
    }
}'

The result after the qualifications of a terms facet on the “opponent” field in our matches index, looks like this:

{
  ...
  "facets" : {
    "opponent" : {
      "_type" : "terms",
      "missing" : 0,
      "total" : 10,
      "other" : 0,
      "terms" : [ {
        "term" : "wales", "count" : 2
      }, {
        "term" : "serbia", "count" : 2
      }, ...  {
        "term" : "croatia", "count" : 2
      } ]
     ...

It shows we played each of our 5 opponents twice, which makes perfect sense.

Of course the more interesting stuff happens when you start combining these types of searches. What would the average attention span (facet) of my girlfriend be in matches we won (filter search) that had a review containing the words “Vincent” OR “Kompany” (query string)?

One More Thing … Document Relations

Document Relations in ElasticSearch are what is used to couple different kinds of objects to each other, much like you would use a table JOIN in SQL. ElasticSearch provides 2 mechanisms for this …

Parent/Child Documents

When using parent & child relations you have two different types of objects in a certain index and you define the relation by adding the id of the “child” object to the “parent” object or vice versa. This can be used for any type of relation (one to one, one to many, many to many) and is recommended when you eg. update parents or children independently.
A simple use case could be “products” and “offers” for an e-commerce site, where “offers” has certain properties but also consists of one or more products. Independent, but linked parent/child documents are perfectly suited for this, since products might be updated seperately or can exist without the context of an offer.

When searching through an index it’s possible to find parent documents based on restrictions on children documents or vice versa, making all kinds of JOINs possible. This is basically a “query-time”-join; depending on the type of query and where the different documents live in the cluster, it might be needed to access data on other nodes. This obviously can have a certain performance impact.

Nested Documents

A more specialised use case are nested documents, that allow you to define a one-to-many relation between certain objects. This is used by Engagor to store the social messages and the actions (assignments, replies, …) on them. In our use case a certain action is always associated with a certain single social message.

ElasticSearch takes care of denormalizing this structure and makes sure nested documents are stored close to its parent documents, thus assuring fast lookups. When we insert or update a social message we also supply its associated nested action documents, hence the type “index-time”-join.

Putting it all together …

Now we’ve got all the pieces of the puzzle, we can put them all together to come the example we started off with;



A breakdown of all parts of the Query DSL object to retrieve the graph above;

  1. A “range” filter on the message’s ‘publish_date‘-field;
  2. A “query_string” with the filter defined by the user in the advanced search bar. (Actually a sanitized, reworked version of it.)
  3. A “date_histogram” facet, with interval month, on the message’s ‘publish_date‘-field;
  4. A “term_stats” facet per type of action (defined via a “facet_filter”) on the ‘delay‘-field (int with difference between action’s ‘date‘ and message’s ‘publish_date‘.) of the nested action document, grouped by month.
    The result of these facets include:

    • Amount of message-documents with matching nested action-documents;
    • The total amount of action-documents in the set of matching message-documents;
    • The total of amount of ‘delay’, aka the “response time” for reply-actions

The multiple filters and facets are grouped into a single Query DSL JSON Request object, of which the results are then processed to give the example above. Pretty neat, huh?

The Engagor Cluster

  • We’ve been running ElasticSearch for over 2 years now. (We used Solr previously, and switched mainly because of better scaling features in ElasticSearch. We haven’t really looked back to Solr since, but we notice the attention that ElasticSearch receives, sparks a competition which is certainly good for the end user.)
  • Right now there are about 1 billion social messages in our cluster, and our current ingestion rate is anything in between 20 to 100 new messages per second.
  • Our current cluster consists of 20 nodes; each node has about 24 GB of RAM of which 14 is reserved for ElasticSearch. The same machines also run application scripts, a mysql server and memcache instances.
  • We use ElasticSearch as our main, but not only, data source, meaning almost all of the data needed for the application interface is directly coming from ElasticSearch and that other data storage system are mainly for backup and consistency.
  • Our usage is very write-heavy. The rate at which we index new documents is high (several million new messages per day), compared to the amount of reads needed from the system (reads are user-triggered). The amount of changes to existing documents depending on the use case of the client (social webcare versus pure analytics).

All of this typically results in a dashboard that looks like this, where each line indicates a node and each green column a successfull connection to other nodes, with blue indicating the server currently elected as master;



But there have been days where it didn’t look that good, and we’ve seen servers nearly drowning in load (both because of CPU, Memory as well as Disk bottlenecks).
So, time for some lessons learned.

3 Lessons Learned

1/3: Indexing Speed

  • Bulk Indexing is faster, obviously, since there’s much less network overhead.
  • The application processes adding data to our cluster work distributed work via a message queue (RabbitMQ), to prevent data-loss when there are ElasticSearch problems, but also to distribute load (over time and over nodes). This way we’re also better suited to handle peaks in data coming in from our 3rd party services.

Bulk indexing is combined with a maximum delay (if “more then x items in queue” or “oldest message in queue older than x seconds”, do a bulk index operation) so Engagor users see incoming messages (and updated statistics) in near realtime.

You can also experiment with skipping the HTTP interface and interface with ElasticSearch on a direct Java level, if indexing speed stays a bottleneck.

2/3: Choose Sharding Strategy Wisely

General advice on choosing the amount of shards in your set-up is to plan on your expected growth, and not on current set-up.

But I have to add, we’ve experienced quite some issues related to the big amount of shards we chose for our setup. Each customer has several topics, each topic is split into several shards depending on volume of the data and also sliced in time windows. This results in 10k+ shards, and we’ve seen problems arise where the master node had trouble managing the cluster state for that amount. Recent versions of ElasticSearch have improvements for managing big cluster states.

You can use “aliases” to create “virtual shards”/”windows on shards”. The overhead for an alias is smaller than for a real shard, but they act the same.

3/3: TRY to keep up with releases

ElasticSearch is a very young product, definitely when we started out using it in 2011. In recent months, with the funding and start of the ElasticSearch.com company (offering training, consultancy and support), it’s really been going hard for both the company and the open source product.

The latest release branch of the product, 0.90, saw 5 releases (0.90 to 0.90.5) in 6 months (April – September 2013). Currently the 1.0 release is planned for early 2014 and will have some very interesting updates, including improved backup support, and advanced percolator feature and more awesome possibilities for facets.

So, it’s hard to keep up with the amount of updates, especially since rolling updates is not yet part of the product (again for the 1.0 release), but it’s worth it to keep up to date since every version update has brought important (and sometimes much needed) bugfixes and improvements.

Bonus Tip Keep your JVM up to date, those are still under active development too.

“Filtering, free text search & analytics all in the same box”

“You know, for Search”, ElasticSearch’ Hello World when you’ve installed it, is maybe a narrow definition of the possibilities of this product. Actually I’ve shown that it also makes perfect sense to use ElasticSearch as an analytics engine without ever using any of free text search features.
Because it’s got both filtering, search as well as analytics in the same box, we at Engagor don’t have to move big data between different systems, which is certainly a plus.

Apart from that we can offer our clients advanced features, that put the power of search and data-digging into the hands of our users.
It’s definitely been interesting for us to see which queries our clients come up with to fuel their day to day business.

Presentation Slides


View on Slideshare

More information?

If you’re looking for more information to get started or to dive deeper into ElasticSearch I recommend the following resources;

  • Official Documentation
    While it does lack in clarity and completeness in several places, the official documentation is always the best place to start looking for a solution to your problem. At least, if you know what you’re searching for. I hope this article gives you some clues about what terms are used.
  • Video Tutorials & Presentations
  • IRC Channel
    irc.freenode.net has an active #elasticsearch channel where some helpful and smart people hang around.
  • ElasticSearch BE User Group
    If you live in or near Belgium, we have an active user group, kickstarted earlier this year, led by the people from data.be. There have already been several meet-ups with interesting introductory talks, but also more advanced topics (a look at the new percolator from one of the core contributors of ElasticSearch) as well as great tips and use cases from fellow Belgians.
    Join the ElasticSearch BE Meetup group to be notified of future events.

Post a comment.

You must be logged in to post a comment.