How We Monitor and Run Elasticsearch at Scale

By Splunk

Splunk Infrastructure Monitoring is known for monitoring modern infrastructure, consuming metrics from things like AWS or Docker or Kafka, applying analytics in real time to that data, and enabling alerting that cuts down the noise. Core to how we do that is search. It’s not good enough to just process and store/retrieve data faster than anything out there, if it takes a long time for users to find the data they care about. So to match the speed of the SignalFlow analytics engine that sits at the core of Splunk Infrastructure Monitoring, we’ve been using Elasticsearch for our searching needs from day one.

In this post, we’ll go over some lessons learned from monitoring and alerting on Elasticsearch in production, at scale, in a demanding environment with very high performance expectations.

Why Elasticsearch

We’ve found Elasticsearch to be highly scalable, providing a great API, and very easy to work with for all our engineers. Ease of setup also makes it very accessible to developers without operational experience. Furthermore it’s built on Lucene, which we’ve found to be solid.

Since the launch of Splunk Infrastructure Monitoring, our Elasticsearch deployment has grown from 6 shards to 24 (plus replicas) spread over 72 machines holding many hundreds of millions of documents. And it’ll soon double to keep up with our continued growth.

Operating Elasticsearch

At Splunk Infrastructure Monitoring, every engineer or team that writes a service also operates that service — running upgrades, doing instrumentation, monitoring and alerting, establishing SLOs, performing maintenance, being on-call, maintaining runbooks, etc. Some of the challenges we face might be unique to our scale and how we use Elasticsearch, but some of them are universal to everyone who uses Elasticsearch.

Instrumentation: Collecting Metrics

Elasticsearch provides a fairly complete set of metrics for indexes, clusters, and nodes. We collect those metrics using collectd and the collectd-elasticsearch plugin. An important aspect of how Splunk Infrastructure Monitoring works and why people use it is what we call “dimensions”. These are basically metadata shipped in with metrics that you can aggregate, filter, or group by—for example: get the 90th percentile of latency for an API call response grouped by service and client type dimensions.

There were a few things we’ve added to the original collectd-elasticsearch plugin to take advantage of dimensions, which we’ll be submitting as PRs soon. Now you can track metrics per index and also get cluster-wide metrics. These are enabled by default in the plugin but can be switched on/off in the config.

If you use collectd and the collectd-elasticsearch plugin, Splunk Infrastructure Monitoring provides built-in dashboards displaying the metrics that we’ve found most useful when running Elasticsearch in production at the node, cluster, and cross-cluster levels. We’re looking at adding index level dashboards in the future.

CPU Load	Disk IOPs	Deleted Docs %	Top Indexes by Search Requests
Memory Utilization	Search requests / sec	Active Merges	Top Indexes by Indexing Requests
Heap Utilization	Indexing requests / sec	# Clusters	Top Indexes by Query Latency
GC Time %	Merges / sec	# Nodes	Top Indexes by Index Growth
Avg Query Latency	File Descriptors	Nodes / Cluster	Top Clusters by Search Requests
Requests / sec	Segments	Filter Cache Size	Top Clusters by Indexing Requests
Doc Growth Rate %	Thread Pool Rejections	Field Data Cache Size	Top Clusters by Query Latency
			Top Clusters by Index Growth

As the infrastructure changes and nodes come in or out of service, all charts, dashboards, and alerts automatically take into account the changes and don’t have to be modified in any way.

Investigation: Cluster, Node, and Shard

With a large number of nodes, you have to figure out whether problems are cluster-wide or machine-specific. We used to frequently get threadpool full issues that were caused sometimes because of large numbers of pending requests and sometimes because a single node was slow and dragging down the performance for a whole batch of requests.

The process:

First we look at cluster level metrics
Then down to a node view, looking at the top 10 or top 5 for a particular raw metric or analytic (like variance) to isolate where the problem is
Then a check to see whether the problem is on a subset of nodes that host the same shard or whether it’s a node specific problem

There are basically three scales of problems to contend with – cluster, shard, and node – and typically you have to look at all three. Some examples from the trenches:

We’ve had many performance issues that came down to a noisy neighbor, or network I/O bugginess, or other random problem with the AMI or VM underlying a given node on AWS.
We used to see large spikes in memory consumption that caused problems during garbage collections. Eventually we started looking at individual caches and discovered that it was the field data cache, set cluster-wide but effective per-node as a configurable amount of heap used. So now we’ve moved to using doc values (disk-based field data introduced in Elasticsearch 1.0.0).
As part of handling document growth, we sometimes need to re-shard our index. When re-sharding, we need to re-index documents in batches. We’ve run into thread pool rejections on the indexing queue in the process. After isolating which nodes this was happening on, over a period of time, we saw that it was on three nodes (primary plus two replicas) representing a single shard. Rejections would spike on three machines, then come down and spike on another three machines (the next shard). It turned out that when we queried Elasticsearch to build the batch of documents to index, results were returned in shard order, causing significant load shard-by-shard. So we changed the query order to be more randomly distributed.

Alerting: More Splunk Infrastructure Monitooring Less Noise

At our scale, the amount of metrics emitting from Elasticsearch is huge. It’s impossible to look at the raw metrics and alert on them in any useful manner. So we’ve had to figure out derived metrics that are actually useful with alert conditions that don’t inundate on-call, applying Splunk Infrastructure Monitoring's powerful analytics engine to do the math in real-time so we don’t miss any anomalies.

In one example we used to have checks on the state of every node, but the way Elasticsearch works — if the cluster becomes yellow or red, then all the machines in cluster get set yellow or red — meant 72 alerts, one per node. We’ve since switched to taking the cluster status reported by each host, assigning it a numerical value (0 for green, 1 for yellow, 2 for red) and alerting on the Max value. Now, using Splunk Infrastructure Monitoring, we only trigger a single alert when the cluster status gets set to yellow or red, limiting the noise. When all 1 node is yellow but all 72 instances report yellow we report on the max so you only get 1 alert (limit the noise).

In another example, we know that Elasticsearch can recover a failed machine by restarting replicas on another node. We also know based on shard size and experience timing it, that recovery can take up to an hour and a half. We then use this to decide whether it makes sense to wake somebody up — by applying duration thresholds on alert conditions, it is triggered if any of these three conditions are true:

The number of unallocated shards is non 0 for longer than 2 mins, the time it takes Elasticsearch to assign failed shards to other nodes
The number of relocating shards is non 0 for longer than 90 mins, the time it takes to fully relocate a shard
The number of Elasticsearch nodes that are down is 2 or more, higher risk of getting into the RED state

Putting all our experience over the last few years together, here’s what we’ve found most useful to alert on:

Spikes in thread pool rejection
Sustained memory utilization above 50%
Cluster status at yellow for longer than 90 mins
More than one node at a time goes out of service
Any master node goes out of service
Number of concurrent merges per node being higher than five for a sustained period — when this happens, Elasticsearch will start throttling indexing which causes index requests to stall or timeout.
Query latency variance — Elasticsearch exposes metrics for the sum of latencies (per node) since a node has come up and for the total number of queries. Dividing the two metrics give us average query latency for every node from it’s start time. But over time, tracking the average over the lifetime of a node will smooth out any variance. So using Splunk Infrastructure Monitoring's timeshift capability, we take the difference in the average query latency on a per minute basis to see if any spikes are significant enough to move the average within any given minute per node and also by top cluster.

Scaling and Capacity: When To Grow

Because of the way sharding works in Elasticsearch, we’ve found that scaling and capacity management have to be thought through clearly and treated as a proactive process. There are basically two ways to scale: add disk capacity to existing nodes or reshard to add more nodes. The first is low-risk and non-disruptive. Resharding is a complex process; doing it while the old index is being written to makes it even more complex. We’ve had to develop some methods of our own to make it work at Splunk Infrastructure Monitoring, where we we can’t afford to lose updates to metadata or not serve queries while resharding is in process. In addition, at our scale it is not a fast process—taking up to many days. There’s no getting around the physics of moving bits. You can read about how we do resharding and not only guarantee that no updates are lost, but also provide ways to pause and roll back the process.

The key metrics we track for capacity are document growth rate and storage usage. We track the percentage of growth in documents, percentage growth in storage consumption, absolute storage consumption, and top indexes by growth. We’ve found that storage consumption has to stay below around 50% in general and below 70% at all times. Going above 50% on a regular basis, or 70% at all, means that large merges can bring everything to a crawl and it’s time to scale.

Comparing document growth rates to storage growth rates and absolute storage consumption gives us an idea of when we’re going to have to reshard in the future, so we have enough runway to reshard before suffering performance problems.

Conclusion

We hope everyone who runs Elasticsearch and is trying to figure out what to monitor on, what to alert on, and how to scale their infrastructure will find this useful.

----------------------------------------------------
Thanks!
Mahdi Ben Hamida

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.