Splunk Infrastructure Monitoring is known for monitoring modern infrastructure, consuming metrics from things like AWS or Docker or Kafka, applying analytics in real time to that data, and enabling alerting that cuts down the noise. Core to how we do that is search. It’s not good enough to just process and store/retrieve data faster than anything out there, if it takes a long time for users to find the data they care about. So to match the speed of the SignalFlow analytics engine that sits at the core of Splunk Infrastructure Monitoring, we’ve been using Elasticsearch for our searching needs from day one.
In this post, we’ll go over some lessons learned from monitoring and alerting on Elasticsearch in production, at scale, in a demanding environment with very high performance expectations.
We’ve found Elasticsearch to be highly scalable, providing a great API, and very easy to work with for all our engineers. Ease of setup also makes it very accessible to developers without operational experience. Furthermore it’s built on Lucene, which we’ve found to be solid.
Since the launch of Splunk Infrastructure Monitoring, our Elasticsearch deployment has grown from 6 shards to 24 (plus replicas) spread over 72 machines holding many hundreds of millions of documents. And it’ll soon double to keep up with our continued growth.
At Splunk Infrastructure Monitoring, every engineer or team that writes a service also operates that service — running upgrades, doing instrumentation, monitoring and alerting, establishing SLOs, performing maintenance, being on-call, maintaining runbooks, etc. Some of the challenges we face might be unique to our scale and how we use Elasticsearch, but some of them are universal to everyone who uses Elasticsearch.
Elasticsearch provides a fairly complete set of metrics for indexes, clusters, and nodes. We collect those metrics using collectd and the collectd-elasticsearch plugin. An important aspect of how Splunk Infrastructure Monitoring works and why people use it is what we call “dimensions”. These are basically metadata shipped in with metrics that you can aggregate, filter, or group by—for example: get the 90th percentile of latency for an API call response grouped by service and client type dimensions.
There were a few things we’ve added to the original collectd-elasticsearch plugin to take advantage of dimensions, which we’ll be submitting as PRs soon. Now you can track metrics per index and also get cluster-wide metrics. These are enabled by default in the plugin but can be switched on/off in the config.
If you use collectd and the collectd-elasticsearch plugin, Splunk Infrastructure Monitoring provides built-in dashboards displaying the metrics that we’ve found most useful when running Elasticsearch in production at the node, cluster, and cross-cluster levels. We’re looking at adding index level dashboards in the future.
CPU Load | Disk IOPs | Deleted Docs % | Top Indexes by Search Requests |
Memory Utilization | Search requests / sec | Active Merges | Top Indexes by Indexing Requests |
Heap Utilization | Indexing requests / sec | # Clusters | Top Indexes by Query Latency |
GC Time % | Merges / sec | # Nodes | Top Indexes by Index Growth |
Avg Query Latency | File Descriptors | Nodes / Cluster | Top Clusters by Search Requests |
Requests / sec | Segments | Filter Cache Size | Top Clusters by Indexing Requests |
Doc Growth Rate % | Thread Pool Rejections | Field Data Cache Size | Top Clusters by Query Latency |
Top Clusters by Index Growth |
As the infrastructure changes and nodes come in or out of service, all charts, dashboards, and alerts automatically take into account the changes and don’t have to be modified in any way.
With a large number of nodes, you have to figure out whether problems are cluster-wide or machine-specific. We used to frequently get threadpool full issues that were caused sometimes because of large numbers of pending requests and sometimes because a single node was slow and dragging down the performance for a whole batch of requests.
The process:
There are basically three scales of problems to contend with – cluster, shard, and node – and typically you have to look at all three. Some examples from the trenches:
At our scale, the amount of metrics emitting from Elasticsearch is huge. It’s impossible to look at the raw metrics and alert on them in any useful manner. So we’ve had to figure out derived metrics that are actually useful with alert conditions that don’t inundate on-call, applying Splunk Infrastructure Monitoring's powerful analytics engine to do the math in real-time so we don’t miss any anomalies.
In one example we used to have checks on the state of every node, but the way Elasticsearch works — if the cluster becomes yellow or red, then all the machines in cluster get set yellow or red — meant 72 alerts, one per node. We’ve since switched to taking the cluster status reported by each host, assigning it a numerical value (0 for green, 1 for yellow, 2 for red) and alerting on the Max value. Now, using Splunk Infrastructure Monitoring, we only trigger a single alert when the cluster status gets set to yellow or red, limiting the noise. When all 1 node is yellow but all 72 instances report yellow we report on the max so you only get 1 alert (limit the noise).
In another example, we know that Elasticsearch can recover a failed machine by restarting replicas on another node. We also know based on shard size and experience timing it, that recovery can take up to an hour and a half. We then use this to decide whether it makes sense to wake somebody up — by applying duration thresholds on alert conditions, it is triggered if any of these three conditions are true:
Putting all our experience over the last few years together, here’s what we’ve found most useful to alert on:
Because of the way sharding works in Elasticsearch, we’ve found that scaling and capacity management have to be thought through clearly and treated as a proactive process. There are basically two ways to scale: add disk capacity to existing nodes or reshard to add more nodes. The first is low-risk and non-disruptive. Resharding is a complex process; doing it while the old index is being written to makes it even more complex. We’ve had to develop some methods of our own to make it work at Splunk Infrastructure Monitoring, where we we can’t afford to lose updates to metadata or not serve queries while resharding is in process. In addition, at our scale it is not a fast process—taking up to many days. There’s no getting around the physics of moving bits. You can read about how we do resharding and not only guarantee that no updates are lost, but also provide ways to pause and roll back the process.
The key metrics we track for capacity are document growth rate and storage usage. We track the percentage of growth in documents, percentage growth in storage consumption, absolute storage consumption, and top indexes by growth. We’ve found that storage consumption has to stay below around 50% in general and below 70% at all times. Going above 50% on a regular basis, or 70% at all, means that large merges can bring everything to a crawl and it’s time to scale.
Comparing document growth rates to storage growth rates and absolute storage consumption gives us an idea of when we’re going to have to reshard in the future, so we have enough runway to reshard before suffering performance problems.
We hope everyone who runs Elasticsearch and is trying to figure out what to monitor on, what to alert on, and how to scale their infrastructure will find this useful.
----------------------------------------------------
Thanks!
Mahdi Ben Hamida
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.