Lessons Learned From Monitoring Next Gen Infrastructure

By Arijit Mukherji

Modern infrastructure and DevOps practices are evolving rapidly. Over the last few years, we’ve consistently seen organizations large and small dealing with this reality as it pertains to monitoring. While monitoring systems have been around for a long time, they are being greatly challenged by these changes. At SignalFx we deal with these challenges every day. In this blog post, we present some of the lessons learned and best practices we have found useful.

The Times They Are A Changin’

What are some of these changes? Businesses want to deliver new customer applications at a faster, more efficient, and more cost-effective way. In just the last few years, there have been rapid changes in how infrastructure and software development practices have evolved.

Clouds: clouds are everywhere, enabling startups like SignalFx to exist and also large enterprises to expand and grow.
On-demand auto-scaling services: the flexibility of today’s infrastructure means that services are only scaled up with demand, and therefore IT teams can optimize their costs.
Containers & micro-services: the adoption we saw with clouds a few years back, we’re seeing again with containers now. Transitioning from monolithic to micro-service and containerized applications means that teams can move faster to deliver new features.
Function-as-a-service: in the name of moving faster, more and more teams are exploring function-as-a-service (FaaS). AWS Lambda is leading the charge in this area, but many are quick to follow.
CI/CD & DevOps: in an age where teams are pushing for more and more frequent deployments, silos between Dev and Ops are not sustainable. Breaking down these silos and embracing a culture of collaboration enables teams to move even faster.

Many organizations jump at the opportunity to evolve their infrastructure to take advantage of the newest innovation edge. Yet many do not plan accordingly for the impact on their monitoring systems.

Challenge #1: There Are Way More Metrics Than You Expect

Server-centric estimation of monitoring capacity is dead

Determining the baseline of how many time-series one’s monitoring system should be capable of handling tends to be fairly arbitrary and purely based on experience. A common strategy is to multiply the number of servers by some factor (e.g. 100 metrics per server) and perhaps allow for some future growth. Yet we’re seeing (and experiencing first-hand) that these simplistic assumptions are inadequate. When we look into why, we uncover a variety of reasons.

Firstly, there are many more monitored resources. Our environments are far more dynamic than the static environments of yesterday. While micro-services and right-sized instances promote component isolation, they also increase number of servers/instances. The democratization of scale-out architectures means that organizations have more difficulty anticipating the number of instances in their environment at any given time. Now add multiple containers on top of each of these instances and we get into some serious increases in number of time-series being reported.

Secondly, we’re observing a trend where business and product teams are recognizing the benefits of more fine-grained monitoring. They want to measure things that have nothing to do with the number of servers they have, but critical to their business or service. For example per customer metrics, or tracking metrics related to business KPIs such as jobs, orders, or queries for example, can greatly amplify the number of time-series.

It’s Not Just About the Number of Datapoints, but Also the Number of Time-Series

Conventionally we’ve measured of monitoring system capacity solely in terms of data-point volume. However, in this new world, the number of time-series becomes important too, especially since there are many more of them. Querying and analyzing data spread across a large number of time-series presents its own computational challenge and directly affects the responsiveness and efficiency of your monitoring system.

Even Small Organizations Can Have a Large Time-Series Footprint

The volume of data and metrics that your metrics system has to deal with is greatly affected by your environment and use cases. Build and select your metrics system carefully to be able to handle this volume.

Challenge #2: History Is Hard: Tracking Performance Over Time Is A Computing Challenge

There are many use cases that require historical reporting — you might want to know the number of API calls per second received over the last week or the trend of active user sessions over the last quarter in order to plan your capacity. Perhaps you want to compare week-over-week or year-over-year growth KPIs to determine whether you’re on track and alert you if trends are off in order to rectify the issue immediately.

History makes the problem of scale significantly worse. Firstly there is a time based multiplier on the number of datapoints — querying longer periods of time means querying and processing proportionately more data. This leads to longer querying durations. Secondly, there is a ‘churn’ based multiplier which increases the number of time-series that need to be processed. It also leads to more complex query and analytics.

In this context, we define churn as when a time-series is replaced by a different but equivalent one. This happens in next gen infrastructure often. Here are some examples:

You replaced an AWS instance in your environment, creating a new instance_id as a new source of metrics.
A Docker container is restarted, creating a new container_id that starts reporting.
An AWS EMR map-reduce cluster is spun up, creating an entirely different set of instances that report metrics.
Your dev team pushes a new version of your application code, causing all your app metrics to report with a new version. In some organizations this may happen multiple times a day.

Churn is insidious — things may work great at first, and get gradually worse and worse over time as more time-series accumulate. Preventing churn is not a practical solution based on our experience for two reasons a) customers do care about software versions etc. and b) even if you didn’t, enforcing churn-free data reporting practices across a whole organization is hard.

A better strategy to handle churn is to optimize around it. Here are some effective strategies we have seen that you can use:

Cache queries to prepare to more quickly compute repeat queries.
Pre-compute services aggregates, for example compute and publish the mean latency across a service as a separate time-series. This reduces the number of time-series queried for common use cases.
Pre-compute time rollups, reducing the number of datapoints. For example keeping 1-hr aggregates of each time-series means a 1-day report needs to consider 24 aggregate 1-hr data-points instead of 86,400 1-sec data-points.

Challenge #3: It’s Easy To Lose Sight Of the Forest For The Trees

Infrastructure and cloud environments are becoming increasingly diverse, causing metrics systems to have to deal with more and more different types of data-sets and sources. On top this, there is a diversity of uses cases from different levels on your organization which all need to be satisfied by your monitoring system.

DevOps engineer wants per-instance health
Team lead/manager wants microservice health
Big boss wants overall service health

Getting a holistic view that can span these different types of data is increasingly hard. It’s inefficient, and frankly disruptive, to have different point monitoring solutions to address each of the views. Instead, what you need in your monitoring solution is scalable analytics.

You need scale-out so your DevOps engineer can monitor 10 servers simultaneously as easily and as quickly as a 1000 servers.
The health of microservices requires aggregations and combines. Determine the 99th percentile aggregation for API latency across a whole service tier, or combine cache hit and cache miss counts to find cache hit ratio across N memcache servers.
From instances to microservice to overall service health, each component provides a different piece of info and piecemeal context that gives a partial view. Having the ability to view up, down, and across your environment for overall service health requires joining across disparate data sets and analytics chaining so you can combine partial results and build them up into the eventual value to monitor. For example the total requests across a tier (application metric, aggregation) and the average CPU utilization across the same tier (system metric, aggregation) when chained together to calculate the ratio gives you an important capacity planning metric for your service (# API calls / % CPU).

Challenge #4: Timeliness Of Metrics And Analytics Is Critical

The proliferation of SaaS means that you are providing SLA guarantees and relying on SLAs from third-party APIs. As many organizations completely refresh their environment through the adoption of next gen infrastructure, maintaining those SLAs is becoming harder, because now there is added complexity:

Containers are increasingly ephemeral with life cycles as short as a few minutes.
Having numerous microservices means more complicated interactions and dependencies between them. Root cause analysis and predictability is more challenging and may increase MTTR.

Determining SLA in today’s world is more than an IT issue — it’s business-critical. For a SaaS solution guarantee 99.9% or 99.99% uptime means they’re allowed 43 minutes or 4 minutes respectively of total downtime over a whole month. That’s barely enough time to humanly find and fix an issue, and automated triage and remediation is a must.

So what contributes to time-to-resolution?

Sampling frequency of raw metrics. Nyquist’s sampling theorem says if you measure at 1-min intervals, you’ll need at least 2 or more minutes to confirm a problem. The same is true in reverse: you’ll need another 2 minutes to verify a fix.
Time needed to analyze the sampled raw metrics to raise an alert. Add that to total remediation time for an incident.
Human/bot time to react, do root cause analysis, and take remediation steps. Add that to total remediation time too.

By this point, our chances are looking pretty slim. What we’ve learned at SignalFx is the need to produce fast results. This means:

Measure metrics at high resolution. The higher the frequency, the better. At SignalFx, we use 1 second resolution. This is a trend among our customers too as they are also moving to more frequent measurements.
Analytics and alerting must be fast. High-frequency raw metrics is only half the story. You need to apply analytics to them, create aggregate or composite metrics that model important KPIs, and alert on those trends quickly. This needs to happen across your entire infrastructure. This means your analytics system needs to be triggered quickly and produce results fast. For example, it is no good if you measure every 1 second but your alerting system polls the measured data every 5 minutes.

What’s Next

We are just at the beginning of this transition of next gen infrastructure and addressing the key challenges and trends created as a result. Based on the experience of the SignalFx engineering team and many of our forward-thinking customers, we strive to continue sharing our lessons learned and best practices in monitoring.

What key trends are you seeing and experience? Share your thoughts and keep the conversation going!

Join our live weekly demo on cloud monitoring »

Arijit Mukherji

Arijit Mukherji is CTO at SignalFx and passionate about monitoring. He was one of the original developers of Facebook’s metrics solution (ODS), and subsequently managed the development of Facebook’s networking tools, data visualization, and other infrastructure monitoring software. While focused on the monitoring space for more than a decade, his diverse career of over 20 years also spans IP telephony, VoIP conferencing, and network virtualization.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.