Modern infrastructure and DevOps practices are evolving rapidly. Over the last few years, we’ve consistently seen organizations large and small dealing with this reality as it pertains to monitoring. While monitoring systems have been around for a long time, they are being greatly challenged by these changes. At SignalFx we deal with these challenges every day. In this blog post, we present some of the lessons learned and best practices we have found useful.
What are some of these changes? Businesses want to deliver new customer applications at a faster, more efficient, and more cost-effective way. In just the last few years, there have been rapid changes in how infrastructure and software development practices have evolved.
Many organizations jump at the opportunity to evolve their infrastructure to take advantage of the newest innovation edge. Yet many do not plan accordingly for the impact on their monitoring systems.
Determining the baseline of how many time-series one’s monitoring system should be capable of handling tends to be fairly arbitrary and purely based on experience. A common strategy is to multiply the number of servers by some factor (e.g. 100 metrics per server) and perhaps allow for some future growth. Yet we’re seeing (and experiencing first-hand) that these simplistic assumptions are inadequate. When we look into why, we uncover a variety of reasons.
Firstly, there are many more monitored resources. Our environments are far more dynamic than the static environments of yesterday. While micro-services and right-sized instances promote component isolation, they also increase number of servers/instances. The democratization of scale-out architectures means that organizations have more difficulty anticipating the number of instances in their environment at any given time. Now add multiple containers on top of each of these instances and we get into some serious increases in number of time-series being reported.
Secondly, we’re observing a trend where business and product teams are recognizing the benefits of more fine-grained monitoring. They want to measure things that have nothing to do with the number of servers they have, but critical to their business or service. For example per customer metrics, or tracking metrics related to business KPIs such as jobs, orders, or queries for example, can greatly amplify the number of time-series.
Conventionally we’ve measured of monitoring system capacity solely in terms of data-point volume. However, in this new world, the number of time-series becomes important too, especially since there are many more of them. Querying and analyzing data spread across a large number of time-series presents its own computational challenge and directly affects the responsiveness and efficiency of your monitoring system.
The volume of data and metrics that your metrics system has to deal with is greatly affected by your environment and use cases. Build and select your metrics system carefully to be able to handle this volume.
There are many use cases that require historical reporting — you might want to know the number of API calls per second received over the last week or the trend of active user sessions over the last quarter in order to plan your capacity. Perhaps you want to compare week-over-week or year-over-year growth KPIs to determine whether you’re on track and alert you if trends are off in order to rectify the issue immediately.
History makes the problem of scale significantly worse. Firstly there is a time based multiplier on the number of datapoints — querying longer periods of time means querying and processing proportionately more data. This leads to longer querying durations. Secondly, there is a ‘churn’ based multiplier which increases the number of time-series that need to be processed. It also leads to more complex query and analytics.
In this context, we define churn as when a time-series is replaced by a different but equivalent one. This happens in next gen infrastructure often. Here are some examples:
Churn is insidious — things may work great at first, and get gradually worse and worse over time as more time-series accumulate. Preventing churn is not a practical solution based on our experience for two reasons a) customers do care about software versions etc. and b) even if you didn’t, enforcing churn-free data reporting practices across a whole organization is hard.
A better strategy to handle churn is to optimize around it. Here are some effective strategies we have seen that you can use:
Infrastructure and cloud environments are becoming increasingly diverse, causing metrics systems to have to deal with more and more different types of data-sets and sources. On top this, there is a diversity of uses cases from different levels on your organization which all need to be satisfied by your monitoring system.
Getting a holistic view that can span these different types of data is increasingly hard. It’s inefficient, and frankly disruptive, to have different point monitoring solutions to address each of the views. Instead, what you need in your monitoring solution is scalable analytics.
The proliferation of SaaS means that you are providing SLA guarantees and relying on SLAs from third-party APIs. As many organizations completely refresh their environment through the adoption of next gen infrastructure, maintaining those SLAs is becoming harder, because now there is added complexity:
Determining SLA in today’s world is more than an IT issue — it’s business-critical. For a SaaS solution guarantee 99.9% or 99.99% uptime means they’re allowed 43 minutes or 4 minutes respectively of total downtime over a whole month. That’s barely enough time to humanly find and fix an issue, and automated triage and remediation is a must.
So what contributes to time-to-resolution?
By this point, our chances are looking pretty slim. What we’ve learned at SignalFx is the need to produce fast results. This means:
We are just at the beginning of this transition of next gen infrastructure and addressing the key challenges and trends created as a result. Based on the experience of the SignalFx engineering team and many of our forward-thinking customers, we strive to continue sharing our lessons learned and best practices in monitoring.
What key trends are you seeing and experience? Share your thoughts and keep the conversation going!
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.