Monitoring modern infrastructure poses fundamentally new challenges in terms of data volume and velocity. Collecting the metrics emitted by machines is only the first step. To extract value from that data, we need a method of expressing service, team, or business goals against that stream of data. That method is analytics.
This is the second in a series of posts on how to use analytics to both compose the most salient signals to monitor out of raw metrics and also how to configure useful alerts.
As discussed in the first post – Static Thresholds, Durations, and Transformations – we’ve found that duration conditions and simple transformations greatly reduce the number of false alerts we receive from using static thresholds. In this post we will discuss characteristics of a signal which may be considered suspicious but are missed by duration conditions (“false negatives”). Here is an example.
The metric represents the mean time a job spends on a queue across a number of machines running the same service. Analysis revealed that this spiky behavior was due to the machine with the longest queue wait time failing to report its data in time for the job to proceed. (Deciding when to proceed with analysis before all machines have reported results, and how to handle missing data, are separate topics.) In real time, we want to be alerted of this kind of spikiness so we can react. In this case, we want to be pointed to the troublesome machine and determine if it needs to be restarted. In our example, the threshold and duration condition fail to fire, but the queue wait time is reaching undesirable levels from the standpoint of the end user experience. The SignalFx analytics engine supports transformations, which allow us to be alerted of the spiky behavior even when duration conditions are not met.
This first transformation captures the essential upward trend in this example: the rolling maximum. Whereas the rolling mean replaces a signal with its average value over a window, the rolling maximum replaces a signal with largest value in the window. The rolling maximum does not attempt to approximate the true signal. Instead, it captures trends in the worst case (assuming high levels are bad) scenario. On this chart the rolling maximum is shown along with the original signal.
Since the rolling maximum is a “pessimistic summary” of the window, thresholds based on rolling maximum transformed signals should be set at levels higher than thresholds based on the original signal, and the duration required must be longer than the window size (since a single elevated value determines the rolling maximum for a period equal to the window size).
There is a second transformation that captures the spiky behavior: the absolute value of the rate of change. We have two approaches for building a detector based on the absolute value of the rate of change of a signal (the “transformed signal”). The first is to manually inspect historical values of the transformed signal and decide on a threshold and duration; the other is to use further analytics to compare the very recent history of the transformed signal to its semi-recent past. This comparison is achieved by summarizing the semi-recent past via some statistical functions. For example, we could require that the absolute value of the rate change be above the 99th percentile (calculated over the last day) for 5 minutes. This is one way of capturing that the last 5 minutes were very different from the preceding 24 hours.
In SignalFx, we can constructed the signal as follows:
Then the detector can be set to fire when J is above K for a duration of 5 minutes. The absolute value of the rate of change is large when the signal oscillates wildly between small and large values, but note it is also large with the signal experiences steep ascent or descent. This is unlikely to be a problem from the standpoint of monitoring the signal: sustained steep descent is rare for metrics bounded from below, and sustained steep ascent is also worth detecting.
Whereas in the last post we focused on improving detectors based on static thresholds, in this example we’re employing a dynamic threshold (namely, the signal K) — one that changes with time.
We may expect a service owner to be familiar with the basic performance profile (e.g. known good envelope for utilization of CPU, memory, and disk) of the service, so for these metrics we can construct high-quality detectors based on static thresholds. On the other hand, the distribution of the absolute value of the rate of change of some metric is more obscure. How can we establish a baseline (a sense of the range of typical values) for a complicated signal? Only by observing those values over time do we gain more information about the distribution. A value above the last day’s 99th percentile is extreme; if we observe values above the last day’s 99th percentile for a duration of 5 minutes, we have an indication the signal is changing state. This logic applies to any signal, but is particularly valuable for complicated derived signals with which we do not have direct experience of what constitute typical values.
Moreover, the baseline is continuously updated with new data. The performance profile of a service may change with regularly deployed code changes, or when an upstream service is altered. A dynamic threshold does not need to be manually reset in response to such events. It is designed precisely to continuously capture the new normal.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.