Monitoring modern infrastructure poses fundamentally new challenges in terms of data volume and velocity. Collecting the metrics emitted by machines is only the first step. To extract value from that data, we need a method of expressing service, team, or business goals against that stream of data. That method is analytics.
This is the third in a series of post about how to use analytics to both compose the most salient signals to monitor out of raw metrics and also how to configure useful alerts.
In the first two posts, we discussed alert detectors built using threshold and duration conditions to trigger on persistently bad state and using transformations like rolling means and rolling maximums that capture behavior more subtle than simple persistent states. In this post we will discuss two common patterns that cause threshold-duration based alert detectors to exhibit the same kind of flappy behavior you might see when using static thresholds. We’ll see how to configure alerts in such a way that they are not flappy when these patterns occur.
Let’s start with the pattern. The signal in this chart represents the maximum (taken over 15 machines) of the average time it takes for a job to start. The detector is set to fire when the signal is above 1000 (1 second) for one minute. Consider the following firing and clearing pattern on the signal (indicated by the orange diamonds on the horizontal axis):
Here the problem is that the signal very temporarily drops below the threshold of 1000 but then quickly returns to its elevated state. But there is something worth noting: the variation in this metric is normal and does not justify firing several alerts. This suggests using a distinct clearing condition for the detector, e.g. clear when the metric is below 900 for one minute.
The metric being above 1000 for one minute is equivalent to its one-minute rolling minimum (RMin) being above 1000. The metric being below 900 for one minute is equivalent to its one-minute rolling maximum (RMax) being below 900. Thus the behavior we want is:
FIRE when RMin >= 1000; CLEAR when RMax < 900
Graphically, we want to fire when a minute is spent above the red line and clear once a minute is spent below the green line as shown here:
Note that the RMax is greater than or equal to the RMin. Therefore:
So we can re-characterize the desired behavior as:
FIRE when RMin >= 1000 AND RMax >= 900
CLEAR when RMax < 900 AND RMin < 1000
Now, suppose for the moment we have signals B and C with the following behavior:
By the implications above, B=1 implies C=1 and C=0 implies B=0. Therefore the sum B+C has three possible states:
The possible states of the signals B and C, and the desired behavior of the detector, are summarized in the following table.
B (RMin >= 1000) |
C (RMax >= 900) |
B+C |
Desired behavior |
1 |
1 |
2 |
Fire |
0 |
1 |
1 |
No change |
0 |
0 |
0 |
Clear |
“No change” means neither fire a new alert nor clear an existing one.
To B+C we apply an exclude rule to values within range [0.5, 1.5]. This modified form of B+C can only take on the values 0 and 2. Now we set the detector to fire when B+C is above 0. This only happens when the value is 2, namely when RMin >= 1000; and this only clears when the value is 0, namely when RMax < 900. The exclusion has the effect of ignoring the state B+C=1, which is precisely when the naive detector (above 1000 for one minute) would turn off and back on shortly thereafter in our initial example.
To construct the signal B, we first apply a one-minute rolling minimum transformation to the original signal. Then we exclude values less than 1000, so that we “publish” the rolling minimum when it is at least 1000 and we “publish” nothing when it is less than 1000. Finally we take the count, so we get 1 when there is a “published” time series and 0 if there is no such series. The resulting signal is exactly the desired signal B described above.
The signal C is constructed in a very similar way: we apply a one-minute rolling maximum, exclude values less than 900, and then take the count.
The necessary analytics are shown below in SignalFx. Line A contains the original metric (the summary of the job start time across a cluster), and line D corresponds to the signal on which we should alert.
Although reasoning through and configuring well constructed alert detectors like this can be complicated, SignalFx provides starting points with templates for detectors that are surfaced in context and totally customizable.
The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.
Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.