Monitoring modern infrastructure poses fundamentally new challenges in terms of data volume and velocity. Collecting the metrics emitted by machines is only the first step. To extract value from that data, we need a method of expressing service, team, or business goals against that stream of data. That method is analytics.
This is the fourth in a series of post about how to use analytics to both compose the most salient signals to monitor out of raw metrics and also how to configure useful alerts.
In the previous posts, we learned how to: use threshold and duration conditions to trigger on persistently bad state; use transformations like rolling means and rolling maximums to more subtle issues than persistent states; and using ranges to construct new signals which don’t flap in cases where duration-thresholds don’t work. In this post we’ll look at another, more complex, pattern that can cause all of those methods to misfire and learn how to deal with it.
Let’s start with the pattern in the chart below. The metric "Job start time mean" is the primary signal in orange, its rate of change is blue, and the diamonds indicate firing and clearing of alerts.
By the time the duration condition is satisfied, the problem is already on its way to resolution. This suggests a further refinement to the detector where we fire only when both these conditions are met:
And clear when:
We will not involve the rate of change in the clearing condition because the metric is typically decreasing once the duration under the threshold condition is satisfied, as shown in the chart.
In addition to the signals for "Job start time mean" (B) and it’s "rate of change" (C), suppose we had a new signal E which is 1 when the original signal is increasing or stationary, and 0 when it is decreasing. We would like the detector to fire when B=1, C=1, and E=1; and we would like it to clear when B=0 and C=0 (E is free to be 0 or 1) as shown in the following table. If you’ve been following along, you’ll notice that for every line in the table we used in the previous post, we have two lines in the following table (corresponding to the two cases E=1 and E=0).
B |
C |
E |
Desired behavior |
1 |
1 |
1 |
Fire |
1 |
1 |
0 |
No change |
0 |
1 |
1 |
No change |
0 |
1 |
0 |
No change |
0 |
0 |
1 |
Clear |
0 |
0 |
0 |
Clear |
Using the method from the previous post and taking the sum B+C+E will not work in this case since this does not distinguish between the cases B=0, C=1, E=0 (which should cause no change) and B=0, C=0, E=1 (which should clear). This can be remedied by inserting a factor in front of E before forming the sum which will result in a clear ordering of all the combinations, as shown in the table below. The requirement we’ll use is that the “Clear” values are the smallest, the “No change” values are intermediate, and the “Fire” value is the largest, while the values of the three states cannot overlap. One combination of B, C, and E with this behavior is B+C+0.1*E.
B |
C |
E |
B+C+0.1*E |
Desired behavior |
1 |
1 |
1 |
2.1 |
Fire |
1 |
1 |
0 |
2 |
No change |
0 |
1 |
1 |
1.1 |
No change |
0 |
1 |
0 |
1 |
No change |
0 |
0 |
1 |
0.1 |
Clear |
0 |
0 |
0 |
0 |
Clear |
So now we can construct B+C+0.1*E and then exclude the values 1, 1.1, and 2 so that the detector does not change state when it passes through these values. One way of doing this is to exclude values greater than 0.5 and less than 2.05, and then set the detector to fire when B+C+0.1*E is greater than 2.
There are several ways to make precise the idea that the signal is increasing. The rate of change is positive when the signal has increased since the last observation. Therefore if we apply the rate of change transformation and then exclude values less than zero, we obtain a signal which “publishes” only when the signal is stationary or increasing. Applying the count produces a signal which is 1 when the original signal is stationary or increasing, and 0 otherwise. This is captured in line E.
Finally we construct the sum B+C+0.1*E and exclude values corresponding to the desired behavior “No change.” This is the final signal on which we should alert, shown as plot F. In terms of the red “fire” zone above 1000 and the green “clear” zone under 900, this detector will fire when we are above the red line for one minute and when the current value is larger than the last observed value; and clear once we are under the green line for one minute. This detector would not have fired in our example scenario since the slope of the curve is negative when the duration condition is satisfied.
Although reasoning through and configuring well-constructed alert detectors like this can be complicated, we’ve done the work for you in pre-built templates for alert detectors. These are surfaced via the Recommended Detectors feature on every chart and Host Navigator view in SignalFx.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.