Reducing Alert Noise: Rates of Change

By Joe Ross

Monitoring modern infrastructure poses fundamentally new challenges in terms of data volume and velocity. Collecting the metrics emitted by machines is only the first step. To extract value from that data, we need a method of expressing service, team, or business goals against that stream of data. That method is analytics.

This is the fourth in a series of post about how to use analytics to both compose the most salient signals to monitor out of raw metrics and also how to configure useful alerts.

Part 1: Static Thresholds, Durations, and Transformations
Part 2: Transformations and Dynamic Thresholds
Part 3: Ranges for Firing and Clearing
Part 4: Rates of Change

In the previous posts, we learned how to: use threshold and duration conditions to trigger on persistently bad state; use transformations like rolling means and rolling maximums to more subtle issues than persistent states; and using ranges to construct new signals which don’t flap in cases where duration-thresholds don’t work. In this post we’ll look at another, more complex, pattern that can cause all of those methods to misfire and learn how to deal with it.

Let’s start with the pattern in the chart below. The metric "Job start time mean" is the primary signal in orange, its rate of change is blue, and the diamonds indicate firing and clearing of alerts.

Mean Job Start Time

By the time the duration condition is satisfied, the problem is already on its way to resolution. This suggests a further refinement to the detector where we fire only when both these conditions are met:

The signal is above 1000 for one minute
The signal is increasing

And clear when:

The signal is below 900 for (say) one minute

We will not involve the rate of change in the clearing condition because the metric is typically decreasing once the duration under the threshold condition is satisfied, as shown in the chart.

In addition to the signals for "Job start time mean" (B) and it’s "rate of change" (C), suppose we had a new signal E which is 1 when the original signal is increasing or stationary, and 0 when it is decreasing. We would like the detector to fire when B=1, C=1, and E=1; and we would like it to clear when B=0 and C=0 (E is free to be 0 or 1) as shown in the following table. If you’ve been following along, you’ll notice that for every line in the table we used in the previous post, we have two lines in the following table (corresponding to the two cases E=1 and E=0).

B	C	E	Desired behavior
1	1	1	Fire
1	1	0	No change
0	1	1	No change
0	1	0	No change
0	0	1	Clear
0	0	0	Clear

Using the method from the previous post and taking the sum B+C+E will not work in this case since this does not distinguish between the cases B=0, C=1, E=0 (which should cause no change) and B=0, C=0, E=1 (which should clear). This can be remedied by inserting a factor in front of E before forming the sum which will result in a clear ordering of all the combinations, as shown in the table below. The requirement we’ll use is that the “Clear” values are the smallest, the “No change” values are intermediate, and the “Fire” value is the largest, while the values of the three states cannot overlap. One combination of B, C, and E with this behavior is B+C+0.1*E.

B	C	E	*B+C+0.1E**	Desired behavior
1	1	1	2.1	Fire
1	1	0	2	No change
0	1	1	1.1	No change
0	1	0	1	No change
0	0	1	0.1	Clear
0	0	0	0	Clear

So now we can construct B+C+0.1*E and then exclude the values 1, 1.1, and 2 so that the detector does not change state when it passes through these values. One way of doing this is to exclude values greater than 0.5 and less than 2.05, and then set the detector to fire when B+C+0.1*E is greater than 2.

How to Construct the New Signal E

There are several ways to make precise the idea that the signal is increasing. The rate of change is positive when the signal has increased since the last observation. Therefore if we apply the rate of change transformation and then exclude values less than zero, we obtain a signal which “publishes” only when the signal is stationary or increasing. Applying the count produces a signal which is 1 when the original signal is stationary or increasing, and 0 otherwise. This is captured in line E.

Mean Job Start Time - Signal E Analytics

Finally we construct the sum B+C+0.1*E and exclude values corresponding to the desired behavior “No change.” This is the final signal on which we should alert, shown as plot F. In terms of the red “fire” zone above 1000 and the green “clear” zone under 900, this detector will fire when we are above the red line for one minute and when the current value is larger than the last observed value; and clear once we are under the green line for one minute. This detector would not have fired in our example scenario since the slope of the curve is negative when the duration condition is satisfied.

Mean Job Start Time - Negative Slope

Although reasoning through and configuring well-constructed alert detectors like this can be complicated, we’ve done the work for you in pre-built templates for alert detectors. These are surfaced via the Recommended Detectors feature on every chart and Host Navigator view in SignalFx.

Get more signal today »

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.