Observability

August 01, 2018

9 Minute Read

Refining Real-Time Alerts with Compound Conditions

By Joe Ross

At SignalFx we’re always working for ways to make our alerts both smarter and easier to configure. This post will explain one of our latest efforts in this direction: the addition of compound conditions to the SignalFx detector creation UI.

Compound condition alerts allow you to combine simple “predicate for percent of duration” conditions using the Boolean operators "AND" and "OR". For example, you can alert if (CPU utilization is above 90% for 5 minutes) "AND" (latency is above 3s for 80% of 3 minutes). Alerting on compound conditions has been supported in the SignalFlow language (the API for specifying analytics computations in SignalFx, which has also supported the "NOT" operator) for some time, and that core functionality is now even more accessible. This post will explain some methods for improving and refining alerts by using compound conditions. The first part discusses four example use cases. The second part explains the role of metadata in compound conditions, and also how to access compound conditions in the UI.

Part I

Example 1: Capturing Service Dependency Logic

A common use of SignalFx is to collect and monitor latencies for various operations, and alert when they cross a worrisome threshold or experience a sudden increase. In this example we’ll explain how alerting on a compound condition (involving more than one type of latency measurement) helps to ensure notifications go to the correct audience, thereby reducing alert fatigue. As we’ll see, the general use case is to suppress alerts for a service when an unhealthy dependency is the culprit.

Monitoring user-facing latency serves as a stopgap measure to ensure performance issues are detected before customers notice. Loading a web application typically involves a query to some database (containing messages sent or financial transactions, for example), and so latency in serving those requests is a lower bound on latency in serving the UI request. Assuming a robust collection of metrics concerning the performance and health of the database is in place, we need not alert on a sluggish UI if the database is taking 5 seconds to return query results. Suppressing some important details on aggregating these measurements across time and emitters (see the section on Metadata in part 2), the compound alert condition looks schematically as follows:

ui_load_time_ms > 5000 AND database_latency_ms < 5000

This is a fairly general pattern: user-perceived latency can often be neatly expressed as a sum of database latency and latencies in other parts of a system. Using a compound condition enables us to attribute the latency increase to the offending dependency. More generally, if service "A" depends on service "B", we can employ a compound alert condition to suppress alerts for service "A" when service "B" is unhealthy.

There are some organizational benefits to this improvement: if there is a single on-caller, conditioning the “UI load time alert” on database performance reduces the number of alerts associated with a single underlying incident, and focuses attention on the root cause rather than a symptom. If each team has its own on-call rotation, conditioning the alert helps to ensure notifications go to a targeted recipient responsible for the underlying service. By contrast, broadcasting even legitimate alerts to a large group can lead to alert fatigue.

Just as end-to-end latency can be decomposed as a sum of latencies over individual services, in the next example we express the sum of a metric across emitters (i.e., overall throughput) as a sum across different values of a dimension (i.e., per-Kafka topic throughput).

Example 2: Distinguishing Population-Level from Individual-Level Problems

In the previous example, we assumed a suite of alerts on a dependency was in place, and conditioned an alert on the health of that dependency. In this example, we assume the overall health of a population of emitters is already being monitored, and want to suppress alerts for individual emitters when the population itself is unhealthy. This example comes directly from one of our customers, and was part of the motivation for adding compound conditions to the SignalFx UI.

The specific customer example involved monitoring Kafka producers. In the past, they had seen problems with producers for individual topics, and found it useful to set an alert on throughput by topic. After creating a signal "A" representing producer throughput (summed by topic), we can alert when "A" is too low (relative to a static threshold or a historically defined baseline). This will let us know when producers for a given topic are not sending enough records. Somewhat unfortunately, we are likely to receive a large number of alerts across all topics if there are problems with the Kafka cluster as a whole! These are legitimate alerts in the sense that they require some action, but they do not really pertain to the topics. We can improve the alert by incorporating a condition regarding cluster health.

To do this, we create a signal "B" representing overall producer throughput (a simple sum), and condition the alert on "A" on the behavior of "B": instead of alerting when "A < 100", for example, we alert when "A < 100 AND B > 500". In the detector creation UI, this looks as follows:

This configuration uses different trigger sensitivities for each condition. We use a 1-minute duration for the topic-wise condition ("A < 100") to be resilient against a few low readings. We can be confident the cluster is not to blame by inspecting the overall throughput (signal "B") at one timestamp; adding a duration to the condition "B > 500" makes the alert harder to trigger, but might lead to some false negatives, depending on how much variation signal "B" experiences.

Using this compound condition alert (together with an alert regarding the overall health of the cluster) enables the alert responder to know immediately which scenario they’re in (whether the problem is topic-specific or cluster-wide), allowing them to more quickly focus on triage and remediation instead of diagnosis.

In Examples 1 and 2, compound conditions allow us to encode relationships among metrics, or across different dimensions of a single metric, and the end result is a more specific and actionable alert. In the following examples, we use compound conditions to refine an alert on a single metric and eliminate low-quality alerts.

Example 3: Incorporating Rate of Change

As explained in this post, one problem with simple “signal above threshold for duration” alerts is that they may fire while the signal is heading back towards the healthy state and intersects the threshold, so the alert rule does not perfectly align with the problem scenario one intends to capture. A solution proposed and implemented (by a somewhat alembicated method in the plot builder) in that post is to add a condition on the rate of change of the signal. For example, we can simply require the signal’s rate of change to be positive, in addition to the threshold being exceeded for some duration. It is now straightforward to accomplish this task in the SignalFx UI: in addition to the original signal one intends to monitor (say it’s plot "B), in the Alert signal tab we also create the rate of change of this signal by adding analytics.

We can then, for example, alert when "B > 1000" for 1 minute "AND C > 0", where "C" is the rate of change. This requires "B" to be above 1000 for 1 minute, and to be increasing.

Note that double exponential smoothing (Double EWMA) combines information about the values of a signal and its rate of change. Transforming the signal via Double EWMA before setting an alert is another approach to reducing alerts that trigger “on the way down.” For more information, see the Analytics Reference Guide.

Example 4: Accounting for Sample Size

A common alerting pattern is to trigger when a certain “failure” rate exceeds a threshold. “Failure” is intended loosely here – we have in mind the situation where you are reporting metrics (often counters) of the type “action/process completed successfully” and “action/process not completed.” You want to monitor the ratio “Failures / Attempts” and alert when it is too high.

The number of “attempts” often varies with the time of day and day of week, so the same failure rate might be interpreted quite differently depending on overall attempts. A failure rate that would be worrisome during a high traffic period (750 of 1000 user attempts to open a message failed!) will often happen by chance during a low traffic period (3 of 4 user attempts failed). A compound condition allows us to distinguish between these cases: instead of alerting on “Failures / Attempts” in isolation, we can also require the number of attempts to be large enough to worry about. So, defining the quantities "num_attempts" and "failure_rate" as follows:

Our alert would look as follows:

failure_rate > 0.2 AND num_attempts > 50

Another approach is to explicitly adjust the failure rate by some function of the number of attempts. To do this, define a non-negative function which is decreasing on the set of positive numbers, for example an exponential function of -1 times the input (perhaps altered by some positive constants). Then subtract from the failure rate this function evaluated on the number of attempts, and alert on the result. Here is an example:

This has the effect of relaxing the threshold on the failure rate when the number of attempts is too low, essentially capturing a family of compound conditions.

Part II

Exercises in Metadata Correlation

Making sure the alerts are triggered correctly relies not just on compound conditions, but also on getting metadata correlation right.

Metadata in SignalFx refers to a collection of key-value pairs associated to a time series that, taken together, identify that time series. In SignalFx, you can combine metrics using mathematical operations. Metadata correlation is the mechanism that enables us to evaluate expressions in the way you would expect. For example, if "A" represents the count of failures for a transaction performed on a set of nodes and "B" the count of successes for the same transaction on the same nodes, then "A / (A+B)" will yield the failure rate per node. SignalFx’s metadata correlation makes this easy to express – see this post for a deep dive on the inner workings of metadata correlation, in particular how we produce the key-value pairs associated to an expression such as "A + B" from the pairs of "A" and those of "B".

The same metadata correlation algorithm applies to compound conditions. The individual conditions "A < 100" and "B > 500", for example, possess the same metadata as "A" and "B" respectively, and the compound condition "A < 100 AND B > 500" possesses the same metadata as would other combinations involving "A" and "B "("A + B", "A * B", etc.).

How does this apply to the above uses of compound conditions?

In Examples 3 and 4, the metrics for each condition had exactly the same metadata, so correlation was fairly straightforward. A signal and its rate of change possess the same metadata, hence a compound alert incorporating rate of change presents no additional metadata complications (the “correlation” is the identity mapping). Similarly, in the failure rate scenario, the overwhelmingly common case is that every “success” time series has a counterpart “failure” time series (with the same metadata), hence number of attempts and failure rate have this metadata as well, and the compound condition alert behaves as the simple one would (as far as metadata is concerned).

Metadata correlation in Example 2 (combining population and individual health) is slightly more interesting. Here "A" is a metric summed by topic, and "B" is a simple sum. As the section “How Aggregations Define Metadata” of the deep dive post explains, the members of signal "A" will have only a topic dimension. As “Working with Complete Aggregations” explains, the complete aggregation "B" loses all metadata (except the metric itself), and so each time series of "A" correlates with "B", and an expression involving "A" and "B" will have one output for each member of "A". Since compound conditions leverage the same metadata correlation algorithm, alerts for "A < 100 AND B > 500" will have an associated value of the topic dimension: they will specify which topic triggered the alert!

The reader familiar with SignalFx will notice metadata issues are gnarliest in Example 1 (dependency logic): there is no natural way to correlate the client(s) reporting "ui_load_time_ms" with the host(s) (say) reporting the metric "database_latency_ms". In general there is no way to correlate the hosts comprising service "A" with those comprising service "B", so an alert of the form "ui_load_time_ms > 5000 AND database_latency_ms < 5000" (with no aggregations) would never trigger, as the metadata of the conditions do not correlate. To remedy this, we need to aggregate the metrics along a common dimension (which probably does not exist in examples like this), or apply a simple aggregation to one (or both) metrics.

In this example, it makes sense to summarize "ui_load_time_ms "by some percentile, say the 90th, as the condition "ui_load_time_ms (90th percentile) > 5000" means “at least 10% of users are experiencing unacceptable latency.” Now the "alert ui_load_time_ms (90th percentile) > 5000 AND database_latency_ms < 5000" will trigger, but it will trigger for each emitter (e.g., host) of "database_latency_ms", provided that latency is low and the 90th percentile of "ui_load_time_ms" is high. In our scenario, we do not care about the host-level behavior of the database. Therefore we summarize also "database_latency_ms" by some percentile, say the 95th, as "database_latency_ms (95th percentile) < 5000" means “at least 95% of the hosts are returning results within 5 seconds.” Now the alert:

ui_load_time_ms (90th percentile) > 5000 AND database_latency_ms (95th percentile) < 5000

triggers when 10% of users are experiencing long waits but at least 95% of the hosts are serving query results within 5 seconds. One should attach percent of duration conditions as well, but we omit these details to keep things simpler.

How to use Compound Conditions

After creating signals and selecting one in the Alert signal tab, choose Custom Threshold (the final option) in the Alert condition tab. In the Alert settings tab, after setting the condition on the selected signal, click “Add another condition.”

Alternatively, after creating more than one signal in the Alert signal tab, click the double bell icon in the Alert on column and proceed to the Alert settings tab, and create multiple conditions. For more information, see the documentation.

In this post we have explained how to use compound alert conditions to reduce the quantity of alerts associated with a common underlying incident (and surface only the most specific and relevant alerts), and to eliminate some low-quality alerts altogether. If you have an interesting use case for compound conditions, I’d love to hear about it. You can reach me here.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk

Subscribe to our blog

Get the latest articles from Splunk straight to your inbox.

Connect with Splunk on X

Follow @Splunk

Connect with Splunk on Instagram