Observability

January 23, 2024

7 Minute Read

How to Customise Detectors for Even Better Alerting

By Koray Harman

In the previous blog, we introduced what makes a bad alert and how being able to simply customise and fine-tune your detectors is critical to creating great alerts. The first category of detectors in Splunk Observability Cloud that we dived into was the out-of-the-box offering called AutoDetect. Customising and subscribing to these detectors is a great way to get up and running straight away with industry best-practice alerts and bring down MTTx.

Every business has a different tech stack and differing priorities and objectives. Each environment is varied and complex. When you want to precisely monitor the things that matter to your business, you’ll need to customise. Custom Detectors within Splunk Observability Cloud allow you to specify the exact dataset and dimensions and leverage a powerful real-time streaming metrics engine and comprehensive customisation options so you can create meaningful and actionable alerts that are relevant to your business to continually drive business and technical improvements.

In this blog, we’ll cover how to create a Custom Detector and explore first-hand how the right customisations can turn an alert storm into meaningful insights. If you’ve started a free trial, you’ll have access to the same metric that we’ll be using in this example. Feel free to follow along!

Custom Detectors

With Splunk Observability Cloud, you can create custom detectors for any service or metric(s) with full control of the alert conditions. You can apply filters and functions both to and between datasets, so you can define simple criteria on base metrics through to comprehensive criteria involving multiple metrics and mathematical computations. A simple static threshold on the disk utilisation for a database server? Forecasting expected service saturation for proactive scaling? Complex formulas to determine overarching service health scores across multiple systems? Custom detectors are the perfect fit.

To get started creating a Custom detector, we’d click the plus ‘+’ icon in the top right corner of the Alerts & Detectors page and select Custom Detector.

Alert Signals

Once we’re on the New Alert Rule screen, the first thing we’ll need to do is identify the metric(s) we want to monitor. When it comes to working with metrics and signals in Splunk Observability Cloud, we use ‘plot lines’ to define and correlate between different datasets. Each plot line can act as an independent stream of a metric and its associated dimensions, or those plot lines can be evaluated against one another to compute a new output altogether. For this example, we’ll be looking at service latency for our customers, comparing it to the previous week, and setting it to alert us if we see trends that indicate regression in our user experience. Ideally, we should be improving latency for end users, not going the other way! We’d start by taking a sample service latency metric, aggregating it down to each of the customers, and performing basic functions to determine the weekly variance for latency.

Say we have a metric called demo.trans.latency that simulates service latency that gets used by three customers globally and is hosted within multiple data centers. With all the metadata we have for the metric, splitting it by those dimensions we’d get 18 Metrics Time Series (MTS) as pictured in the Data Table. Think of MTS as the underlying rows that form a given metric and all of its unique dimensions. For example, we could summarise the max CPU utilisation of our entire tech stack and it would be a single MTS. This wouldn’t go a long way in helping us to identify or resolve particular infrastructure issues. Instead, we could aggregate by availability zone. If we were hosted across three zones, then our MTS would be three. What we really need to know is which host specifically is experiencing issues. In this case, we’d aggregate down to the host level. Our MTS count is the same as the number of our hosts in this case, which, hypothetically, is 100.

^{(Input the demo.trans.latency metric as the signal for plot line A)}

^{(Data Table for demo.trans.latency shows the unique values across the 18 MTS)}

We want to look at latency for each customer and not by the various hosts/regions/data centers the service is hosted from, so we add an analytic function under the F(x) column to apply a max latency aggregation by the demo_customer dimension. We’ll be able to immediately see the number of data rows update to reflect the analytical aggregation we’ve just applied in the timechart.

^{(Applying a Max latency by demo_customer analytic function to our demo.trans.latency metric)}

Because there are three unique customers in this environment, we’ll be able to see that the updated data table shows only three rows/MTS being evaluated in the detector as a result.

^{(Data table shows 3 MTS / rows with Max latency by demo_customer aggregation)}

Next, we need the max latency for each of our customers at this time last week as a baseline we can use to determine our variance. We want to see an improvement in service latency for our customers as we release new features and improvements. We’d clone the existing plot line as the next step.

^{Cloning a plot line to duplicate the signal and its analytic function configuration to a new plot line}

The cloned plot line (plot line B) will carry over the same metric and analytic configurations. Then we can shift the date and time for this plot line to the same time in the prior week by adding a new analytic function called Timeshift and setting it to one week.

^{Adding the Timeshift analytical function and setting it to 1 week to plot line B}

Our dataset would now show both the real-time and historical service latencies for each of our customers. The metric we want, however, is the variance between the two so we can make sure we’re not regressing. In the next plot line, we’d add a formula to interact between the existing data sets.

^{Selecting Enter Formula in the next Plot Line}

Now, we’d be able to calculate the difference between the first two plot lines by adding the formula A-B (value of plot line A (current) minus the value of plot line B (last week)). We want to see a negative value; the bigger the number, the worse the latency is compared to last week. Once the formula is in place, we can set the new variance calculation metric as the dataset we see in the time chart and the value to alert on by toggling the eye and bell icons. We can also verify that the new metric has gone from showing us the actual latency numbers to just the difference since the Y-axis should now be in decimal increments.

^{Adding a formula to calculate the variance between our two plot lines}

Alert Condition

With the right metric calculation in place, the next thing to set is the Alert Condition. This is how we evaluate the criteria and thresholds we want to apply to the metrics we’ve gathered in the Alert Signal component. These conditions greatly simplify the use of Machine Learning (ML) to accurately identify anomalies and deviations. In this example, we’ll be using a simple Static Threshold, but there are numerous Alert Conditions available to select from within the platform.

^{Selecting from the list of Alert Conditions for our detector}

Alert Settings

Once we select Static Threshold, we then specify what we want that threshold to be. Setting the threshold to three, we’ll be alerted when the service latency experienced by each of our customers is three milliseconds (ms) slower than it was the previous week. You’ll notice in the time chart that setting a 3ms threshold would have resulted in more than 61,000 alerts on the previous day had this alert been active. That’s beyond even ‘alert noise’ at that point! We’d customise this even further to keep our team happy and our email service functioning.

^{Setting a custom threshold with a large number of simulated alerts due to a spiky dataset}

In this example, we see that the dataset is quite ‘spiky’ - constantly peaking above the threshold before coming back below the 3ms mark. Since we’re not concerned about the behaviour in such a fine-grained resolution and are more interested in how the latency is trending, we can customise this even further to make it much more relevant and meaningful for our team. A great way to take this to the next level is by changing the Trigger Sensitivity from alerting on any immediate breach of the threshold to only alerting when there is a substantial period where the latency is slower than the previous week.

^{Customising the Trigger Sensitivity to accommodate spiky data and look at the trend instead}

Now, we’d specify that we only want to fire an alert if our latency exceeds the threshold for at least half (50%) of a five-minute window. Accommodating for a spiky dataset and customising the threshold to look for a trend brings the number of simulated alerts over the previous day down from over 61,000, to just one. Our customisation has made what would have been an unusable detector into something valuable and meaningful for the team!

^{Setting the sensitivity to a duration time window and correcting the number of simulated alerts}

Alert Message

At this stage, we’ve catered for false positives and negatives, and alert noise. In Observability Cloud, you can also ensure the alerts generated are actionable and context-rich by customising the Alert Message. Starting with the default alert message, there are a couple of things that we can update to make it more useful. Firstly, we’d change the alert from ‘Critical’ to ‘Warning’ since we’re not looking for a full-service outage but are more concerned about incremental trends. The next component we can customise is the body of the alert message. In the preview, we get a few important details like the time, the condition for the alert, and the value that caused the breach, but we’re missing a VERY key detail our team would need to effectively and immediately start troubleshooting. As we’re looking at the latencies for each customer, the actual name of the customer that was impacted is critical information we’d want to include in the alert and not leave for further investigation. The good news is that this is easily added within Splunk Observability Cloud. We’d click on ‘Customise’ to add more detail.

^{Setting the severity and opening up the alert message customisation pane}

From the alert message customisation pane, we can pull through any of the dimensions from the signals we’ve configured into the alert message. The customisation options are quite comprehensive and even allow for conditional statements and URLs/images to be embedded into the alert message. We’d simply identify the variable we want to include in the message body and subject - {{dimensions.demo_customer}}, and, with a simple change, we’ve made the alert significantly more actionable.

^{Customising the Alert Message and Subject using variables to pull through key metadata}

Alert Recipients

The final step! We’d make sure we don’t send any generated alerts to the entire company. Let’s be honest, does the HR team need a notification that our service latency is trending in the wrong direction? Probably not. In the Recipients pane, we can specify who or where any generated alerts should be sent. The recipient options in this view will reflect the notification services you have set up. We could add “my Team” as the only group to be notified, and we’re done!

^{Customising the Alert Recipients for any generated alerts, limiting the notifications to just a single team}

We’ve only scratched the surface of how you can create powerful detectors in Splunk Observability Cloud. Check out future blogs, the Alerts and Detectors Documentation, and the Splunk Training Portal for more detailed walkthroughs. Start a trial today to test this out, and, when you’re ready to take it to the next level, you can move to a more programmatic, automated, and templated approach to creating and managing your detectors using SignalFlow, Terraform, and the platform APIs available.

Previously: How to Create Great Alerts

Coming Soon: How to Investigate Kubernetes Failures with Logs in Context

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.