We’ve all been guilty of it. Creating rules and filters to hide those alerts that, for the most part, are just noise. Only then to have notifications about a legitimate issue also get swept up by those same filters. There’s only so many times we can break concentration and disrupt productivity before getting fed up with false positives and ignoring everything completely.
With our digital systems getting bigger and more complex, being able to customise and fine tune alerts to make them relevant, accurate, and actionable is more critical than ever. Between alert fatigue, missing issues altogether, and having to spend hours just to get the right context and detail to start troubleshooting, the experience for our users, our KPIs and SLAs, and staff morale all suffer greatly as a result.
In this blog, we’ll touch on alerting scenarios we want to avoid, and explore some of the ways we can create and customise great alerts in Splunk Observability Cloud to go from noise to action and improve your MTTx.
There are four main categories that bad alerting can be broken into. Here’s a quick summary to set the scene:
An alert condition is met but is not representative of the actual severity or state of the problem. An example of this is when a critical alert fires off to say that a service state is degraded, however, the service is functioning fine with no measurable impact on users. This typically happens when thresholds are set too aggressively, or when they don’t account for spiky/flappy datasets.
In this scenario, you’re missing incidents altogether. When user-impacting issues occur no alerts are triggered and complaints from end-users are typically the first instance that organisations will hear about the impact. This is indicative of thresholds being set too conservatively so they’re never breached and alerting conditions that are generally too coupled with infrastructure instead of the signals and telemetry from our services.
An easy way to have alerts filtered into a black hole is to send every alert to everyone, every time. Even if they’re accurate, and actionable, sending alerts to people that have no scope to action them, may as well be false positives as far as alert noise goes.
So, we’ve done well in ensuring our Mean-Time-to-Detect (MTTD) is in order, and we’ve fired off an alert as soon as there were signs of service degradation. The key metric as far as our users are concerned is Mean-Time-to-Recover (MTTR). No matter how great our MTTD is, it’ll end up with a poor experience for our users and a poor MTTR if all you get in an alert is “Host X has high memory utilisation”. It leaves a lot to be uncovered before an issue can be resolved. Which services were actually impacted? Which users? Which demographic? Were there any code changes associated with the high memory utilisation? If we increase the memory on the host will it resolve the problem or cap out again because of a memory leak in a bug we pushed? To reduce our MTTR, we need as much context and relevant information in the alert as possible to get straight into actually resolving the issue.
Splunk Observability Cloud simplifies the process of creating meaningful alerts. AutoDetect provides out-of-the-box (OoTB) detectors for common alerting use cases to simplify proactive troubleshooting without needing to create or maintain queries or conditions. Users can minimise the risk of false positives and negatives, reduce tech debt and time-to-value (TTV), and quickly onboard new systems and services so they can get back to innovating.
Depending on the datasets you’ve integrated, a list of detectors will automatically be available once you start sending data into Observability Cloud. While you’ll only see AutoDetect detectors for the active supported integrations in your environment, there’s an exhaustive list of available detectors. You can always check which AutoDetect detectors are available to you by navigating to the Detectors page, then to the Detectors list and use the filter to show only AutoDetect.
Some AutoDetect detectors will use dynamic Machine Learning (ML) based thresholds to help ensure the accuracy and relevancy of alerting conditions across a variety of systems and services. Having a per-entity-based approach to historical anomalies, sudden deviations and capacity limits goes a long way to reduce the complexity and sprawl of the detectors users need. Every service, for example, will have a different baseline of normal operating conditions and could result in bespoke detectors being created for every service even if we’re looking for the same symptoms. Splunk Observability Cloud lets you create a single detector to alert on scenarios where services are experiencing abnormal latencies or error rates across all the services you want to keep tabs on.
Filtering to see AutoDetect detectors available in your environment
While the AutoDetect detectors will be automatically published, alerts will only be sent once a subscription is created for that particular detector. The last thing you want is alerts sneaking up on you! Subscribing to a detector is simple. First, navigate to the one you want to receive alerts from, review the OoTB conditions, and set the threshold and recipients as desired. Let’s say you wanted to subscribe to the Application Performance Monitoring (APM) Sudden Change in Service Latency detector, clicking on the detector from the list gets you into the review pane.
Subscribing to APM Sudden Change in Service Latency AutoDetect Detector
This pane details the logic and conditions, the entities in scope (services to be considered for latency changes in this case), and a preview of historical thresholds along with a simulation of how many times an alert would have been triggered.
You can see in this specific scenario though that we have quite a large number of services and that without any filtering it will be sending off alerts to the recipients listed for every service in our environment. To avoid having all alerts ignored and to reduce MTTR, we’d need to create a customised version to limit the scope down to the entities (in this instance services) that each particular person or team is responsible for.
Thankfully, AutoDetect is designed for this exact situation. Users can create a customised version of an AutoDetect detector without modifying the base detector itself. This allows the customisation and tuning of alert conditions, as well as the severity and recipients for any triggered alerts, which is built on top of the query that is provided and managed by the platform. Simply put, every person or team can tune and filter it to what they need, without ever worrying about the underlying search. To start customising an AutoDetect detector, click on ‘Create a Customised Version’ from the same summary review pane we navigated to above.
Navigating to the Create a Customised Version of an AutoDetect detector configuration pane
From here, we can give it a new name so it’s easily identifiable, filter the scope down to any tags/dimensions we like, fine-tune the alert conditions, specify an appropriate severity, and add the right recipients to action alerts as per the filtering. Once the tuning and filtering have been set, a new preview is generated from historical data to simulate how many times an alert would have been triggered based on the updated conditions.
A customised AutoDetect detector with narrowed and fine-tuned configuration
Now we’d only trigger alerts when we expect them to be triggered. We can visualise which entities are included in the dataset, and alerts will only go to the recipients that need to receive them. This is just one example of how users can reduce false positives and negatives, and remove unnecessary alert noise.
False negatives and positives, alert noise, and a lack of context can have drastic consequences for your teams, your customers, and your KPIs. Creating great alerts are a critical way to ensure that they’re accurate, meaningful, actionable, and targeted. Splunk Observability Cloud AutoDetect detectors and alerts make this a painless process by providing meaningful alerts out of the box and simplifying the customisation experience so only the right person is alerted at the right time.
You can test this experience today by starting a free trial, or follow along in the series to learn how to create fully customised detectors and alerts for even greater control and less toil as you keep track of what’s going on across your environment.
< | Previously: Set Up Monitoring for Your Hybrid Environment | Next: How to Customise Detectors for Even Better Alerting | > |
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.