Important Update: While the concepts covered in this blog post are still applicable, I'm pleased to announce that we've released an ITSI content pack for monitoring and alerting which provides all the functionality of this blog in pre-packaged format. For more information on how to get started with this content pack, review the content pack documentation or check out the blog covering the release of this content pack.
I’ve previously authored several blog posts covering thresholding basics and alerting best practices in Splunk IT Service Intelligence (ITSI). In those posts, I focused on foundational concepts and left a lot of implementation details to interpretation; moreover, as my experiences and methodologies evolve, so too does my guidance.
In this blog post, I intend to get a lot more prescriptive and lay out a blueprint for enterprise-wide alerting across all your services. We’ll zoom out from single-service or single-KPI based alerts and generate a design that is uniform and applicable to all services and KPIs in your ITSI environment. I believe that you’ll quickly see the benefits of this design, ranging from performance to maintainability to flexibility.
Interestingly enough, this design happens to mirror a popular risk-based security design strategy which was discussed at .conf18 called “Say Goodbye to Your Big Alert Pipeline, and Say Hello to Your New Risk-Based Approach.” If you buy into the design layed out in this blog, I encourage you to watch the replay of that talk. Ideally, you'll draw several correlaries between their approach and mine, and you will uncover even more alerting ideas.
To that end, I foresee the guidance within this blog further evolving toward that risk-based approach, and it’s possible that the technical details of my design change slightly or perhaps even dramatically over time as the product and methodologies evolve. Nonetheless, if you’re actively increasing the number of services, KPIs, or alerts in your environment this strategy will probably feel like a step in the right direction and it’s time to consider changing your approach.
The alerting design involves two major concepts. So before we dive deep, an overview of the design and those concepts is warranted:
Concept 1: Create, in fact proliferate, notable events for any noteworthy changes to services, KPIs, and entities. We’ll depend heavily on custom correlation rules to achieve this. Additionally, we’ll build each correlation rule to evaluate across all services, KPIs, and entities leading to a performant, maintainable, and uniform implementation across our environment.
Concept 2: Apply attributes to notable events to facilitate grouping and alerting logic. Attributes are nothing more than field/value pairs present in the itsi_tracked_alerts index. We’ll depend on typical core Splunk concepts to achieve this, such as lookups, calculated fields, and eval statements in our correlation searches. Once present, these attributes can be leveraged in notable event aggregation policies, alert action rules, and the episode review.
Putting it all together, it looks like this… We’ll build multiple correlation searches looking for bad stuff happening in our services, KPIs, and entities. When our rules detect bad stuff, notable events will be created. We’ll apply various attributes to these notable events, allowing us to group related notables using aggregation policy logic to cut down on the noise. And lastly, we’ll configure alert actions in our aggregation policies to produce alerts to the NOC based on our desired alerting rules.
Like all things Splunk, ITSI stores much of its data to several key indexes and our configurations and correlation rules will reference them; here’s a quick overview of the key indexes used by ITSI and what data is stored within:
Because our marketing team likes bite-sized blogs and because you don’t need to eat this elephant all at once, I’ve broken out the design into five steps. Each step will be its own blog, and once you’ve completed the fifth step, you’ve effectively implemented the approach and are free to alter and augment as you see fit. The five steps are:
As you try this out and make changes to your environment, you’ll want to test early and often. The customer I was working with had a very simple and effective method for testing that I’ll share. Simply create a test service with one or more test KPIs. When you need to break a service for testing purposes, use your test service and modify the threshold values to simulate failure. Similarly, as we start building up our notable event aggregation policies (NEAPs), you can build a test NEAP which includes only notables from your test service. This provides a very simple and isolated environment to test your changes.
Ready? Go on to Step 1...
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.