As we continue this series on creating "A Blueprint for Splunk ITSI Alerting," in this next step we’ll create a blanket correlation rule to create notable events when the service health score for any service degrades. To meet this need, most people would head down the route of creating a multi-kpi alert (mKPI) for each service.
But this strategy has two not-so-obvious drawbacks: First, it creates increased effort both initially and ongoing to maintain a mKPI per service. As new services are brought in, the effort continues. Second, it creates undue search load in Splunk. If I have 50 services in my environment, I’m now running 50 additional searches every 5 minutes or so to drive alerting. Not ideal.
Creating a correlation search which scans all services in one pass addresses both of these drawbacks. Additionally, it begins to normalize the alerting strategy across all services built in Splunk ITSI, which I personally view as another huge benefit.
Did you know that Splunk ITSI ships with a pre-built correlation rule that does exactly this? Pretty cool, eh? Just head on over to Configure -> Correlation Searches and look for the search called “Monitor Critical Service Based on HealthScore.” We could simply enable this correlation search and be done, but we’re going to wind up modifying it slightly so we’ll duplicate the existing rule and make our modifications to the copy.
For now, we’ll make just one small adjustment to the out-of-the-box rule. We’re going to create notable events if the service health score is anything other than normal. Point of note here, the field alert_level on the itsi_summary index is the field that represents the KPI or Service severity (1=info, 2=normal, 3=low, 4=medium, 5=high, 6=critical). In looking for alert_level>2, we're basically creating notable events anytime service health scores are detected as low, medium, high, or critical. The correlation search is below.
`service_health_data` alert_level>2
| rename serviceid as itsi_service_id
| rename kpiid as itsi_kpi_id
| rename kpi as kpi_name
| lookup service_kpi_lookup _key as itsi_service_id OUTPUT title
| rename title as service_name
| eval actual_time=_time
| convert ctime(actual_time) as actual_time
Save and enable that new correlation rule and you’re done. Use your test service to validate, and once the service degrades, you should see one or more notable events created in the episode review. Remember, we’ve decoupled the criteria which drives the creation of notable events from the creation of alerts; so we’re not necessarily going to produce alerts if the service health changes to low, but we now have more information and more context about this service that we can use to our advantage in the future.
Easy enough? Go on to Step 2...
The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.
Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.