In this blog we are going to describe how you can create a notable event policy in IT Service Intelligence (ITSI) that is able to group your events using labels generated by unsupervised machine learning in the Smart ITSI Insights App for Splunk – and don’t worry you don’t have to be a data scientist to read this blog!
You may have read in previous blogs how you can use graph analytics to understand connected systems and may even have had a chance to try out the 3D Graph Network Topology Visualization App for Splunk yourself. There are a number of techniques in the app that allow you to apply machine learning to connected systems, but the one we are going to focus on in this blog is called label propagation.
Label propagation looks at the degrees of connectedness between the nodes in a system, and sorts nodes by how similar their connections are. You will probably have been the victim of this type of analytic on a daily basis, given this type of technique is often used on social media platforms and online shopping to recommend content to you based on what similar people are interested in.
Here we’re going to examine how it can be applied to a service model ITSI to group our services into sub-components, and then use the sub-component labels to inform the notable event aggregation policies. These policies are designed to reduce the noise generated by monitoring tools, sorting your data into more manageable groups.
Our first step in this process is to use our service models to detect the communities of interest. All of the analysis here is contained in the ITSI Service Tree Analysis dashboard in the Smart ITSI Insights app for Splunk, but the details of the searches used are contained here for reference if you are interested.
ITSI contains a number of commands and lookups that can help us build up a table that represents how each service is connected to another. In graph analytics we need data that is structured so that each record describes two connected entities. In wider applications there will often be additional data that describes the nature of the connection, for example, a record of “person, phone, owns” would mean a person (entity) owns (relationship) a phone (entity).
For our purposes,we need a table that describes each connection in the service model in a source and destination type structure. For this we are going to use the getservice command and some of the KPI attribute lookups. The search below will return for every service in our instance all of the services that depend on it:
| getservice
| table serviceid services_depending_on_me
| eval dest=mvindex(split(mvindex(split(services_depending_on_me,"~"),0),"="),1)
| rename serviceid as src
We are then going to append onto the results of this search a table that describes for each service all of the services it relies on – finally removing any records that don’t have a dependency defined:
| getservice
| table serviceid services_depending_on_me
| eval dest=mvindex(split(mvindex(split(services_depending_on_me,"~"),0),"="),1)
| rename serviceid as src
| append [| getservice
| table serviceid services_depends_on
| mvexpand services_depends_on | eval src=mvindex(split(mvindex(split(services_depends_on,"~"),0),"="),1)
| rename serviceid as dest]
| table src dest
| where isnotnull(src) AND isnotnull(dest)
This search gives us a table the shows the KPI and service IDs rather than something that a user would easily be able to interpret, however, so we’re going to enrich these results using the itsi_kpi_attributes.csv lookup and give our services and KPIs a name:
| getservice
| table serviceid services_depending_on_me
| eval dest=mvindex(split(mvindex(split(services_depending_on_me,"~"),0),"="),1)
| rename serviceid as src
| append [| getservice
| table serviceid services_depends_on
| mvexpand services_depends_on | eval src=mvindex(split(mvindex(split(services_depends_on,"~"),0),"="),1)
| rename serviceid as dest]
| table src dest
| where isnotnull(src) AND isnotnull(dest)
| join src [| getservice | table title serviceid | rename title as src_name serviceid as src]
| join dest [| getservice | table title serviceid | rename title as dest_name serviceid as dest]
| table src_name dest_name
| dedup src_name dest_name
This presents us with a table like the one below.
If we were to visualise this table using the 3D graph network topology app we can see something like the structure below, with the Shared IT Infrastructure sat at the top of this particular tree structure:
This diagram, however, doesn’t tell us much about the different communities in our data. We need to do a tiny bit more analysis to identify the community labels, but the search itself is actually quite trivial.
If we add the following to our initial search we will enrich our data with community labels:
…
| fit GraphLabelPropagation src_name dest_name
| eval color_src="#".upper(substr(md5(labeled_community),0,6))
| eval color_dest=color_src
| table src_name dest_name color_src color_dest
Visualising this with the nodes coloured by their community label will give us a chart that looks a bit like the one below:
We can now see much more clearly the different sub-communities within each tree in our service model.
The community labels generated by the algorithm will be a numeric label, and won’t capture much of the context around the service tree that the group has been found in. In practice you may want to enrich these labels to be more descriptive – in our example, we could add in the name of the service at the top of each tree to the group label (for example “Buttercup Stores Group 1” instead of just 1). Adding in the context will help make sense of the episodes generated by the notable event aggregation policies, and this is something we have seen work well with some of our customers.
Now that we have detected the different communities in our service trees we’re going to see how the labels can be used to generate episodes using two approaches:
These options will be handled by the app itself, which will identify if you have the content pack installed or not.
This awesome content pack contains a load of useful frameworks for getting more out of your ITSI instance. Here we are going to use the itsi_kpi_attributes.csv lookup that ships with the app to automatically enrich our episodes with a community label.
Within this lookup, there is a field called alert_group that is used in a pre-defined notable event aggregation policy to automatically create episodes. If we use the results of the graph label propagation algorithm to replace the alert_group values with the community label we will then have the ability to create episodes based on the community label immediately! You can do this in the smart ITSI Insights app for Splunk by clicking on the ‘Save Labels Using the Monitoring & Alerting Content Pack’ button once the ITSI Service Tree Analysis has loaded up. Note that you will need to have the Monitoring and Alerting content pack installed for this to work!
The following sections describe how the community labels can be applied manually to generate episodes.
To begin with we need to take the results of the label propagation algorithm and apply the outputlookup command to create a CSV that we are going to use to enrich our correlation searches. This can be done by entering a model name and clicking the ‘Save Community Labels into a Lookup’ button on the save community labels panel on the ITSI Service Tree Analysis dashboard.
I’ve called my lookup service_community_labels.csv, which will be referenced as we go through the remainder of this blog.
We’re now going to use the lookup we just created to enrich correlation searches. I have included an example correlation search below for reference, but I don’t expect much of this to match a production search given it is running against one of our demo environments!
A few elements to highlight as important in the search, however, is the use of the apply_entity_lookup macro to map our host to an entity, then using the get_service_name macro to map the entity to a service. Once we have the service name we can then use our lookup to enrich our data with a community label.
index=itsidemo sourcetype=nagios perfdata=SERVICEPERFDATA
| rex field=reason "(?<value>\d+\.?\d*)"
| rex field=name "check_(?<metricname>.+)"
| eval norm_severity=case(severity=="CRITICAL",6,severity=="WARNING",4, severity="OK", 2)
| dedup consecutive=true src_host severity name
| eval tmp_entity=host
| eval host=src_host
| `apply_entity_lookup(host)`
| eval host=tmp_entity
| fields - tmp_entity
| `get_service_name(serviceid,service_name)`
| lookup service_community_labels.csv src as service_name OUTPUTNEW labeled_community
Creating this as a correlation search will look something like the screenshot below, where you can see that we have also added the community label to the notable event identifier fields.
Once you have enabled this correlation search you will find new alerts being generated in your instance that include a community label. It’s worth noting that you will want to apply the community label lookup to all of your correlation searches in ITSI if you want to use the label effectively in your notable event aggregation policies.
Now that we have our correlation searches generating alerts with community labels, we can use that data to create aggregation policies.
Creating the policies is easy, as you can see below we are looking for results that contain a community label, filtering on each community, and also giving the episode the name of the community. You may also want to configure additional parameters here, such as the minimum time between events or the maximum length of the episode.
Once the policy is enabled you should start seeing episodes being generated on the episode review dashboard by this policy.
Drilling down into this episode will show you all of the context you need to start triaging what has happened.
We have shown you how to identify different communities from your ITSI service trees in this blog and then how to then use these labels to identify different episodes from your alerts. Clearly you could also apply the community labels in different ways in the notable event aggregation policies, or even to generate additional KPIs, but hopefully this has helped inspire you to go and apply some machine learning to your ITSI data!
Happy Splunking!
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.