Splunk is committed to using inclusive and unbiased language. This blog post might contain terminology that we no longer use. For more information on our updated terminology and our stance on biased language, please visit our blog post. We appreciate your understanding as we work towards making our community more inclusive for everyone.
Maybe you have used the previous blog post about generating smarter episodes in ITSI using graph analytics and want to know what else you can apply ML to. Maybe you’re still swamped in alerts even after using the awesome content pack for monitoring and alerting. Maybe your boss has told you to go read up on AIOps…. Whatever the reason for finding yourself here this blog is intended to help you identify the “unknown unknowns” in your alert storms.
Even if you are using event aggregation policies in ITSI and are able to group alerts by well understood factors, you may still need help being drawn toward the alerts or groups of alerts that appear to be unusual compared to what you normally see – and that is what we’re hoping to help with here!
There are two main exam questions addressed here when looking for “unknown unknowns”:
We will deal with each of these separately before bringing the results together to see if there have been truly abnormal event ‘storms’ in the data.
Health warning: there are some big searches on display in this blog!
There are a couple of factors that can be examined when thinking about alert volumes being high, namely:
Here we’re going to walk through how you can determine these types of insights using the Probability Density Function in the Machine Learning Toolkit.
This search will generate an anomaly score for each service by looking at the expected number of alerts at any given time for any service, for each individual service and for each community label as well:
index=itsi_tracked_alerts
| bin _time span=5m
| stats count as alerts by _time service_name
| join service_name type=outer [|inputlookup service_community_labels.csv | table src labeled_community | dedup src labeled_community | rename src as service_name]
| eval hour=strftime(_time,"%H")
| fit DensityFunction alerts by "hour" into df_itsi_tracked_alerts_volume as alert_volume_outlier
| fit DensityFunction alerts by "service_name" into df_itsi_tracked_alerts_service as service_alert_outlier
| fit DensityFunction alerts by "labeled_community" into df_itsi_tracked_alerts_community as community_alert_outlier
| eval anomaly_score=0
| foreach *_outlier [| eval anomaly_score=anomaly_score+<<FIELD>>]
| table _time alerts service_name anomaly_score *_outlier
| xyseries _time service_name anomaly_score
| fillnull value=0
This search will allow you to identify the services that have the most unusual volumes of alerts, as shown in the table below.
We could even plot this over time to see what the anomaly scores are – perhaps comparing this to a service health score or something that provides more context about the health of your environment.
There are a couple of cautions you should take with this search, however, as if you have over 1000 services or community labels in your data then you may want to consider training separate density function models (perhaps by service or community) to manage the load on your Splunk instance.
The other consideration to make is that these results only tell us about unusual volumes of alerts, they don’t have much context about what has made up those alerts. Next up we’re going to look at the different combinations of alert types that make up this information and see if we can use that data to further refine our results.
Here we are going to take two approaches:
Here we’re looking for a count of the different sourcetypes seen in the tracked alerts index every five minutes for each service. There’s nothing but some simple statistics on display with this search!
index=itsi_tracked_alerts
| bin _time span=5m
| stats values(orig_sourcetype) as sourcetypes by _time service_name
| eval sourcetypes=mvjoin(mvsort(sourcetypes),"|")
| stats count by sourcetypes service_name
| eventstats sum(count) as total by service_name
The results of this search will look something like this:
The ratio of the count to the total will tell us a bit about how likely this sourcetype combination is for each service, but for now, we’re going to save this as a lookup called expected_itsi_alert_sourcetypes.csv and refer back to it later on.
Here we’re going to use the Smart Ticket Insights app for Splunk to determine the likely alert descriptions in our data. If you want more details on the app, feel free to read up on it here. We’re going to start with a simple search to return the alert descriptions, IDs and services from our correlation search index.
If you then select the relevant fields form the dropdown as per the image below you should get a report on the data. Once the panels have populated select the single threshold from the dropdown and click on the identify frequently occurring types of tickets button.
On the next dashboard, you may want to modify some of the selections depending on the groups that are identified. I have gone with the defaults, except for choosing not to use the description statistics for the clustering. Once you have checked over the groups and made sure you are happy the alerts are being grouped sensibly save the model and move onto the manage smart groups dashboard.
Picking on your everything category will present some similar reports back to you and if you then select a group then an open in search button will appear. Clicking on this will open a search similar to the one below in a new window. We have made a few changes to this search, however, as highlighted in bold below: adding in the _time field throughout the search, and replacing the last few lines to calculate a few statistics:
index="itsi_tracked_alerts" | table _time event_id service_name orig_sourcetype description | eval type="everything"
| table _time "event_id" "type" "service_name" "description"
| rename "event_id" as documentkey "type" as category "service_name" as subcategory "description" as description
| eval category=if(category="","No Category",trim(category)), subcategory=if(subcategory="","No Subcategory",trim(subcategory))
| where NOT ( subcategory="$result.exclude$")
| eval category=replace(replace(category,"[^A-Za-z0-9| ]","")," ","_")
| search category="everything"
| eval descriptionmv=description, description_fragment=description
| makemv delim=" " descriptionmv
| eval PC_description_length=len(description), PC_words=mvcount(descriptionmv)
| fields - descriptionmv
| rex field=description_fragment mode=sed "s/([\r\n]+)/|/g"
| makemv delim="|" description_fragment
| eval PC_lines=mvcount(description_fragment)
| mvexpand description_fragment
| eval description_fragment_len=len(description_fragment)
| makemv delim=" " description_fragment
| eval words_line=mvcount(description_fragment)
| fields - descriptionmv
| stats max(PC_description_length) as PC_description_length max(PC_words) as PC_words max(PC_lines) as PC_lines avg(description_fragment_len) as PC_avg_line_length avg(words_line) as PC_avg_words_per_line by _time documentkey category subcategory description
| apply tfidf_ticket_categorisation_everything_1605112121
| apply pca_ticket_categorisation_everything_1605112121
| apply gmeans_ticket_categorisation_everything_1605112121
| rename cluster as gmeans_cluster
| join type=outer category gmeans_cluster [| inputlookup ticket_cluster_map.csv]
| table _time documentkey category subcategory description filter_cluster
| eval filter_cluster=if(len(filter_cluster)>0,filter_cluster,"No cluster")
| bin _time span=5m
| stats values(filter_cluster) as groups by _time subcategory
| eval groups=mvjoin(mvsort(groups),"|")
| stats count by groups subcategory
| eventstats sum(count) as total by subcategory
This will produce a very similar table to the one we saw for the sourcetypes above:
What we have now, however, is something that tells us about the expected alert descriptions for each service that we have some correlation searches against, and we’re going to persist this as a lookup called expected_itsi_alert_groups.csv.
We’re now going to combine all the different approaches we used above to see if we can find the least likely (or the most unusual) alert storms in our data.
Although the search below seems daunting (and rather large!) most of it has been generated for you by the smart ticket insights app for Splunk, and we’re just modifying a few lines and calculating some statistics at the end to get our results. Essentially, we need to add in the orig_sourcetype field throughout the search (highlighted in bold). Once we have this in our search we have also replaced everything after the eval statement that fills in alerts that do not have a group attached to them to calculate our statistics, apply the density function models and enrich our results using the lookups we generated in section 2 of this blog post.
index="itsi_tracked_alerts" | table _time event_id service_name orig_sourcetype description | eval type="everything"
| table _time "event_id" "type" "service_name" "description" orig_sourcetype
| rename "event_id" as documentkey "type" as category "service_name" as subcategory "description" as description
| eval category=if(category="","No Category",trim(category)), subcategory=if(subcategory="","No Subcategory",trim(subcategory))
| eval category=replace(replace(category,"[^A-Za-z0-9| ]","")," ","_")
| search category="everything"
| eval descriptionmv=description, description_fragment=description
| makemv delim=" " descriptionmv
| eval PC_description_length=len(description), PC_words=mvcount(descriptionmv)
| fields - descriptionmv
| rex field=description_fragment mode=sed "s/([\r\n]+)/|/g"
| makemv delim="|" description_fragment
| eval PC_lines=mvcount(description_fragment)
| mvexpand description_fragment
| eval description_fragment_len=len(description_fragment)
| makemv delim=" " description_fragment
| eval words_line=mvcount(description_fragment)
| fields - descriptionmv
| stats max(PC_description_length) as PC_description_length max(PC_words) as PC_words max(PC_lines) as PC_lines avg(description_fragment_len) as PC_avg_line_length avg(words_line) as PC_avg_words_per_line by _time documentkey category subcategory description orig_sourcetype
| apply tfidf_ticket_categorisation_everything_1605112121
| apply pca_ticket_categorisation_everything_1605112121
| apply gmeans_ticket_categorisation_everything_1605112121
| rename cluster as gmeans_cluster
| join type=outer category gmeans_cluster [| inputlookup ticket_cluster_map.csv]
| table _time documentkey category subcategory description filter_cluster orig_sourcetype
| eval filter_cluster=if(len(filter_cluster)>0,filter_cluster,"No cluster")
| bin _time span=5m
| stats count as alerts values(filter_cluster) as groups values(orig_sourcetype) as sourcetypes by _time subcategory
| eval sourcetypes=mvjoin(mvsort(sourcetypes),"|"), groups=mvjoin(mvsort(groups),"|"), hour=strftime(_time,"%H")
| join subcategory type=outer [|inputlookup service_community_labels.csv | table src labeled_community | dedup src labeled_community | rename src as subcategory]
| rename subcategory as service_name
| apply df_itsi_tracked_alerts_volume
| apply df_itsi_tracked_alerts_service
| apply df_itsi_tracked_alerts_community
| eval anomaly_score=0
| foreach *_outlier [| eval anomaly_score=anomaly_score+<<FIELD>>]
| lookup expected_itsi_alert_sourcetypes.csv sourcetypes as sourcetypes service_name as service_name OUTPUTNEW count total
| lookup expected_itsi_alert_groups.csv groups as groups subcategory as service_name OUTPUTNEW count as group_count total as group_total
| eval sourcetype_likelihood=1-(count/total), group_likelihood=1-(group_count/group_total)
| eval anomaly_score=anomaly_score*(sourcetype_likelihood+group_likelihood)
| table _time service_name alerts anomaly_score *_outlier *_likelihood
This search will present you with a result set that has an anomaly score calculated as the sum of our outliers multiplied by the likelihood of the sourcetype and alert group combinations in the data.
The table itself isn’t that fascinating, but if we overlay the anomaly scores against one of the service health scores we can see some clear correlations: unusual patterns seem to occur around the times when our service health score is degrading, which could help us both understand the root cause if we can trace these alerts back to something more meaningful going on in the environment.
Here we have taken you through how you can use statistical analysis to identify whether you have an unusual number of events, and how similar techniques can be applied to non-numeric data to see if descriptions and sourcetype combinations also appear unusual. Through combining a few different techniques, we have been able to find event storms that appear to correlate with service degradation, hopefully guiding you toward the alerts that really matter!
Hopefully, this has inspired you to apply some unsupervised machine learning to your event data to see what hidden patterns exist in your correlation search results.
Happy Splunking!
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.