At this point we are well past the third installment of the trilogy, and at the end of the second installment of trilogies. You might be wondering if the second set of trilogies was strictly necessary (we’re looking at you, Star Wars) or a great idea (well done, Lord of the Rings, nice compliment to the books). Needless to say, detecting anomalies in data remains as important to our customers as it was back at the start of 2018 when the first installment of this series was released.
This time round we will be looking at an amazing use case from one of our customers, where they have taken the best bits of the original blog and combined it with a probability density function approach to help detect outliers in very high cardinality datasets. Unfortunately we can’t show you their exact use case, but we will be replicating their technique using one of my favorite datasets: BOTSv2.
That’s right, because unfortunately not all data is created equally—different profiles, periodicity and underlying characteristics mean that anomaly detection can be a varied field for us here at Splunk. Even recently you may have seen some methods for generating baselines, statistics and likelihoods on big data alongside a decision tree for selecting the right way to detect anomalies.
Despite this variety of approaches, we often have situations where “some of the alerts generated by the existing configuration are false positives, leading to alert fatigue, false interpretation, and improper resource utilization” according to Manish Mittal, a Senior Data Scientist with GovCIO.
It is Manish’s fine work that has inspired this blog post, where he has been looking to detect anomalies in counts of events that occur independently of each other. Now there is a lot to unpack in terms of what that means, but we are talking about situations like:
These are very different to continuous variables, such as CPU utilization or session durations, which would be entirely unsuitable for the technique we’re about to take you through.
Although we’d often suggest using the Machine Learning Toolkit for capturing baselines using the fit command, here we’re going to take a slightly different approach using core SPL and lookups.
Like almost all machine learning methods the first step for us is to train a model or capture some descriptive statistics from historic data. The statistics that we are interested in are:
In this example we will be using all 31 days worth of data from the BOTSv2 data to collect some information about the average number of events generated by sourcetype for a given hour of day and day of the week. The search can be seen below, where we are counting the number of events over each 10 minute interval for each sourcetype. We then create a key for each sourcetype, hour and day combination, and remove the upper 95th percentile of the data to try and remove some outliers from our training dataset. Once we’ve done some basic outlier removal we are collecting our baseline statistics — the average value and the number of records used to calculate the average — into a lookup.
| tstats count WHERE index=botsv2 BY _time span=10m sourcetype
| eval HourOfDay=strftime(_time,"%H"), DayOfWeek=strftime(_time,"%a")
| eval key=sourcetype."_".HourOfDay."_".DayOfWeek
| eventstats perc95(count) as upper by sourcetype
| where count<upper
| stats count as cardinality avg(count) as average by key
| outputlookup botsv2_event_count_averages.csv
In our example there are just over 12k keys in our lookup, which works just fine as a csv lookup. In Manish’s real world example he had over 100k keys, which is much better stored in the KV store to avoid performance issues associated with huge knowledge objects - like large csv lookups. If you want to know more about the benefits of using the KV Store for situations like these Josh Cowling takes you through some of the ins and outs in his blog here.
Now that we have our lookup we’re going to use it to enrich new data with the average and cardinality that we have seen from the historic data. Applying a filter based on the cardinality we are going to ignore data that has averages based on less than 10 historic records. Note that this is done behind the scenes with the DensityFunction algorithm in MLTK, where we recommend having a cardinality of 50 or more and actually don’t detect outliers in data that has cardinality less than 15. Removing these low cardinality data points means that we are now looking only at records that we have reasonable confidence in as we have seen plenty of historic data from them in the past.
With our current data we are going to use the count of events and the average count of events to calculate a probability of the current count occurring. To do this we are modeling the data as having a Poisson Distribution, and have some SPL to determine the probability based on this distribution. To do this we need to calculate the factorial of the current event count (using Stirling’s approximation) and the exponential of the negative mean, and while this may seem like a lot of complicated mathematics don’t worry - we’ve provided the SPL below. From these two values we can now determine the probability, and have chosen to flag outliers as event counts that have a probability of less than 0.3%.
| tstats count WHERE index=botsv2 BY _time span=10m sourcetype
| eval HourOfDay=strftime(_time,"%H"), DayOfWeek=strftime(_time,"%a")
| eval key=sourcetype."_".HourOfDay."_".DayOfWeek
| lookup botsv2_event_count_averages.csv key as key OUTPUTNEW average cardinality
| where cardinality>10
| eval factorial=sqrt(2*pi()*count)*pow(count/exp(1), count), exponential=exp(-average)
| eval probability=(pow(average,count)*exponential)/factorial
| eval outlier=if(probability<0.003,1,0)
| timechart sum(outlier) as outliers span=1h by sourcetype
As can be seen in the chart below the number of outliers by sourcetype is a very close match to the overall event count over time in the BOTSv2 data.
This part is really simple, just go and try out the SPL above on your own indexed data! You may need to select an index that has data actively coming into it to base your searches on, and make sure to select a long time period (30 days or more) for your initial search to generate the lookup.
To learn more about the technique discussed in this blog please come along to our talk at .conf22, where Manish, Pinar and I will be talking through how you can use ML in Splunk to discover new insights like this. While you’re at it, why not come along to .conf to find out about how other customers are using ML to get more out of their data!
If you are interested in alternative methods of detecting outliers in data streams coming into Splunk check out our webinar on preventing data downtime with machine learning, or the Machine Learning Toolkit (MLTK) tutorial of the very same topic.
Alternatively if anomaly detection is your thing please read the previous five entries in this series, or watch back our tech talk on anomaly detection with Splunk machine learning.
Happy Splunking!
------
Very special thanks to Manish Mittal for presenting us with this incredible technique for detecting anomalies, providing all the smarts behind the analytics!
Senior Data Scientist Manish Mittal has cross-functional expertise in medical informatics, pharmacology, biostatistics and biomedical sciences. Manish served on multiple research projects, presented at major conferences and provided statistical, scientific and technical support in the development of novel technologies in healthcare IT and biomedical sciences.
GovCIO is a rapidly growing provider of advanced technology solutions and digital services to the federal government. In the 11 years since its founding, GovCIO has become a leading prime contractor supporting the mission of federal agencies, including the Departments of Defense, Health and Human Services, Homeland Security, Justice, Veterans Affairs, the Pension Benefit Guaranty Corporation, United States Patent and Trademark Office, Government Accountability Office and the Internal Revenue Service. GovCIO’s core capabilities are in cybersecurity, digital services, data analytics, digitization, DevSecOps, and IT modernization. Visit www.govcio.com for more information.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.