So you want brilliant alerts?
Well, yeah, of course you do. And you’re a sophisticated Splunker—you know your SPL and you’ve been messing around with the Splunk Machine Learning Toolkit enough to start understanding what all this “science” is about. (You don’t? You haven’t? You should probably start here then…). You’re at the keyboard, ready to gather up key measurements for every entity in a critical system, apply your business rules and operations policies into the mix, and build behavior curves for those metrics that can be used to identify anomalies and escalate brilliantly useful alerts above the noise, highlighting events you actually care about. You already know that:
You are going to need a good sample of observable behavior from the past to develop a model,
You will expect to forecast the perceived boundaries of “normal” behavior out into the future, and
You will score new incoming data against those boundaries to determine if how severe an alert is appropriate—1% outside of the boundary might be worth investigating at your next opportunity; 3x the normal boundary probably means somebody has some ‘splainin’ to do!
So you know your goal and you know what you’re going to do with this model once it’s working, so let’s get started creating one!
Hold up though—we’re barreling into this without looking closely at our data! That’s a great way to build a spiraling model of doom, so let’s stop for a minute and explore this data and ask a few questions:
Are all of your measurements relatable? That is, does a higher value always mean good/bad news consistently across them all? You might need to change them around, e.g. convert “CPU is 30% free” to “CPU is 70% utilized.” For discussion purposes, we’re going to assume high numbers mean bad news, but make sure you pick one and stick with it.
Are there nulls in your data? How are you going to treat them? You might want a separate alert for the presence and frequency of nulls as monitoring the cleanliness of the data that your model is digesting can help you monitor its health. Don’t feed your model garbage!
We’re going to be using historical data to forecast forward, so make sure you’ve had a look into your data’s history! Using a time period that is typical for your systems will create a model that forecasts equally-typical behavior; a time period where systems were behaving oddly will expect that oddness to continue. Make sure you are using a sufficient period of history to smooth out these kinds of exceptions.
Okay, that was a valuable stop along the journey, but now let’s get on this train and go for a ride. As it’s always wise to make sure you’re on the right train before it departs, let’s confirm: We’re going to forecast a future set of boundaries based on measurements we have from the past, giving us boundaries to alert against for new events as they arrive, and we’re going to do so with the simplest (yet still useful) statistical approach we can. There will be a chance later to add layers of analysis and make everything more robust, but this first stop on the train is just about getting a usable, useful solution quickly. On the right train? Good.
We’ll start in the Splunk MLTK Outlier Detection Assistant. We’ll use this workflow to create a base search that scores our events based on the statistical behaviors found during that search. For discussion purposes, we’re using some call center data linked here and we’ll leverage the Standard Deviation/Averages method; but it’s easy enough to swap that out (once you have it working) for one of the other statistical methods available in the assistant’s dropdown menu.
We will identify anomalies by analyzing the behavior of all the events in a given search; to compare a new event to that behavior baseline we need to store the behaviors in some kind of saved state. As this assistant doesn’t use the fit and apply commands (which would force the persistence of our results) we’ll have to force that persistence by hand.
Pro tip as you're preparing your data: “by” clauses added to the streamstats, eventstats, or stats operations in the preparatory search might be a great way to split events by entity types, time spans, etc., that make these behavior predictions more useful. You will see we’ve used those in the search below, and that’s why.
We are using “count” as the metric of interest and are using source, HourOfDay, BucketMinuteOfHour, and DayOfWeek as entity types:
| inputlookup CallCenter.csv
| eval _time=strptime(_time, "%Y-%m-%dT%H:%M:%S")
| bin _time span=15m
| eval HourOfDay=strftime(_time, "%H")
| eval BucketMinuteOfHour=strftime(_time, "%M")
| eval DayOfWeek=strftime(_time, "%A")
| stats avg(count) as avg stdev(count) as stdev by HourOfDay,BucketMinuteOfHour,DayOfWeek,source
| eval lowerBound=(avg-stdev*exact(2)), upperBound=(avg+stdev*exact(2))
| fields lowerBound,upperBound,HourOfDay,BucketMinuteOfHour,DayOfWeek,source
| outputlookup state.csv
Breaking this down, you can see we’ve created fields with our stats command for HourOfDay, BucketMinuteOfHour, DayOfWeek, and source, as well as creating an upperBound and lowerBound—those will all become the column headings in the resulting data table.
This application warranted 15-minute buckets, but you can use any bucket size that makes sense for your data frequency and business needs. Wider time ranges will tend to smooth out peaks and troughs more, narrower ones will be more sensitive to variation.
We might schedule this search to run every night at midnight using a data window going back five or more weeks and that will create in our result an upperBound and lowerBound value for every source at every combination of hour, minute, and weekday based on that history. As we are breaking our bounding calculations by day of week, the thresholds for Mondays will be separate from those on Saturdays, and as we are splitting the calculation by hour of day, the thresholds will be independently calculated for 11:00am versus 3:00am.
You can see now why it was important that we carefully chose the time period of this data as we don’t want to use a holiday long weekend to attempt to forecast a normal Monday, so make sure enough normal Mondays are included in the historical sample.
The results of our nifty threshold creation are stored via outputlookup in a local file. Congratulations, your data behaviors are now persistent!
Now that we have our thresholds persisted, we can recall them at any time and evaluate whether our thresholds have been exceeded.
| inputlookup CallCenter.csv
| eval _time=strptime(_time, "%Y-%m-%dT%H:%M:%S")
| bin _time span=15m
| eval HourOfDay=strftime(_time, "%H")
| eval BucketMinuteOfHour=strftime(_time, "%M")
| eval DayOfWeek=strftime(_time, "%A")
| stats max(count) as Actual by HourOfDay,BucketMinuteOfHour,DayOfWeek,source,_time
| lookup state.csv HourOfDay as HourOfDay BucketMinuteOfHour as BucketMinuteOfHour DayOfWeek as DayOfWeek source as source OUTPUT upperBound lowerBound
| search Actual="*"
| eval isOutlier=if(Actual < lowerBound OR Actual > upperBound, 1, 0)
| table _time,isOutlier,Actual,upperBound,lowerBound,source
| where source="si_call_volume"
Depending on your use case—the number of entities, the length of time required to establish a useful “normal," the awe-inspiring beefiness of your hardware—the answer to “does it scale?” is, of course, “it depends.” As entity counts grow, gathering and writing this lookup might start to consume a few cycles, but on the other hand we’re only running it once per day right now. We could further mitigate that impact by breaking this into 7 problems each 1/7th the size by slicing this apart by day of week. If the search above is scheduled to run each day at midnight, it will only be changing values from the previous day at each run; when calculating new thresholds for a Monday, we are only concerned with the data from previous Mondays, so exclude the other days in the first line of the search with a filter on date_wday.
There are, of course, more sophisticated ways to conduct this analysis at even greater scale. But that isn’t the train we’re on today, is it? Now that you understand the workflow and can do it the “quick and dirty” way, the next article will elevate your approach and get you a ticket on a faster, fancier train.
While we are here though—on this train—let’s ride it all the way to our destination. The next step is to create alerts using these new thresholds.
We actually have already completed the critical step here to making a working alert, with this bit of SPL from the above code block:
...
| eval isOutlier=if(Actual < lowerBound OR Actual > upperBound, 1, 0)
...
With that neato eval command we have identified any events that exceed either our upper or lower bound thresholds and marked those events with a flag of isOutlier=1. If we set isOutlier as the chart overlay data series in a line chart we can see the outlier events in time:
That’s displaying all outlier events, whether above the upper threshold or below the lower one. That’s easy to implement, but in the real world exceeding an upper threshold probably means something very different than falling below a lower one (the server room is on fire vs. the ice machine is overflowing). So we can split that apart into separate values for the upper and lower threshold by replacing that statement with the following:
...
| eval isLowerThanBound=if(Actual < lowerBound, abs(Actual - lowerBound)/ lowerBound , 0)
| eval isUpperThanBound=if(Actual > upperBound, abs(Actual- upperBound)/ upperBound, 0)
...
For an even more interesting alert, we’ve also added a bonus calculation for how significant the outlier is relative to the threshold, which will help us isolate just-over-the-line outliers from catastrophic events. It might also be interesting to append the sourcetype name or any other value that adds a little context to the alert:
...
| eval isUpperThanBound=if(Actual> upperBound, abs(Actual- upperBound)/ upperBound, 0)
| eval isUpperThanBound= isUpperThanBound.”-“.sourcetype
...
Here’s the complete example as one block of SPL, using our Call Center data set, where we generate different anomalies and score the anomalies with interesting scores:
| inputlookup CallCenter.csv
| eval _time=strptime(_time, "%Y-%m-%dT%H:%M:%S")
| where source="si_call_volume"
| bin _time span=15m
| eval HourOfDay=strftime(_time, "%H")
| eval BucketMinuteOfHour=strftime(_time, "%M")
| eval DayOfWeek=strftime(_time, "%A")
| stats max(count) as Actual by HourOfDay,BucketMinuteOfHour,DayOfWeek,source,_time
| lookup state.csv HourOfDay as HourOfDay BucketMinuteOfHour as BucketMinuteOfHour DayOfWeek as DayOfWeek source as source OUTPUT upperBound lowerBound
| search Actual="*"
| eval isOutlierLow=if(Actual < lowerBound , abs(Actual-lowerBound)/lowerBound, 0)
| eval isOutlierHigh=if(Actual > upperBound, abs(Actual-upperBound)/upperBound, 0)
| eval isOutlier=if(Actual < lowerBound OR Actual > upperBound, abs(Actual)/abs(upperBound-lowerBound), 0)
| fields _time, "Actual", lowerBound, upperBound, isOutlier,isOutlierLow,isOutlierHigh,source | where source="si_call_volume"
To make sure that chart impresses people in the board room, be sure and set isOutlier, isOutlierLow, and isOutlierHigh as Chart Overlay series in the chart format options. Look how clever that looks!
What we have so far is a useful and (hopefully!) not overly complicated method for identifying and alerting on outliers from a data set of existing events. The last step to convert this from an analysis of the past to an analysis of the “now” and truly generate useful monitoring alerts for these events is to extend these upper and lower bound thresholds into the future a bit—enough to compare them to new events as they arrive. The makeresults command is going to help us with this by literally creating events in the eventstream, extending our upper and lower bound thresholds into the future to give us boundaries to compare new events to.
Here’s the SPL to make that happen, using the output file from our previous step and a look-forward window of 2 days:
| makeresults count=2
| streamstats count as count
| eval time=case(count=2,relative_time(now(),"+2d"),count=1,now())
| makecontinuous time span=15m
| eval _time=time
| eval HourOfDay=strftime(_time, "%H")
| eval BucketMinuteOfHour=strftime(_time, "%M")
| eval DayOfWeek=strftime(_time, "%A")
| eval source = "si_active_agents,si_call_volume,si_groups_mapping,si_kpi_elements_cti_asgi,si_kpi_elements_sgi"
| makemv delim="," source
| mvexpand source
| lookup state.csv HourOfDay as HourOfDay BucketMinuteOfHour as BucketMinuteOfHour DayOfWeek as DayOfWeek source as source OUTPUT upperBound lowerBound
| fields _time, source,upperBound,lowerBound
| where source="si_call_volume"
In this query block we have a few key steps. First, we have used the now operator and a limit of “+2d” to specify that we would like our results generated from the present moment to two days into the future. Because we’ve used now, this search can be run at any time and will always project forward a full two days of thresholds.
Then we leveraged makecontinuous to create events in the time stream every 15 minutes; because that is the bin size we chose earlier and we want them to match.
What’s most important here is that we’ve used our lookup file created in the first step and all its upper and lower bound threshold values for every 15-minute increment of every hour of every day to draw out these new boundaries. So if we are running this query on a Monday, the threshold values for every 15-minute block for the rest of today (Monday) as well as Tuesday and part of Wednesday (48 full hours from now) will be populated into our “future” time stream.
Based on these slick new forecasted upper and lower bounds, we now not only have a sense of what this metric is likely to do over the next 2 days (capacity forecasting, anyone?) but we also have a boundary with which to analyze a new event that arrives—say, right now—to determine if it’s an outlier against our upper or lower boundaries. Yeah, we’re fancy!
Victory! Using the MLTK’s Numeric Outlier Assistant to guide us, we have built a basic forecasting, thresholding, and alerting mechanism that can be applied to pretty much any type of time series metric. We know it has some scale limitations so we’ll explore a way to mitigate that in the next post in this series as well as some more advanced statistical methods for determining an outlier which will be sure to impress your friends at dinner parties.
Until then, happy Splunking!
Special thanks to the Splunk ML Customer Advisory team including Andrew Stein, Brian Nash and Iman Makaremi.
----------------------------------------------------
Thanks!
Manish Sainani
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.