So you want brilliant alerts over big data?
Well, yeah, of course you do! In the previous post, "Cyclical Statistical Forecasts and Anomalies - Part 1," we discussed how to gather up key measurements for every entity in a critical system, apply your business rules and operations policies into the mix, and build behavior curves for those metrics that can be used to identify anomalies and create useful alerts to filter out the noise and focus in on the events you care about most. We created some interesting alerts based on cyclical anomalies and built a basic-but-working forecast using static lookup files to persist and project the past behaviors.
That works great for CSV files and a low number of entities—from a handful up to 100’s—but requires a different approach when you have 15,000 servers and billions and billions of events to process.
So now we'll adapt the workflow and use some Splunk goodness such as summary indexes (or data model accelerations if you have those handy) to operate our forecasts at greater scale.
We’ll use the same CallCenter.csv sample data from the previous post in this series to illustrate the example, although if you have live data you can just replace that part of the search. You can even use index=_internal which should show the cyclic nature of your Splunk instance if it’s been running for a few months or more, but for discussion purposes the examples will use that CSV. Just make the following adaptations:
Since we’re using the Call Center data CSV for the examples, you’ll see the index=callcenter used to search the data streaming into Splunk. If you aren’t using that example data, you’ll replace ‘callcenter’ with whatever index you're using. If you have your data in datamodels or summary indexes already, that's great—just replace the data references below.
We are going to assume a summary index has been created (you can see how to do that on Splunk Docs) and that it’s called “callcentersummary.” We are going to point our searches there and publish results there in this example, but again, that would be your summary index once you have it created. Learn more about summary indexing here.
Last time, we saved the results from the Splunk Machine Learning Toolkit (MLTK) Numeric Outlier Detection Assistant to a lookup to operationalize the insights. This time, we are going to save the results to the summary index and start with the forecasting technique instead of persisting the statistical behaviors of the past.
Let's begin by making the forecast for tomorrow using the last three weeks of data just for kicks.
Just as before, we are going to take the Numeric Outlier search created by the Assistant and split it into two parts—the upperBound and lowerBound, and the isOutlier parts. This time we need to filter for just the days of the week matching tomorrow (cloning just the data we need), and create the time values for the future (introducing time travel without a Delorean), too!
index=callcenter
| bin _time span=15m
| stats count by _time,source
| eval this = relative_time(now(),"+1d")
| eval filterday=strftime(this, "%A")
| eval DayOfWeek=strftime(_time, "%A")
| where filterday=DayOfWeek
| eval HourOfDay=strftime(_time, "%H")
| eval BucketMinuteOfHour=strftime(_time, "%M")
| stats avg(count) as avg stdev(count) as stdev max(_time) as time by HourOfDay,BucketMinuteOfHour,DayOfWeek,source
| eval lowerBound=(avg-stdev*exact(2)), upperBound=(avg+stdev*exact(2))
| eval that = relative_time(time,"+7d")
| eval _time=strftime(that, "%m/%d/%Y %H:%M:%S")
| fields lowerBound,upperBound,source,_time
|collect index=callcentersummary
If we add a
| where source="si_call_volume"
To the end, we can see the graphics for one source type tomorrow:
This search should be saved as a scheduled search (say CallCenterForecastTomorrow) to trigger at 11:55pm each night, creating the forecast for tomorrow. Alternatively, you can forecast multiple days out, but remember to change the MAX_DAYS_HENCE in props if you go beyond 2 days into the future.
Python summary index filling command:
./splunk cmd python fill_summary_index.py -app search -name CallCenterForecastTomorrow -et -1month -lt now -j 1 -dedup true
Note you can change the -j flag to have multiple backfill searches triggering at once, depending on your hardware provisioning.
Next, we make a search to add new values as they occur to the summary index as Actual. Save that search as ActualCallCenter, make sure to set the time range to Relative last 15 minutes and schedule the search to run every 15 minutes.
index=callcenter
| bin _time span=15m
| stats count as Actual by _time,source
|collect index=callcentersummary
Python summary index filling command:
./splunk cmd python fill_summary_index.py -app search -name ActualCallCenter -et -1month -lt now -j 1 -dedup true
Great. We now have two scheduled searches—one creating the forecast of tomorrow every night at close to midnight, and another creating the actual values to compare our forecast to as the future becomes now. Thanks to backfilling, we can simulate what the last month would have looked like as we roll into the future. We will use the same techniques as we leave statistical forecasts and enter into machine learning projects, so learn to love these commands!
Now, time to get back to our alerts...
Let’s look at just one source from our sample data set so we can make an easy graph to illustrate, and see what our alerts and values would have looked like over the last week.
Pro Tip: I used the source field from an index and fed that into a summary index, where the origin source field is renamed to orig_source.
index=callcentersummary
| where orig_source="si_call_volume"
| bin _time span=15m
| eval HourOfDay=strftime(_time, "%H")
| eval BucketMinuteOfHour=strftime(_time, "%M")
| eval DayOfWeek=strftime(_time, "%A") | stats max(lowerBound) as lowerBound max(upperBound) as upperBound max(Actual) as Actual by _time
| eval isOutlierLow=if(Actual < lowerBound , abs(Actual-lowerBound)/lowerBound, 0)
| eval isOutlierHigh=if(Actual > upperBound, abs(Actual-upperBound)/upperBound, 0)
| eval isOutlier=if(Actual < lowerBound OR Actual > upperBound, abs(Actual-upperBound)/abs(upperBound-lowerBound), 0)
In the graphic above, we can see the Actual events stopped at 30 minutes past midnight on Thursday morning when I took this snapshot, and we have outliers when call volume was abnormally high given our statistical forecast—from data that was just pushed into the summary index!
Awesome.
If you have datamodels, convert the searches to tstats and away you go. If you want to collect the alerts into a summary index or another persistence layer, you can do that too!
Let’s make a quick debugging dashboard to show where the statistical forecast is coming from—the past data in Splunk! This step will be very useful as we move into more complicated descriptive statistics and into machine learning algorithms, so getting into the habit of making a debugging workflow now will really help later on in our journey. Note that I am using the non summarized data here; I'm looking at the raw data and checking to see if the forecast in my summary index makes sense.
index=callcenter source="si_call_volume"
| timechart span=15m count
| timewrap 1week
So with an easy search looking over a few weeks of data, using the line chart in Splunk with the multi-series mode turned on like so:
You can visually see each week that is contributing to your forecast.
In Part 1 of this series, I wanted to get into custom holidays or special cyclical treatment base on business rules but we ran out of room... :(
I’m going to make up a completely fictitious holiday from the days in my data set, but I want to show the steps you need to take to make a real list. Just as I'm making a special case for holidays, you can make special cases for entities like Server10001 which manages your CEO’s email server; if your CEO has the same volume of email as Doug Merritt, maybe this is as critical to your business as it is to ours. We are going to create a CSV file or lookup via the Splunkbase app Lookup File Editor and maintain a list of holidays and associated values.
Create a CSV with the columns:
Time,isHoliday,isHolidayDefaultValue,isHolidayGroup,isHolidayName
11/25/2017,1,2,Splunk,SplunkDay
For example, with SPL:
| makeresults 1
|eval Time="11/25/2017"
|eval isHoliday=1
|eval isHolidayDefaultValue=2
|eval isHolidayGroup="Splunk"
|eval isHolidayName="SplunkDay"
|fields- _time
|outputlookup isHoliday.csv
In the search ….
|eval time_key = strftime(that, "%m/%d/%Y")
| lookup isHoliday.csv Time as time_key
Pro Tip: Use a time_key field instead of joining on _time for easy control. Splunk does have a temporal lookup system, but that requires a different workflow.
You have a choice to either use hard-coded values based on your knowledge as an SME, or learn different upperBound and lowerBound file values from your data! You can use the isHolidayDefaultValue as an intelligent replacement for avg+/- stdev*exact(isHolidayDefaultValue). Or you can enrich your alerts directly like |eval isOutlierDougMerrit=if(isHolidayName=”SplunkDay”, “Danger Will Robinson”, “”) and put that field into your alert for added value during your event analytics step. (Don't have an Event Analytics policy? How are you managing 100,000 alerts with your resources? Go check out Splunk IT Service Intelligence.)
Alternatively, you can use the Holiday names as keys to find new behaviors for holiday groups or specific holidays through time.
Stats avg(Count) as avgHoliday stdev(count) as stdevHoliday by HolidayName,isHolidayGroup
Compare those values to the normal “day of week” traffic that you have already calculated.
Victory! Using the Splunk MLTK’s Numeric Outlier Assistant to guide us, we have built a scalable forecasting, thresholding, and alerting mechanism that can be applied to pretty much any type of time series metric. In our next post, we'll use a useful Splunk workflow abstraction, a customer created macro, and some more advanced statistical methods for determining an outlier which will be sure to impress your friends at dinner parties.
Until then, happy Splunking!
Special thanks to the Splunk ML Customer Advisory team including Andrew Stein, Brian Nash and Iman Makaremi.
----------------------------------------------------
Thanks!
Manish Sainani
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.