Splunk customers are awesome and often come up with interesting new methods for building analytical workflows in Splunk.
Splunk customer Michael Fisher presented a fantastic technique for his .conf2016 presentation—"Building a Crystal Ball: Forecasting Future Values for Multi-Cyclic Time Series Metrics in Splunk"—using techniques he dubbed “Cloning” and “Time Travel.” It's pretty compelling stuff! I ran across Michael's work months after he presented when my attempts to build the same workflow were running into scale problems; I've used his techniques ever since.
In this example, I want to forecast the future and create interesting anomalies for alerts as the future becomes the present, but I want to also smooth my data and add business rules. I want to use data from the past exactly like we have in the past, but this time I want to get data from around my keys in the past—i.e. I want to get data from the 30 minutes before and after 12:15pm on Thursdays so I can smooth those behaviors. We are going to use this same technique in further Machine Learning Toolkit (MLTK) examples for creating interesting features, a key requirement for any machine learning solution.
To make this even more complicated, I don’t want to assume the behavior of my data is normal; that is, I don’t want to assume a bell curve of behavior. Toufic Boubez, our VP of Machine Learning and Incubation Engineering, presented on this topic at .conf2016 as well in his session, "A (VERY) Brief Introduction to Machine Learning for ITOA," explaining why that’s important.
In addition, I want a repeatable workflow made of macros so I can reuse the whole workflow again and again for different forecast periods and confidence levels or threshold multipliers.
That’s a big list of requirements, but it can be done! Let’s break it down.
Pro Tip: When using the timechart command before this macro, remember cont=f if you are trying to preserve your events with nulls; or if you want to use multiple entities through the same macro, use bin and stats like:
..
| bin _time span=10m
| stats count by EntityField, _time
Snippet #1: I am going to clone data so I can smooth/average out around a specific time slot. In our previous forecasting examples we only used values from the past that were directly linked to our target time period—for 1pm on a Monday we only used the values from 1pm on all past Mondays. What if I want to use values around 1pm on Monday in the past to smooth my forecasting? I am going to need to clone the data from around the target time in the past.
Snippet #2: Create any custom field creation—like upper and lower—add any individual outlier removal rules—like |eval count=if(count>100,100,count). Do not use |outlier, as it doesn’t have a by clause.
Snippet #3: As multiple events have the same time field (not _time, we have a new field called time), and all the values in that field have future time values (relative_time, just like our last examples), we can now aggregate any custom field to the future time point.
Snippet #4: Business rules and most importantly, Chebyshev's inequality, which assumes nothing about the future distribution. Explicitly, this is where we are creating a statistical forecast without specifying whether the future will be a normal curve or not. Note the example uses 90% confidence, and if you decrease your confidence levels downward the bands will become tighter around the “average."
You can replace this step with any of the examples from the MLTK outlier detection assistant as well, if you want absolute median deviation. This is where you change the lowerBound field to have a minimum value of say 1 if lowerBound<1.
Pro Tip: Make sure to persist your NULLS if that is part of your workflow.
Snippet #5: We are replacing _time with time, making our time travel complete. Exactly like we did in the last post for forecasting a single day into the future.
Snippet #6: We just want to clean up with a time chart in case we want a new aggregation, and finally we remove all the empty events that have no results. We are now ready to save this search as a macro, and then call the macro with a collect() command, maybe even with a map command if we have to.
Macro version:
….
|timechart count span=10m
Snippet #1:
eval w=case( (_time>relative_time(now(), "$reltime$@d-5w-30m") AND _time<=relative_time(now(), "$reltime$@d-5w+$days$d+30m")), 5, (_time>relative_time(now(), "$reltime$@d-4w-30m") AND _time<=relative_time(now(), "$reltime$@d-4w+$days$d+30m")), 4, (_time>relative_time(now(), "$reltime$@d-3w-30m") AND _time<=relative_time(now(), "$reltime$@d-3w+$days$d+30m")), 3, (_time>relative_time(now(), "$reltime$@d-2w-30m") AND _time<=relative_time(now(), "$reltime$@d-2w+$days$d+30m")), 2, (_time>relative_time(now(), "$reltime$@d-1w-30m") AND _time<=relative_time(now(), "$reltime$@d-1w+$days$d+30m")), 1) | eval shift=case(isnotnull(w),"+"+w+"w-30m,+"+w+"w-20m,+"+w+"w-10m,+"+w+"w-0m,+"+w+"w+10m,+"+w+"w+20m,+"+w+"w+30m,") | where isnotnull(shift) | makemv delim="," shift | mvexpand shift | eval time=relative_time(_time,shift)
Snippet #2:
| eventstats avg($val$) AS pred stdev(pred) as stdev_pred by time
| eval upper=if($val$>pred,$val$,pred)
| eval lower=if($val$<pred,$val$,pred)
Snippet #3:
| stats avg($val$) AS pred, stdev(upper) AS ustdev, stdev(lower) AS lstdev by time
Snippet #4:
| eval lowerBound=pred-lstdev*(sqrt(1/(1-$confidence$/100)))
| eval upperBound=pred+ustdev*(sqrt(1/(1-$confidence$/100)))
Snippet #5:
| eval _time=time
Snippet #6:
| timechart span=10m min(pred) as pred , min(lowerBound) as lowerBound, min(upperBound) as upperBound
| search pred=*
Pro Tip: Remember your python back-fill command. We can backfill searches using this macro to simulate what kind of values we would have seen - just like we have repeatedly done in this blog series.
Here’s Michael’s output again, from his presentation:
This is complicated stuff, so let's do another example using the CallCenter.csv sample data. Remember, we should have it in an index from the last blog post, "Cyclical Statistical Forecasts and Anomalies - Part 2," or you can use the |inputlookup bit of SPL to create the upperBound, lowerBound, and pred, again from the previous blog entry.
Here is what that might look like:
index=callcenter | where source="si_call_volume"
| bin _time span=15m
| stats count by source, _time
| eval w=case(
(_time>relative_time(now(), "+1d@d-5w-30m") AND _time<=relative_time(now(), "+1d@d-5w+3d+30m")), 5,
(_time>relative_time(now(), "+1d@d-4w-30m") AND _time<=relative_time(now(), "+1d@d-4w+3d+30m")), 4,
(_time>relative_time(now(), "+1d@d-3w-30m") AND _time<=relative_time(now(), "+1d@d-3w+3d+30m")), 3,
(_time>relative_time(now(), "+1d@d-2w-30m") AND _time<=relative_time(now(), "+1d@d-2w+3d+30m")), 2,
(_time>relative_time(now(), "+1d@d-1w-30m") AND _time<=relative_time(now(), "+1d@d-1w+3d+30m")), 1)
| eval shift=case(isnotnull(w),"+"+w+"w-30m,+"+w+"w-15m,+"+w+"w-0m,+"+w+"w+15m,+"+w+"w+30m,")
| where isnotnull(shift)
| makemv delim="," shift
| mvexpand shift
| eval time=relative_time(_time,shift)
| eventstats avg(count) AS pred by time, source
| eval upper=if(count>pred,count,pred)
| eval lower=if(count<pred,count,pred)
| stats avg(count) AS pred, stdev(upper) AS ustdev, stdev(lower) AS lstdev by time, source
| eval lowerBound=pred-lstdev*(sqrt(1/(1-80/100)))
| eval lowerBound=if(lowerBound<0, 0, low)
| eval uppperBound=pred+ustdev*(sqrt(1/(1-80/100)))
| eval _time=time
| timechart span=15m useother=false limit=0 cont=false min(pred) as pred , min(uppperBound) as high, min(lowerBound) as low by source
| makecontinuous _time
Note the changes to the last two lines—the timechart flags and makecontinuous. In this case, I want the nulls to continue forward as gaps where my forecast couldn’t get data for whatever reason.
What if the customer doesn’t want smoothing using the future data in the Snippet #1 section above? Simply remove the positive time shifts on the line:
| eval shift=case(isnotnull(w),"+"+w+"w-30m,+"+w+"w-15m,+"+w+"w-0m,")
To view the back weeks over time, so you can see the values being smoothed, use:
|timechart ….
|timewrap series=relative 1w
Open up a line chart, set multiseries option in format so you can visually inspect the data being used.
Alerting hasn’t changed much from the previous section. We have a summary index and we want to push into that summary index the real values as they occur.
index=….
| bin _time span=15m
| stats count as Actual by host,_time | collect index=...
Then the alerting search looks at the summary index and fires off an alert when Actual was above or below our thresholds, or even when compared to our new bound, pred. Yes, we will be changing pred in future posts to multivariate predictions.
Just as Michael suggests in his workflow, we can save these searches to macros and arrange the scheduled searches via those macros quickly.
index=… time range
… time chart or stats with a time command
| `ML_Forecast_Macro(count,90,+2d,2)`
| collect index=blah
Every copy of Splunk has an index=_internal. Run the above macro looking for the count of each host, or each sourcetype, or each source and look at the created curves. Do we have a cyclical forecast that makes sense? How would you run your Splunk deployment differently if you had a forecast of each source/host/sourcetype’s usage? What if one of those sources is under the forecasted threshold? What if it is above?
Exactly like in our last blog post in this series, you can customize the holidays from a learned or SME-set threshold value.
That ends the basics of statistical anomalies and forecasting in Splunk; you should have plenty of brilliant alerts for single values moving through time. There are a thousand uses for this technique, so it’s a good thing you created macros to reuse it! You’ll find this approach tackles a lot of use cases but there might be some that it doesn’t.
Don’t worry, this is only the beginning of the clever and sophisticated predictions and alerts you can craft using Splunk and the Machine Learning Toolkit. We have many more stories from our customers and Splunkers who are solving real-world problems every day using the power of this solution. I will share more of those stories and solutions in the next series of blog posts.
Until then, happy Splunking!
Special thanks to the Splunk ML Customer Advisory team including Andrew Stein, Brian Nash and Iman Makaremi.
----------------------------------------------------
Thanks!
Manish Sainani
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.