A couple months ago, a Splunk admin told us about a bad experience with data downtime. Every morning, the first thing she would do is check that her company’s data pipelines didn’t break overnight. She would log into her Splunk dashboard and then run an SPL query to get last night’s ingest volume for their main Splunk index. This was to make sure nothing looked out of the ordinary. But one day the team went on vacation for two weeks, and no one checked the data pipeline’s health for that time. When everyone came back two weeks later something terrible had happened: their index had stopped receiving data for the past two weeks. This was a big problem for their team, which had emergency meetings and scrambled to figure out how to recover the data, but to no avail.
Although the previous story is an extreme case of data downtime, teams struggle with less severe versions of this issue every time a data pipeline breaks because of a bug, misconfiguration, or network issues. In addition to data downtime, data spikes are also of interest because they can overstress downstream systems. That’s why for my summer internship project I tried to solve this problem with an automatic alerting system that sends alerts whenever there is unexpected downtime or spike in ingestion volume. This allows Splunk admins to know when something goes wrong in real-time, and gives them another 10 minutes in the morning for a second coffee.
There are two common scenarios where admins want to be alerted:
To give an example of what both of these anomalies look like in the real world - not in synthetic data - we looked through Splunk’s internal usage. It didn’t take too long to find scenarios where ingestion errors had occurred, and we can label the corresponding parts of the ingest function produced by an internal application.
The graph above represents data ingestion volume for an application that runs internally at Splunk. In the graph, we can see how the volume goes up and down every day; up during the day and down overnight. In addition, volume goes down during the weekend. The first red box (1) shows a weekend with abnormally high volume (should have been lower), and the second box (2) is an example of an unexpected, sustained data spike that overwhelmed the system causing a subsequent outage. In both cases the Splunk admin would have liked to be notified of these situations.
In this blog I’ll walk you through the steps I took to complete my summer project with Splunk’s Applied Research team. In the end I built an ML powered Splunk dashboard that was chosen by Splunk to be demoed for customers at Splunk’s .conf21.
Below you can see the dashboard we built for Splunk admins:
At the top of the dashboard (1) we can see information about the latest ingestion volume timpoint.
The current anomaly score tells us how anomalous the most recent data point is. Below that we can see a time series chart (2) showing the confidence intervals predicted by our ML model.
All points that lie outside of the expected volume interval are flagged as either warnings or alerts (3). Here we also can see a higher resolution breakdown of the past 24 hours.
Warnings are constituted by points with medium errors, while alerts occur with substantial errors only.
After we go through how this works, I share the instructions so that you can set it up yourself in your Splunk instance within ~15 minutes.
Before going into any specific models and methods for this problem, it is important to touch on the overarching structure of the ML system we want to build.
The inputs for this model are (timestamp, event_count) pairs. Each pair tells us the number of events ingested in the hour preceding the given timestamp. Given a list of (timestamp, event_count) pairs, we train a model to predict whether the last pair (the most recent hour) is anomalous.
The graph below shows how this works:
First, the model is trained on past data (in orange) and learns the pattern and trend. Then the model makes a prediction on the most recent time point (in yellow). The model provides live predictions every hour. By continuously retraining on new incoming data, the model adapts to changes in patterns and trends.
An overarching challenge in this project is the subjectivity of anomalies. Different users will define anomalies differently. Therefore, our solution needs to be able to adapt to this subjectivity.
First, there is a high diversity in ingest functions, as seen by comparing the two graphs below:
Second, the detection model needs to contextualize each point within the entirety of the function. Then, based on that, determine the degree of ‘anomalousness.’ As a result, setting a static threshold, where all points above and below are anomalous will not work. An example of where thresholds fail are the contextual anomalies mentioned above. If a point is only anomalous given a date-time context, it will most likely lie within any static threshold.
Lastly, seasonality and trend changes with time which makes it difficult to build a robust solution. Therefore, the model needs to be able to adapt to changes in function behavior. Methods where the function is decomposed using a predefined seasonality fall short here, for they cannot adapt to shifts in periodicity. A successful model will be able to learn the current periodicity of the function and adapt to new features of the function as time passes.
Techniques Used
One way to measure the degree of anomalousness of a point is comparing it to the mean of points around it. A straightforward way of doing this is through z-scores:
That states if the difference in expected value and observed value at a time point are greater than a threshold value, the point X(t) is anomalous. Different error metrics can be used for this: absolute error, error squared, rolling root mean squared error, exponential error, etc.
In order to build a regression model to solve our anomaly detection problem, let’s take a look at our textbook pattern anomaly example from earlier:
The low volume on the last Monday is anomalous given the context that on all other Mondays volume went up substantially during the work day. Thus, it makes sense that the model needs “it’s a Monday” as an input to predict that volume will increase. Similarly, the model must know “It’s a Saturday” to predict low volume on a weekend.
The model also needs to learn a sinusoidal pattern on weekdays where volume peaks during the day and dips at night. By telling the model it’s 12:00 PM on a weekday, and given that volume has peaked at 12:00 PM on a weekday in every previous week, it will predict that same behavior.
A final and important feature that needs to be accounted for are holidays. Given that typical work days differ greatly from off-days, a holiday on a weekday would cause an alert if the model isn’t informed that it’s a special day. By adding a final feature that tells a regressor “today is a corporate/national holiday” the model learns to treat these days differently. This is a typical complaint regarding anomaly detection systems that fail to differentiate between holidays and workdays.
Our ultimate input for the regression model will appear as follows:
x(t)=[isMonday(t), isTuesday(t), ... , isSunday(t), hourOfDay(t), isHoliday(t)]
y(t)=volume(t)
Then the regressor models a function f : x(t) → ŷ(t) such that ŷ(t) - y(t)is the prediction error, Error(t) ŷ(t) - y(t), that we want to minimize.
Then, K-sigma can be used for outlier detection in Error(t). In other words, if error deviates greatly from the mean error, it can be considered an anomaly. The K-sigma model also functions better at detecting sustained changes in this case, because outliers in mean error aren’t common outside of abnormally large prediction errors. However, alone K-Sigma fails to account for date-time context and cannot detect pattern anomalies. So our main use for this method was error thresholding.
The means of arriving at f, and the function type of f vary across each type of regressor, so we tested multiple models. First we attempted to use linear regression to solve this problem, but this failed due to the complexity of ingest functions. Our two successful approaches were with a LSTM regressor and a RandomForest regressor that we thoroughly tested as described later.
A similar technique that looks also for deviations in seasonal patterns is STL (Seasonal-Trend-Loess) decomposition. The idea behind STL is to decompose our volume function into seasonal, trend, and residual components.
volume(t) = seasonal(t) + trend(t) + residual(t)
Large residuals are indicative of (volume, timestamp) pairs not fitting into the seasonality and trend of the data. Similarly to error thresholding in regressors, we can draw a threshold:
residual(t)
And when this threshold is breached an alert is sounded. The model will learn the weekly periods of daily peaks on weekdays and low static volume on weekends, and account for this in the seasonal component. Then any long term increases and decreases in mean are modeled using the trend component.
This approach requires the period length to be known, so that the seasonality function can be fit. Because of this STL has issues adapting to changes in seasonality because it uses a predefined seasonality as an input. This also complicates things because not every ingestion function has the same periodicity.
In order to test potential models we developed a simulation of live anomaly detection in synthetic and real ingestion data. In order to do this, we built a system that simulates the live hourly training and prediction that would occur with real-time ingestion data. Then, we can feed data that is either synthetically generated or real data into our simulation.
Our code allows us to create and modify synthetic functions in 3 different ways. The first being the addition of a spike or outage as described in the introduction. Second, we can alter a pattern to create a pattern anomaly. Lastly, we can add various different trends to the data in order to simulate real-world long term trends.
We can test many different models that fit the live training and predicting structure detailed earlier using the following test system. Here for every data set - whether synthetic or real - we can gather performance data of each model and then create composite scores across multiple datasets. The structure of these tests can be seen below:
At the center of the evaluation is our model class that is running a ‘train-predict-live’ loop through each data point. The ‘train-predict-live’ method simulates hourly predictions on real time data.
class Model:
def train(time, volume):
# train model using historical time, volume pairs
self.model()
self.model.train(time, volume)
def predict(time, volume)
# use model to predict whether the most recent time, volume pair # is anomalous
self.model.predict(time, volume)
def predict_train_live(obs)
# run the two above function iteratively through the data and # return a list of predictions
predictions = []
for time, volume in obs:
train(time, volume)
predictions.append(predict(time, volume))
return predictions
Every prediction is either correct or incorrect based on the labels we generated. This yields accuracy data for each individual model corresponding to each data set we test on, allowing us to optimize error thresholding and choose a model.
The metric that we optimized for was F1 - score, which is a harmonic average of precision and recall. The model trains and predicts, simulating a live feed as described earlier. These predictions are compared to the ground truth, from which F1 - scores are calculated.
We focused on testing 2 potential regressors: LSTM and RandomForest. For testing, we labeled ingest volume data from 8 internal splunk usages by looking for disruptions in patterns, outages, and spikes. These datasets are independent of customer usage, so Splunk’s corporate wellness days were helpful controls. Splunk gives wellness days once a month, so the volume is similar to a weekend while the date time features were those of a weekday.
Test Results Across All Data (F1):
Random Forest | LSTM | STL |
0.842 | 0.851 | 0.824 |
All 3 methods performed similarly on the test data, making them candidates for the final model. However, the RandomForest model was chosen for the continuation of this project because it a) had high performance b) already exists in MLTK and c) doesn’t require hyperparameter tuning.
All we need to do is featurize the ingestion data and fit a model using a few simple lines of SPL. Then for detecting abnormally large errors we tack the K-Sigma model on.
Note: this requires the MLTK App which is free and available to install on any splunk instance. For more information on setting up the MLTK App, go here.
For instructions to set up the dashboard head on over to here. It should take ~15 minutes to set up your own anomaly detection dashboard following the provided steps. Here is a brief explanation of the SPL we used to train our model and make predictions.
Get ingestion volume by hour:
| tstats count where index=[insert your index here] sourcetype=[insert your sourcetype here] groupby _time span=1h
Now we can turn our timestamps into hour and day of week data:
| eval hour=strftime(_time,"%H")
| eval weekday=strftime(_time,"%a")
As a side note: because days of week are strings not integers, they are treated as categorical variables by default. The search will use one-hot encoding to translate ["Monday"]to[isMonday(t) , isTuesday(t), ... , isSunday(t)] = [1 , 0, ... ,0]
Finally we add a column that tells us if it’s a holiday
| eval is_holiday=case(month_day=="01-01",1,month_day=="01-18",1,month_day=="05-31",1,month_day=="07-05",1,month_day=="09-06",1,month_day=="11-11",1,month_day=="11-25",1,month_day=="12-24",1,month_day=="12-32",1,1=1,0)
Keep in mind that these are just US holidays. Admins can also add global rest days and international holidays, for example. (Additionally, a lookup index can be set up with holiday data.)
Now it is time to fit the model:
| fit RandomForestRegressor count from hour is_holiday weekday weekday_num into regr as predicted
This will create a model called ‘regr’ that takes inputs in the form is_holiday weekday weekday_num and predicts a corresponding volume. We can now apply this model to incoming batches of ingestion data from the same sourcetype and make predictions
Now in a separate SPL query we can gather the newest data point by running this over the past hour:
| tstats count where index=[insert your index here] sourcetype=[insert your sourcetype here] groupby _time span=1h
And then apply the same featurization process to the newest point:
| eval hour=strftime(_time,"%H")
| eval minute=strftime(_time,"%M")
| eval weekday=strftime(_time,"%a")
| eval is_holiday=case(month_day=="01-01",1,month_day=="01-18",1,month_day=="05-31",1,month_day=="07-05",1,month_day=="09-06",1,month_day=="11-11",1,month_day=="11-25",1,month_day=="12-24",1,month_day=="12-32",1,1=1,0)
And then the model is applied to the latest date time data point
| apply regr as predicted
Then we look at the latest volume data point in comparison to our prediction and write the error to the a lookup index
| eval error=count-predicted
| outputlookup anomaly_detection_[insert your index here]_[insert your sourcetype here].csv append=true
Both of these searches are scheduled to run every hour, which follows the batch fit-predict method outlined earlier. Then to do K-sigma anomaly prediction, we run some SPL to calculate the z-scores of the errors. This process also occurs hourly, once fitting and prediction has concluded.
Grab the data from the index we wrote it to:
| inputlookup anomaly_detection_[insert your index here]_[insert your sourcetype here].csv
And now we calculate the error squared, mean error, standard error, percent error.
| eval error_sq=error*error
| eval pct_error=abs(error)/count
| eventstats avg(error_sq) as mean_error
| eventstats stdev(error_sq) as sd_error
| eval z_score=(error_sq-mean_error)/sd_error
Based on mean and standard errors we can set upper and lower bounds. These are equivalent to thresholds of k in a K-Sigma model. (k=3, in this case)
| eval upper_error_bound=predicted+sqrt(3*sd_error+mean_error)
| eval lower_error_bound=predicted-sqrt(3*sd_error+mean_error)
| outputlookup anomaly_detection_[insert your index here]_[insert your sourcetype here].csv
Ultimately, in this project we created a versatile solution to monitoring ingest volume. Using SPL and the MLTK we assembled and tested an ML model for anomaly detection. The first conception of this dashboard was put together during the Splunk Hackathon, where our team won 3rd place in the “Most Valuable” category. We received a lot of positive feedback from the judges, and with a bit more fleshing out we had a comprehensive demo.
Next on the agenda for this project is to find more users (internal or external) to try it out and give feedback. Ultimately, we hope for this to become a built in feature for all Splunk users. We hope that this dashboard will prevent weeks of data from going missing into the future by providing real-time insights into ingestion.
Check out the Machine Learning Toolkit today to get started using Machine Learning to prevent downtime and ingestion spikes and explore more in our Machine Learning blog series.
This blog was primarily authored by our intern, Francis Beckert. As a Stanford freshman-to-be, she joined our Applied Research team in summer 21' and hit the ground running. She impressed us with her skills, hard work, and positive attitude. Her work led to this blog, a patent application, and 3rd place in our company-wide hackathon. Thank you Francis!
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.