This is Part 2 of a two-part series on custom anomaly detection Splunk IT Service Intelligence and the Splunk Machine Learning Toolkit v3.2.
You can view Part 1 here.
Let's see if it's possible to improve the model. We left out one field that has valuable information and can help improve the model: _time.
Simply adding _time as a predictor for the model would not do the job—the assistant thinks of it as a categorical field with too many unique values and removes it as a not useful field. But, we can extract features out of _time and use them as additional features. The following query extracts day of week, hour, and minute as numeric features. (Check out "Cyclical Statistical Forecasts and Anomalies - Part 1" for more on this).
| eval date_wday = strftime(_time, "%w"),
date_hour = strftime(_time, "%H"),
date_minute = strftime(_time, "%M")
Using the time features as extracted above has a caveat. Let’s pick date_hour as an example. If we build a model with date_hour, we are telling the model that there’s a significant difference between hour 1 and hour 23, while from a human’s point of view, they can be fairly similar. There are two ways to remove this inaccurate imposed difference. The first way is to change the time features to be categorical. In this case, hour 1, hour 10, and hour 23 are similarly different before fitting a model. It will be up to our data and algorithm to find the similar hours that are closer to each other behaviorally. The following query will make Splunk think the time features are categorical.
| eval date_wday = strftime(_time, "%w").”_”,
date_hour = strftime(_time, "%H").”_”,
date_minute = strftime(_time, "%M").”_”
Another approach is transforming the time features so that hour 1 is more similar to hour 23 and both of them are equally less similar to hour 12. To represent the cyclical characteristic, we express date_hour with two features—flexing your trigonometry muscles—sin(date_hour) and cos(date_hour). Let’s take a look at the following chart. The cyclical characteristic is obvious now. By looking at the blue line, we can tell that after the transformation hour 1 and 23 are more similar now and both are different from hour 12.
Also, the blue line says that hour 6 and 18 are similar, which can be wrong in the context of a website’s traffic. That’s why we need the second transformation to introduce the dissimilarity between 6 and 18.
We perform the same transformation on the other two time features we extracted. The following query does it for us.
| eval date_wday = strftime(_time, "%w"),
date_hour = strftime(_time, "%H"),
date_minute = strftime(_time, "%M") | eval _pi = 3.141592
| eval
date_wday_sin = sin(2*_pi*date_wday/7),
date_wday_cos = cos(2*_pi*date_wday/7),
date_hour_sin = sin(2*_pi*date_hour/24),
date_hour_cos = cos(2*_pi*date_hour/24),
date_mintue_sin = sin(2*_pi*date_minute/60),
date_minute_cos = cos(2*_pi*date_minute/60)
By adding these time features, our model will be able to capture seasonalities at three temporal levels. Depending on the application, you might want to remove some of these time features or add other ones, e.g. month or day of month.
Let’s see how these new time features help our model.
Trying it again:
Much better now! The scatter plot looks much better both around zero and higher values—a problem we had with the previous model. R^2 of CHRFModel is greater than 0.91 and the residual histogram has a dominant peak at 0 and fairly short tails, if we ignore the few instances that are more like outliers.
But where are the anomalies?
Now that we have a good model that predicts the behavior of the number of handled calls, we can monitor the difference between the model's output and the actual number of handled calls to find anomalies. There are several ways to do this:
Using the assistant for detecting numeric outliers in the Splunk Machine Learning Toolkit (MLTK) to set either a static or dynamic threshold
Using Splunk IT Service Intelligence's (ITSI) Adaptive Thresholding or Trending Anomaly Detection
We expect that the residual behave like a Gaussian noise—WHAT? It means that its average stays very close to zero and deviates around zero randomly. Therefore, setting static thresholds should work fine.
But how do we set the static thresholds? This is the part that needs some investigation. Let's take the search that calculates the residual to the Detect Numeric Outliers assistant.
After running the search in this assistant, we choose residual from the Field to Analyze dropdown menu. This menu gets populated based on the fields in the search result. We keep the threshold method to be Standard Deviation and set the Threshold Multiplier equal to 6 (for really extreme anomalies). Also we keep the Sliding Window box unchecked, because we are looking for static thresholds.
After clicking on Detect Outliers, the assistant generates multiple charts showing the outliers in various ways.
We are happy with the thresholds and the anomalies that are detected. But how can you be sure your thresholds are good enough? There’s no easy answer to this question. You need to investigate. If you detect too many anomalies, it might mean that your false positive rate is high. To avoid that, you need to choose a larger threshold multiplier. Choosing a very large threshold multiplier can also cause high false negatives. So, the path to find the right threshold is finding if the anomalies that you find make sense to you and that the anomalies that you know of do also get detected.
Alright, looks like a higher threshold of 4.5 and lower threshold of -4.5 work for us. One last note that we could make a similar guess on what the thresholds should be by just looking at the residuals histogram.
Now that we have a decent model and know how to use it to find anomalies, we need to make the model available in other apps. By default, the models are only accessible within the scope of the app they are built in. For example, CHRFModel is only available in MLTK. We can easily make this model available globally from MLTK.
Click on the Model from the toolbar inside MLTK, find your model, click on Edit, and change Display For from App to All Apps. Once saved, you will see that under Sharing the value changes from App to Global.
We are good to go back to ITSI.
In ITSI, we go to the Call Center service to add two new KPIs, a KPI for the predicted values for CH and another KPI for the difference between the predicted and actual values—residual. Let's start with the first one.
Under the KPI tab, we choose Generic KPI from the 'new' dropdown menu. Let's call this KPI PredictedCH. The following search applies the CHRFModel model on the data and produces a new field called 'predicted(CH)'.
index="itsi_summary" serviceid="93a765c0-d2bf-4914-b44b-5a223594f6c5"
| timechart span=15min avg(alert_value) by kpi limit=0
| fields - Call* ServiceHealthScore
| fillnull
| fields _time, CH, CR, *Time, ALO, ALI
| eval date_hour = strftime(_time, "%H"),
date_minute = strftime(_time, "%M"),
date_month = strftime(_time, "%m"),
date_wday = strftime(_time, "%w")
| apply CHRFModel
The other KPI that we are interested in is the residual.
Next, we set the thresholds we calculated under the Thresholding section in defining the KPI.
That's it. We're set. Let's take a look at Deep Dive.
Looks like we had some anomalies in the past 24 hours on the Residual detected—yep, that's right, we made a custom advanced anomaly detection system for some ad hoc KPIs, and fed them back into ITSI for actionable intelligence.
You should be ready now to catch an entirely new breed of fishy anomaly, in addition to the ones ITSI is already able to catch for you with its zonkers and woolly buggers. You’ve built composite KPIs made from non-linear machine learning models, able to highlight deep corner-case anomalies relevant to your specific data and business. That’s something you can wear on your silly fishing hat with pride! You also have a deeper understanding of the workflow of anomaly chasing and a blueprint for building additional custom flies to seek out fish both big and small in the rivers of your data.
Note: You need to be at least on Splunk Machine Learning Toolkit v3.2 for the above workflow to work since earlier versions of MLTK have a bug which was resolved only in v3.2.
----------------------------------------------------
Thanks!
msainani
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.