I’m sure many of you will have tried out the predictive features in ITSI, and you may even have a model or two running in production to predict potential outages before they occur. While we present a lot of useful metrics about the models’ performance at the time of training, how can you make sure that it is still generating accurate predictions?
Inaccuracy in models as the underlying data or systems change over time is natural. We usually recommend that you retrain the predictive models in ITSI on a regular basis to make sure they are current.
In this blog we will talk about some strategies for monitoring your models in ITSI for model drift. This is the idea that the predictive models will become less accurate over time as the rules that were generated originally no longer match the data they are applied to.
ITSI has a number of commands and macros to help you keep on top of your predictive models. The first port of call for identifying the services that have predictive enabled is the | getservice command as shown below:
| getservice
| search algorithms=*itsi_predict_*
| table serviceid identifying_name algorithms
Here we are filtering on the results to return only the services that have active predictive models. Note that the predictive models are stored in the lookups folder of ITSI, so you may also need to check that they are still there…
Now because the algorithms field is pretty dense I thought I would break out some of the key bits of information below:
Note that kpiModelsCreatedAt is in epoch time including the milliseconds.
Using the model name from the JSON array will help us find the actual predictive models that ITSI is using under the hood! ITSI trains a host of models when you save a predictive model for any given service:
All three models are an extended version of the model ID from the JSON: for example, the average health score predictor is the model ID appended with _avg. Each model can be examined using the | summary command from the MLTK as below.
Another important point – especially because we have scaled the input KPIs with a standard scaler – is that for each of the gradient boosting, random forest and linear regression algorithms the higher the coefficient or importance from the model the higher the impact of that feature on the models’ prediction. In other words, if a KPI has a big value in the model summary then it has a big impact on future health score.
Now that we have found our ITSI models it’s time to check if they are still operating well. Thankfully this is all pretty easy using the apply_model macro that ships with ITSI. Using the service ID and the model ID we can run this macro against our data in ITSI to generate some predictions as below:
`apply_model(4bf1f146-3b89-4ae7-b8f3-32f536357bc4,itsi_predict_4bf1f146_3b89_4ae7_b8f3_32f536357bc4_RandomForestRegressor_e1201d046501187fa988d848_1588186089405)`
| table _time next30m_avg_hs predicted(next30m_avg_hs)
This returns our actual values and the predictions to make sure everything is working as hoped:
Next up we could either eyeball the data that gets returned to us to figure out if the model is still accurate, or instead we can calculate some statistics that should quantify the accuracy for us.
First up we’ll have a look at the R squared statistic, which is essentially a measure of accuracy – you can kind of read this measure as a percentage (negative values mean truly awful predictions!). Calculating this value can be done with the score command form the MLTK as below:
`apply_model(4bf1f146-3b89-4ae7-b8f3-32f536357bc4,itsi_predict_4bf1f146_3b89_4ae7_b8f3_32f536357bc4_RandomForestRegressor_e1201d046501187fa988d848_1588186089405)`
| table _time next30m_avg_hs predicted(next30m_avg_hs)
| score r2_score next30m_avg_hs against predicted(next30m_avg_hs)
In our case here we are still hitting pretty good accuracy, which is probably down to my test instance being populated by cyclical dummy data feeds… You could at this point set up an alert against this statistic to notify someone if the accuracy drops below a certain value, maybe suggesting that they go and re-train the predictive model. For me a good rule of thumb is that an accuracy above 0.7 is good enough to run in production, but this very much depends on the importance of the service and the risk appetite for poor predictions.
Another useful view on the predictions is to see how far out they are compared to the actual values we get in the data. We can use some simple statistics here to see if we have any unusual predictions compared to what we have seen historically based on the cyclical statistical forecasts and anomalies blog series:
`apply_model(4bf1f146-3b89-4ae7-b8f3-32f536357bc4,itsi_predict_4bf1f146_3b89_4ae7_b8f3_32f536357bc4_RandomForestRegressor_e1201d046501187fa988d848_1588186089405)`
| table _time next30m_avg_hs predicted(next30m_avg_hs)
| eval residual=next30m_avg_hs-'predicted(next30m_avg_hs)'
| table _time residual
| eventstats avg(residual) as avg stdev(residual) as stdev
| eval lower_bound=avg-3*stdev, upper_bound=avg+3*stdev
| table _time residual lower_bound upper_bound
This will help you identify the points where your predictive model was less accurate than expected, either missing a service degradation or predicting degradation when in fact none occurred.
Now that we’ve explored some of the ways you can check on the accuracy of your production models in ITSI it should be fairly straightforward to put your searches into a dashboard to display some current metrics about the models’ accuracy.
Although we have focussed entirely on ITSI in this blog these techniques could easily be applied to any model in Splunk that uses supervised learning. I’d encourage you to read more here about some other approaches to monitoring model drift in Splunk or to check out some of our other content about using machine learning to augment your ITSI instance.
Happy Splunking!
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.