August 09, 2023

7 Minute Read

ML-Powered Assistance for Adaptive Thresholding in ITSI

By Om Rajyaguru

Splunk is committed to using inclusive and unbiased language. This blog post might contain terminology that we no longer use. For more information on our updated terminology and our stance on biased language, please visit our blog post. We appreciate your understanding as we work towards making our community more inclusive for everyone.

Adaptive thresholding in Splunk IT Service Intelligence (ITSI) is a useful capability for key performance indicator (KPI) monitoring. It allows thresholds to be updated at a regular interval depending on how the values of KPIs change over time. Adaptive thresholding has many parameters through which users can customize its behavior, including time policies, algorithms and thresholds.

A time policy is a breakdown of a KPI’s time range into blocks, where each block represents an expectation of relatively constant weekly behavior. For example, imagine a KPI tracks the number of users on an ecommerce store every minute. This user activity driven KPI may have normal values during weekday mornings, lower values in weekday afternoons (when people are likely at work), and yet another during evenings and weekends (when people have more time). In this scenario an ITSI user would specify a time policy with three time blocks, and configure appropriate thresholds separately for each block.

The other parameters for adaptive thresholding are policy type and thresholds: ITSI users can specify whether to use standard deviation, range, percentage, or quantile to compute KPI thresholds; and for a given method, what threshold should be used (e.g., if “percentage,” trigger when 20% above normal).

While adaptive thresholding is very flexible and can be tailored to monitor a vast range of KPIs, the available degrees of freedom can make configuration of thresholds difficult. It can take users several hours or more to research, experiment, and choose parameters to configure adaptive thresholding on their KPI. Therefore, we developed a feature to recommend these parameters to make it easier for users to set up adaptive thresholding. As part of Splunk AI for .conf23, ITSI 4.17 includes a preview of ML-Assisted Thresholding (shown below), a feature that will provide recommendations in the adaptive thresholding configuration UI including time policies and thresholds that would suit a user’s particular KPI using the standard deviation method.

Below, we’ll walk through the algorithm that we developed to provide these recommendations. Additionally, we’ll share information about how our algorithm is implemented within ITSI and how we evaluated our approach. Plans for improvement of the feature for inclusion in a future ITSI release will also be highlighted.

Algorithm

The algorithm of ML-Assisted Thresholding for ITSI Adaptive Thresholding has three main steps:

Detection of potential seasonality patterns in the input time series.
Establishment of the normal behavior of the time series based on the detected pattern.
Calculation of the threshold using the normal behavior of the time series.

The seasonality pattern detection problem is solved using clustering analysis. The possibility of a time series showing seasonality patterns, such as daily pattern or weekly pattern, is quantized by first partitioning the time series into corresponding subsequences and then numerically measuring the quality of the clustering on the collection of subsequences using the silhouette score. Going back to the ecommerce store example where we derived three time policies, we would expect one cluster for weekday mornings, another for weekday afternoons, and a third for evenings and weekends.

From the definition of the silhouette score, its value for a data point (in a dataset) is in the range of -1.0 to 1.0. Because we are using the median of the silhouette scores for each data point to measure the overall clustering quality on a dataset, the value of this overall silhouette is in the range of 0 to 1.0 most of the time.

The figure below illustrates the extreme case of a perfect daily pattern without noise. In this case, the pattern detection will identify the pattern with (maximal) silhouette score 1.0.

The figure below shows another extreme case of a totally random time series. In this case, the pattern detection will not report any pattern, and the silhouette score will be very close to 0.

The time series below exhibits a weekly pattern, with two anomalies: one on 2022-10-17, characterized by lower-than-normal activities, and another on 2022-10-19, featuring unusual activities on a day that is typically inactive.

The ML-Assisted Thresholding can detect this pattern and divide the time series into subsequences of one-day:

With clustering analysis, the collection of subsequences is divided into two clusters:

The anomaly on 2022-10-19 is clearly visible in the cluster representing off-days. The anomaly on 2022-10-17 can be identified by isolating the 5th workday from the cluster of workdays.

The threshold that the ML module generates to speed up the configuration of the ITSI Adaptive Thresholding (ITSI AT) is a selected statistic that summarizes the normal behavior as represented by the collection of subsequences without anomalies.

For this release, we focus on the standard deviation method supported by ITSI AT. With the standard deviation method, after calculating the mean and standard deviation of a subsequence, the maximum of the z-values is the threshold, or boundary, between the normal behavior and abnormal behavior. The figure below illustrates the calculated boundary for this time series with a weekly pattern.

In addition to the aforementioned steps, the Assisted Thresholding algorithm also splits relatively long subsequences into smaller segments to achieve more precise boundaries. The zoomed-in view below demonstrates the division of a one-day subsequence into multiple variable-length segments

Finally, it's worth noting that the silhouette score returned from the clustering analysis serves as a heuristic measure of confidence for a detected pattern. When the silhouette score is below a predetermined value, the Assisted Thresholding algorithm does not detect any pattern. In such cases, instead of providing threshold recommendations based on patterns, Assisted Thresholding gracefully opts to provide no recommendations, allowing the user to decide how to configure adaptive thresholding.

Integration with ITSI

The code is delivered through an installable Splunk app without a UI (i.e., an add-on), available bundled along with ITSI. It is invoked through a reporting search command called ‘recommendthresholdtemplate’. This way, the ITSI UI only needs to use the search command and process the results. An example of the search command’s usage and corresponding results are shown below on the previously shown “sample_cyclical_AT_condensed'' time series.

The ITSI UI processes the output of our custom search command and uses the results to display recommended time policies and thresholds to the user.

Evaluation

To help make sure our time series clustering-based approach is working, we were given 6 sample time series with expected time policies. We made sure that we met the expected time policies exactly. For example, the sample_cyclical_AT_condensed time series above is expected to have a time policy of multiple blocks of 1-2 hours, with a weekly seasonality, with Tuesday and Wednesday as off-days, and with a daily offset of 7 hours. Even with such a complex weekly and daily seasonality with off days and off hours, the silhouette score-based approach was able to capture the pattern. The standard deviation multipliers are numeric values, so we can’t match an exact number. Instead, we were given a value and a tolerance range.

User experience was also an important consideration; currently, we are able to process 100,000 events within 10 seconds. The above time series has about 9000 events, and results were returned in less than 3.5 seconds.

Future Work

The recommendation feature we are releasing for .conf23 is just the start and we are working to expand upon what we've built for a future release of ITSI.

Currently, we recommend the standard deviation algorithm for use with adaptive thresholding, as this is the most popular method chosen by over 50% of ITSI customers using adaptive thresholding today. There are other algorithms supported by ITSI as well, including range, percentage, and quantile and we plan to engage in further discussions with customers to understand when these other algorithms are preferred over standard deviation. We’ll incorporate that feedback to augment our recommendation algorithm.

For some KPIs, such as those that don’t change over time substantially, adaptive thresholding is not needed. For these KPIs, static thresholding is sufficient. We want to further understand when static thresholding is preferred and augment our recommender with this option.

When a long time series is sent to the ML module, the time series may contain pattern switches in the middle. We plan to add the capability to detect pattern switches and notify users accordingly.

Finally, we plan to investigate recommendations for multiple KPIs at one time. ITSI customers have shared that they sometimes want to configure AT for multiple KPIs in one batch rather than for each KPI individually. We will augment our recommender so that it can take in a whole batch of a customer’s KPIs and return recommendations for each KPI. This will facilitate customers’ configuration of AT for complex systems with many KPIs that need to be monitored.

Summary

The Assisted Thresholding feature makes recommendations for time policies, algorithms, and thresholds for ITSI adaptive thresholding. We have presented the algorithm that is used to make these recommendations, as well as explained how we evaluated its accuracy and latency. We have also shared our plans for improvement for a future release of ITSI. Please try out the feature in ITSI 4.17 and offer any feedback at mlsupport@splunk.com.

Co-Authors

This blog was co-authored by Houwu Bai, Kristal Curtis, and Om Rajyaguru.

Houwu Bai is a Senior Data Scientist at Splunk. His work is focused on time series analysis and anomaly detection. Before joining Splunk, he worked for various companies in the San Francisco bay area, including SignalFx, Apple, and Arena Solutions. He received his Master's degree in Machine Learning and Pattern Recognition from the Institute of Automation at Chinese Academy of Sciences.

Kristal Curtis is a Senior Engineering Manager at Splunk. She leads a team that is responsible for delivering machine learning solutions for anomaly detection throughout the Splunk portfolio. Prior to becoming a manager, she worked as an engineer and researcher on various problems related to integrating machine learning into Splunk’s products. Before joining Splunk, Kristal earned her PhD in Computer Science at UC Berkeley, where she was advised by David Patterson and Armando Fox and belonged to the RAD and AMP Labs.

Om Rajyaguru is an Applied Scientist at Splunk working primarily on time series clustering problems, along with methods to fine-tune and evaluate large language models for code generation tasks. He received his B.S. in Applied Mathematics and Statistics in June 2022, where his research focused on multimodal learning and low-rank approximation methods for deep neural networks.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk

Subscribe to our blog

Get the latest articles from Splunk straight to your inbox.

Connect with Splunk on X

Follow @Splunk

Connect with Splunk on Instagram