Splunk customers process massive amounts of data every day. Most of this data is typically indexed as events or metrics and contains valuable information about how organizations are performing their cybersecurity, IT or business-related operations. Some data streams already consist of implicit time series data, others can be easily aggregated and transformed with SPL commands like tstats. Those results are often used to derive various key performance indicators (KPI) that can be important for business-critical decision-making or simply to “keep the lights on” in operations.
It’s a no-brainer to come up with the next idea to apply machine learning techniques to such time-series data to predict the future. This is where forecasting methods can do their magic and you find a few standard methods in Splunk’s Machine Learning Toolkit (MLTK) ready to use, like ARIMA and StateSpaceForecast. This works well for many cases and is quick and easy to do with MLTK. Just have a look at some of the forecasting examples in MLTK’s showcase area.
When it comes to situations where you have to use different forecasting algorithms or libraries and you need to scale your forecasts over say millions of individual time series for each entity you are likely to hit some limits. The good news is that there is a solution pattern that was already implemented successfully for customers who utilize the Splunk App for Data Science and Deep Learning (DSDL), formerly known as the Deep Learning Toolkit (DLTK), which is freely available on Splunkbase. It helps to solve the main limitations mentioned above as well as:
Typically, when you need to solve a larger or more complex problem, you try to break it down into smaller parts that work with your limitations. Assuming you need to compute forecasts of KPIs for millions of entities on a daily basis, it can be easy to run into a computational challenge. The good news is that this type of challenge can be solved using parallelization to get the job done in less than 24 hours. By splitting the time series by entity key, you can initially distribute the data and the forecast calculations across multiple cores, e.g. using a framework like DASK. Finally, you can send the results back to Splunk HTTP Event Collector (HEC). The following diagram shows the core building blocks and the main DASK python command:
With this approach, you are also able to run different tests to investigate how to scale the computations. Running and building a million predictive models on a single CPU core typically takes over 23 days. By adding more CPU cores to the DASK cluster, users can achieve a 15-hour runtime on a 36-core cluster. This is enough to get the computations done in a faster time frame. You can also see on the chart that the approach scales linearly, so if you need to make more forecasts in the future, you'll know roughly how much additional computational cost it involves.
This blog described how you could scale out a specific forecasting use case for millions of entities with the help of DASK and Prophet. Both libraries are available with examples in the Splunk App for Data Science and Deep Learning, so you should have all the building blocks on hand should you need to build something similar. With the latest version 3.9, there is an easy-to-use HEC class available as well. It’s needless to say that many other modeling tasks can equally benefit from such a distribution strategy.
If you want to learn more about the Splunk App for Data Science and Deep Learning, you can watch this .conf session to explore how BMW Group uses DSDL for a predictive testing strategy in automotive manufacturing.
Happy parallelizing,
Philipp
Credits
A big and bold THANK YOU goes to my former ML Architect colleague Pierre Brunel for pioneering and proofing this solution in a production setup. Many thanks to Judith Silverberg-Rajna, Katia Arteaga and Mina Wu for your support in editing and publishing this blog post.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.