A Smarter Way to Preprocess Your Data

By Splunk

In May we released the Splunk Machine Learning Toolkit (MLTK) version 5.2. We’ve loved telling you about some of the great new features, including the most recent blog on DensityFunction. However, we know that before you can start experimenting with model-building algorithms such as DensityFunction, your data needs to be prepared for machine learning. Machine learning operates best when you provide clean data as the foundation for building your models. This is a common pain point for customers, which is why the Machine Learning Toolkit enables both self-guided and assisted data preprocessing options. In this blog, we’ll highlight the data preprocessing options available within the guided workflows of the MLTK Smart Assistants (and a selection of Experiment Assistants).

Preprocessing can transform your data into fields that give you better data experimentation results, higher quality models, and more usable visualizations. The Smart Assistants in the MLTK offer different preprocessing options within the Learn stage of their step-by-step workflows. Preprocessing steps include algorithms that reduce the number of fields, produce numeric fields from unstructured text, join or extract fields, or re-scale numeric fields. As with other aspects of MLTK guided modeling Assistants, any preprocessing steps taken also generate Splunk Search Processing Language (SPL) for you that can be viewed using the SPL button within each Assistant.

The Smart Clustering Assistant offers the option to use the StandardScaler algorithm to standardize the data fields by scaling their mean and standard deviation to 0 and 1, respectively. This standardization helps to avoid dominance of one or more fields over others in subsequent machine learning algorithms and is useful when the fields have very different scales.

The Smart Prediction Assistant offers the option to use FieldSelector to select the best predictor fields based on univariate statistical tests. Users can select modes including Percentile, K-best, False positivity rate, False discovery rate, and Family-wise error rate.

Both the Smart Clustering Assistant and Smart Prediction Assistant offer PCA and Kernel PCA preprocessing algorithms options. Use these preprocessing algorithms to reduce the number of fields by extracting new, uncorrelated features out of the data. PCA and KernelPCA can also be used to reduce the number of dimensions for visualization purposes, for example, to display a scatterplot chart.

To learn more about these algorithms and other preprocessing options with the MLTK Assistants, check out our User Guide.

Ready to get started? Download the free Machine Learning Toolkit app today and see how you can leverage machine learning with your Splunk data!

This blog was co-authored by Kristal Curtis, Senior Software Engineer

----------------------------------------------------
Thanks!
Mohan Rajagopalan

Data Preparation Made Easy: SPL2 for Edge Processor

Announcing the General Availability of the SPL2 Profile for Edge Processor, containing the specific subset of powerful SPL2 commands and functions that can be used to control and transform data behavior within Edge Processor.

Platform 2 Min Read

Splunk AR: Admin AR Web App

Check out how the Splunk AR web app allows administrators to manage their entire AR experience at scale and all in one unified place.

Platform 4 Min Read

Make the Splunk Connected Experiences Mobile Apps Work for You

Tips and tricks for creating mobile dashboards and making the best of the Splunk Connected Experiences suite of apps.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.