Splunk is committed to using inclusive and unbiased language. This blog post might contain terminology that we no longer use. For more information on our updated terminology and our stance on biased language, please visit our blog post. We appreciate your understanding as we work towards making our community more inclusive for everyone.
As described in Splunk Vulnerability Disclosure SVD-2022-0624, there is a list of SPL (Search Processing Language) commands that are classified as risky. This is because incorrect use of these risky commands may lead to a security breach or data loss.
As a precautionary measure, the Splunk Search app pops up a dialog, alerting users before executing these commands whenever these commands are called. However, there are scenarios where this safeguard measure can be bypassed, leaving a vulnerability to malicious users to exploit these risky commands to gain higher privilege, to collect security data or to delete data queried.
Although rules can be defined to find these risky command searches, it is difficult, if not impossible, to identify the searches that maliciously exploit these vulnerabilities without incurring large amounts of false positives. It is therefore desirable to develop methods to detect such risky command misuse or abuse using machine learning (ML) algorithms — in addition to rule intelligence detections — to further pinpoint a true threat.
One of the targets of malicious exploits of these risky commands is to exfiltrate data. Expecting an unusually long run time compared with benign searches containing risky commands, we therefore can assume that the search time anomaly is an indicator of the exploit of risky command vulnerability from attackers. Based on this assumption, we developed a machine learning approach to model users’ behavior of search run time with risky commands and detect such anomalies to alert analysts of possible threats.
This is accomplished by using the time spent executing one of these risky commands as a proxy for misuse/abuse of interest during an investigation and/or hunting. The detection builds a model utilizing the MLTK DensityFunction algorithm on Splunk app audit log data. The model is trained from users' historical reference of running the risky commands, and then the total search run time of executing these commands in each hour is aggregated as an indicator of user behavior to perform anomaly detection.
We build our detection based on Splunk app audit data, specifically search activities in the audit data model. The related data fields used in this detection are:
Where ‘search’, ‘search_type’ and ‘user’ fields are used as filters to narrow the detection scope to be correlated with risky command vulnerability exploits, and the ‘total_run_time’ field is used to model user behavior. We process the log by ‘bin’ command to aggregate them into hourly intervals to suppress noise. This operation can generate another numerical field ‘count: the number of runs’.
Exploring the data closely, we notice that the trend of these two numerical variables is heavily correlated to each other as shown in the below figure because both are derived from the same log events, we thus choose ‘total_run_time’ as a single indicator. In this way, we can use MLTK DensityFunction as our underlined algorithm since at the current time this algorithm works only for univariate data.
In our detection, the ‘total_run_time’ of past 7-day data is fed to the MLTK app ‘FIT’ command to train a baseline model of user behavior, along with ‘user’ as a ‘by’ clause to create per-user models. The model can then be used to monitor new search activities continuously.
By using ‘APPLY’ command to infer whether ‘total_run_time’ in the last hour of a user is an anomaly, the model will alert a potential exploit of risky commands. The overall detection flow is presented in the below diagram. The model identifies the top 0.1% of user search run time, which signals a potential exploit of these risky commands. Users can adjust this threshold to values higher or lower than 0.1% as needed to adjust the efficacy of the model based on the acceptable true positive/false positive rates.
Users can also choose to modify baseline build intervals, currently set at 7 days, depending on the search activity frequency in their environment. The principle is that the data points used to build baseline should be large enough so that the baseline model represents users’ normal behavior.
As shown in the above flow chart, our detection:
Run time of search activities varies dramatically among users as shown in the below data exploration sample where the standard deviation is as large as 779. The data distribution, which determines normal behavior and impacts model performance, is unable to be predefined before actual user data is collected, we therefore set the ‘dist’ parameter of DensityFunction in our detection baseline training as ‘auto’ so that the algorithm can learn the best distributions from each user’s behavior, though it will take much longer model training time because the process will have to train a model for each distribution and select the one with the best performance.
Users can modify earliest=-7d@d in the search to other values so that the search can collect enough data points to build a good baseline model. Users can also modify a list of risky commands in "Search_Activity.search IN" to better suit users' violation policy and their usage environment.
Also, we set ‘lower_threshold’ to a tiny value (0.000001) so that the lower bound of the anomaly is sufficiently close to zero and no search activity with short runtime will be wrongly marked as positive.
| fit DensityFunction "run_time" dist=auto lower_threshold=0.000001 upper_threshold=0.001 |
To test our implemented detection, we manually implanted two anomalies into a synthesized dataset of two users (one anomaly for each user). As shown in the below figure, these two anomalies are reported in the inference stage and matched our expectation.
The corresponding detection in ESCU is "Splunk Command and Scripting Interpreter Risky SPL MLTK". To use this detection, Splunk accelerated audit data model must be available. Detection should be scheduled to run hourly to detect whether a user has run searches containing risky SPL with abnormally long running time in the past one hour, compared with his/her past seven days history. This detection depends on the MLTK App that should be installed before running this detection. The list of apps this detection depends on:
The name of the machine learning model generated by this detection’s baseline training is "risky_command_abuse" and should be configured to be globally shared (not private) in the MLTK app as described in the MLTK document unless the same account of training this model will be used to perform inference using this model for anomaly detection.
For large enterprises, for example where more than 1,000 users will actively run Splunk searches, training the baseline model might take significant computing resources and might require a dedicated search head. Default settings of this detection’s underlying DensityFunction algorithm within MLTK App may need to increase to achieve optimal performance as described in the section Configuring DensityFunction parameters section in the manual for MLTK App, especially for these parameters:
If you would like to adopt this detection, you can get the corresponding baseline and detection YAML file from the Splunk Security Content GitHub repository.
Type |
Name |
Technique ID |
Tactic |
Description |
Baseline |
Splunk Command and Scripting Interpreter Risky SPL MLTK Baseline |
Execution |
This YML is to build baseline models for risky command exploit detection from user’s past 7 days’ search activities using total search run time as user behavior indicator. |
|
Anomaly |
This YML is to utilize the baseline models and infer whether the search in the last hour is possibly an exploit of risky commands. |
More Related Detections |
||||
Hunting |
Execution |
This YML file is to hunt for ad-hoc searches containing risky commands from non-administrative users. |
||
Anomaly |
This YML is to identify the use of the risky command ‘DELETE’ that may be utilized in Splunk to delete some or all data being queried. |
|||
Anomaly |
This YML is to use a pre-trained machine learning text classifier to detect potentially risky commands. |
Any feedback or requests? Feel free to put in an issue on GitHub and we’ll follow up. Alternatively, join us on the Slack channel #security-research. Follow these instructions If you need an invitation to our Splunk user groups on Slack.
We would like to thank the following for their contribution to this post and corresponding detections:
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.