As the demands of our customers continue to rise, Splunk User Behavior Analytics (UBA) V5.3 now boasts an increased ingesting rate up to 160K EPS from Splunk Enterprise to a 20-node large deployment. This scalability improvement facilitates support for 750K user accounts, 1 million devices, and 64 data sources[1]. Detecting anomalous user behaviors within a configurable 30-day timeframe, at such a scale of data volume from cybersecurity, poses significant performance challenges when building AI/ML-driven detection models.
One of our guiding principles for evolving Splunk UBA is to enhance the scalability of detection models and help customers address many operational issues that are linked to large scales. We will be presenting a series of blogs on this topic to support more UBA customers in scaling up their Splunk user behavior analytics. As the first blog from this series, we will first introduce some fundamental techniques to validate data volume and monitor models to understand the size of your own UBA clusters.
Through years of experience with global customer engagement, we realized that most of the scalability issues encountered by our customers could have been mitigated at an early stage, simply by validating data volume and monitoring model operations before any impact on cluster performance. For example, regularly monitoring the volume of data stores for historical anomalies can prevent the database from occupying all drive space, which consequently causes the termination of all scheduled model executions. Another common scenario involves onboarding invalid HR/device data or incorrectly mapping fields, resulting in consuming extremely large computing and memory resources.
To tackle the aforementioned challenges, UBA offers various logs accessible through UI, as shown in Figure 1, serving as crucial tools for diagnosis and monitoring. These logs offer a comprehensive insight into the operations of the UBA system. However, searching details from large logs may be time-consuming and sometimes not unnecessary for a quick checking; in this blog, we introduce an additional tool - Apache Zeppelin notebooks - to help you quickly validate the scale of data for UBA models in real-time, without searching from logs.
Figure 1: Check all logs for UBA diagnostics
Since UBA 5.2, a Zeppelin installation package has been made available to customers at the directory of system: /opt/caspida/conf/zeppelin/. (Please note: Zeppelin packages were not part of UBA deliverables, so the installation was not required to customers. Zeppelin Security and Authentication can be retrieved from its official website.) As Zeppelin install is not part of official UBA customer support, we published this installation guide just for your reference. Following the guide, you may download our UBA sample notebook from this open-source repository, and import them into your own Zeppelin UI. Once the notebook is opened from Zeppelin, you should observe an interface similar to the screenshot below.
Figure 2: Sample notebook for real-time data validation and model monitoring
We provided this sample notebook to group a selected set of commands and queries that may help you quickly learn how to check the scale of your UBA cluster. Given the right security setting of the notebook, you should be able to check your scalability settings without manually typing these commands or interacting with the UBA console. We welcome your contributions to more useful queries to this open-source notebook or bring your new UBA ideas from this link[2].
Here we would like to explain some selected useful commands listed in the notebook.
Through the sample notebook, you may first track the event processing count, the Maximum, Minimum, and Average EPS values if you are during the data onboarding process. Currently, UBA secures the scale of EPS value under 160K within 20-node deployment. To know the real-time EPS values during onboarding, run the following three commands in the notebook:
%sh /opt/caspida/bin/status/eps_ds %sh /opt/caspida/bin/status/eps_etl %sh /opt/caspida/bin/status/eps_ir
The following figure exemplifies the query of all your existing data sources, which can be a good starting point for you to assess significant increases in skipped or failed events during onboarding. You may explore more statistics from datasources and connectorstats table.
Figure 3: Retrieve EPS Statistics from data sources
The scale of users, entities (e.g. devices, applications) and their associated profiles may determine the performance of some models. Validate user and entity profiles are preliminary for reliable detections. The notebook exemplifies the way to access those numbers from hrdatausers, hrdataaccounts, systems, and usystems tables. The following two examples illustrate the ways to review your currently registered users (organized by OU group) and devices (categorized by device types), which will be applied to UBA peer grouping baseline analysis.
Figure 4: Check the number of OU groups for registered users.
Figure 5: Check the number of device types from registered devices.
Given our previous customer issues, the misconfig of the LDAP system caused an extremely large number of OU groups, which unnecessarily scale up all models with peer grouping computing with OU groups. One user may have multiple types of accounts including Service accounts. Extremely large number of service accounts may indicate wrong configurations or security vulnerabilities, which may also unnecessarily increase the scale of computations and system resources.
On the contrary, the extremely large number of unresolved devices that can’t be identified as valid devices for UBA may indicate issues with the data onboarding, which scale down the power of behavioral analysis.
The notebook includes examples, as shown in Figure 6, to check in real-time for any failures of repeated model executions, and identify models with the longest durations. The long duration of models are different from customers, depending on the different scale of data sources. Monitor the long-running models if there are significant changes in the execution times, and engage with UBA Customer support if they start to downgrade system performance.
Figure 6: Check long-duration models from your UBA cluster
Some issues caused performance downgrades but not related to scalability. For example, customers neglected to clear historical anomalies from Postgres databases, resulting in insufficient space and model failures. The notebooks provided query examples, as shown in Figure 8, to check the number of legacy threats and anomalies from anomalies and threats tables.
It was recommended to consistently clean up anomalies and close unwanted threats, targeting active threats fewer than 1000 and active anomalies fewer than 1 million for deployments with less than 10 nodes. For deployments with more than 10 nodes, the best practice is to keep 2000 threats, fewer than 1.5 million anomalies[3].
Figure 7: Show details of anomalies from history.
UBA provides two sets of model registration files. If the number of deployment nodes exceeds 10, ensure that you have configured the large deployment using this file:/opt/caspida/content/Splunk-Standard-Security/modelregistry/offlineworkflow/ModelRegistry.json.large_deployment.
When dealing with significantly increased data volume, UBA provides default configurations for performance tuning, including the following files:
/etc/caspida/conf/uba-default.properties /etc/caspida/conf/uba-env.properties /etc/caspida/conf/deployment/uba-tuning.properties
With best practice, any customization or changes in the properties should be performed in your local /etc/caspida/local/conf/uba-site.properties configuration file. Check all your recent changes take effect after syncing UBA cluster and restart services, with the following commands:
/opt/caspida/bin/Caspida sync-cluster /etc/caspida/local/conf/ /opt/caspida/bin/Caspida stop/ opt/caspida/bin/Caspida start
Please note: Some customers chose to modify the out-of-the-box model registry file (/opt/caspida/content/Splunk-Standard Security/modelregistry/offlineworkflow/
ModelRegistry.json) directly but later realized that the changes will be overwritten during upgrades, since all files under /opt/caspida with be updated during upgrade procedure.
To preserve your changes, the best practice is to update the custom registry from /etc/caspidalocal/conf/modelregistry/offlineworkflow/ModelRegistry.json, since this file takes precedence over the default registry.
For UBA batch models, retrieve the spark configurations from the directory /var/vcap/packages/spark/conf. For example, to address Out-Of-Memory (OOM) issues of UBA models, you may
1. update key parameters from /opt/caspida/bin/uba-spark/trigger-models.sh when triggering a particular model.
--driver-memory (spark.driver.memory) --spark.driver.maxResultSize
2. fine-tune key parameters from /var/vcap/packages/spark/conf/spark-defaults.conf.
Example:
spark.executor.memory spark.driver.memory spark.default.parallelism
3. Copy the edited file /var/vcap/packages/spark/conf/spark-defaults.conf via scp to all the UBA nodes since the sync-cluster command does not sync /var/vcap/.
4. On UBA Management node [4], restart spark services by following commands:
/opt/caspida/bin/Caspida stop-spark /opt/caspida/bin/Caspida start-spark
In this blog, We have shown how to setup Zeppelin notebook that connects your UBA environment; We have provided sample UBA notebooks that can validate your data and monitor your model operations without searching from large Health Check logs; We have explained some frequently used queries and commands that help you validate the scale of data and models before performance issues grow from your cluster. Next blog, we will disclose performance improvements of models we achieved for future release, and explain some strategies to scale up models in your own clusters.
Special thanks to William Lac and Maria Sanchez from the UBA Customer Support Team for their unwavering dedication and valuable contributions to this blog and customers.
References:
[1] https://docs.splunk.com/Documentation/UBA/5.3.0/Sizing/Scale
[2] https://github.com/splunk/uba-content-security/tree/main
[3] https://docs.splunk.com/Documentation/UBA/5.3.0/User/AnomalyThreatLimits
[4] https://docs.splunk.com/Documentation/UBA/5.3.0/Sizing/Architecture
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.