Splunk administrators are increasingly facing the unique challenge of providing Splunk service to several users and applications, and at the same time balancing the quality of that service in line with overall business needs. Most times, they have a shared Splunk infrastructure, which makes it complicated to allocate Splunk resources based on workload priority and ensure proper resource isolation to prevent noisy neighbor problems.
In this series of three Splunk blogs, I will share how Splunk Workload Management may be used to solve these challenges. In the first installment below, I will describe how to configure the feature and use it to reserve resources for ingestion and search workloads.
Splunk Workload Management was first released in Splunk Enterprise 7.2 and later improved in Splunk Enterprise 7.3. This blog includes capabilities that are generally available to Splunk Enterprise customers.
Workload management is a rule-based framework to allocate system resources (compute and memory) to various workloads, and manage workloads through their lifecycle. For resource allocation, workload management uses cgroups functionality in the Linux operating system. Hence, you need to setup cgroup hierarchy on all servers (search heads and indexers) before using workload management for resource allocation. There are various ways to do so depending on whether or not you are using systemd.
The overall system resources that are allocated to Splunk are divided in three predefined categories — Ingest, Search and Miscellaneous. For each category, you may allocate resources and create resource pools. For Ingest and Miscellaneous categories you can only create a default pool. For Search category, you may create several pools and assign resources from Search category. If CPU resources allocated to a pool are not fully utilized, they are automatically shared with workloads running in other pools. However, the memory allocation is the maximum memory that can be used by workloads in a pool.
All key Splunk processes run in the Ingest pool, hence, protection against OOM may be provided by setting appropriate memory limits for each pool. In the example below, even in the worst case scenario when several searches are running, the ingest pool gets a minimum of 30% memory because Search and Miscellaneous pools have a combined memory limit of 70% (65%+5%). This provides memory protection in heavily loaded deployments. Modular and scripted inputs run in the Miscellaneous pool.
For search heads, the resource settings can be defined using the graphical user interface. Once defined, the settings are saved in workload_pools.conf file. For indexer clusters, this file needs to be pushed through Cluster Master. In most scenarios, workload_pools.conf should be the same across search head and indexers but it is not a requirement. If you are using multiple standalone search heads or search head clusters with the same indexer cluster, you may need a different configuration file for each search head (cluster) and indexer cluster. More on this in the second blog!
Note: This is just an example. The allocation will depend on your specific workload and deployment.
Once the workload pools are defined, you may create workload rules to place searches in different search pools. The rules are a logical combination of predicates such as app, user, role and index (e.g. If index=security AND role=security_team, then Place Search in Priority pool). The rules are evaluated in order and once a rule is matched, the corresponding action is taken. Rules listed below the matched rule are not evaluated and hence the ordering of rules is very important. For example, in the scenario below, Rule #2 will be never matched. Generally, more restrictive rules should be ordered on the top.
Oftentimes you want to ensure that heavy search workload does not result in data ingestion lag or drop, and vice versa search execution is not impacted because of heavy ingestion load. This can be achieved by allocating CPU and Memory resources across ingest and search categories. The example below illustrates the CPU utilization by ingest and search workloads at four periods of time when workload management is enabled.
CPU Utilization Monitoring from Splunk Monitoring Console
Workload management enables you to assign Splunk resources to meet your business and operational requirements. You may use a set of rules to automatically place searches in different pools and also provide granular access controls to certain users to choose their own workload pools. In addition, rich monitoring capabilities allow you to track utilization and fine tune the resource allocations.
Overall, Workload Management puts you in control to allocate Splunk resources more precisely to service your internal business requirements without over provisioning the capacity. In the next blog, I will describe how to configure workload pools in complex deployment scenarios, and how to tame use cases like internal chargeback and high priority search execution. In the meantime, check out the demo video to get started with Splunk Workload Management.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.