Splunk gives you amazing tools to analyze system data and make business-critical decisions, react to issues, and even proactively address future problems. And since it is so good at those things, it’s not surprising that there are myriad tools to help you reflexively analyze your own Splunk deployment. For example, the Cloud Monitoring Console (CMC), available on Splunk Cloud, allows you to check the health of your deployment, observe ingest and search trends, and dive into details of the usage patterns.
But what does it mean if the CMC is telling you your ingest is steady and your system is healthy? How much more load can you put on your system before performance degrades below acceptable levels? What are the hourly patterns in your ingest or search load? What about daily, or monthly, or yearly patterns? What is happening on the system when you see degraded performance? These questions are difficult, if not impossible, to answer with the CMC alone. In many cases, users may resort to opening a tech support case with Splunk to answer them. What if there were a tool that could help you understand the performance characteristics of your particular deployment and tune your system to run optimally?
Well, there is! Performance Insights for Splunk (PI) was originally developed for internal use by Performance Engineers at Splunk, who needed a deeper understanding of how usage patterns are tied to system and resource usage. Recognizing the value that this tool provides for both Cloud and On Premise customers, it was added to Splunkbase for anyone to use.
This article is an introduction to the tool, giving a high-level overview of what it can do and how you can use it to ensure your system is not only running smoothly, but optimally as well. We’re often tempted, when we see performance issues, to assume that more hardware will solve the problem. While this is often true, it might not be the most cost effective solution. PI is meant to help you identify realized or potential performance issues, understand what parts of the system are involved, and monitor your environment after taking corrective steps.
As examples, some of the issues we might see are:
If you were working inside the CMC alone, and you saw any one of these issues, you would find it difficult to relate any of them to each other or any other events that might have been happening at the same time. The CMC is great for health checks, but not as good for diagnostics; the pages’ layouts are to provide health information for a particular area, with their own groupings and time scales. If you know what you’re looking for, you might be able to find the information in there, but it’s not likely to jump out at you.
Figures 1 and 2: CMC pages using very different, often using time periods that are not customizable
PI helps by allowing you to set the same time scale and granularity on every page, for every chart. For example, when trying to see why the CPUs are spiking every 5 minutes, you’ll be able to see that scheduled search counts are also higher at those times, as is search concurrency. You’ll be able to see which searches were triggered 12 times in that hour and how much CPU those searches used.
Figures 3, 4, and 5: No matter which page, Performance Insights for Splunk can display the same time range and granularity
At the top of every page you’ll find the time range and granularity pickers, for zooming in or out on the data. You’ll also find the cluster selector that allows you to filter relevant charts to metrics from the current search head cluster (or search head, if not clustered), or all search head clusters.
Figure 6: Time pickers are specific to a page but can be set to the same values on every page.
The Performance Trend page gives an overview of ingest rates and search load, as well as indexer and search head CPU and memory metrics. This is where you can get a high level view of the general system performance over time.
Figure 7: Performance Trends
The System Environment and Data page provides details of the deployment landscape and installed applications, along with statistics about the distribution of ingest on indexers, indexes, and source types. Knowing how your data is distributed will help you build more efficient searches. You might also choose to filter or redistribute the data to reduce index sizes.
Figure 8: System Environment and Data
The Search Metrics page is broken into 6 parts: an overview page, and 5 search-type-specific pages. The overview page shows the collective view of all search activity, including concurrency, runtimes, and counts. Here you can check for seasonality in your searches, showing you how to flatten that load. The search-type-specific pages break down the details of each search type, including detailed runtime statistics, resource usage, skipped search details, and long-running search details. These pages can show you which searches to tune first to get better performance from your system.
Figures 9 and 10: Search Overview and Details
The Resource Monitoring page outlines CPU and memory statistics for the search heads and indexers, along with search head restarts. This is likely the starting point whenever you are experiencing sub-optimal performance. Resource contention, especially CPU, is very often at the heart of performance related issues. If you see exhausted resources here, use the other pages to help tune and optimize to lower the burden on that resource.
Figure 11: Resource Monitoring
The Splunk Features page gives insight into cache, smart store, data model and bucket statistics, along with information about assets and identities and notable events for Enterprise Security details.
Figure 12: Splunk Features
And finally, the Environment Diagnose page exposes the error rates and error details for indexers and search heads. Seeing what errors were happening at the time of an issue is often key to solving the issue. Even if you're not experiencing any performance problems, minimizing warnings and errors on your system is generally a good practice, and can save you some system resources.
Figure 13: Error Trends and Reporting
Armed with these views into your Splunk deployment, correlating an observed behaviour with other events that were happening at the same time is made easy. PI will not only help you diagnose issues with your system more quickly, but also allow you to find the best adjustments to make to reduce resource contention, allowing you to do more with the same amount of hardware. This can lead to significant savings in the long run. This brief overview of Performance Insights for Splunk just scratches the surface of what you can do with the tool. In future posts, I will walk through case studies showing how particular problems are solved with this tool, and how you can unlock its potential to get the most out of your Splunk system. So let's get started! Visit Splunkbase and install Performance Insights for Splunk today!
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.