Splunk APM Expands Code Profiling Capabilities with Several New GAs

By Mat Ball

Splunk is committed to using inclusive and unbiased language. This blog post might contain terminology that we no longer use. For more information on our updated terminology and our stance on biased language, please visit our blog post. We appreciate your understanding as we work towards making our community more inclusive for everyone.

We’re excited to share new Splunk capabilities to help measure how code performance impacts your services. Splunk APM’s AlwaysOn Profiling now supports .NET and Node.js applications for CPU profiling, and Java applications for memory profiling. AlwaysOn Profiling gives app developers and service owners code level visibility into resource bottlenecks by continuously profiling service performance, at minimal overhead. Whether refactoring monolithic applications, or using microservices and cloud native technologies, engineers can easily identify CPU and memory bottlenecks in .NET, Node.js, and Java applications, linked in context with their trace data. Along with Splunk Synthetic Monitoring, Splunk RUM, Splunk Infrastructure Monitoring, Splunk Log Observer, and Splunk On-Call, AlwaysOn Profiling helps detect, troubleshoot, and isolate problems faster.

As a background, our initial GA blog post announcement provided a thorough walkthrough for CPU troubleshooting. For memory profiling, here’s an overview of how to identify problems and troubleshoot bottlenecks. For more on memory profiling, please see our docs on memory profiling, or view this detailed video walkthrough.

Workflow: Slow Response Time From High Memory Allocation

Service owners notified about slowness log in to Splunk APM see that service latency metrics indicate a spike in response time after a new deployment. We know that potential bottlenecks can come in many of the following forms:

Code issues (e.g. slow algorithms or slow I/O calls)
Runtime issues (e.g. Garbage Collection overhead)
Configuration issues (e.g. DB connection pool being too small)
Operating system level issues (e.g. “noisy neighbors” consuming CPU, or thread starvation)

As code issues are common causes for slowness, we analyze code performance first. We look at a few example slow threads and their CPU profiling data within our traces, and examine bottlenecks across our services (as detailed in our initial GA blog post).

If we can’t find any code bottlenecks our next step is to examine the runtime metrics. Splunk APM’s runtime dashboard displays a plethora of runtime-specific metrics, all gathered automatically by our runtime instrumentation agent.

Looking at these charts, we see our JVM metrics, and notice abnormalities in Garbage Collection.

When we examine the “GC activity” chart we see for every one minute period, the garbage collection process takes upwards of 20 seconds. This could indicate that our JVM is spending too much time doing garbage collection instead of servicing incoming requests.

By looking at CPU usage we confirm our suspicions. The JVM incurs a significant amount of overhead (20 to 40% of CPU resources) from garbage collection, leaving only 60-80% of CPU resources for actual work (ie. serving incoming requests).

The most obvious case for excessive garbage collection activity is code allocating too much memory, or creating too many objects for which garbage collection must account. To visualize code bottlenecks we use flame graphs (for more details, see using flame graphs). The below flamegraph visually aggregates the 457k+ call stacks captured from our JVM when our code was allocating memory. The widths of stack frames, represented as bars on the chart, tell us how much each stack frame proportionally allocated memory.

The lower part of the flame graph points to our first party code, meaning we will likely have the option to re-engineer our code to allocate less memory.

From here we can either switch to our IDE and navigate to the right class and method indicated by the stack frame manually, or click a specific frame to display details under the flame graph.

If we click “Copy Stack Trace,” we place the entire stack trace containing the frame into our clipboard. We can then navigate to our IDE, and paste it to the “Analyze Stack Trace” / “Java Stack Trace Console” or similar dialogue window, and the IDE will point us to the exact line in the right file.

After fixing and redeploying the source code, we view our flame graph to see that specific stack traces aren’t constraining memory, and verify that Garbage Collection overhead (and service response time) have decreased.

Why Splunk for Code Profiling?

Unlike dedicated code profiling solutions, Splunk’s AlwaysOn Profiler links collected call stacks to spans that are being executed at the time of call stack collection. This helps separate data about the background threads from active threads which service incoming requests, greatly reducing the amount of time engineers need to analyze profiling data.

Additionally, with Splunk’s AlwaysOn Profiler, all of the data collection is automatic, and low overhead. Instead of having to switch the profiler on during production incidents, users only need to deploy the Splunk-flavored OpenTelemetry agent and it begins to continuously collect data in the background.

To learn more, find more detailed instructions in our AlwaysOn Profiling documentation.

Not an APM user? Sign up for a trial today.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.