For application developers and service owners who build and troubleshoot modern enterprise software, resolving production issues requires identifying poor performance across multiple networks, operating systems, servers, configs, and third party dependencies. When the problem is the code itself, code profiling helps identify service bottlenecks by periodically taking CPU snapshots, or call stacks, from a runtime environment. Information from call stacks provides additional context for slow spans from transaction traces, and helps visualize bottlenecks through flamegraphs, to show service performance over time. These benefits speak for themselves, but most other code profiling products incur notable performance overhead, which requires engineers to manually switch them on or off, creating a tradeoff between application performance and available data.
We’re proud to announce the Beta of AlwaysOn Profiling, part of Splunk APM. Available initially for Java-based applications, AlwaysOn provides continuous visibility of code-level performance, linked with unsampled trace data, with minimal overhead. Along with Splunk Synthetic Monitoring, Splunk RUM, Splunk Infrastructure Monitoring, Splunk Log Observer, and Splunk On-Call, AlwaysOn Profiling gives engineers more context to identify performance issues and troubleshoot faster across production environments.
Splunk APM’s AlwaysOn Profiler is constantly monitoring code performance to give you immediate context of where performance bottlenecks exist. Here are two examples of how AlwaysOn can help identify production issues:
Workflow One: Viewing Common Code in Your Slowest Traces
Engineers troubleshooting production issues often sort through example traces looking for common attributes in their slowest spans. AlwaysOn’s call stacks are linked to trace data, providing context into which code is executed during each trace.
Within APM you can easily view latency within your production environment.
By clicking into any service you’re taken to the service maps, which provide additional context on bottlenecks within that service and its dependencies.
From here, we can explore example traces.
Note: We filtered the “min” by 10,000, or ten seconds, to focus specifically on the slowest traces. We see that requests to /stats/races/fastest repeatedly respond in around 40+ seconds.
By clicking into one of these long trace, the following screen opens:
We see that while the StatsController.fastestRace operation was being executed, we collected 36 call stacks. As the java agent continuously collects call stacks, the longer the spans, the more call stacks they will have. When I open this span, I see the metadata on the left, and the call stacks that the agent collected on the right. We can use the “Previous” and “Next” buttons to flip through all call stacks:
If you see several consecutive call stacks pointing to the same line of code, it indicates that these lines take a long time to execute, or execute many times in a row. This is often a solid hint at a performance bottleneck.
Workflow Two: Viewing aggregate performance of services over time
Before you begin optimizing code, it’s always helpful to understand which part of your source code impacts performance the most. How do you know which part is the biggest bottleneck? This is where aggregation of collected call stacks, in the form of flamegraphs, helps.
When viewing your service map, notice the code profiling addition on your right side panel, which automatically shows you the top five frames from the call stacks we’ve collected for your selected time range, that already point to bottlenecks in code.
By clicking into the feature, you’re taken to a flame graph, which is a visual aggregation of call stacks collected from the time range you’ve specified. Flame graphs visualize call stacks across a time range — the larger the horizontal bar, the more frequently that line of code is found in the collected call stacks.
Upon viewing the flamegraph, focus on larger top down “pillars”, which indicate lines of code that use the CPU the most. If you want to highlight your own code classes in the flamegraph, use the filter in the top left.
Within each horizontal bar of the flamegraph, there are class names and line numbers for your code. Flame graphs point you to the bottleneck causing the slowness, and the final step in troubleshooting is returning to your source code itself to fix the problem.
Unlike dedicated code profiling solutions, Splunk’s AlwaysOn Profiler links collected call stacks to spans that are being executed at the time of call stack collection. This helps separate data about the background threads from active threads which service incoming requests, greatly reducing the amount of time engineers need to analyze profiling data.
Additionally, with Splunk’s AlwaysOn profiler, all of the data collection is automatic, and low overhead. Instead of having to switch the profiler on during production incidents, users only need to deploy the Splunk-flavored OpenTelemetry agent and it begins to continuously collect data in the background.
With “Always On” profiling, teams using Splunk APM can now analyze and improve both intra-service performance of code heavy monoliths, and inter-service performance of microservice based architectures, to troubleshoot bottlenecks and optimize service performance at any stage of cloud migration.
Sign up for the preview to get started today.
Follow all the conversations coming out of #splunkconf21!
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.