Observability

November 14, 2020

6 Minute Read

Reimagining APM for the Cloud-Native World: Introducing Splunk APM

By Splunk

We are very excited today to introduce Splunk APM, the newest component of the Splunk Infrastructure Monitoring for monitoring of microservices-based applications. At Splunk, we strive to deliver the industry’s most powerful cloud monitoring solution to accelerate our customers’ journey to cloud-native. With monoliths and legacy applications getting re-engineered as service-oriented architectures, enterprises need a new class of application monitoring tools to get visibility into transactions that now travel through complex paths across distributed services.

Traditional APM is a misfit for monitoring microservices. In a multi-part blog series, we outlined the shortcomings of the existing APM solutions and the need to fundamentally transform application monitoring to support cloud-native architectures.

Industry analysts have reported similar challenges experienced by their clients. In a recent research report Gartner said:

"Most APM solutions were designed for a prior generation of applications that were monolithic and long-lived. These approaches are ill-suited to the dynamism, modularity and scale of today’s emerging microservice-based applications."

Practitioners also see the complexity of microservices. This tweet, which went viral because it rings true, is a particularly humorous take on the challenge:

No doubt. Troubleshooting microservices in the middle of the night with traditional APM solutions seems like solving a murder mystery because:

There is no guarantee that you will see anomalous traces as you begin troubleshooting due to the random sampling of APM solutions
There is no end-to-end distributed system view showing all the services and their interdependencies, or any correlation with highly ephemeral and dynamic infrastructure environments
There is no guided troubleshooting, which forces you to manually examine individual traces and find common patterns which may be causing the system-wide performance issue
There is no way to know what constitutes normal performance behavior based on historical trends

In this blog, we will lay out how Splunk APM addresses these gaps with a unique set of capabilities and features:

NoSample™ Architecture

Traditional APM vendors and open-source solutions only capture a small and arbitrary portion of your transactions via probabilistic sampling, leaving you blind to actual issues. To quote one of our customers: “Sampling is the elephant in the war room!”

We tackled one of the biggest shortcomings of traditional APM with a unique approach to sampling trace data. Splunk APM is built on a unique architecture that we refer to as NoSample™. Unlike other APM tools, Splunk analyzes 100% of transactions throughout your distributed services and intelligently captures errors, anomalies, and outliers. This approach – also known as ‘tail-based sampling’ – is implemented via the Smart Gateway, a highly scalable and intelligent relay that lives in the customers’ environment.

The chart below shows the stark difference between the head-based sampling strategy used by most of the APM vendors and the Splunk Infrastructure Monitoring NoSample Architecture approach. Splunk Infrastructure Monitoring NoSample approach captures all outliers so that you don’t miss crucial trace data when you are troubleshooting a performance issue. Our early testing with customers shows that visibility into anomalous and long-tail traces increases by 10x using a tail-based sampling approach.

^{Fig 1: Head-based sampling vs Splunk Infrastructure Monitoring NoSample Architecture approach}

Splunk Infrastructure Monitoring NoSample Architecture assures that you will have the trace data when you need it the most to troubleshoot end-user issues. Next, you need to narrow down to the right traces quickly to begin incident resolution.

End-to-End Observability in a Single Pane of Glass

Narrowing down whether a performance issue is caused by infrastructure or application code can be like looking for needles in a haystack. Traditional APM tools require you to manually correlate performance issues across different layers of the application stack – resulting in higher MTTR, siloed troubleshooting, war room scenarios, and finger-pointing.

^{Fig 2: Service and Endpoint dashboard with infrastructure correlation}

Splunk APM provides an intuitive, end-to-end service map to quickly isolate the service which is causing the latency spike. You get pre-built dashboards for every service and all endpoints. Built-in infrastructure correlation helps immediately identify the root cause of a performance issue and engage the right team for resolution.

"Splunk Infrastructure Monitoring acts as a single source of truth for our teams. Service dashboards reduce our mean time to engage as they quickly narrow down the performance issue to code or infrastructure and help us engage the relevant team quickly."
- Senior DevOps Manager, Manufacturing Design SaaS Firm

Directed Troubleshooting with Splunk Infrastructure Monitoring Outlier Analyzer™

Tagging metrics and traces with dimensional key-value pairs and labels is a common practice in modern monitoring systems. However, as the number of dimensions grows, traditional APM solutions struggle to search and filter data without incurring performance penalties.

Splunk Infrastructure Monitoring provides a multi-dimensional data model and the industry’s best high-cardinality analytics capabilities, giving you the infinite flexibility to slice and dice trace data and quickly isolate relevant traces and spans.

Cloud-native deployments can be extremely complex to debug and troubleshoot because of the increased number of individual components backing an application. There can be many factors causing the latency of a transaction to go up. Where do you start your troubleshooting efforts? When using existing APM solutions, our customers told us they needed to examine each and every outlier trace and manually correlate among the traces to determine a pattern before starting troubleshooting.

Splunk Infrastructure Monitoring solves this challenge for our customers by using the latest innovations in data science. Outlier Analyzer uncovers patterns relating trace tags to trace durations, highlighting possible explanations for degraded system performance (or slowness in steady state). It automatically can answer questions such as:

Are the long tail traces coming from a particular customer segment (whose requests might be large or somehow malformed)?
Do the slow traces tend to pass through the same (possibly overloaded or misconfigured) load balancer?

^{Fig 3: Outlier Analyzer surfacing most commonly represented patterns in the long tail transactions}

Outlier Analyzer offers prescriptive insights to significantly reduce MTTR. One of our customers put it simply: “Before Outlier Analyzer we used to open 50 tabs and try to understand patterns manually ”

Know the Normal: Validate Code Releases with Span and Trace Metricization

When a particular span contributes most to the latency of a trace, how do you determine whether this is a normal behavior, or that a bug got introduced in a canary version of your code?

Other APM solutions capture RED metrics at the service level, or at best provide metrics at the root, originating span, giving you a very partial view of your environment.

The Smart Gateway observes every single transaction across distributed services, assembles the traces, and metricizes all of your traces and spans into metrics automatically. Additionally, it keeps the distribution of the performance at the trace execution path, as well as at the span level.

^{Fig 4: Span performance details with historical comparison alongside infrastructure correlation – all within the trace context}

Metricization provides you out-of-the-box, real-time visibility into the health of microservices deployed, as well as the historical performance trends at the span level. You can quickly determine how a new code release performs compared to historical baselines and automatically identify what is contributing the most to the latency of your transactions – down to the specific line of code.

In short, span-level metricization enables you to understand what constitutes normal performance behavior for any span or trace.

Splunk Infrastructure Monitoring does all of these things to expedite the incident response process and significantly reduce MTTR, while giving complete flexibility to our customers for instrumentation so they can remain vendor-neutral. You can choose any or a combination of following instrumentation methods:

Open Instrumentation Standards: OpenTracing, Zipkin, OpenCensus
Service mesh such as Istio, Envoy or Linkerd
Function wrappers to monitor serverless functions such as AWS Lambda

Additionally, Splunk Infrastructure Monitoring Auto Instrumentation agents and libraries, built upon open standards, provide automatic instrumentation for the most commonly used open source packages and frameworks.

It’s no secret that every company is a software company today, and software is driving new digital business initiatives. It is also true that distributed systems are much more complex compared to monolithic environments. Today, we have taken a huge step toward helping our customers successfully adopt cloud-native architectures by cutting through microservices complexity with Splunk APM.

Learn More

Splunk APM is feature packed. We’ve just scratched the surface here and can’t wait to show you all the features we’ve built that will accelerate your journey to cloud-native.

- Learn how our customers are already leveraging Microservices APM to diagnose the root cause to their issues and drive down MTTR

Check out key fetures of Splunk APM in the video below:

This post contains contributions from Maxime Petazzoni and Amit Sharma.

----------------------------------------------------
Thanks!
Amit Sharma

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.