When all your IT systems, your apps and software, and your people are spread out, you need a way to see what’s happening in all these minute and separate interactions. That’s exactly what distributed tracing does.
Distributed tracing is a way to tracking requests in applications and how those requests move from users and frontend devices through to backend services and databases.
Distributed tracing enables you to track requests or transactions through any application you monitor — giving you vital information that supports uptime, issue and incident resolution, ongoing optimization and, ultimately, a pristine user and customer experience.
In this article, let’s take a long look at distributed tracing and the technologies used to make it possible in your enterprise.
Managing modern software environment hinges on the three “pillars of observability”: logs, metrics and traces. (Sometimes events is included in this: MELT.) Each of these is a data source that provides crucial visibility into applications and the infrastructure hosting them. This practice is often known as application performance monitoring (APM), one type of IT monitoring.
For many IT operations and site reliability engineering (SRE) teams, two of these pillars, logs and metrics, are familiar enough. For decades, teams have analyzed logs and metrics in order to:
It’s the third pillar— traces — that may be less familiar.
Tracing is not a new concept. A trace is a collection of transactions (spans) that represent a unique user or API transaction that is handled by an application and its constituent services.
Tracing starts the moment a user interacts with an application. You send an initial request — adding an item to your cart, for example — and that is assigned a unique trace ID. One trace represents one user interaction. As the request moves through the host system, every operation performed on it (span) is tagged a few items:
The trace is made up of a collection of spans — each span a single operation, which contains:
Each span represents one segment of the request’s path. So, each span includes important information related to the service performing the operation, such as:
Teams who develop and manage monolithic applications have long used traces to understand the performance of applications:
The fundamental goal behind tracing — understanding transactions — is always the same. But the way we do it changes. It doesn’t work, though, when used with applications built on a distributed software architecture, such as microservices.
Let’s first look at traditional tracing, how it used to work, and then we can start to understand why this isn’t a great solution today.
In old-school applications that ran as monoliths, tracing was possible, but the need to understand what was happening was less important: There were fewer moving parts through which requests had to flow as the application processed them. So, less to trace.
The tracing tools that did exist performed probabilistic sampling. This captures only a small — and arbitrary — portion of all transactions. Probabilistic sampling provides a little insight into what is happening. But because it’s only taking samples of transactions, not looking at all of them, you don’t have full visibility. For example, tracing with sampling would, at best, allow IT and SRE teams to:
This approach, however, would not yield more nuanced performance trends. It certainly cannot scale enough to measure the thousands of distributed services in a transient containerized environment. For example:
(You can quickly see that this won’t work with the ephemeral nature of distributed systems that causes other unrelated alerts to happen can exacerbate troubleshooting. For example, if an EC2 node fails and another replaces it. If it only affects one user request, is that worth alerting about? As a result, the team would not identify these issues until they grew into major disruptions.)
Now, this might be all well and good for you if you’re working in a static, monolithic environment. But chances are, your environment is significantly more complex and constantly changing.
A single user transaction fans out to interact with tens or hundreds of microservices, each one requesting data from backend data stores, interacting with each other via APIs, or interacting with other parts of your infrastructure. (Think of it as software using software.)
By design, microservices scale independently from one another. It’s perfectly normal to have multiple instances of a single service running at the same time on different servers, in different locations and different environments. And each microservice can run on multi-layered stacks made up of:
Traditional monitoring — looking at samples or a single body of code — just won’t cut it. To understand request flows within a web of microservices, you must collect additional data, such as:
Because traces are distributed across different services, this process of tagging spans and correlating them is called “distributed tracing”.
Imagine a popular online video game with millions of users. This game must keep track of each player’s location, every interaction they have with each other, the items they pick up in the game and a variety of other data generated during play. Keeping the game running smoothly would be unthinkable with traditional tracing methods — distributed tracing makes it possible.
Distributed tracing follows a request (transaction) as it moves between multiple services within a microservices architecture, allowing so you can identify where the service request originates from (user-facing frontend application) throughout its journey with other services.
As an example of distributed tracing, imagine a collection of microservices in a standard modern application:
In this environment, a distributed trace of the user’s request would start by recording information about the request’s status on the first frontend service — which data the user inputs and how long it takes the service to forward that data to other services.
The next touchpoint in the trace would involve the backend services, which accept the input and perform any necessary data processing, like ETL or, increasingly, ELT. Then, the backend services transfer the processed data to the database service, which stores it.
Let’s consider a simple client-server application:
Within the context of the client, a single action has occurred. The client sent a request and got a response — but we observe each server request generated as a result of this client request in a span. As the client performs different transactions with the server in the context of the application, more spans are generated. We can correlate them together within a trace context.
The trace context is the glue that holds the spans together. Take a look at this breakdown:
Note that the trace context remains the same, tying each span together and letting the infrastructure know that each span belongs to the same transaction.
A distributed tracing tool first correlates the data from all the spans and then formats them into visualizations that are available either:
To gather traces, your applications must first be instrumented. Instrumenting an application requires using a framework, like OpenTelemetry, in order to generate traces and measure application performance. This information is super helpful: now you easily locate bottlenecks — and quit spending time seeking them out.
As long as you’re using a vendor-agnostic framework, like OTel, you’ll only have to do instrumentation work once.
Once your application has been instrumented, you’ll want to begin collecting this telemetry using a collector.
The Splunk OpenTelemetry collector is a great example. The collector provides a unified way to receive, process, and export application telemetry to an analysis tool like Splunk APM, where you can:
Can you guess which microservice is ultimately responsible for the errors in the application? With Splunk, you can go even deeper than this to determine which version of paymentService is responsible.
The main benefit of distributed tracing is visibility into real user transactions in one place. Complexity doesn’t matter. Some benefits that come from a more holistic approach include:
By monitoring the request’s status and performance characteristics on all these services, SREs and IT teams can pinpoint the source of performance issues. You can track specific, minute detail to know the responsiveness of each individual service in order to determine, for example that:
The disjointed nature of microservice architectures makes application performance monitoring — including functions such as tracking down and fixing performance issues — time consuming and expensive compared to monolithic applications.
Additionally, the way failure data is delivered in microservices isn’t always clear and often requires developers to decipher issues from error messages and arcane status codes. Distributed tracing provides a more holistic view of distributed systems that:
All this ultimately improves your mean time to recovery/repair (MTTR).
In a microservice environment, each process is generally the responsibility of a particular team. This can cause problems when it becomes necessary to identify errors and determine who is responsible for fixing them. In this context, distributed tracing helps to:
Distributed tracing tools work with a wide variety of applications and programming languages, so developers can incorporate them into virtually any system and view data through one tracing application.
Code tracing refers to a programmer’s interpretation of the results of each line of code in an application and recording its effect by hand instead of a debugger — which automates the process — to trace a program’s execution.
Manually tracing small blocks of code can be more efficient because the programmer doesn’t need to run the entire program to identify the effects of small edits.
Data tracing helps check the accuracy and data quality of critical data elements (CDEs) trace them back to their source systems, and monitor and manage them using statistical methods. Typically, the best way to perform accuracy checks is to trace operations to their origins and validate them with source data — although historically this hasn’t been cost-effective in large operational processes.
A program or stack trace is an index of the instructions executed and data referenced during the running of an application. The information displayed in a program trace — the program name, language, and the source statement that was executed, among other data — is used in the debugging process.
With applications consisting of different coding languages, distributed microservices, and written by people all over the globe, it helps to have an open vendor agnostic framework you can use to instrument your applications.
For many languages, OpenTelemetry provides automatic instrumentation of your application, where others must be manually instrumented.
Jaeger and Zipkin are two popular open-source request tracing tools, each with similar components: a collector, datastore, query API, and web user interface.
Both work like this: Outgoing requests are traced along with the application. The collector then records and correlates the data between different traces and sends it to a database where users can query and analyze it via the UI.
Jaegar and Zipkin are differentiated by their architecture and programming language support — Jaeger is implemented in Go, Zipkin in Java. Zipkin supports virtually every programming language with dedicated libraries for Java, Javascript, C, C++, C#, Python, Go, Scala, and others. Jaeger’s supported-language list is shorter: C#, Java, Node.js, Python, and Go.
Kafka is a distributed event streaming platform, providing a high-throughput, low-latency platform for handling real-time data feeds, often used in microservice architectures. It’s used to:
Kafka uses “topics” — a category or feed name to which records are published — to abstract streams of records. For each topic, Kafka maintains a partitioned log, an ordered, continually appended sequence of records that can serve as an external commit log for a distributed system.
Using microservices for cloud-based apps is the norm today. Distributed request tracing offers a huge advantage over the older, needle-in-a-haystack approach to tracking down the problems that could disrupt your services.
If you’re responsible for a microservice-based system, equipping your enterprise with this powerful technology will optimize how you do your job. Try distributed tracing with Splunk Observability. Try it for free to get a real-time view of your tracing telemetry and start solving problems faster.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.