Observability

November 15, 2021

5 Minute Read

How Using Annotations with OpenTelemetry Can Lower Your MTTR

By Splunk

When it comes to gaining control over complex distributed systems, there are many indicators of performance that we must understand. One of the secrets to understanding complicated systems is the use of additional cardinality within our metrics, which provides further information about our distributed systems’ overall health and performance. Developers rely on the telemetry captured from these distributed workloads to determine what really went wrong and place it in context.

OpenTelemetry allows us to easily capture metrics from our applications and add custom dimensions for later analysis. In this post, I will explain how to use annotations to associate your captured measurements to provide contextual information about your distributed workloads. For example, you can add a version annotation to a metric to trivially find all requests made by one particular version anywhere in your application.

About OpenTelemetry

OpenTelemetry data pipelines are built with the OpenTelemetry collector. It is responsible for aggregating workload telemetry and exporting this data to an analysis system like Splunk or open-source ones like Prometheus. I’ll provide a brief introduction to annotations and configuration of the OpenTelemetry collector below.

Annotations, also known as tags, are key-value pairs of data associated with recorded measurements to provide contextual information, distinguish, and group metrics during analysis and inspection. When measurements are aggregated to become metrics, annotations are used as labels to break down the metrics. Let’s take a look at real examples of adding annotations using the Splunk distribution of the OpenTelemetry collector.

The OpenTelemetry Collector configuration file is written using YAML and a full pipeline contains the following components:

Receivers: How to get data in. Receivers can be push or pull-based.
Processors: What to do with received data.
Exporters: Where to send received data. Exporters can be push or pull-based.
Extensions: Provide capabilities on top of the primary functionality of the collector.

Each of these components is defined within their respective section and must also be enabled within the service (pipeline) section.

Adding an Annotation for the Deployment Environment

Adding a deployment environment to our workloads can be done simply - it only requires adding the resource/add_environment processor to the Splunk OpenTelemetry collector’s configuration file. The resource/add_environment processor adds the deployment.environment annotation to all spans to help you quickly identify your workloads within your analysis system, like Splunk APM.

_{Example of collected traces in Splunk APM with no named environment. Without a named environment, production and testing/staging data could be mixed together, making analysis difficult.}

The bold text below highlights the addition to the processors section of the configuration file to aggregate the CloudProduction annotation to contain the specific deployment environment.

processors:

  resourcedetection:

    detectors: [system,env,gce,ec2]

    override: true

  resource/add_environment:

    attributes:

      - action: insert

        value: CloudProduction

        key: deployment.environment

We then enable this processor in the pipelines section for our traces and logs of the configuration file to enable the resource/add_environment processor.

^{Example configuration file showing the resource/add_environment processor enabled.}

With this configuration in place, the Splunk APM console now shows the CloudProduction annotation and lets you filter throughout the backend based on which environment is handling the request. This is one of the default troubleshooting MetricSets, which Splunk APM automatically indexes.

In addition to deployment environment any other annotations can be aggregated to help with identifying application performance bottlenecks. This can be done using the attributes/newenvironment processor, which adds a span annotation to any spans that don’t already have the annotation. This is particularly useful to add metadata to your spans, like version numbers or deployment color when using blue/green deployments. Implementing the attributes/newenvironment processor is the same as resource/add_environment processor or any other processor when using OpenTelemetry. Let’s illustrate this with another example of what the attributes/newenvironment processor and the resource/add_environment processor look like as part of the same configuration.

In the configuration file below, you can see the attributes/newenvironment processor added to the previous configuration to include both the version of our microservice application and deployment color.

processors:

  resourcedetection:

    detectors: [system,env,gce,ec2]

    override: true

  resource/add_environment:

    attributes:

      - action: insert

        value: CloudProduction

        key: deployment.environment

  attributes/newenvironment:

    actions:

      - key: version

        value: "v1.0.1"

        action: insert

      - key: deploymentcolor

        value: "green"

        action: insert

When we look at the trace in Splunk APM, we see that the version and deployment color are now included as part of each span collected for our microservice application.

Why Annotations and Cardinality Have an Impact on MTTR

Adding annotations to our spans adds cardinality to our telemetry, allowing us to ultimately better understand more about our application and get answers to what went wrong and why. For example, with Splunk APM, we can create MetricSets which are categories of metrics about traces and spans you can use for real-time monitoring and troubleshooting. MetricSets are specific to Splunk but are effectively aggregates of metrics and metric time-series, enabling you to populate charts and generate alerts. Creating custom MetricSets from our previously referenced annotations identified in our examples will allow us to use specific filters to narrow down any bottleneck affecting application performance. For example, with Splunk Infrastructure Monitoring, we can narrow down all hosts particular to a given application environment, such as a region or datacenter. The screenshot below shows how we used the annotation for our deployment environment CloudProduction as a filter to create a custom dashboard showing all hosts within the CloudProduction environment.

Since all of our data is tagged with these annotations and created as MetricSets, we can also use them within Splunk APM. You can see from the example screenshots below that the annotations are now available as part of Splunk APM’s Tag Spotlight and Dynamic Service Map.

This will now allow you to filter your application telemetry specifically by these annotations and get a clear map of service dependencies and find granular trends contributing to possible application performance issues. Overall, adding custom annotations to your traces will help you narrow down your data best fit for your application's development and deployment, ultimately reducing your MTTR.

Conclusion

With our ability to annotate metrics best fit for your organization, you can far more quickly locate what we're looking for within our cloud-native deployments. No longer worry about limitations to identifying just where the application bottlenecks may be.

Want to try working with OpenTelemetry yourself? You can sign up to start a free trial of the suite of products – from Infrastructure Monitoring and APM to Real User Monitoring and Log Observer. Get a real-time view of your infrastructure and start solving problems with your microservices faster today.

----------------------------------------------------
Thanks!
Johnathan Campos

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.