Docker shook the DevOps world a couple of years ago. Containers ready for cloud architecture brought production operations closer to development and helped make microservices the backbone of a more flexible, aggressive approach to building software architecture. The Docker movement gives product teams more freedom in their technology choices since they’re empowered to deploy and manage their applications in production themselves. However, operationalizing Docker can also mean more complexity, an abundance of infrastructure and application data, and greater need for monitoring and alerting on the production environment.
Splunk Infrastructure Monitoring has been running Docker containers in production since 2013. Every single application we manage executes within a Docker container. Along the way, we’ve learned how to monitor our Docker-based infrastructure and how to get maximum visibility into our applications, wherever and however they run.
This is the first in a series of blogs on monitoring Docker containers. In this post, I’ll discuss what’s important to monitoring Dockerized environments, how to collect container metrics you care about, and your options for collecting application metrics.
Even as IT, operations, and engineering orgs come together around the value of and objectives for containers, one question endures: “How do I monitor Docker in my production environment?” The source of confusion here comes from the fact that we’re asking the wrong question. Monitoring the Docker daemon, the Kubernetes master, or even the Mesos scheduler isn’t complicated. It needs to be done, and there are solutions for each of these.
Running your applications in Docker containers only really changes how they are packaged, scheduled, and orchestrated—not how they run. The question we should be asking then becomes: “How does Docker change how I monitor my applications?”
The answer, as is so often the case, is “it depends.” It depends on the dependencies of your environment and is affected by your use case and objectives:
Better yet, to understand the changes a microservices regime and Dockerized environment might cause for your monitoring strategy, you should first answer these four simple questions. Your answers may differ for each application and your approach to monitoring should reflect those differences.
If you need system-level metrics from your containers, Docker has you covered. The Docker daemon exposes very detailed metrics about CPU, memory, network, and I/O usage that are available for each running container via the /stats endpoint of Docker’s remote API. Whether or not you plan on collecting application-level metrics, you should definitely get your containers’ metrics first.
The best way to collect those metrics and send them to your monitoring system is to use collectd and the docker-collectd-plugin. For more information, check out our introductory blog post on Monitoring Docker at Scale with Splunk Infrastructure Monitoring.
The simplest and most reliable way of getting metrics from all your containers is running collectd on each host that has a Docker daemon. Simply configure the docker-collectd-plugin to talk to the local Docker daemon on each host:
If you’re using Docker Swarm, the Swarm API endpoint exposes the full Docker remote API, reporting data for all the containers executed in the swarm. This means only one collectd instance with the docker-collectd-plugin is needed to point at the Swarm manager’s API endpoint. Container metrics from all running containers that you started on your Swarm nodes will be collected:
Once you have your container metrics flowing to your monitoring system, you can build charts and dashboards to visualize the performance of your containers and your infrastructure. Learn about the metrics collected by the docker-collectd-plugin here.
If your monitoring system is Splunk Infrastructure Monitoring, we automatically discover these metrics and provide curated, built-in dashboards to show your Docker infrastructure from cluster to host to container.
A key challenge with collecting application metrics from Dockerized applications is locating the source of the data. If your applications don’t automatically push metrics to a remote endpoint, you need to know what runs where, what metrics to poll, and how to poll those metrics from your applications.
For first-party software, I strongly recommend that you make your application report its metrics on its own. Most code instrumentation libraries already work this way. Or you should be able to easily add this functionality to your codebase. Just make sure the remote endpoint is easily and (if possible) dynamically configurable.
In Java, for example, Codahale/Dropwizard Metrics is a popular library that is recommendable for instrumenting Java programs. To set it up to report metrics to Splunk Infrastructure Monitoring, include our signalfx-java client library and add a few lines to your application:
Similar solutions exists for Python, Go, Ruby, and more.
Third-party software is where collecting metrics becomes much trickier. Most of the time, the application you want to monitor is not capable of pushing metrics data to an external endpoint. You have to poll those metrics directly from the application, from JMX, or even from logs. In Dockerized environments, this makes configuring your monitoring system quite challenging, depending on whether you have a static container placement or use some form of dynamic container scheduling.
Knowing the placement of your application containers, either by configuration or by convention, makes it easier to collect metrics from those applications. Simply configure collectd on each host or from another location to start the collection process.
Depending on the application, you may have to expose additional TCP ports to reach whichever endpoint the application exposes metrics through. In some cases, such as for Kafka, you’ll need to enable and expose JMX. For others, like Elasticsearch and ZooKeeper, a specific endpoint of the API is made directly available.
If you use a dynamic container scheduler such as Kubernetes or Mesos + Marathon, it’s very likely that you don’t entirely control where your applications execute. Even if your applications leverage service discovery, it can be very difficult to bridge the gap between your metrics collection and monitoring systems. The same problem arises when using server-less infrastructures or pure container hosting providers.
In this situation, we see three solutions to this problem. None is perfect if you want to stay close to the doctrine of lightweight Docker images that execute a single application binary inside the running container. However, all provide a starting point to bridge the gap between metrics collection and monitoring systems.
Monitoring Docker itself and getting system-level metrics from your containers is easy with the docker-collectd-plugin. Monitoring the applications that you run inside your Docker containers is where it gets more complex and where the confusion around monitoring Docker comes from.
In the second part of the Monitoring Docker Containers series, we’ll discuss how Splunk Infrastructure Monitoring monitors its containerized infrastructure, the tools used to orchestrate across our various environments, and how we get visibility across all layers of the infrastructure.
To learn more, check out our webinar with Zenefits on operationalizing Docker and orchestrating microservices. I shared lessons from running Docker at scale during the past three years, including what metrics matter for monitoring, how to assign data dimensions for troubleshooting, and strategies for alerting on microservices running in Docker containers.
Learn more about Splunk Infrastructure Monitoring and get a 14-day free trial!
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.