Splunk is committed to using inclusive and unbiased language. This blog post might contain terminology that we no longer use. For more information on our updated terminology and our stance on biased language, please visit our blog post. We appreciate your understanding as we work towards making our community more inclusive for everyone.
Today’s business is powered by data. Success in the digital world depends on how quickly data can be collected, analyzed and acted upon. The faster the speed of data-driven insights, the more agile and responsive a business can become. Apache Kafka has emerged as a popular open-source stream-processing solution for collecting, storing, processing and analyzing data at scale.
Kafka is a distributed event streaming platform which nowadays typically deployed on distributed, dynamic and ephemeral infrastructure such as Kubernetes. These distributed, cloud-native systems, while boost agility and efficient scalability, also introduce operational complexities. Decoupled or loosely coupled components often pose challenges in making sense of complex interdependencies, detecting the source of performance bottlenecks, and correlated insights to understand the why behind performance anomalies.
In this blog series, we take a deep dive into Kafka architecture, the key performance characteristics that you should monitor and how to collect telemetry data to gain real-time observability into the health and performance of your Kafka cluster using Splunk.
Kafka leverages two key capabilities to implement event processing in real-time:
Publish/subscribe messaging is a pattern where the sender of data messages is decoupled and agnostic of the receiver of data. Instead, the publisher characterizes the message with metadata and subscriber “picks up” messages of its interest.
Applications that publish the data message are called Producers while the applications that subscribe (receive and process messages) are called Consumers. Kafka brokers act as intermediaries between producer and consumer applications. Brokers are designed to operate as a part of a cluster. Within the cluster, one broker will also function as the cluster controller for administrative purposes such as monitoring broker failures.
Related messages are organized and stored in Topics. Producers publish messages to relevant one or more topics and consumers subscribe to those topics and read the messages. Topics themselves are divided into one or more partitions which forms a unit of parallelism. Each partition can be placed on a separate machine and assigned to the broker to allow for multiple consumers to read in parallel. Multiple consumers can also read from multiple partitions in a topic resulting in a high message processing throughput. Although a partition may be assigned to multiple brokers for redundancy and high-availability a partition is “owned” by a single broker in the cluster known as the leader of the partition.
Kafka writes messages to only one replica—the partition leader. Follower replicas obtain copies of the messages from the leader. Consumers may read from either the partition leader or from a follower This architecture distributes the request load across the fleet of replicas.
There is one additional component, ZooKeeper, to keep track of the status of the Kafka cluster. To reduce the complexity, the community is moving to replace ZooKeeper with Metadata Quorum. Kafka 2.8 release introduced an early access look at Kafka without ZooKeeper, however, it is not considered feature complete and it is not yet recommended to run Kafka without ZooKeeper in production.
Kafka reads metadata from ZooKeeper and performs the following tasks:
To comprehensively monitor the performance of a Kafka cluster, we need to monitor key metrics of each component that the cluster comprises:
Kafka acts as a central nervous system in the enterprise data flow. Brokers play that part within Kafka. Every message pases through the broker before it is consumed. It is critical to monitor performance characteristics and get alerted to take remedial actions for performance anomalies. To get full-stack insights, we monitor:
Metric name | Monitor | Act |
---|---|---|
gauge.kafka-active-controllers | Specifies if the broker is an active controller. The sum of this metric should always result in 1 as there is only one broker at any given time which acts as a controller. You should alert on any other value for this metric. | Certain issues, such as errors in creating topics would require you to check controller logs. Active controller metric shows which broker was an active controller at any given time. |
counter.kafka-messages-in | Number of messages received per second across all topics. This is an indicator of overall workload based on the quantity of messages | A positive correlation between the number of messages in and total-time in the system - produce, fetch, queue etc. suggests adding resources as Kafka approaches maximum throughput. |
gauge.kafka-under replicated-partitions | Number of underreplicated partitions across all topics on the broker. Under repolicated partitions means that one or more replicas are not available. | Replication is key to deliver high availability in a Kafka cluster. Underreplicated partitions metrics is a leading indicator of the unavailability of one or more brokers. Check the error logs and restart the broker. |
counter.kafka-isr-expands and counter.kafka-isr-shrinks | If a broker goes down, in-sync replicas ISRs for some of the partitions shrink. When that broker is up again, ISRs are expanded once the replicas are fully caught up. A healthy cluster needs replicas for high throughput as well as failover. You should put an alert if ISRs do not expand shortly after ISR shrink correspondingly. | If ISR is expanding and shrinking frequently, adjust Allowed replica lag. |
gauge.kafka-offline-partitions-count | Number of partitions that don’t have an active leader and are hence not writable or readable. You should alert on a non-zero value for this metric as it indicates that brokers are not available. | Check error logs and restart the broker if needed |
counter.kafka-leader-election-rate | A partition leader election happens when ZooKeeper is not able to connect with the leader. Any other partition replica can then be elected as the leader. Monitor this metric as it can indicate unavailability of brokers. | Frequent leader elections can indicate a system wide issue. Check the broker status and error logs. |
counter.kafka-unclean-elections-rate | A leader may be chosen from out-of-sync replicas if the broker which is the leader of the partition is unavailable and a new leader needs to be elected. Put an alert on this metric as it means potential loss of messages. | As of version 0.11.0.0, unclean leader election has been disabled by default. Check broker config whether unclean.leader.election.enable is set up as true. By default, Kafka is prioritizing durability over availability. In any case, you want to get alerted about data loss if unclean election is enabled. |
gauge.kafka.produce.totoal-time gauge.kafka.fetch-consumer.total-time gauge.kafka.fetch-follower.total-time | These metrics measure the total time taken to process a message - produce, fetch consumer or fetch follower. You can monitor not only the median but also the 99th percentile of latency. Chose algorithms such as Historical Anomaly to detect whether the latency has gone up compared to historical performance. | There may be several contributors to overall latency in processing messages. Check request and response queues to pinpoint where exactly the anomaly is. |
counter.kafka-bytes-in counter.kafka-bytes-out | These metrics represent the amount of data brokers receive from producers and the amount that consumers read from brokers. These are the indicator overall throughput or workload on the Kafka cluster. | Correlate these metrics with network related metrics to ascertain that throughput is approaching maximum capacity. Consider add additional resources to increase Kafka throughput |
Visuslizations of selected performance metrics across all the brokers are displayed below:
When producers can no longer push messages to brokers, consumers will not get new messages. Some of the key producer metrics are discussed below:
Metric name | Monitor |
---|---|
gauge.kafka.producer.response-rate | Average number of responses received per producer rolled up per minute |
gauge.kafka.producer.request-rate | Average number of requests sent per producer rolled up per minute |
gauge.kafka.producer.request-latency-avg | Average requests latency in milliseconds |
gauge.kafka.producer.record-send-rate | Number of records sent rolled up per minute |
gauge.kafka.producer.compression-rate | Average compression rate of sent batches, You can monitor top 5 or top 10 entries |
gauge.kafka.producer.io-wait-time-ns-avg | Average length of time the I/O thread spent waiting for a socket (in ns) |
gauge.kafka.producer.outgoing-byte-rate | Average number of outgoing bytes per second |
A visualization of selected producer metrics is shown below:
Monitoring consumer metrics may indicate systematic performance issues on how effectively the data is being fetched by consumers. High lag values could indicate overloaded consumers prompting you to add more consumers or add partitions for the topics to reduce lag and increase throughput. Similarly low trending fetch rate may indicate failures on consumers, an important metric to get alerted on.
Metric name | Monitor |
---|---|
gauge.kafka.consumer.records-lag-max | Max lag in terms of number of records. An increasing value means consumer is not keeping up with producers. |
gauge.kafka.consumer.fetch-rate | Alert on consumers with low fetch rate |
gauge.kafka.consumer.bytes-consumed-rate | Total bytes consumed per second for each consumer for a specific topic or across all topics. |
A visualization of key consumer metrics is shown below:
ZooKeeper maintains information about Kafka’s brokers and topics and applies quotas to control the rate of traffic moving through the cluster.
Metric name | Monitor |
---|---|
counter.total.requests | Indicates total number of requests received or processed |
gauge.zk_avg_latency | Monitor top hosts with average request latencies and alert on higher trending ones |
gauge.zk_followers | Number of active followers per leader in the cluster. Create alert for any change in the value |
A visualization of key ZooKeeper performance metrics is shown below:
In this blog we looked at key performance metrics across all the components in your Kafka deployments. In the next part of the series, "Collecting Kafka Performance Metrics with OpenTelemetry," we will discuss how to use Splunk Infrastructure Monitoring for real-time visibility into the health of Kafka cluster. In the final part, we will cover how to enable distributed tracing for your Kafka clients using OpenTelemetry and Splunk APM.
You can get started by signing up for a free 14 day trial of Splunk Infrastructure Monitoring and check out our documentation for details about additional Kafka performance metrics.
----------------------------------------------------
Thanks!
Amit Sharma
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.