Observability

June 08, 2021

7 Minute Read

Monitoring Kafka Performance with Splunk

By Splunk

Splunk is committed to using inclusive and unbiased language. This blog post might contain terminology that we no longer use. For more information on our updated terminology and our stance on biased language, please visit our blog post. We appreciate your understanding as we work towards making our community more inclusive for everyone.

Today’s business is powered by data. Success in the digital world depends on how quickly data can be collected, analyzed and acted upon. The faster the speed of data-driven insights, the more agile and responsive a business can become. Apache Kafka has emerged as a popular open-source stream-processing solution for collecting, storing, processing and analyzing data at scale.

Kafka is a distributed event streaming platform which nowadays typically deployed on distributed, dynamic and ephemeral infrastructure such as Kubernetes. These distributed, cloud-native systems, while boost agility and efficient scalability, also introduce operational complexities. Decoupled or loosely coupled components often pose challenges in making sense of complex interdependencies, detecting the source of performance bottlenecks, and correlated insights to understand the why behind performance anomalies.

In this blog series, we take a deep dive into Kafka architecture, the key performance characteristics that you should monitor and how to collect telemetry data to gain real-time observability into the health and performance of your Kafka cluster using Splunk.

Kafka Architecture: An Overview

Kafka leverages two key capabilities to implement event processing in real-time:

Publish/subscribe of streaming events
Durable storage of event data

Publish/subscribe messaging is a pattern where the sender of data messages is decoupled and agnostic of the receiver of data. Instead, the publisher characterizes the message with metadata and subscriber “picks up” messages of its interest.

Applications that publish the data message are called Producers while the applications that subscribe (receive and process messages) are called Consumers. Kafka brokers act as intermediaries between producer and consumer applications. Brokers are designed to operate as a part of a cluster. Within the cluster, one broker will also function as the cluster controller for administrative purposes such as monitoring broker failures.

Related messages are organized and stored in Topics. Producers publish messages to relevant one or more topics and consumers subscribe to those topics and read the messages. Topics themselves are divided into one or more partitions which forms a unit of parallelism. Each partition can be placed on a separate machine and assigned to the broker to allow for multiple consumers to read in parallel. Multiple consumers can also read from multiple partitions in a topic resulting in a high message processing throughput. Although a partition may be assigned to multiple brokers for redundancy and high-availability a partition is “owned” by a single broker in the cluster known as the leader of the partition.

Kafka writes messages to only one replica—the partition leader. Follower replicas obtain copies of the messages from the leader. Consumers may read from either the partition leader or from a follower This architecture distributes the request load across the fleet of replicas.

There is one additional component, ZooKeeper, to keep track of the status of the Kafka cluster. To reduce the complexity, the community is moving to replace ZooKeeper with Metadata Quorum. Kafka 2.8 release introduced an early access look at Kafka without ZooKeeper, however, it is not considered feature complete and it is not yet recommended to run Kafka without ZooKeeper in production.

Kafka reads metadata from ZooKeeper and performs the following tasks:

Controller election: In a Kafka cluster, one of the brokers serves as the controller, with the responsibility for managing the states of partitions and replicas and for performing administrative tasks like reassigning partitions.
Configuration of topics: A list of existing topics, number of partitions for each topic, the location of all the replicas are maintained in ZooKeeper
Cluster membership: ZooKeeper maintains a list of all the functioning brokers which are part of the cluster
Access control and quotas: ZooKeeper also maintains ACLs for all topics as well as quotas on topics to limit the throughput of producers or consumers.

Key Performance Metrics for Monitoring Kafka

To comprehensively monitor the performance of a Kafka cluster, we need to monitor key metrics of each component that the cluster comprises:

Kafka broker metrics
Producer metrics
Consumer metrics
ZooKeeper metrics

Broker metrics

Kafka acts as a central nervous system in the enterprise data flow. Brokers play that part within Kafka. Every message pases through the broker before it is consumed. It is critical to monitor performance characteristics and get alerted to take remedial actions for performance anomalies. To get full-stack insights, we monitor:

Kafka system metrics
JVM metrics such as garbage collection
Host metrics

Metric name	Monitor	Act
gauge.kafka-active-controllers	Specifies if the broker is an active controller. The sum of this metric should always result in 1 as there is only one broker at any given time which acts as a controller. You should alert on any other value for this metric.	Certain issues, such as errors in creating topics would require you to check controller logs. Active controller metric shows which broker was an active controller at any given time.
counter.kafka-messages-in	Number of messages received per second across all topics. This is an indicator of overall workload based on the quantity of messages	A positive correlation between the number of messages in and total-time in the system - produce, fetch, queue etc. suggests adding resources as Kafka approaches maximum throughput.
gauge.kafka-under replicated-partitions	Number of underreplicated partitions across all topics on the broker. Under repolicated partitions means that one or more replicas are not available.	Replication is key to deliver high availability in a Kafka cluster. Underreplicated partitions metrics is a leading indicator of the unavailability of one or more brokers. Check the error logs and restart the broker.
counter.kafka-isr-expands and counter.kafka-isr-shrinks	If a broker goes down, in-sync replicas ISRs for some of the partitions shrink. When that broker is up again, ISRs are expanded once the replicas are fully caught up. A healthy cluster needs replicas for high throughput as well as failover. You should put an alert if ISRs do not expand shortly after ISR shrink correspondingly.	If ISR is expanding and shrinking frequently, adjust Allowed replica lag.
gauge.kafka-offline-partitions-count	Number of partitions that don’t have an active leader and are hence not writable or readable. You should alert on a non-zero value for this metric as it indicates that brokers are not available.	Check error logs and restart the broker if needed
counter.kafka-leader-election-rate	A partition leader election happens when ZooKeeper is not able to connect with the leader. Any other partition replica can then be elected as the leader. Monitor this metric as it can indicate unavailability of brokers.	Frequent leader elections can indicate a system wide issue. Check the broker status and error logs.
counter.kafka-unclean-elections-rate	A leader may be chosen from out-of-sync replicas if the broker which is the leader of the partition is unavailable and a new leader needs to be elected. Put an alert on this metric as it means potential loss of messages.	As of version 0.11.0.0, unclean leader election has been disabled by default. Check broker config whether unclean.leader.election.enable is set up as true. By default, Kafka is prioritizing durability over availability. In any case, you want to get alerted about data loss if unclean election is enabled.
gauge.kafka.produce.totoal-time gauge.kafka.fetch-consumer.total-time gauge.kafka.fetch-follower.total-time	These metrics measure the total time taken to process a message - produce, fetch consumer or fetch follower. You can monitor not only the median but also the 99th percentile of latency. Chose algorithms such as Historical Anomaly to detect whether the latency has gone up compared to historical performance.	There may be several contributors to overall latency in processing messages. Check request and response queues to pinpoint where exactly the anomaly is.
counter.kafka-bytes-in counter.kafka-bytes-out	These metrics represent the amount of data brokers receive from producers and the amount that consumers read from brokers. These are the indicator overall throughput or workload on the Kafka cluster.	Correlate these metrics with network related metrics to ascertain that throughput is approaching maximum capacity. Consider add additional resources to increase Kafka throughput

Visuslizations of selected performance metrics across all the brokers are displayed below:

Kafka Producer Metrics

When producers can no longer push messages to brokers, consumers will not get new messages. Some of the key producer metrics are discussed below:

Metric name	Monitor
gauge.kafka.producer.response-rate	Average number of responses received per producer rolled up per minute
gauge.kafka.producer.request-rate	Average number of requests sent per producer rolled up per minute
gauge.kafka.producer.request-latency-avg	Average requests latency in milliseconds
gauge.kafka.producer.record-send-rate	Number of records sent rolled up per minute
gauge.kafka.producer.compression-rate	Average compression rate of sent batches, You can monitor top 5 or top 10 entries
gauge.kafka.producer.io-wait-time-ns-avg	Average length of time the I/O thread spent waiting for a socket (in ns)
gauge.kafka.producer.outgoing-byte-rate	Average number of outgoing bytes per second

A visualization of selected producer metrics is shown below:

Kafka Consumer Metrics

Monitoring consumer metrics may indicate systematic performance issues on how effectively the data is being fetched by consumers. High lag values could indicate overloaded consumers prompting you to add more consumers or add partitions for the topics to reduce lag and increase throughput. Similarly low trending fetch rate may indicate failures on consumers, an important metric to get alerted on.

Metric name	Monitor
gauge.kafka.consumer.records-lag-max	Max lag in terms of number of records. An increasing value means consumer is not keeping up with producers.
gauge.kafka.consumer.fetch-rate	Alert on consumers with low fetch rate
gauge.kafka.consumer.bytes-consumed-rate	Total bytes consumed per second for each consumer for a specific topic or across all topics.

A visualization of key consumer metrics is shown below:

ZooKeeper Metrics

ZooKeeper maintains information about Kafka’s brokers and topics and applies quotas to control the rate of traffic moving through the cluster.

Metric name	Monitor
counter.total.requests	Indicates total number of requests received or processed
gauge.zk_avg_latency	Monitor top hosts with average request latencies and alert on higher trending ones
gauge.zk_followers	Number of active followers per leader in the cluster. Create alert for any change in the value

A visualization of key ZooKeeper performance metrics is shown below:

Monitor your Kafka Cluster

In this blog we looked at key performance metrics across all the components in your Kafka deployments. In the next part of the series, "Collecting Kafka Performance Metrics with OpenTelemetry," we will discuss how to use Splunk Infrastructure Monitoring for real-time visibility into the health of Kafka cluster. In the final part, we will cover how to enable distributed tracing for your Kafka clients using OpenTelemetry and Splunk APM.

You can get started by signing up for a free 14 day trial of Splunk Infrastructure Monitoring and check out our documentation for details about additional Kafka performance metrics.

----------------------------------------------------
Thanks!
Amit Sharma

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk

Subscribe to our blog

Get the latest articles from Splunk straight to your inbox.

Connect with Splunk on X

Follow @Splunk

Connect with Splunk on Instagram

Follow @Splunk

See Splunk Perspectives blog for execs

Get Perspectives

Monitoring Kafka Performance with Splunk

Kafka Architecture: An Overview

Key Performance Metrics for Monitoring Kafka

Broker metrics

Metric name

Monitor

Act

Kafka Producer Metrics

Metric name

Monitor

Kafka Consumer Metrics

Metric name

Monitor

ZooKeeper Metrics

Metric name

Monitor

Monitor your Kafka Cluster

Related Articles

About Splunk

Subscribe to our blog

Connect with Splunk on X

Connect with Splunk on Instagram

See Splunk Perspectives blog for execs