This guest blog post is courtesy of Alvaro Santos Andres, Cloud Solution Architect at Bluetab Solutions, and was first published in DZone.
During all my years as a Solution Architect, I have built many streaming architectures, such as real-time data ETL, reactive microservices, log collection, and even AI-driven services, all using Kafka as a core part of their architecture. Kafka is a proven stream-processing platform used for many years at companies like LinkedIn, Microsoft, and Netflix. In many cases Kafka works very well, supports large amounts of data, and has a good community. Because of that, Kafka is used for many streaming scenarios.
However, due to the design of Kafka, all of my projects using Kafka have been suffering similar problems:
Latency, or the delay before a transfer of data begins, could be a nightmare for anyone working with data-intensive applications. As IoT-enabled applications, such as autonomous vehicles and even industrial inspection, become commonplace, the data generated from sensors will become too demanding for existing architectures.
To maintain low latency while keeping up with the ever growing throughput requirements becomes a big challenge. As a result, data takes longer to move from devices to data centers, causing the user experience to degrade exponentially.
Apache Pulsar shows notable improvements in both latency and throughput in comparison to Kafka. Pulsar is approximately 2.5 times faster and 40% less latency than Kafka (*). Those differences are huge, and in critical systems they can mean success or failure.
There are many techniques that Pulsar uses to improve performance. The most important technique is used to handle tailing reads. In a scenario where readers are only interested in the most recent data, the readers are served from an in-memory cache in the serving layer (the Pulsar brokers), and only catch-up readers end up having to be served from the storage layer (Apache BookKeeper). This approach is key to improving the latency and throughput compared to systems such as Kafka.
If you are more interested in the matter, Chris Bartholomew wrote recently a very good article benchmarking latency that compares Apache Pulsar and Kafka.
Imagine you have thousands or millions of devices sending data to your data lake. This data must be managed with speed, security, and reliability. In addition, for legal reasons you must partition data by country, device, and city. These requirements seem reasonable, and in 2019, stream-processing platforms must be able to deal with them.
But how well do they? Kafka is not known to work well when there are thousands of topics and partitions even if the data is not massive. You can see how complicated it can be to try to solve performance challenges in these scenarios.
Fortunately, Pulsar is designed to serve over a million topics in a cluster. The key to scaling the number of topics is how data is stored. In Kafka, data for a topic is stored in dedicated files and directories, but as a result, Kafka has trouble scaling because I/O will be scattered across the disk as these files are flushed from the page cache to disk periodically. In contrast, Pulsar stores data in bookies (BookKeeper servers), where messages from different topics are aggregated, sorted, and stored in large files and then indexes. With these, Pulsar is able to scale to millions of topics.
Another common error in many projects I have participated in is the limited scope of their initial design. When you begin to design the architecture, you are often focused on the ROI for the first year and on local impact. However, when future expansion to new countries becomes mandatory, you are often forced to expand that same infrastructure to new regions without a global architecture design.
Kafka brokers are designed to work together in a network in a single region or even availability zone. So, there is no easy way to work with a multi-datacenter architecture. In contrast, geo-replication is an out-of-the-box feature in Pulsar. Global clusters can be configured at the namespace level to replicate data among any number of clusters. Additionally, Pulsar’s multi-tenancy feature makes it possible to stand up one cluster for an enterprise while still providing isolation of data storage.
Working in Agile projects, it is desirable to begin with fewer features and incrementally add new ones so that the project is not overwhelmed by so many services that must be coded, tested and maintained. In infrastructure there is a similar scenario. First, we have a small Kafka cluster that is enough for our current volume of data. In the following months, more and more customers arrive and the cluster can manage them by adding new partitions.
However, there will be a point in time that a new server must be added to the cluster, and then not only do I have to mess with the configuration but I also have to re-balance the current topics. These are some examples of how the operational expenditure exponentially increases with a Kafka-based architecture.
Happily for us, Pulsar’s layered architecture and stateless brokers help make zero downtime in these cases possible. When a new broker is added to the cluster, it is immediately available for writes and reads and does not spend any time re-balancing data across the cluster.
From the perspective of data storage (bookies), when a new bookie is added to the cluster, re-balancing of data based on the replication configuration will take place behind the scenes, without any impact on the cluster. Finally, Pulsar can be easily deployed in Kubernetes clusters, either in managed clusters on Google Kubernetes Engine or Amazon Web Services or in custom clusters. Easy to install and easy to maintain, as delivered with Pulsar, are exactly what we are looking for.
Apache Pulsar is a powerful stream-processing platform that has been able to learn from the weaknesses of previous systems. Its layered architecture is complemented by a number of great out-of-the-box features including geo-Replication, multi-tenancy, zero rebalancing downtime, unified queuing and streaming, TLS-based authentication/authorization, proxy and durability. Compared to other platforms, Pulsar can give you the ultimate tools to deliver successful projects.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.