The New Look and Feel of Apache Kafka 4.0

The recently introduced Apache Kafka 4.0 comes with a number of upgrades across nearly every aspect of the open source distributed event streaming platform. The recent release has numerous Kafka Improvement Proposals (KIPs) — which provide new functionality from the open source community — across Kafka Streams, Kafka Connect, and Kafka brokers, consumers, producers, and more.
Nevertheless, the most significant part of the release likely doesn’t involve any KIPs. Kafka 4.0 is the first version to run Apache Kafka Raft (KRaft), an implementation of the Raft protocol in Kafka, by default. As such, the platform is now completely bereft of Apache ZooKeeper, a centralized service for maintaining information for configurations, synchronization, naming conventions, and more for distributed applications.
According to Sandon Jacobs, Senior Developer Advocate at Confluent, the phase-out of ZooKeeper was anything but abrupt. “The migration plan has been there for a while,” Jacobs remarked. “The community hasn’t been secretive about the fact that ZooKeeper is going the way of the dodo, so to speak. So, for the past few releases, this has been in the works.”
The 4.0 edition also contains an early access version of Queues for Kafka (KIP-932), which allows users to scale Kafka consumers beyond the number of partitions for a given topic. Other notable capabilities pertain to substantial improvements for rebalancing consumer groups without downtime and mechanisms for expediting the input of logic into Kafka Streams applications.
KRaft Over ZooKeeper
There are numerous reasons for Kafka’s replacement of ZooKeeper with KRaft. Some pertain to the cost benefits; others pertain to uptime and application stability. “You don’t have to worry about running another ZooKeeper cluster now,” Jacobs explained. “One less moving part is always better. There’s one less point of failure.” As a metadata coordination layer for Kafka deployments, ZooKeeper was responsible for several nontrivial considerations when utilizing Kafka. Most of these concerns involved operations and were exacerbated by the fact that, although ZooKeeper is an open source resource available from the Apache Foundation, it was external to Kafka deployments.
Consequently, it added to the complexity of streaming data topologies. “Now, you don’t have your operators saying I need to create a ZooKeeper cluster; I need to create a Kafka cluster,” Jacobs said. “I need to manage things like TLS between ZooKeeper and Kafka. I’m not managing compute for ZooKeeper anymore.” ZooKeeper was widely used as a means of managing metadata for replicated topic information and partitions to determine factors like which partitions could serve as leaders based on the synchronization of replicas. “At some point, all that got migrated into KRaft,” Jacobs said.
Queues for Kafka
Queues for Kafka overcomes certain limitations consuming applications encounter because Kafka is a topic-based, and not queue-based, system. Information about topics is partitioned for parallel processing and replicated accordingly. According to Jacobs, “Your partitioning really dictates, or is dictated by, either way, your consumption model.”
Before the 4.0 release, it was largely impractical to scale the number of members of a consumer group in Kafka beyond the number of partitions for a topic. In traditional queue-based messaging systems like Amazon Simple Queue Service (SQS), consumer applications can create any number of threads “for consumption and these threads will just pick up the next available event in the queue,” Jacobs said. But before Kafka 4.0, users were effectively limited to scaling the members of consumer groups to the number of partitions for a topic, because if they tried to do more “you end up with idle members of the consumer group,” Jacobs said. “They’re not going to get a partition assignment; they’re not going to be consuming any data.”
Queues for Kafka rectifies this situation with support similar to that of queue-based messaging systems. It effectively consumes from each partition via the First In, First Out paradigm “as best it can,” Jacobs added. Thus, consumers can scale beyond the number of partitions for a particular topic. However, the partition semantics in Kafka topics all but guarantee consumption from partitions in the order in which events were written. Today, such sequential order of consumption isn’t guaranteed in Queues for Kafka. “In cases where order of processing matters, you still want to stick with traditional consumer groups,” Jacobs commented. “In cases where you want to scale out and just get the work done and the order doesn’t matter as much, Queues for Kafka is where you would look.”
Consumer Group Rebalancing
KIP-848 expedites and simplifies consumer group rebalancing. Historically, there was some downtime in Kafka when new partitions were assigned to add machines, for example, during a spike in traffic. It wasn’t uncommon for some of the consuming applications to stop during this process. KIP-848 ameliorates this scenario so that, ideally, consumption doesn’t stop when partitioning for consumers is rebalanced.
“It’s a performance thing,” Jacobs said. “You don’t need to necessarily stop everybody just because a new member of the group showed up.” These capabilities can enhance auto-scaling deployments via frameworks like Kubernetes. When scaling up to meet the needs of pre-defined thresholds, the system automatically “adds the pods that you want to speed up your consumption, and the other pods carry on as they normally would,” Jacobs said.
Code Injections and Observability
KIP-1112 inserts code into Processor API and DSL processors in Kafka Streams with customized process wrapping capabilities. It eliminates the manual cutting and pasting of code — which can be time-consuming for developers — into processors, which is otherwise required to uniformly add logic for auditing, for example, to them.
Instead of the traditional manual approach, “Now you can actually define a processor wrapper class, and you can add that to your Streams config and your topology config, so that it will apply to all the processors in your topology,” Jacobs said.
KIP-1076 and KIP-1091 heighten Kafka’s observability capability by making available consolidated client metrics and new detailed state metrics on the system’s brokers. They improve the messaging service’s capacity to “report client metrics back to Kafka so that you can setup your OpenTelemetry collector of choice and leverage the tools that you already have an investment in, like Datadog, to get those metrics about your producers and consumers, your admin clients, into that platform,” Jacobs said.
Better With Time
Many of the KIPs providing the new features found in Kafka 4.0 are in general availability. Most of them are designed to reduce the time and effort required for developers to maximize the value from one of the most ubiquitous platforms underlying contemporary streaming data applications. To that end, improvements may be made to any of the KIPs found in the 4.0 release — in much the same way that they’re improving the platform itself.