What is Streaming Data?
Streaming data or events are small data (about kilobytes in size) and are generated continuously. When it flows into the system, we immediately process the data in order to use this information in real-time or near real-time.
By information we mean:
• Financial information in the stock market
• Sensor data from the transport vehicle to send to the streaming application that the vehicle has reached where it is.
• When you use the Lyft and Grab apps, these will find out where you are, and how much traffic you have, and calculate real-time prices at that time.
• Social media feeds.
• Information on playing players in games that are analyzed in real-time
What is Kafka?
Apache Kafka is an open-source distributed event-streaming platform. Today, Kafka is used to building real-time data pipelines and streaming apps.
Kafka was originally created by LinkedIn, but was now an open-source project by 2011, but was originally an Apache Incubator (in other words, an incubation project before it was fully transformed into the Apache Software Foundation), and one of its developers. That is Jay Kreps, so Kafka was named after the author Franz Kafka.
Until now, it has become one of the full Apache Software Foundation, written in Java and Scala
Real-time data platform with high throughput and minimal latency.
Kafka Core Concepts:
Streaming data (information) or the event will be generated by the producer because it is the producer who writes this event data and sends it to Kafka.
Producer draws from external systems such as web servers, components of applications, IoT devices, and monitoring agents.
For example, if a user registers on a website, it counts as an event, or a weather sensor containing temperature, humidity, and wind data is counted as an event. In other words, anyone can create this event.
Consumers are responsible for fetching data from producers and directing this data to other destinations such as databases, data lakes, or data analytics applications.
As for Kafka, our main hero will act as the middleman between producer and consumer. This Kafka system is called a Kafka Cluster.
There are several nodes in a cluster, the node in Kafka is called a broker, which is why Kafka is classified as a distributed system.
The producer publishes the event to Kafka topics where the consumer subscribes from the desired topic, where that topic can have multiple consumers.
Therefore, producers are often referred to as publishers, and consumers are also referred to as subscribers, thus both Kafka is often classified as publish/subscribe messaging systems.
What is Publish/subscribe messaging?
Publish/subscribe messaging or pub/sub messaging is a pattern of sending information from a publisher to a subscriber as a piece of data (message).
Kafka Topics:
The producer will write data to topics (imagine a topic like a file system).
• Topics are identified by name.
• Get any kind of message format
• The sequence of messages is called a data stream.
• Cannot query data, but uses producer to write data and consumer to read data.
The topic is divided into partitions and in each partition, each message is assigned an id or offset.
Caution is
• The data in the same offset is not the same data. In other words, Partition 0 data at offset 1 is not equal to partition 1 at offset 1.
• Data written in the partition cannot be changed (immutable).
• Sorting can only be done within the partition.
Kafka Producer:
The producer sends messages to topics and messages are distributed across partitions.
This distribution can be done in 2 ways: it can be distributed as a round robin (partition 0, partition 1, ..) or it must be in the original partition.
In this process of collecting data, Kafka has what is known as a Message Serializer, which converts data from objects into bytes in order to send bytes to the consumer.
Kafka Consumer:
The consumer reads information from the topic by pulling a message.
• Its intelligence is that the consumer knows which broker to read.
• The consumer's way of reading is to read from the low offset to the high in each partition.
• If the broker fails, the consumer will know where to go back to reading by looking at the offset.
• If the consumer reads more than one partition, there is no guarantee that the message order will be aligned across the partition.
• Now the producer has a serializer, so the consumer side has a deserializer too, converting the byte data back into objects.
Consumer group:
The Consumer has a Consumer group that will merge multiple consumers together to help each other read messages in each partition.
• Kafka will keep the offset of where the consumer group has been read and the name __consumer_offsetsis in the topic.
• Because of this, if a consumer dies, it can resume reading from the original offsets (committed offset).
Replication:
There are some other interesting things Kafka has such as:
• Topics also have a replication factor, where partitions are stored at another broker if that broker dies. There is also a reserve in another broker.
• There is a Leader concept, in each partition, there is only one broker that is the leader for that partition.
• When the producer sends data to the broker, it sends only the leader, and the follower broker replicates the leader.
Kafka vs RabbitMQ:
Kafka is often compared to traditional messaging queues (Message brokers), such as RabbitMQ, IBM MQ, and Microsoft Message Queue.
Message broker has a concept that allows applications, and services within the system to talk to each other and exchange information. It acts as a separator between the sender and the receiver. The sender doesn't have to know which receiver to send it to.
Well, that sounds like Kafka.
The difference is that Kafka will retain the message for a period of time (default is 7 days), while RabbitMQ will delete the message even when the consumer (subscriber) has received it. The message is sold out if subscribed, but the consumer side of RabbitMQ will not receive the same message because it has been deleted.
RabbitMQ will use the push message method to the consumer and will calculate how the consumer will process messages (broker smart, the consumer doesn't mind).
Kafka, on the other hand, gives the consumer more to pull messages, where the consumer knows where to pull from and with how much offset.
Kafka scales horizontally (i.e. adding more nodes) while RabbitMQ scales vertically (adding performance to the same node).
Kafka and Zookeeper:
Zookeeper is the person who manages the brokers for Kafka.
• In choosing which broker will be the leader
• Keeping topic and permission config.
• Send a notification to tell Kafka that the topic is reborn, broker has died, broker has returned, etc.
But Zookeeper has been with Kafka for a long time, but Zookeeper won't go on as Kafka develops itself gradually to remove dependency on ZooKeeper.
• In Kafka ver. 2, it won't work without Zookeeper.
• In Kafka ver. 3 it is possible to run without Zookeeper / Kafka in KRaft.
• In Kafka ver. 4, there is no more Zookeeper.
Why was ZooKeeper removed?
Because of bottleneck in scaling which has many points such as:
• Kafka cluster can create limited partitions (only 200,000 partitions)
• Security in ZooKeeper is low.
• Metadata in Kafka and ZooKeeper are out of sync.
Should we use Zookeeper?
Kafka ver. 4 is not yet ready for production so it recommends it still exists, but for best practice in development, we should avoid working based on ZooKeeper instead.
Kafka in KRaft mode:
In 2019, Kafka plans to migrate Zookeeper due to scaling issues. In case Kafka clusters have more than 100,000 partitions, once Zookeeper is removed, Kafka can scale to millions of partitions. So in Kafka 3.x create a Raft protocol (KRaft) to replace Zookeeper.