Apache Kafka is a distributed real-time data streaming platform. Processing Payments, Tracking and Monitoring real-time, continuous analytics of sensor data are some examples of Event Steaming.
Kafka consists of servers and clients communicating through a high performance TCP network protocol. The data is organized into Topics making it efficient and easy for retrieval. It is also fault-tolerant as the data is replicated.
It was first founded at LinkedIn to solve a very specific problem with data. They had multiple applications bringing in data to other applications for processing/analyzing/monitoring or rendering. As the applications scaled up, dealing with the huge data flow became highly complex and un-maintainable. The necessity to have a platform that could talk to all these applications, rather than the conventional way of application being connected to one another lead to the creation of Kafka and it has evolved tremendously since then.
What is Data Streaming?
A continuous flow of data from various sources, generated real-time is a data stream. They are usually stored, processed or analyzed in real-time to produce actionable results.
Message Queues vs Data Streaming
Message queues deliver only to a single consumer whereas streaming brokers deliver to multiple subscribers. Message once delivered is lost in a message queue, Stream brokers are fault-tolerant having a distributed log file and hence reprocessing is possible.
A Kafka Broker receives messages from Producer and makes it available to the Consumer. The consumer can access data by specifying the Topic, Partition and Offset. There can be multiple brokers forming a Kafka Cluster to achieve fault-tolerance. The data is distributed on each broker in various partitions. The data is stored across brokers and partitions in a such a way that no data is lost in case a broker fails.
Topic is a name given to a specific Dataset, analogous to the table name in a relational database. Producer’s data is stored in specific topics. Consumers request for a specific topic to fetch data.
Topic Partitions: Data related to a particular topic is stored in a distributed fashion across various partitions. Size of the partition is decided by the Architect during design phase.
Kafka Consumers belonging to a group share the same group id. The given task is shared among the consumers. Kafka assigns each topic partition to a consumer in the group such that each partition is read by exactly one member.
There are two kinds of Connectors – Source and Sink. The source connector consumes data from the source system and sends to the Kafka Cluster. The sink connector mediates in sending data from the Kafka cluster to the Consumer. Connectors can move large datasets into Kafka using APIs compatible with various source and destination data stores.
KSQL: A SQL interface to Kafka streams
Kafka is a master-less cluster. Zookeeper keeps track of Kafka brokers, topics, partitions etc and also monitors and tracks the status of Kafka cluster nodes. Zookeeper maintains ephemeral nodes or znodes which exists as long as the session to the client (in this case, the kafka broker) is active. When the session ends, the znode is deleted. It also keeps record of the partition leader.
Kafka Controller: One of the Kafka brokers in the cluster is elected as the controller. It monitors the kafka brokers and re-assigns work when one of the active brokers fails (fault-tolerance). The node that is acting as the controller can be queried in the zookeeper.
Assignment of the controller: Every Kafka broker that is initiated tries to create a controller in zookeeper. The first one to do so is elected as the controller. The other brokers receive a flag that the controller is already elected. If the controller fails/taken down, the flag no longer exists and others start trying to create a controller. The first one succeeds this time.
Where can I use them?
- Big Data Integration/ Data Lakes
- Data Warehouse
- Application Logs gathering
- Activity Tracking
- Stream Processing
Note: Data Lakes vs Data Warehouse – Data Lakes store raw or semi-structured data where the schema is defined post storage of data whereas Data Warehouses have processed data with a well defined schema before the data is stored.
A Small Demo
I am using Confluent Kafka Community Version, but the same steps are applicable to Apache Kafka as well.
What will we do:
- View Ephemeral Nodes
- Start Zookeeper
- Start Kafka Server
- Create a Kafka Topic
- Create a Producer (from console)
- Create a Consumer (from console)
To see the ephemeral nodes:
>bin\windows\zookeeper-shell.bat localhost:2181 Connecting to localhost:2181 Welcome to ZooKeeper! JLine support is disabled WATCHER:: WatchedEvent state:SyncConnected type:None path:null ls / [cluster, controller_epoch, controller, brokers, zookeeper, admin, isr_change_notification, consumers, log_dir_event_notification, latest_producer_id_block, config] ls /brokers [ids, topics, seqid] ls /brokers/ids  ls /brokers/topics [demo1, __consumer_offsets]
Start Kafka Server:
Creating a Kafka Topic:
>bin\windows\kafka-topics.bat --create --topic demo1 --partitions 1 --replication-factor 1 --bootstrap-server localhost:9092
Here, partitions lets you specify number of partitions required.
Replication-factor -> It is a fault tolerance feature which creates multiple copies of the data in different brokers. Because we are using a single broker, we will have a single replication-factor
bootstrap-server -> Cluster co-ordinates are specified to this command using the argument bootstrap-server which takes the values of IP and Port i.e. the default kafka broker IP/hostname (localhost) and default kafka broker listener port (9092) respectively.
>bin\windows\kafka-console-producer.bat --topic demo1 --broker-list localhost:9092 < data\sample1.csv
broker-list: same as bootstrap-server. In this case, it is the same local machine, but in actual scenario, the producer can be connecting to a kafka broker which is usually a remote machine connected over a TCP/IP network.
>bin\windows\kafka-console-consumer.bat --topic demo1 --bootstrap-server localhost:9092 --from-beginning
Let me leave you with some interesting applications of Kafka in these well known companies.
- LinkedIn: This is where it was found. Kafka is used in Activity Streaming and Metrics. This powers the LinkedIn Newsfeed and various other products.
- Twitter: Used as a storage system to receive and process requests
- Netflix: Event-Processing Pipeline and Real-time monitoring
- Coursera: Data-pipeline for real-time learning analytics dashboard
- Pinterest: Log collection Pipeline