Friday, 25 April 2014

[Big Data] Apache Kafka - Part I

Huge amount of real-time data is continuously getting generated these days by various sources. We’ve so many examples: Facebook-  Your feed will continuously be populated with newer and newer items, you also have recent activities of your friends listed down which keeps on updating as and when any activity happens. Similarly, the question answer site Quora shows your notifications, answers, upvotes, new questions asked etc. You do not have to click the refresh button to get them. Twitter is another such very good example. On the other hand there are many such applications which want to consume this data. In most of the cases these data consuming apps are not connected to the data producing apps. Since we do not have data producers and consumers under the same umbrella, we need a mechanism which will seamlessly integrate these ends. So producers need not even know who the consumers are. They just have to bother about their work of pushing messages to a system as and when generated. The generated data is Big Data in present time. We have already got familiar with Big Data in our previous post, we also know its characteristics- volume, velocity and variety, as well as its importance. Size of the data generated poses a big challenge in this integration system. In most of the cases it is not just about consuming the data, but also performing analytics on it. Real-time analytics on huge amount of data to produce real-time outputs is something that has to be catered to. Yes, there are some systems which do not need the real-time data. When they want to consume data, they get connected, get the data generated till that time, go back offline again and then perform analytics on the data.

Kafka is the intermediate system between the producers and consumers which seamlessly allows different kinds of applications to consume messages. It is a publish-subscribe commit log system. It is designed to process real-time data stream activity like news feed and logs.

It was developed at LinkedIn and later open-sourced. Need- since LinkedIn had to deal with so many events e.g. Updates, user activity. With low latency.

Kafka is a distributed, partitioned system. Logs are saved under various 'topics'. For each topic, Kafka saves messages in partitions with the intention of scaling, fault-tolerance and parallel consumption. Each partition is ordered, immutable sequence of messages which keep on adding to the log. The log is saved for a predefined amount of time. Consumers can subscribe to multiple topics. Messages are stored in order and each message has got a sequential id. There is 'offset' of each consumer. Offset shifts with consumption of messages. Usually a consumer will consume messages in order.


A server in Kafka cluster is called as Broker. Kafka cluster saves the messages for predefined period. So even if a consumer is not continuously connected with the cluster, it can keep connecting at specified periods and consume the messages published by that time.

It is upto the producers in which topic and which partition should the message get published. Consumers can be grouped together into consumer groups. When a messages is published, it is delivered to one consumer within the consumer group.

A single Kafka broker can handle terabytes of reads/writes per second from multiple clients.
Messages persist on disk and replicated within the cluster. We can consume a message multiple times since there is no data loss. Kafka is cluster-centric which allows fault-tolerance and durability.

No comments:

Post a Comment