Friday, 25 April 2014

[Big Data] Apache Kafka - Part I

Huge amount of real-time data is continuously getting generated these days by various sources. We’ve so many examples: Facebook-  Your feed will continuously be populated with newer and newer items, you also have recent activities of your friends listed down which keeps on updating as and when any activity happens. Similarly, the question answer site Quora shows your notifications, answers, upvotes, new questions asked etc. You do not have to click the refresh button to get them. Twitter is another such very good example. On the other hand there are many such applications which want to consume this data. In most of the cases these data consuming apps are not connected to the data producing apps. Since we do not have data producers and consumers under the same umbrella, we need a mechanism which will seamlessly integrate these ends. So producers need not even know who the consumers are. They just have to bother about their work of pushing messages to a system as and when generated. The generated data is Big Data in present time. We have already got familiar with Big Data in our previous post, we also know its characteristics- volume, velocity and variety, as well as its importance. Size of the data generated poses a big challenge in this integration system. In most of the cases it is not just about consuming the data, but also performing analytics on it. Real-time analytics on huge amount of data to produce real-time outputs is something that has to be catered to. Yes, there are some systems which do not need the real-time data. When they want to consume data, they get connected, get the data generated till that time, go back offline again and then perform analytics on the data.

Kafka is the intermediate system between the producers and consumers which seamlessly allows different kinds of applications to consume messages. It is a publish-subscribe commit log system. It is designed to process real-time data stream activity like news feed and logs.

It was developed at LinkedIn and later open-sourced. Need- since LinkedIn had to deal with so many events e.g. Updates, user activity. With low latency.

Kafka is a distributed, partitioned system. Logs are saved under various 'topics'. For each topic, Kafka saves messages in partitions with the intention of scaling, fault-tolerance and parallel consumption. Each partition is ordered, immutable sequence of messages which keep on adding to the log. The log is saved for a predefined amount of time. Consumers can subscribe to multiple topics. Messages are stored in order and each message has got a sequential id. There is 'offset' of each consumer. Offset shifts with consumption of messages. Usually a consumer will consume messages in order.


A server in Kafka cluster is called as Broker. Kafka cluster saves the messages for predefined period. So even if a consumer is not continuously connected with the cluster, it can keep connecting at specified periods and consume the messages published by that time.

It is upto the producers in which topic and which partition should the message get published. Consumers can be grouped together into consumer groups. When a messages is published, it is delivered to one consumer within the consumer group.

A single Kafka broker can handle terabytes of reads/writes per second from multiple clients.
Messages persist on disk and replicated within the cluster. We can consume a message multiple times since there is no data loss. Kafka is cluster-centric which allows fault-tolerance and durability.

Sunday, 20 April 2014

AnswerReader : An Awesome App in the Making!

AnswerReader is a powerful app to organize and customize Quora, providing a friendly experience.
AnswerReader acts as a single interface for performing various activities. Using Quora from a browser will most likely result in multiple browser tabs being opened. AnswerReader provides a multi-column view of Quora. User can save which all topics he wants in those columns. These saved topics appear as per their order, until the user doesn't remove or change them. User need not set them every time he uses AnswerReader. Any of the columns can be scrolled to main readable area by simply clicking on the topic name in the left navigation panel. AnswerReader provides quick access to most of the profile related info like stats and credits.
  • Create a customized Quora view: Manage columns, shortcuts and much more- all in one app.
  • Boost Productivity: No need to save every time what you want to see in the AnswerReader columns.
  • All in one interface: Manage answers, comments, replies, upvotes, drafts ,posts.
  • Stay Focused: Never miss out anything related to the topic you are most interested in.
  • Multiple Shortcuts: Shortcuts without leaving main page. Get rid of opening multiple browser tabs.
  • Follow without actually following: Keep track of activities of a topic or question even without following it in Quora.
  • Manage What or Whom to Follow: Follow or unfollow question, topic or person.
  • Become a Power User: Everything that you can do on Quora plus ease and multi-column view.
Download from Chrome Web Store: 
AnswerReader is an open source project. Fork it on GitHub:
Note: This tool is under active development. A lot of new features are coming soon.

Thursday, 10 April 2014

Big Data : An Introduction

What is Big Data?

For years together, companies have been making decisions based on analytics performed on huge data stored in relational databases. Saving data in the structured manner in relational databases used to be very costly. Storing and processing huge amount of unstructured data (big data) is much cheaper and faster. This is the main reason why big data has attracted so much of attention. Big data typically has following types:
  • Data from social media sites.
  • Enterprise data including customer related information of CRM applications.
  • Logs of any system- be it related to software or manufacturing.
Big data and its processing are characterised by 3 qualities:
  • Volume : Normally we speak of gigabytes or GBs. Here it goes from tera, peta and exa and so on bytes. And the amount keeps on increasing.
  • Velocity : Relational databases do not scale up in linear way. We expect having same performance even when the data is huge.
  • Variety : Most of the data is unstructured data with a small amount of structured data too.
Why big data is important?

The economic value of big data varies a lot. Sometimes there are indirect advantages of big data as in decision making. Typically there is good amount of information hidden within this big chunk of unstructured data. We should be able to figure out precisely what part of data is valuable and can be used further. This leaves us wondering what do we ultimately do with such a huge amount of data analytics which should be produced as fast as possible. There are so many use cases- Twitter wants to find out most retweeted tweets or trending ones, find tweets containing a particular hashtag, Google has to get the results of so many queries, ad publishers need to know how many new ads have been posted, Quora has to publish the questions posted newly or generate news feed as per every user’s topics and people followed, millions of emails are being sent as notification and much more from so many websites, number of applications downloaded from app store, new article or post published on various news sites to be displayed and a whole lot many things. Especially with the increasing use of smart phones and GPS enabled devices, ad publishers want to display location specific ads or ads of stores located nearby the current location of the user. This helps in targeting the right set of customers.


To get maximum benefit of the big data, we should be able to process all of the data together instead of processing it in distinct small sets. Since this data is much larger than our traditional data, the real challenge is to handle it in a way so as to overcome the computational and mathematical challenges. The heterogeneous and incomplete nature of the data is to be tackled in processing. When we do manual computations, we can handle such heterogeneity of data. However when the processing has to be done by machine, it expects the data to be complete and homogeneous. Another challenge with big data is its volume and it keeps on increasing rapidly. On hardware front, this is taken care by increasing the number of cores rather than simply increasing the clock speed and by replacing traditional hard disc drives by better I/O performance storage. Cloud computing too helps handle this volume challenge by being able to process varying workloads. The large volume of data poses the difficulty of achieving timeliness. As larger data is to be processed, it will take more time. However the quickness of getting analysis is very crucial in most of the cases.