Thursday, 10 April 2014

Big Data : An Introduction

What is Big Data?

For years together, companies have been making decisions based on analytics performed on huge data stored in relational databases. Saving data in the structured manner in relational databases used to be very costly. Storing and processing huge amount of unstructured data (big data) is much cheaper and faster. This is the main reason why big data has attracted so much of attention. Big data typically has following types:
  • Data from social media sites.
  • Enterprise data including customer related information of CRM applications.
  • Logs of any system- be it related to software or manufacturing.
Big data and its processing are characterised by 3 qualities:
  • Volume : Normally we speak of gigabytes or GBs. Here it goes from tera, peta and exa and so on bytes. And the amount keeps on increasing.
  • Velocity : Relational databases do not scale up in linear way. We expect having same performance even when the data is huge.
  • Variety : Most of the data is unstructured data with a small amount of structured data too.
Why big data is important?

The economic value of big data varies a lot. Sometimes there are indirect advantages of big data as in decision making. Typically there is good amount of information hidden within this big chunk of unstructured data. We should be able to figure out precisely what part of data is valuable and can be used further. This leaves us wondering what do we ultimately do with such a huge amount of data analytics which should be produced as fast as possible. There are so many use cases- Twitter wants to find out most retweeted tweets or trending ones, find tweets containing a particular hashtag, Google has to get the results of so many queries, ad publishers need to know how many new ads have been posted, Quora has to publish the questions posted newly or generate news feed as per every user’s topics and people followed, millions of emails are being sent as notification and much more from so many websites, number of applications downloaded from app store, new article or post published on various news sites to be displayed and a whole lot many things. Especially with the increasing use of smart phones and GPS enabled devices, ad publishers want to display location specific ads or ads of stores located nearby the current location of the user. This helps in targeting the right set of customers.


To get maximum benefit of the big data, we should be able to process all of the data together instead of processing it in distinct small sets. Since this data is much larger than our traditional data, the real challenge is to handle it in a way so as to overcome the computational and mathematical challenges. The heterogeneous and incomplete nature of the data is to be tackled in processing. When we do manual computations, we can handle such heterogeneity of data. However when the processing has to be done by machine, it expects the data to be complete and homogeneous. Another challenge with big data is its volume and it keeps on increasing rapidly. On hardware front, this is taken care by increasing the number of cores rather than simply increasing the clock speed and by replacing traditional hard disc drives by better I/O performance storage. Cloud computing too helps handle this volume challenge by being able to process varying workloads. The large volume of data poses the difficulty of achieving timeliness. As larger data is to be processed, it will take more time. However the quickness of getting analysis is very crucial in most of the cases.

No comments:

Post a Comment