Tuesday, 3 January 2012

NoSQL...Hang on a sec

There are tons of articles/tutorials/blogs/presentations available online but one has to be very careful in deciding whether to go for NoSQL or not. And if you decide to use NoSQL, then which database/datastore is the right choice for you.

Now to answer the first question of whether to go with NoSQL or not, first you should understand your requirement in detail. There are n number of points to consider, mentioning a few important points here:

1. How big is your data? What happens if you store it in a relational database like MySQL or Oracle? How many tables do you need to create? How many columns per table you will have on an average and most important thing is how many rows each table will have?

2. Also in a relational database, schema needs to be created first. Do you need some flexibility over there? Like in one of my projects I was working on a logging module and the structure of the logs was of various types. So I wanted to have flexible schema for it.

3. Some applications require fast data retrieval where as some require fast write access and some require both. Think about Google Search where fast data retrieval is very important where as application like Twitter where people tweet a lot (i.e. lots of write operation).

4. Its not just the application speed that matters but also the concurrent read and write access that should be taken into consideration. Think about the number of Facebook users writing simultaneously on various sections of the website.

5. Is indexing, caching etc not a solution to your problem? Because in some application introducing just caching solves the problem. (Atually I was a bit confused about whether to include lines like this in this post or not because talking about performance improvements/optimizing existing applications in all together a different topic and beyond the scope of this post.)

6. Are you creating application for Analytics?

7. Does your application have social-network features? Because Facebook itself is a big inspiration to go for NoSQL if you are using similar features. Facebook Engineering Notes is a good place to read about latest technologies used at Facebook (http://www.facebook.com/Engineering?sk=notes) and if the interest level increases consider reading Facebook Engineering Papers (http://www.facebook.com/Engineering?sk=app_190322544333196). I also like Facebook Tech-Talks LIVE.

Before I proceed toward which db is appropriate for your application here is a brief intro about few NoSQL databases.

Though there are 100+ NoSQL databases available (here is the list - http://nosql-database.org/), I am mentioning a few popular ones:

-MongoDB (http://www.mongodb.org/): Any of the languages can be used like Java, C, C++, PHP. It is written in C++. Data interchange format is BSON (Binary JSON).

-CouchDB (http://couchdb.apache.org/): Written in Erlang with javascript as main query language. Uses JSON and REST as protocol. Useful in case where data changes very often and you want to run predefined queries.

-Cassandra (http://cassandra.apache.org/): Written in Java while Thrift is used for external client-facing API. It supports a wide range of languages (Thrift languages). It brings together the best blend of features of Google's BigTable and Amazon's Dynamo. Cassandra was developed and later open-sourced by FaceBook.

-HBase (http://hbase.apache.org/): It is built on top of Hadoop. Written in Java. It is used when realtime read/write access to big data is needed.
First two of the databases are based on document store while rest two on column store(aka ColumnFamilies).

-Neo4j (http://neo4j.org/):  NOSQL graph database.
-Redis (http://redis.io/): advanced key-value store.

Now once you are convinced that your application requires NoSQL, here are a few points which will help you decide which NoSQL database/datastore is suited for you:

My approach is to choose the one which suites your application data model. If your application has something like a social-graph (some of the social networking features) then use a graph database like Neo4j. If your application requires storing very large amount of data and processing it then use column oriented database like HBase (Google's BigTable belongs to ColumnFamilies).
If you want fast lookups then you can go for something like Redis which supports key/value pairs. When data-structure can vary and you need a document type storage go for something like mongoDB. If your application requires high concurrency and low latency data access then go for Membase.

Now even if you have chosen the appropriate NoSQL database still there are certain things which you should make a note of :

1. Whether the db you have chosen in easy to manage and administer.

2. Developer's angle - Do you have the right set of people who can get started quickly on this. Does it have enough forums and community support. Affirmative answers for these questions is must since NoSQL DBs are still in the stage of emergence and have not matured yet.

3. How actively open source community is building tools/frameworks/utilities around it in order to make the developer's life easy.

At the end, there is a nice diagram available on the web (http://www.nijee.com/2011/02/golden-key-to-distributed-computing-is.html) - "Visual Guide to NoSQL Systems" which talks about which NoSQL db fits where based on CAP theorem. Typically choosing one db may not solve all your problems. Select one keeping certain features of your application in mind and feel free to go for another one for other modules. Also this doesn't mean removing relational database completely. Personally I always prefer to keep relational db in my applications and use it at certain places where it really makes sense to use it!

Recommended sites:

Since I am a regular reader of High Scalability site, I would recommend going though this URL : http://highscalability.com/blog/category/nosql . It has 38 nice articles on NoSQL which will really make you happy :)

Apart from this InfoQ also has good content for NoSQL: http://www.infoq.com/nosql/

Another hot place these days to get smart answers is Quora. Do read various NoSQL related queries and their nice answers written by top Developers/Engineers/Architects from top organizations like Facebook, Twitter, LinkedIn, Amazon and various other hot startups at the following URL : http://www.quora.com/NoSQL

The list is indeed big but I am not going to publish too many urls to divert your attention :)