Thursday, 17 May 2012

Search- An Introduction

We are well familiar with searching for results in RDBMS. Simply write the 'Like' query and you get the results. Unfortunately this simple method is useful for simple use cases only! With huge data, the query will keep everything busy while fetching the results. Lets get to a very high level of difficulty. We search in Google and get so many results with a search- images, text, maps, videos and much more. My understanding tells me that the type of data decides which table it should reside at. In that case, our search should navigate through all those tables and get us the results. Difficult situation indeed!

Lucene is a full-text indexing and searching library originally written in Java (now in many more languages like C++, C# and PHP). Best thing- it is open source.

Lucene has got parsers for respective type of document. Parsers do their assigned job and pass on the text to next level, i.e. analyzer. This is the place where tokens are generated. Analyzer pulls out the tokens and its related info from the text content. It then writes this info in Lucene's index files. Analyzers are components which pre-process input text. Since the search string has to be processed the same way that the indexed text was processed, the same Analyzer must be used for both indexing and searching, else we'll get invalid search results.

Lucene gives performace as high as 95GB/hour. Its incremental indexing is also equally fast as batch indexing and index size is just about 20-30% of the text content. In terms of searching as well, Lucene is very powerful- best results returned first in ranked searching, many powerful query types like phrase query and range query, sorting by fields, field-based search, merging of results for multiple index search.

Lucene along with Solr search server is being used for real-time search in Twitter and various other sites like Wikipedia. Solr is open source enterprise search platform by Apache Lucene. Solr is written in Java and works as a standalone full-text search server within a servlet container. Its major features include powerful full-text search, faceted search, dynamic clustering, database integration, rich document format handling, and geospatial search. Solr uses Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from almost any programming language. Popular sites using Solr are Instagram, AOL, eBay, Cisco, digg, reddit, MTV Networks and Goldman Sachs.

Another wonderful item in this set is Luke! Luke is a handy diagnostic and development tool which accesses existing Lucene indexes and allows to display and modify their content.

No comments:

Post a Comment