Sunday, January 24, 2010

My recent reads in Information Retrieval - Indexing


Information Retrieval (IR) - better known as Search - is probably the most exciting research field I know of, the reasons that makes IR exciting are:
  • solvability - it can probably never be solved perfectly, but always be improved
  • coverage - it spans all areas of computer science and touches many other sciences (e.g. statistics)
  • importance - it is the most important research area related to supporting human decisions? (~AI)
  • difficulty - it is extremely hard to do well
  • applicability - it can be used practically anywhere (anytime).
Where to start learning about information retrieval?
Before jumping into research papers I suggest reading a book about IR, either:
Search Engines: Information Retrieval in Practice (2009) or
Introduction to Information Retrieval (2008)
They are both good and relatively similar books written by a mix of authors from search industry and academic IR research (note: I personally prefer the newest one).

My recent reads in Information Retrieval?


Indexing - algorithms and datastructures for self-indexing
Self-indexing is where (lossless) compression meets indexing, and is an alternative to the classic inverted index. Self-indices has some nice characteristics wrt compression, performance and query-flexibility. Indexing-research-rockstar Gonzalo Navarro even called it the Miracle of Self-indexing (2009).
2 key papers in the field are:
  1. Opportunistic Data Structures with Applications (2000)
    • Introduced the FM-index
  2. High-order entropy-compressed text indexes (2003)
    • Introduced the Wavelet Index Tree
Check out Navarro's survey paper Compressed Full-Text Indexes (2007) for a good overview of self-indexing.

Have a nice read :)