As previously mentioned I currently write most of my blog postings over on Atbrox (my startup company)
Here are the latest postings:
Thursday, August 26, 2010
Sunday, May 30, 2010
Evaluation of Search Predictions made in May 2000
In May 2000 I wrote A few thoughts about the future of Internet Information Retrieval (i.e. search), but how did it actually go? I've tried to evaluate them in this posting, with the original prediction in italic font followed by the evaluation.
1a) Prediction - Specialized Services within Search
It seems likely that the specialization in the Internet Information Retrieval (IIR) business will continue. Internet information crawling, pre-processing, indexing, searching and presentation requires different types of technologies and know-how, this might create opportunities for new companies specializing in only one step of the IIR "food chain". One possibility could be that companies doing crawling will do offer extracts of relevant data on request, e.g. a search engine specializing in winter sports could get only relevant data extracted from several regional crawler companies. In other words, the IIR "food chain" might increase in length.
1b) Evaluation
Specalization of search services happened to some degree, but had relatively small impact. Examples of such services include fetching/crawl-related services (e.g. 80legs). But the services with biggest impact are the free (e.g. Google Ajax Search API and Bing APIs) and commercial search APIs (e.g. Yahoo Boss and Wolfram Alpha API), all in common that they offer the last step, i.e. search - so implicitly covering all steps. Noteworthy happenings in the related direction is cloud computing and increasing number of large data sets (e.g. infochimps collection, DBPedia and the Public Terabyte (crawl) dataset)
2a) Prediction about Potential New Search Players
As the importance of Internet Information Retrieval grows, players that have been concentrating on the lower end of the Internet "food chain", i.e. major bandwidth providers (e.g. MCI or British Telecom) and network software/hardware vendors (e.g. 3COM or Cisco) might want to enter the market as providers of partially indexed data to search engines and topic hierarchies.
2b) Evaluation
This didn't happen at all to my knowledge.
3a) Prediction about Potential New Search Technologies
With the increased growth of the amount of data on the Internet, new technologies for doing distributed indexing/search of data will probably occur. This is particularly interesting if processing and indexing of multimedia data (e.g. sound, pictures and video) becomes popular. Processing of multimedia data is considerably more CPU intensive than processing of textual data. Example of such processing could be automatic detection of objects (e.g. a car) in video frames.
3b) Evaluation
(Massively) distributed indexing in the "SETI@home-style" didn't happen at large scale, though there are a few examples pursuing distributed indexing/search, e.g. the Majestic project. The in retrospective obvious processing of multimedia data is happening (but not trivial problems to solve).
Conclusion
If I am kind - 0.5 on prediction 1, 0 on prediction 2 and 0.5 on prediction 2 ~ 33.33% correct?
1a) Prediction - Specialized Services within Search
It seems likely that the specialization in the Internet Information Retrieval (IIR) business will continue. Internet information crawling, pre-processing, indexing, searching and presentation requires different types of technologies and know-how, this might create opportunities for new companies specializing in only one step of the IIR "food chain". One possibility could be that companies doing crawling will do offer extracts of relevant data on request, e.g. a search engine specializing in winter sports could get only relevant data extracted from several regional crawler companies. In other words, the IIR "food chain" might increase in length.
1b) Evaluation
Specalization of search services happened to some degree, but had relatively small impact. Examples of such services include fetching/crawl-related services (e.g. 80legs). But the services with biggest impact are the free (e.g. Google Ajax Search API and Bing APIs) and commercial search APIs (e.g. Yahoo Boss and Wolfram Alpha API), all in common that they offer the last step, i.e. search - so implicitly covering all steps. Noteworthy happenings in the related direction is cloud computing and increasing number of large data sets (e.g. infochimps collection, DBPedia and the Public Terabyte (crawl) dataset)
2a) Prediction about Potential New Search Players
As the importance of Internet Information Retrieval grows, players that have been concentrating on the lower end of the Internet "food chain", i.e. major bandwidth providers (e.g. MCI or British Telecom) and network software/hardware vendors (e.g. 3COM or Cisco) might want to enter the market as providers of partially indexed data to search engines and topic hierarchies.
2b) Evaluation
This didn't happen at all to my knowledge.
3a) Prediction about Potential New Search Technologies
With the increased growth of the amount of data on the Internet, new technologies for doing distributed indexing/search of data will probably occur. This is particularly interesting if processing and indexing of multimedia data (e.g. sound, pictures and video) becomes popular. Processing of multimedia data is considerably more CPU intensive than processing of textual data. Example of such processing could be automatic detection of objects (e.g. a car) in video frames.
3b) Evaluation
(Massively) distributed indexing in the "SETI@home-style" didn't happen at large scale, though there are a few examples pursuing distributed indexing/search, e.g. the Majestic project. The in retrospective obvious processing of multimedia data is happening (but not trivial problems to solve).
Conclusion
If I am kind - 0.5 on prediction 1, 0 on prediction 2 and 0.5 on prediction 2 ~ 33.33% correct?
Thursday, May 13, 2010
Overview of my postings on the Atbrox blog (Nov 2009-May 2010)
As mentioned in a previous posting I mainly write on Atbrox's blog (and not here), in case you haven't seen them, here is an overview of my postings since November 2009 and so far in May 2010:
Hadoop and Mapreduce
(for even earlier postings on Atbrox check out this overview)
Search
Hadoop and Mapreduce
- Mapreduce & Hadoop Algorithms in Academic Papers (3rd Update)
- Parallel Machine Learning for Hadoop/Mapreduce - a Python Example
- So, what is Hadoop?
- Atbrox Customer Case Study - Scalable Language Processing with Elastic Mapreduce
(for even earlier postings on Atbrox check out this overview)
Sunday, January 24, 2010
My recent reads in Information Retrieval - Indexing
Information Retrieval (IR) - better known as Search - is probably the most exciting research field I know of, the reasons that makes IR exciting are:
- solvability - it can probably never be solved perfectly, but always be improved
- coverage - it spans all areas of computer science and touches many other sciences (e.g. statistics)
- importance - it is the most important research area related to supporting human decisions? (~AI)
- difficulty - it is extremely hard to do well
- applicability - it can be used practically anywhere (anytime).
Where to start learning about information retrieval?
Before jumping into research papers I suggest reading a book about IR, either:
Search Engines: Information Retrieval in Practice (2009) or
Introduction to Information Retrieval (2008)
They are both good and relatively similar books written by a mix of authors from search industry and academic IR research (note: I personally prefer the newest one).
My recent reads in Information Retrieval?
Indexing - algorithms and datastructures for self-indexing
Search Engines: Information Retrieval in Practice (2009) or
Introduction to Information Retrieval (2008)
They are both good and relatively similar books written by a mix of authors from search industry and academic IR research (note: I personally prefer the newest one).
My recent reads in Information Retrieval?
Indexing - algorithms and datastructures for self-indexing
Self-indexing is where (lossless) compression meets indexing, and is an alternative to the classic inverted index. Self-indices has some nice characteristics wrt compression, performance and query-flexibility. Indexing-research-rockstar Gonzalo Navarro even called it the Miracle of Self-indexing (2009).
2 key papers in the field are:
Have a nice read :)
2 key papers in the field are:
- Opportunistic Data Structures with Applications (2000)
- Introduced the FM-index
- High-order entropy-compressed text indexes (2003)
- Introduced the Wavelet Index Tree
Have a nice read :)
Subscribe to:
Posts (Atom)