Sunday, January 24, 2010

My recent reads in Information Retrieval - Indexing


Information Retrieval (IR) - better known as Search - is probably the most exciting research field I know of, the reasons that makes IR exciting are:
  • solvability - it can probably never be solved perfectly, but always be improved
  • coverage - it spans all areas of computer science and touches many other sciences (e.g. statistics)
  • importance - it is the most important research area related to supporting human decisions? (~AI)
  • difficulty - it is extremely hard to do well
  • applicability - it can be used practically anywhere (anytime).
Where to start learning about information retrieval?
Before jumping into research papers I suggest reading a book about IR, either:
Search Engines: Information Retrieval in Practice (2009) or
Introduction to Information Retrieval (2008)
They are both good and relatively similar books written by a mix of authors from search industry and academic IR research (note: I personally prefer the newest one).

My recent reads in Information Retrieval?


Indexing - algorithms and datastructures for self-indexing
Self-indexing is where (lossless) compression meets indexing, and is an alternative to the classic inverted index. Self-indices has some nice characteristics wrt compression, performance and query-flexibility. Indexing-research-rockstar Gonzalo Navarro even called it the Miracle of Self-indexing (2009).
2 key papers in the field are:
  1. Opportunistic Data Structures with Applications (2000)
    • Introduced the FM-index
  2. High-order entropy-compressed text indexes (2003)
    • Introduced the Wavelet Index Tree
Check out Navarro's survey paper Compressed Full-Text Indexes (2007) for a good overview of self-indexing.

Have a nice read :)

Tuesday, November 24, 2009

My latest postings (on Atbrox) related to Hadoop/Mapreduce & Search

As mentioned in a previous posting I mainly write on Atbrox's blog, here are my postings in October and (so far in) November.

Wednesday, October 7, 2009

From now on mainly blogging on Atbrox.com's blog

Atbrox is my new startup company (I left Google 1 month ago after 4 exciting years).

Sunday, March 8, 2009

Snakes on a Cloud

.. or Mobile Agents with Python

What are mobile agents?
"A mobile agent is a process that can transport its state from one environment to another, with its data intact, and be capable of performing appropriately in the new environment", source: Wikipedia.

Why Mobile Agents .. or how to deal with Data Gravity
The phrase "with its data" most likely refers to the agent's state data (and not all types of data), since one of the nicest properties of mobile agents is that they can move to where potentially huge amounts of data is. By plotting the curve "data gravity" - i.e. a rough estimate of how long time it takes to empty/fill/process a hard drive over a network connection - hard disk size divided by (typical) ethernet network speed over the last 20 years - the motivation for moving code to data (and not vice versa) is clearly increasing, making mobile agents a potentially interesting approach.

Mobile Agent Runtime Environment
A basic requirement for a mobile agent runtime environment is the ability to receive and run the agent's code, e.g. typically support one/several of the following (with Python-related examples):
i) receive and run binary code
python example: receive python compiled to binary with Shedskin and g++
ii) receive and compile source code and run it
python example: receive c source code and compile/integrate it with Python using Cinpy, or use Shedskin/g++ on received Python code
iii) receive and run interpreter on source code
python example: receive python code and interpret using the eval() method.

Bandwidth
If the mobile agents move around a bit, you probably want their representation as compact as possible to reduce bandwidth requirements, i.e. prefer agents represented in (small amounts of) source code - alternative ii) or iii) - over (larger amounts of) binary code - alternative i).

Processing
compiled code (or just-in-time compiled code) is usually more efficient than interpreted code (interpreted code can perhaps be seen as analog to the "gas guzzling cars" of computing wrt resource utilization, fortunately there are tools to deal with that), so alternative ii) is probably preferred over iii), and with the cinpy case compilation overhead is negligible (a few milliseconds to compile and make a short C method ready to be called from Python), which matter if you have a large amount of distributed mobile agents. Pareto principle also matters for mobile agents, so a mix (80-99% of code) in interpreted Python and the things that really need to perform in C (quickly compiled with cinpy) might be a common mix.

Example of Cinpy-wrapped C function in python
fibc=cinpy.defc(
"fib",
ctypes.CFUNCTYPE(ctypes.c_int,ctypes.c_int),
"""
int fib(int x) {
if (x<=1) return 1; return fib(x-1)+fib(x-2); } """)


Security
In case mobile agents move around on the cloud it is nice to know that the agent you receive is from a known source (yourself), this can e.g. be done using ezPyCrypto's signString() to sign the agent source and then use verifyString() methods on the signed agent source code together with its signature to check the origin of the agent (assuming the signer's public key is available on the receiving end).

disclaimer: this posting (and all others on this blog) only represents my personal views.

Saturday, December 27, 2008

Ajax with Python - Combining PyJS and Appengine with json-rpc

I recently (re)discovered pyjs  - also called Pyjamas - which  is a tool to support development of (client-side) Ajax applications with Python, it does that by compiling Python code to Javascript (Pyjamas is inspired by GWT  - which supports writing Ajax applications in Java).

pyjs' kitchensink comes with a JSON-RPC example, this posting shows how to use Appengine to serve as a JSON-RPC server for the pyjs json-rpc example.

(Rough) Steps:
1) download appengine SDK and create an appengine application (in the dashboard)
2) download pyjs 
3) Replace pyjs example/jsonrpc/JSONRPCExample.py with this code

from ui import RootPanel, TextArea, Label, Button, HTML, VerticalPanel, HorizontalPanel, ListBox


from JSONService import JSONProxy


class JSONRPCExample:
    def onModuleLoad(self):
        self.TEXT_WAITING = "Waiting for response..."
        self.TEXT_ERROR = "Server Error"
        self.remote_py = UpperServicePython()
        self.status=Label()
        self.text_area = TextArea()
        self.text_area.setText(r"Please uppercase this string")
        self.text_area.setCharacterWidth(80)
        self.text_area.setVisibleLines(8)
        self.button_py = Button("Send to Python Service", self)
        buttons = HorizontalPanel()
        buttons.add(self.button_py)
        buttons.setSpacing(8)
        info = r'This example demonstrates the calling of appengine upper(case) method with JSON-RPC from javascript (i.e. Python code compiled with pyjs to javascript).'
        panel = VerticalPanel()
        panel.add(HTML(info))
        panel.add(self.text_area)
        panel.add(buttons)
        panel.add(self.status)
        RootPanel().add(panel)


    def onClick(self, sender):
        self.status.setText(self.TEXT_WAITING)
        text = self.text_area.getText()
        if self.remote_py.upper(self.text_area.getText(), self) < 0:
            self.status.setText(self.TEXT_ERROR)


    def onRemoteResponse(self, response, request_info):
        self.status.setText(response)


    def onRemoteError(self, code, message, request_info):
        self.status.setText("Server Error or Invalid Response: ERROR " + code + " - " + message)


class UpperServicePython(JSONProxy):
    def __init__(self):
        JSONProxy.__init__(self, "/json", ["upper"])



4) Use the following code for the appengine app
from google.appengine.ext import webapp
from google.appengine.ext.webapp.util import run_wsgi_app
import logging
from django.utils import simplejson

class JSONHandler(webapp.RequestHandler):
  def json_upper(self,args):
    return [args[0].upper()]

  def post(self):
    args = simplejson.loads(self.request.body)
    json_func = getattr(self, 'json_%s' % args[u"method"])
    json_params = args[u"params"]
    json_method_id = args[u"id"]
    result = json_func(json_params)
    # reuse args to send result back
    args.pop(u"method")
    args["result"] = result[0]
    args["error"] = None # IMPORTANT!!
    self.response.headers['Content-Type'] = 'application/json'
    self.response.set_status(200)
    self.response.out.write(simplejson.dumps(args))

application = webapp.WSGIApplication(
                                     [('/json', JSONHandler)],
                                     debug=True)

def main():
  run_wsgi_app(application)

if __name__ == "__main__":
  main()
5) compile pyjs code in 3) and create static dir in appengine app to store compiled code (i.e in app.yaml)
6) use dev_appserver.py to test locally or appcfg.py to deploy on appengine

Facebook application
By using this recipe it was easy to create a facebook app of the example in this posting - check it out at http://apps.facebook.com/testulf/


Alternative backend - using webpy


This shows how to use webpy as backend, just put the javascript/html resulting from pyjs compile into the static/ directory. Very similar as the previous approach, with the code in blue being the main diff (i.e. webpy specific)
import web
import json
urls = (
'/json', 'jsonhandler'
)
app = web.application(urls, globals())

class jsonhandler:
  def json_upper(self,args):
    return [args[0].upper()]

  def json_markdown(self,args):
    return [args[0].lower()]

  def POST(self):
    args = json.loads(web.data())
    json_func = getattr(self, 'json_%s' % args[u"method"])
    json_params = args[u"params"]
    json_method_id = args[u"id"]
    result = json_func(json_params)
    # reuse args to send result back
    args.pop(u"method")
    args["result"] = result[0]
    args["error"] = None # IMPORTANT!!
    web.header("Content-Type","text/html; charset=utf-8")
    return json.dumps(args)

if __name__ == "__main__":
  app.run()

Wednesday, December 17, 2008

Simulating Mamma Mia under the xmas tree (with Python)

Norway has a population of ~4.7 million, and "Mamma Mia!" has during a few weeks been sold in ~600 thousand DVD/Blueray copies (> 12% of the population, i.e. breaking every sales record, perhaps with the exception of Sissel Kyrkjebø's Christmas Carol album which has sold a total of ~900 thousand copies in the same population, but that was over a period of 21 years).  In the UK it has sold more than 5 million (in a population of 59 million, i.e. >8% of the population).


Well, to the point. Guesstimating that there will be ~2 million xmas trees in Norway, one can assume that many of the trees will have (much) more than one "Mamma Mia!" dvd/blueray underneath it*,  the question is how many?

Simulating Mamma Mia under the xmas tree
.. Before you start recapping probability theory, combinatorics, birthday paradox, multiplication of improbabilities and whatnot, how about finding an alternative solution using simulation

A simple simulation model could be to assume that for every Mamma Mia dvd/blueray copy there is a lottery where all trees participate,  and then finally (or incrementally) count how many copies each tree won.

A friend wrote a nice Python script to simulate this:

import random

NUM_XMAS_TREES = 2000000
NUM_MAMMA_MIAS = 600000

tree_supplies = {}
for mamma_mia in range(NUM_MAMMA_MIAS):
    winner_tree = random.randint(0, NUM_XMAS_TREES)
    tree_supplies[winner_tree] = tree_supplies.get(winner_tree,0) + 1

tree_stats = {}
for tree in tree_supplies:
    tree_stats[tree_supplies[tree]] = tree_stats.get(tree_supplies[tree], 0) + 1

print tree_stats

Results:

$ for k in `seq 1 10`; do echo -n "$k " ; python mammamia.py; done
1 {1: 443564, 2: 67181, 3: 6618, 4: 510, 5: 36}
2 {1: 444497, 2: 66811, 3: 6543, 4: 520, 5: 32, 6: 2}
3 {1: 444796, 2: 66376, 3: 6738, 4: 510, 5: 36, 6: 3}
4 {1: 444499, 2: 66750, 3: 6652, 4: 469, 5: 30, 6: 2, 7: 1}
5 {1: 444347, 2: 66717, 3: 6697, 4: 494, 5: 28, 6: 2}
6 {1: 444511, 2: 66389, 3: 6763, 4: 551, 5: 40, 6: 3}
7 {1: 443914, 2: 66755, 3: 6785, 4: 511, 5: 33, 6: 2}
8 {1: 444747, 2: 66558, 3: 6667, 4: 484, 5: 40}
9 {1: 444553, 2: 66703, 3: 6631, 4: 497, 5: 32}
10 {1: 443903, 2: 66853, 3: 6774, 4: 487, 5: 23, 6: 1}

Conclusion

So we see that in run 4 there was one xmas tree that according to the simulation model got 7(!) Mamma Mia DVD/Bluerays underneath it, but the overall simulation shows that 5 or 6 (at most) is probably more likely (assuming the model is right).

Regarding the simulation model, it is probably way too simplistic, i.e. not taking into account people buying mamma mia for themselves (or not as xmas gifts), a likely skewness in terms of number of gifts per christmas tree, interaction between buyers, etc. But it can with relatively simple manners be extended with code to make it more realistic. Check out Markov Chain Monte Carlo simulation for more info on how to create more realistic simulation models.

Wednesday, December 3, 2008

cinpy - or C in Python

cinpy is a tool where you can write C code in your Python code (with the help of ctypes - included in modern Python versions). When you execute your python program the C code is compiled on the fly using Tiny C Compiler (TCC).

In this posting I will describe:
1) installation and testing cinpy
2) a simple benchmark (c-in-py vs python)
3) compare performance with gcc (c-in-py vs gcc)
4) measure cinpy (on-the-fly) compilation time
5) how to dynamically change cinpy methods

1. How to install and try cinpy (note: also found in cinpy/tcc README files)
1) download, uncompress and compile TCC
./configure
make
make install
gcc -shared -Wl,-soname,libtcc.so -o libtcc.so libtcc.o

2) download, uncompress and try cinpy
cp ../tcc*/*.so .
python cinpy_test.py  # you may have to comment out or install psyco 

2. Sample performance results (on a x86 linux box):

python cinpy_test.py
Calculating fib(30)...
fibc : 1346269 time: 0.03495 s
fibpy: 1346269 time: 2.27871 s
Calculating for(1000000)...
forc : 1000000 time: 0.00342 s
forpy: 1000000 time: 0.32119 s

Using cinpy for fibc (Fibonacci) method was ~65 times faster than fibpy, and and cinpy for forc (loop) was ~93 times faster than forpy, not bad.

3. How does cinpy (compiled with tcc) compare to gcc performance?
Copying the C fib() method and calling it with main program
$ time fibgcc


fib(30) = 1346269

real    0m0.016s
user    0m0.020s
sys     0m0.000s


GCC gives roughly twice as fast code as cinpy/tcc (0.034/0.016). 




4. How long time does it take for tcc to on-the-fly compile cinpy methods?


#!/usr/bin/env python
# cinpy_compile_performance.py
import ctypes
import cinpy
import time
t=time.time
t0 = t()
fibc=cinpy.defc(
     "fib",
     ctypes.CFUNCTYPE(ctypes.c_int,ctypes.c_int),
     """
     int fib(int x) {
       if (x<=1) return 1;
       return fib(x-1)+fib(x-2);
     }
     """)
t1 = t()
print "Calculating fib(30)..."
sc,rv_fibc,ec=t(),fibc(30),t()
print "fibc :",rv_fibc,"time: %6.5f s" % (ec-sc)
print "compilation time = %6.5f s" % (t1-t0)

python cinpy_compile_performance.py

Calculating fib(30)...
fibc: 1346269 time: 0.03346 s
compilation time: 0.00333 s


So compilation (and linking) time is about 3.3ms, which is reasonably good, not a lot of overhead! (note: TCC is benchmarked to compile, assemble and link 67MB of code in 2.27s)


5. "Hot-swap" replacement of cinpy code?
Let us assume you have a system with a "pareto" situation, i.e. 80-99% of the code doesn't need be fast (written in Python), but 1-20% need to be really high performance (and written in C using cinpy), and that you need to frequently change the high performing code, can that be done? Examples of such a system could be mobile (software) agents.


Sure, all you need to do is wrap your cinpy definition as a string and run exec on it, like this:

origmethod="""
fibc=cinpy.defc("fib",ctypes.CFUNCTYPE(ctypes.c_int,ctypes.c_int),
          '''
          int fib(int x) {
              if (x<=1) return 1;
              return fib(x-1)+fib(x-2);
          }
          ''')
"""
# add an offset to the Fibonacci method
alternatemethod = origmethod.replace("+", "+998+")
print alternatemethod
# alternatemethod has replaces origmethod with exec(alternatemethod)
print fibc(2) # = 1000
Conclusion:
cinpy ain't bad.

Remark: depending on the problem you are solving (e.g. if it is primarily network IO bound and not CPU bound) , becoming 1-2 orders of magnitude faster (cinpy vs pure python) is probably fast enough (CPU wise), the doubling from GCC may not matter (since network IO wise Python performs quite alright, e.g. with Twisted ).