Tuesday, January 29, 2008

Stackless RPython - Recursive

Got input from pypy/rpython developers, and in order to get stackless rpython it was just to add parameter stackless=True and replace parameter gc='ref' with gc='generation' to Translation() method.
# The Computer Language Shootout
# http://shootout.alioth.debian.org/
# based on bearophile's psyco program
# slightly modified by Isaac Gouy
# And adapted to RPython by Amund

def Ack(x, y):
if x == 0: return y+1
if y == 0: return Ack(x-1, 1)
return Ack(x-1, Ack(x, y-1))

def Fib(n):
if n < 2: return 1
return Fib(n-2) + Fib(n-1)

def FibFP(n):
if n < 2.0: return 1.0
return FibFP(n-2.0) + FibFP(n-1.0)

def Tak(x, y, z):
if y < x:
return Tak(
Tak(z-1,x,y) )
return z

def TakFP(x, y, z):
if y < x:
return TakFP(
TakFP(z-1.0,x,y) )
return z

# RPython stuff starts here

from sys import argv #, setrecursionlimit

def main(argv):
n = int(argv[1]) - 1
print "Ack(3,%d):" % (n+1), Ack(3, n+1)
print "Fib(" + str(28.0+n) + "," + str(FibFP(28.0+n))
print "Tak(%d,%d,%d): %d" % (3*n, 2*n, n, Tak(3*n, 2*n, n))
print "Fib(3):", Fib(3)
print "Tak(3.0,2.0,1.0):", TakFP(3.0, 2.0, 1.0)
return 0

from pypy.translator.interactive import Translation
#t = Translation(main, standalone=True, gc='ref')
t = Translation(main, standalone=True,
stackless=True, gc='generation')
path = t.compile()
print path

Monday, January 28, 2008

RPython GCLB Benchmark - Recursive

Must admit that I am a big fan of python (the programming language), and when I saw the benchmark that RPython can be faster than C (on the binary tree benchmark from the Great Computer Language Shootout - GCLS), I just had to try RPython on another problem from GCLS, so I chose the one with worst performance compared to C/gcc (266 times slower) - recursive (with various recursive methods, e.g. Ackerman, Fibonacci and Tak). Results were roughly that rpython was 50-100 and gcc/c was 100-300 times faster than python (note: I only did one run, so numbers can be somewhat bogus, but not too bad I think).

Unfortunately for the run with n=11 Ackerman had consumed all the stack, but that can be solved with Stackless RPython.

Saturday, January 26, 2008

PhD Subject in Computer Science?

Having written about how to complete your PhD (and for the balance - how to not) I still haven't touched the core which is the research subject, but here is a try.

Computer Science Research
Computer Science has a large menu with exotic sub-disciplines where many are indistinguishable from magic for most people. Several of these disciplines are hybrid, i.e. crossover with other fields (e.g. bioinformatics and environmental informatics). It also has crossover with itself, e.g. the diff between Computer Engineering and Computer Science is rather blurry (and maybe culturally dependent?), so for the sake of simplicity I will treat them as synonyms.
Research is described as "a human activity based on intellectual investigation and aimed at discovering, interpreting, and revising human knowledge on different aspects of the world. Research can use the scientific method" (Source: Wikipedia). Should you use the scientific method when doing computer science research? That is left as deductive exercise for the reader, i.e. deduction of combining "Computer Science" and "Research" in the same phrase. (hint: the answer is yes).

Purpose(Computer Science) == Automation
In my opinion the sole purpose of Computer Science is efficient automation, please don't forget that (many tend to do, even many experienced people in the field).

Why you should go back to the basics
Even though there are plenty of exotic CS sub-disciplines, one stands out as the clearly most important one, and that is software engineering (SE), which is roughly about how to create software and tools/languages supporting it. And software can make users more productive, i.e. have a multiplier effect. My guess is that few world-scale problems (e.g. related to climate) can't benefit from software, and the same goes for the long tail of problems to person-level scale.

Multiplying the multipliers
.. Now you're getting overly vague.

Thanks for reminding me, the only point I want to make is that software is extremely important :). And since software has a multiplicative effect that few other technologies can beat (e.g. 1 persons code can effect a large amount of users in a positive way), making software engineers more productive can have a massive impact on society

What to do research on in Software Engineering?
I've added some personal hypothesis to get you started.
  • Hypothesis 2: Test-Driven Development (TDD), and its siblings {Behavior, Domain, API} Driven Development is a step in the right direction, but needs more automation.
  • Hypothesis 3: Software testing today is to a large degree a manual process (e.g. writing unit tests), that probably won't be the case in the future. Machine learning and statistics will be used ensure that.
  • Hypothesis 4: Refactoring of code is today a manual process, that probably won't be the case in the future. (e.g. extract method is basically about finding the start line and end line, extract method between those lines and rename it. Automatic method naming is probably the hardest problem to solve). Machine learning, heuristics and statistics will be used to ensure that.
  • Hypothesis 5: How to deal with large amounts of code is still unsolved.
  • Hypothesis 6: Code metrics has a future, but which one is yet to be discovered.
  • Hypothesis 7: Duplicate detection in code can be significantly improved. Current tools typically use simple syntactic approaches, they don't even do unification of names (variables and methods) or trying to rewrite the code. If you write a short-to-medium length method, chances are high that it has a clone somewhere else at least if you unify methods, and that clone might even have unit tests => productivity increase.
  • Hypothesis 8: Given a sufficiently good compiler or runtime environment technology any programming language can perform well. (If it doesn't perform well then the compiler or runtime environment isn't sufficiently good enough :)
  • Hypothesis 10: Current programming languages doesn't sufficiently deal with concurrent programming (e.g. to utilize multicore and grid/cloud/web service systems) from a developer point of view. Some claim that threads are a bad idea (for most purposes). Concurrency is crucial and beneficiary in general, but the annual cost of errors in concurrent software systems in the world is not a small amount.
  • Hypothesis 11: Abstractions currently used to represent entities in distributed programming can be improved, e.g. the master-slave/client abstraction implies that the master controls the slave, but in a distributed environment that isn't really the case. Alternatives are needed, and Agent-Oriented Software Engineering (AOSE) can be one of them, but it still needs more industry influence. A related abstraction - Mobile Agents - deals with the issue that code is less that the data, so that code should move to the data, not vice versa. The choice of abstraction also affects the complexity of reliability model.
  • Hypothesis 12: Legacy code is still written every day (and will be for a while), and dealing with it is still largely unsolved.

Monday, January 21, 2008

How to not (or barely) complete your PhD

For the few of you who would like to decrease your chances of completing your PhD in Computer Science, here are some non-recommended suggestions:

  1. Code too much
    • in particular extremely polished and nice-to-have-features, can be fun, but time (and funding) flies.
  2. Write too much
    • writing is in general good, but there got to be some substance behind it and in my opinion either empirical support or proofs.
  3. Teach and tutor too much
    • Pros: Your students will love you and you will learn a lot about teaching. Cons: Your advisor or PhD committee might not. Thought about doing teaching instead of research?
  4. Not write papers
    • Paper collections has become alternative to monograph thesis in several countries, but if you are defending not writing papers because you are aiming for a monograph chances are quite good you won't make it.
  5. Being too focused
    • Being too focused prunes away a lot (which can be good), but sometimes you get ideas of fields that are slightly different from what you are researching, and if you aren't exposed to such fields you may be loosing important input.
  6. Being too unfocused
    • Is never a great idea, but you might learn a lot of other stuff (e.g. getting involved in system administration at university level can be useful).
  7. Have an obsolete research subject
    • Several research fields in CS can be considered retro, this can cause problems when trying to publish papers, what are the primary events and research communities?
    • The field might be exhaustively studied, and can be extremely hard to contribute to.
  8. Have a boring research subject
    • If you think it is boring, chances are that most reviewers will think it at least as boring as you do, and reject all of your papers.
  9. Do a startup
    • I don't necessarily think that doing a startup is a bad idea (why? left as an exercise for the reader, you might actually end up with honorable doctorates instead, and that is probably at least as good, but different and less probable), but for PhD productivity it might not be the greatest idea.
  10. Never finish anything
    • If you mainly have started-on-but-never-at-submittable-quality papers chances are your thesis will be of similar quality.
  11. Have an advisor that doesn't publish actively or interact with research community
    • I personally believe if your advisor doesn't publish actively, chances are higher that you won't either. Choose your advisor with care.
  12. Read too much
    • I have seen people who have an exhaustive-read-approach to PhD, e.g. attitudes like "I have to read all the references of this seminal survey paper". It usually didn't serve them well (in particular if they use this rule recursively..). Reading is good, but it has to be targeted, and doing something is better than only reading.

Disclaimer: This approach isn't recommended, but if recognize some of the above approaches as yours, then you should perhaps consider reading this.

Saturday, January 19, 2008

Schwagile Methods

Agile methods are increasingly being used as the preferred approach for software development.
Scrum is probably the most commonly used agile method, it is briefly about having short timespan (few weeks) and atomic development sprints with tasks that supports a user story (user story is typically written by a product owner, e.g. a customer or a customer proxy). Each day of a sprint there is a daily scrum - short stand up meeting (not standup..) where each team member reports progress since last daily scrum, plan for the day (e.g. pick a new task) and if there are any impediments. The Scrummaster deals with the (relatively lightweight) reporting (e.g. burndown charts) and tries to fight impediments to make the team more productive.

Towards Schwagile Methods
Even though Agile methods are gaining ground in the software industry (and even in research), it still trails Schwag. Schwag was born about 90 years before the Waterfall model, but is fortunately still in very good shape.

Systematically combining Schwag and Agile methods is an obvious next step, e.g. by introducing a Schwagmaster to the daily Scrums. The Schwagmaster's role - in addition to maintain the Schwag budget and inventory - is to monitor and detect Schwagable happenings, e.g. provide:
- Comforting Schwag for overly sad impediments (e.g. development box with uptime X months just died),
- Celebrating Schwag for team members with stellar performance,
- Team-building Schwag for entire team since it was so long ago since last Schwag-round.

Since over time the amount of physical Schwag is piling up (with possible exception of cool t-shirts), the Schwagmaster is responsible for recycling it (i.e. collect used Schwag and provide it for another Scrum team), and gradually introduce (carbon neutral) Virtual Schwag. One issue is where Virtual Schwag should be placed, the Scrum Board is an obvious place, so it gets cleaned after each sprint.

Thursday, January 17, 2008

How to complete your PhD

Shane Lindsay writes about how to complete your PhD (or any large project) which is to use hard deadlines (externally determined), soft deadlines (your own) and the Martini method (write at least X words per day, e.g. between 500 and 1000)

My take on how to complete a PhD (in Computer Science):
  1. Learn and use a version control system for everything you do (code, datasets, text etc.)
    • Useful for backup
    • Useful for versioning
  2. Learn statistics and experimental method (e.g. factorial design)
  3. Learn scientific writing, take a course!
    • Always, and I do mean always support your claims when writing (e.g. with references proofs, empirical support or good rhetoric writing). Consider dangling claims in papers as paper-eating bugs, if you don't feed them with support they will eat your paper. And argumenting why the research you are presenting is important doesn't hurt.
  4. Submit papers early and often (to conferences, workshops and journals)
    • The likelihood that external reviewers provide complementary input on your work compared to your advisor and grad student colleagues is probably significantly close to 100%
    • Demand to be a "slave" co-author for your advisor on the first paper, e.g. do all ground work (experiments etc.), but you learn the skill of writing and review process.
    • As my father advised me half-seriously: "even write on the toilet"
    • Figure out the most important research events in your field
    • Do at least 1 unlikely-to-get-accepted submission in order to get a reject early just to get heat. It is way better to get a reject (with explanation why it was rejected) from a great conference or journal the first year, than getting it for the first time from a medium quality conference a few months before you are going to defend your thesis.
  5. Submit code early and often.
    • Writing code keeps you sharpened. Spending several years developing some kind of framework, model or algorithm sketch (without any implementation) and then try to implement it in order to evaluate is likely to cause trouble. By then your coding skills are about as sharp as a spoon, and the model or whatever you are trying to evaluate with implementation is probably way too abstract and requires a lot of massage in order to be implementable. And if that isn't enough, you probably are about to run out of funding.
    • You are way more marketable when you are finished than if you don't write code
    • Learn at least 1 new language that is not too close to those you know during your PhD
    • The fundamentals of Computer Science is still software (and hardware).
  6. Least Publishable Unit papers are not bad because:
    • It is like code, concise and and shorter methods that do one thing well are preferred over longer methods that try to a lot of stuff, don't get me started on testing.
  7. Try to get involved in the research community as a reviewer (probably not the first year)
    • You'll be surprised how unpolished submitted papers actually are, and how different the first submit and the final polished,.. eh published paper actually is.
  8. Learn at least 1 drawing, 1 presentation, 1 word processor and 1 statistics tool
    • Leslie Lamport knew what he was doing..
    • Google Docs&spreadsheets for notes and calculations
  9. Teach!
    • Being thrown in front of hundreds of students with the expectancy that you are going to teach them something useful and interesting is an extremely valuable lesson (and quite scary too I must admit). And this can save your career if you forgot to code (read 5.) ;-)
  10. Oh wait, create interesting MSc topics and get MSc students
    • The payoff can be great (e.g. get you more productive by helping you concretize your ideas into code and get them to do experiements).
    • There is usual an at least linear (probably exponential) relationship on the PhD relevant output you get and the input/support you provide the MSc students.
  11. My experience is that "doers" are more likely to finish their PhD than "smarties".
    • Thinking doesn't create your thesis, but writing might!
    • Hopefully you are both a "doer" and a "smartie" :)
Hm, can't think of anything else, and no guarantees about that you will manage to complete your PhD if you follow these advice :)

Note: I don't think these approaches will help you complete any large project (as Shane Lindsay's recipe claims to), but it probably won't hurt either.

Wednesday, January 16, 2008

Lessig's book "Future of Ideas" downloadable under Creative Commons licence

Increasing Transparency and Competition

Interesting phenomena that the Norwegian government is encouraging competition, customer choice and transparency by creating portals that provides pricing information. The first portal was telepriser.no that provides info about broadband, mobile and telephony services, and the most recent announcement was finansportalen.no that provides pricing info about various financial services and products, e.g. insurance, banking and mutual funds.

Monday, January 14, 2008

Stackless Python

Erlang is currently getting a lot of attention as a language that supports concurrent programming with a large number of simultaneous lightweight processes (or threads if you prefer). But there are few reasons its ~sibling - Stackless Python - should get less attention, some preliminary benchmarks support that.

During grad school some students and I used Stackless Python in combination with MPI to create a simple simulation of a large number of players and NPCs in multiplayer games on a cluster. The game world was divided onto machines in the cluster and on each machine a few thousand tasklets (also called microthreads) each representing individual players and NPCs ran. When they ran into each other they communicated using channels. When players or NPCs came to the boundary of the game world we sent the serialized state of the tasklet representing the player or NPC using MPI messages to the neighbor CPU world.

Stackless Python turned out to be a very nice language to code in, and in fact the students and I tried to start a company providing microsimulation consulting services using it. Unfortunately such services turned out to be of much less demand than anticipated so we had to stop our efforts in that direction. But that wasn't Stackless' fault, in fact I would recommend having a look at Stackless Python if you consider doing microsimulation.

Search Engine Industry History in Trondheim

A friend of mine wrote a nice entry on what turned out to be part of the search engine industry history in Trondheim, please have a look.

Sunday, January 13, 2008

Future number of programming languages - singularity or infinity?

The development of programming language technology is going at a relatively moderate pace. And it seems like every new programming language tries to become the (singular) language that can be used to solve all problems, I believe that is the wrong approach since it tends to make the languages big, complex and hard to learn. This complexity is reflected in the readability and maintainability of code in such languages. According to Bruce Eckel and Neal Gafter it seems like Java is close to reaching some of those pain points.

Domain-specific languages (DSLs)
I believe a better approach would be to create small and highly domain-specific programming languages that can be mastered in a few minutes and function as specialized tools.

If you've ever done house refurnishing you know the thrill of finding the exactly right and highly specialized (frequently electric) tool for the job, it may not always be more efficient to use than a more general tool, but surprisingly often it is, and often by orders of magnitude (in user time).

I believe there is potential for similar efficiency gains by increasingly using domain specific programming languages. The alternatives to create domain-specific languages are usually to add features to already large and complex languages (not an option) or add more libraries to the large language, the latter is ok, but it can be hard to make these library APIs as syntactically efficient, expressive and pleasant to use as a domain-specific language.

Domain-specific languages - where and how?
Functional programming has in the last few years had a revival, e.g. Google's mapreduce for large-scale parallel computing, Python and Ruby adding Lisp-like features, and in upcoming languages such as Haskell and Erlang. I believe future domain-specific languages will borrow some from functional programming, but probably even more from declarative programming (e.g. Goedel or languages with inductive logic programming support, e.g. Aleph).

A typical sign of where a domain specific language could fit is when you need to write hundreds of lines of boilerplate code to get something done, and that something can be described in one sentence. Boilerplate code is in my opinion about as useful as heat from a light bulb, or noise produced by a computer. And since syntax matters I believe the syntax of most domain specific languages should resemble Python since Python leads the expressivity axis (i.e. gzip of source) of the great computer language shootout (but I am open to even more pleasant alternatives, please enlighten me). One issue with using a Python-like syntax is the representation of types, but I don't think that will be big issue for tight and highly declarative domain-specific languages.

But how should all the domain specific and other languages communicate?
Great question. Not sure, but I believe mainly through some kind of network/RPC interface. It is of course tempting to make the domain specific languages borrow from their "carrier" (e.g. jython being python + relatively full java access), but I am not sure if that is the right direction. But for runtime environments they should be both interpretable and being able to run on common virtual machines with just in time compilers (e.g. jvm). Even though there are some nice parser and compiler tools, e.g. antlr out there, I still think there is plenty of opportunity to create tools to support (rapid) development of domain-specific languages.

So I believe the number of programming languages is likely to diverge from singularity.

Who is the next?

The last 10 or so years there has been several internationally well-known IT companies establishing themselves with engineering presence in Trondheim, either through acquisitions or starting a brand-new offices. This count (most likely) increased by one last week when Microsoft offered to acquire (enterprise search software company) Fast Search and Transfer.

The current list of well-known IT companies with engineering presence in Trondheim to my knowledge are:
  • Google
  • Yahoo
  • Microsoft
  • SAP
  • Sun
  • Atmel
  • ARM
The natural follow up question: who is the next well-known IT company that establish themselves in Trondheim?

Saturday, January 12, 2008

The delicious absence of computer noise

I am not a big fan of noise from computers (which is typically caused by other fans..). Even if the noise can be close to avoided with with certain head phones or drowned out by other sources of sound I still hope computer noise is something that quickly becomes an artifact of the past.

The noise is a symptom
Computer noise caused by fans is a symptom of power inefficiency by spending a lot of energy producing heat rather than doing computations (of course some physical reasons for this). This is as I see it analog to old-fashioned light bulbs that used more energy to produce heat than light, fortunately light bulbs can be replaced with LED lamps.

Fixing the symptom (sort of)
In order to get a less noisy home office I drilled a hole in the floor between the office and my (cold) basement, put up a small shelf near the basement roof and moved my ubuntu box to the shelf with usb and monitor cables through the hole, and it certainly made a difference sound-wise.

Fixing the problem?
But energy-wise it has gotten worse since I am less likely to (manually) turn the box off and don't have any fancy time or usage-based power adjustments for it. So I believe my current home office solution is a only temporary patch, but uncertain what the real solution should be. Suggestions?

Sunday, January 6, 2008

Increase Automation of Test-Driven Development?

I believe Test-Driven Development can be improved by increased automation, some of the suggested thoughts could perhaps also be adapted to Behavioral-Driven Development (BDD).

Test-Driven Development (TDD) is a manual repeated cycle of writing and running tests, writing the least amount of code to make tests pass, and finally refactoring code (and tests) to make it shine. Writing tests gives the important "side effect" of simultaneously designing the API.

Automation of coding part?
The word automated is sometimes mentioned together with TDD, but that usually refers to automated unit tests (that needs to be manually written) or automated refactoring (that needs to be manually performed by the user e.g. using a refactoring tool/IDE).

Writing the code is done with a greedy approach, i.e. writing just enough to make tests pass, and the coding per TDD cycle is usually only a one-to-at-most-few short methods that is called by the new test, i.e. small increments of code. These small code increments could potentially be automatically induced using machine learning, e.g. using inductive logic programming, program synthesis or genetic programming. There are at least 2 problems with this 1) readability of induced code, 2) scalability. The readability part can be manually handled in the refactoring step of the TDD cycle, and wrt scalability of induction of TDD code increments I believe one can perhaps do smart (automated) things with mocks to prune the search space for the chosen machine learning algorithm.

Towards API Driven Design (ADD?)
One of the things that is hot in software testing research is development of tools that generates unit tests automatically based on code input, e.g. if you have a large chunk of (legacy) code with zero or low test coverage you can use such tools to get high (or even full) test coverage. In their simplest forms such tools just generate tests calling methods to be tested with various permutations of input (e.g. corner/extreme value inputs), this typically leads to enormous amounts of variable quality tests, and there are fortunately tools that are smarter and seems to promise higher quality tests, e.g. DSD-Crasher. I believe such tools can perhaps be used to automatically fill in missing test-cases in TDD, e.g. if you add one test with a set of input values, such tools can generate the corner-variants of the same test. This leads to a more API driven development, since the developer doesn't have to write the API many times (with various inputs), but gets support by the tool to fill in test cases.

Remarks related to recent blogosphere postings about code size
I agree that size is code's worst enemy, but believe it would be slightly better to deal with size if much of the coding and writing tests for it is offloaded to the computer so the coder can focus more on API driven development. I don't believe everybody should be afraid of a little typing if it is to improve readability of code, but they should spare their fingers when the computer is eager to fill in.

Friday, January 4, 2008

Dining Philosophers and Quarreling Kids

Concurrency in general and in particular making concurrent processes smoothly share resources are hard programming problems. The classic problem illustrating concurrency is the dining philosophers problem where the shared resources are forks. The problems the philosophers can get include resource starvation, deadlock and livelock. (Note: dining philosophers has a slight resemblence of the game of musical chairs)

The quarreling kids problem
A much less classic but way more realistic concurrency problem is the quarreling kids problem, i.e. when providing N resources (e.g. food or toys) to M kids (who might be either spoiled, tired or hungry) several situations can occur:
  1. N < M
    • Typical result: Quarreling kids since some kids doesn't get any resources
  2. N ≥ M and N mod M != 0
    • Typical result: Quarreling kids since resources are unevenly distributed
  3. N &ge M and N mod M == 0
    • Can go smoothly, but can lead to quarreling kids if:
      • the resources differ, e.g. in type, size, shape or color
      • the order/speed resources are distributed in
      • the resources doesn't match the recipient's expectations (stereotypical: the girl doesn't get the pink colored and the boy doesn't get the blue colored)
Solutions to the quarreling kids problem?
An obvious solution approach could be to let the kids select resources themselves, but that will typically make the kids that selects first happy and the rest less happy and start quarreling, and if the kids select resources at the same time that also typically leads to quarreling during selection process.
A second approach is to provide non-discrete resources that are of infinite divisible nature, but that has unfortunately practical lower boundaries.
A third approach is to provide personalized (heterogeneous) resources, so every kids gets what they anticipate, but this requires both detailed knowledge of each kids preference and is impractical due to either limited resource budget, limited resource availability or possibly other reasons (e.g. nutrition?).

So what might work?

A) Provide homogeneous resources (where N ≥ M and N mod M == 0) could work, this way the kids are likely to feel that they are fairly treated.

But what if there are fewer resources than kids (N < M)

B) Provide resources that supports and encourages communication and interaction, i.e. where sharing pays off (e.g. in a more or less continuous sport like soccer or a turn-based board game)

This sounds awfully politically correct, what is the point?

The point is that solving the quarreling kids problem most likely requires solutions that:
  1. Avoids resource sharing by having dedicated resources to each kid (e.g. a pair of skates each), like in A above, or
  2. Provides resources that are being competed for under a set of communicated and agreed-upon cooperative rules as in soccer, or waited for under as set of cooperative rules in a turn-based board game, like in B above. (Without cooperation quarreling about resources will occur)
From a programming perspective, the quarreling kids (processes) problem with dedicated resources don't have any sharing problems, and can entirely avoid complex synchronization mechanisms (e.g. locks and mutexes), communication between the kids (processes) are done using speech (messages). This approach is similar to message-based parallelism such as using (asyncronous) messages in MPI and Erlang or channels in Stackless Python.

But what is the computational analogy for solution 2?