Friday, July 18, 2008

The Rebirth of Confounding and Theory in Science?



Wired recently published an interesting article titled: The End of Theory: The Data Deluge Makes the Scientific Method Obsolete, where they roughly conclude that with enough data and processing correlation is all you need. While interesting I think it misses a few important points.

Confounding variables
At the end of the article they claimed: "Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all".

Correlation between two variables can give some information, and with a lot of data and computational horsepower you can find correlation relatively easily, but the danger is if correlation is interpreted as causation (and that can happen).

Example: Correlation between GDP and Children´s Weight
An example is the positive correlation between a country´s Gross Domestic Product (GDP) and the weight of its children, if that was interpreted as a causation the cheapest way to increase GDP for a country would be put kids on e.g. a Chankonabe diet (note: I am not saying that is a likely action). The GDP and weight correlation is an example of where a third confounding variable time was not accounted for.

Example: Correlation between substance X and your health
In the press you frequently see eating/drinking substance X is good for your health and makes you live longer, in several of those cases I believe correlation is reported and not the confounding variables, i.e. many of these substances are luxury goods and may not be affordable to most of the worlds population, so the confounding variable might be personal economy which is probably correlated with access to vaccines and medical care.

The value of theory
My personal hypothesis is that the likelihood of not finding confounding variables increases exponentially with the complexity or level of abstractness of the domain investigated for correlation (e.g. in medicine, farmacology, biology or physics) The useful information is more likely to be found when discovering confounding variables, and the correlation is primarily a signal that says that something exciting is underneath (there can be cases where correlation is sufficiently interesting in itself though). In order to find or propose confounding variables I believe the need for models and theories is still very much needed, probably even more than earlier (but the simplicity or complexity of models is another discussion, I tend to lean towards KISS).

Disclaimer: This posting (and all my others) represent only my personal views.

Monday, July 14, 2008

Annual Echo Chamber

Having blogged for a bit more than a year (April 2007), here are some rough stats and opinions of blog entries so far. Most read entries:
  1. Pragmatic Classification with Python
  2. Greenlet Python is concurrently alive and kicking
  3. RPython GCLB Benchmark - Recursive
  4. How to complete your PhD
  5. Future number of programming languages - singularity or infinity?
In terms of self-eval (read: navel gazing..) of entries, the list would look like this: As always this blog only expresses my personal opinions.