Tuesday, April 22, 2008

Pragmatic Classification: The very basics

Classification is an everyday task, it is about selecting one out of several outcomes based on their features. An example could be recycling of garbage where you select the bin based on the characteristics of the garbage, e.g. paper, metal, plastic or organic.

Classification with computers
For classification with computers the focus is frequently on the classifier - the function/algorithm that selects the class based on features (note: classifiers usually has to be trained to get fit for fight). Classifiers can be found in many flavors, and quite a few of them have impressive names (phrases with rough, kernel, vector, machine and reasoning aren't uncommon when naming them).

note: Garbage in leads to Garbage out - as (almost always) - same goes for classification.

The numerical baseline
Let us assume you have a data set with 1000 documents that shows to have 4 equally different categories (e.g. math, physics, chemistry and medicine). A simple classifier for a believe-to-be-similar-dataset could be the rule: "the class is math", which is likely to give a classification accuracy of about 25%. (Another classifier could be to pick a random category for every document). This can be used as a numerical baseline for comparison with when bringing in heavier classification machinery, e.g if you get 19% accuracy with the heavier machinery it probably isn't very good (or your feature representation isn't very good) for that particular problem. (Note: heavy classification machinery frequently has plenty of degrees of freedom, so fine tuning them can be a challenge, same goes for feature extraction and representation).

Combining classifiers
On the other hand, if the heavy machinery classifier gave 0% accuracy you could combine it with a random classifier to only randomly select from the 3 classes the heavy machinery classifier didn't suggest.

Question 1: What is the accuracy with these combined classifiers?

Baseline for unbalanced data sets
Quite frequently classification problems have to deal with unbalanced data sets, e.g. let us say you were to classify documents about soccer and casting (fishing), and your training data set contained about 99.99% soccer and 0.01% about casting, a baseline classifier for a similar dataset could be to say - "the article is about soccer". This would most likely be a very strong baseline, and probably hard to beat for most heavy machinery classifiers.

Silver bullet classifier and feature extraction method?

Q: My friend says that classifier algorithm X and feature extraction method Y are the best for all problems, is that the case?
A: No, tell him/her to read about the ugly duckling and no free lunch theorems which clearly says that there is no universally best classifier or feature extraction approach.

note: Just some of the basics this time, something more concrete next time (I think).