Wednesday, June 11, 2008

Pragmatic Classification of Classifiers

recap: In my previous machine learning related postings I have written about the basics of classification and given an overview of Python tools for classification (and also a machine learning dream team and how to increase automation of test-driven development).

In this posting I will "go meta" and say something about classes and characteristics of classifiers.

Informative vs Discriminative Classifiers
Informative classifiers model the densities of classes and select the class that most likely produce the features, in the naive bayes case this modeling involves counting (see here for an example with these data).

Discriminative classifiers have a different approach - they try to model class boundary and membership directly, e.g. in a simple 2-feature dimension case this could mean trying to finding the line that best separates the classes (in >3 feature dimensions case it would be looking for the hyperplane that best separate classes). Examples of discriminative classifiers are support vector machines (SVM) and ridge regression.

Classifier training methods
Many classifiers are batch-based, that means that they need to have access to all training data at the same time (including historic data in a re-training case). Online classifiers don't need all data for every training round, they supporting updating the classifier data incrementally. A related training method is decremental training, which is about dealing with classifier problems where there is concept drift (i.e. forgetting out-of-date examples). Other training methods include stochastic training which is about training using random samples of data.

Linear vs Non-Linear Classifiers
If you have a situation where one class is inside a circle and the other class is outside the circle (and surrounding the circle), it will be impossible to linearly separate the two classes (with a discriminative classifier), fortunately there are non-linear classifiers that can solve this (typically by transforming the problem into a more computationally heavier problem using a kernel trick, but at least the new problem is possible to solve).

Sequential vs Parallel Classifiers
Sequential classifier algorithms can typically utilize one core, cpu or machine, and parallel classifier algorithms are able to utilize more cores, cpus or machines (e.g. in order to handle more data or get faster results).

Non-orthogonal Data
Non-orthogonality is handled by some classifiers, this can happen when there are repeated occurrences of training data.

Dependencies between features
Dealing with dependencies between features (e.g. correlations) is handled by some classifiers (this is sometimes a symptom of potential for improvement in feature representation).