Amund Tveit's Blog: machine learning

Showing posts with label machine learning. Show all posts

Thursday, May 13, 2010

Overview of my postings on the Atbrox blog (Nov 2009-May 2010)

As mentioned in a previous posting I mainly write on Atbrox's blog (and not here), in case you haven't seen them, here is an overview of my postings since November 2009 and so far in May 2010:

Search

Initial Thoughts on Yahoo's Ranking Challenge

Hadoop and Mapreduce

(for even earlier postings on Atbrox check out this overview)

Wednesday, June 11, 2008

Pragmatic Classification of Classifiers

recap: In my previous machine learning related postings I have written about the basics of classification and given an overview of Python tools for classification (and also a machine learning dream team and how to increase automation of test-driven development).

In this posting I will "go meta" and say something about classes and characteristics of classifiers.

Informative vs Discriminative Classifiers
Informative classifiers model the densities of classes and select the class that most likely produce the features, in the naive bayes case this modeling involves counting (see here for an example with these data).

Discriminative classifiers have a different approach - they try to model class boundary and membership directly, e.g. in a simple 2-feature dimension case this could mean trying to finding the line that best separates the classes (in >3 feature dimensions case it would be looking for the hyperplane that best separate classes). Examples of discriminative classifiers are support vector machines (SVM) and ridge regression.

Classifier training methods
Many classifiers are batch-based, that means that they need to have access to all training data at the same time (including historic data in a re-training case). Online classifiers don't need all data for every training round, they supporting updating the classifier data incrementally. A related training method is decremental training, which is about dealing with classifier problems where there is concept drift (i.e. forgetting out-of-date examples). Other training methods include stochastic training which is about training using random samples of data.

Linear vs Non-Linear Classifiers
If you have a situation where one class is inside a circle and the other class is outside the circle (and surrounding the circle), it will be impossible to linearly separate the two classes (with a discriminative classifier), fortunately there are non-linear classifiers that can solve this (typically by transforming the problem into a more computationally heavier problem using a kernel trick, but at least the new problem is possible to solve).

Sequential vs Parallel Classifiers
Sequential classifier algorithms can typically utilize one core, cpu or machine, and parallel classifier algorithms are able to utilize more cores, cpus or machines (e.g. in order to handle more data or get faster results).

Non-orthogonal Data
Non-orthogonality is handled by some classifiers, this can happen when there are repeated occurrences of training data.

Dependencies between features
Dealing with dependencies between features (e.g. correlations) is handled by some classifiers (this is sometimes a symptom of potential for improvement in feature representation).

Sunday, May 25, 2008

Pragmatic Classification with Python

In my previous posting I wrote about classification basics, this posting will follow up and talk about Python tools for classification and give an example with one of the tools.

Open Source Python Tools for Classification

Orange - machine learning tool which supports classification (including combining classifiers in ensembles), feature extraction, basic statistical analysis, regression and association rules. It also has an extension module which supports clustering and additional classifier algorithms. Note: Orange is probably the Python-based machine learning tool that is most similar to the more famous tool Weka (which is for Java, or Jython for that matter).

Monte - less comprehensive than Orange, written purely in Python (i.e. no SWIGed C++). Looks interesting (has several classifiers algorithms), but the APIs seems to be in an early phase (relatively new tool in version 0.1.0)

libsvm - Python API for most popular open source implementation of SVM. Note: libsvm is also included with Orange and PyML. (I used this tools during my PhD a few years ago)

RPy - not exactly a classification tool, but it is quite useful with a statistics tool when you are doing classification (it has a nice plotting capability, not unlike matlabs), check out the demo.

PyML - also less comprehensive than Orange (specialized towards classification and regression, it supports SVM/SMO, ANN and Ridge Regression), but it has a nice API. Example of use:


from PyML import multi, svm, datafunc
# read training data, last column has the class
mydataset = datafunc.SparseDataSet('iris.data', labelsColumn = -1)
myclassifier = multi.OneAgainstRest(svm.SVM())
print "cross-validation results", myclassifier.cv(mydataset)

My recommendation is to either go with Orange or with PyML.

Tuesday, April 22, 2008

Pragmatic Classification: The very basics

Classification is an everyday task, it is about selecting one out of several outcomes based on their features. An example could be recycling of garbage where you select the bin based on the characteristics of the garbage, e.g. paper, metal, plastic or organic.

Classification with computers
For classification with computers the focus is frequently on the classifier - the function/algorithm that selects the class based on features (note: classifiers usually has to be trained to get fit for fight). Classifiers can be found in many flavors, and quite a few of them have impressive names (phrases with rough, kernel, vector, machine and reasoning aren't uncommon when naming them).

note: Garbage in leads to Garbage out - as (almost always) - same goes for classification.

The numerical baseline
Let us assume you have a data set with 1000 documents that shows to have 4 equally different categories (e.g. math, physics, chemistry and medicine). A simple classifier for a believe-to-be-similar-dataset could be the rule: "the class is math", which is likely to give a classification accuracy of about 25%. (Another classifier could be to pick a random category for every document). This can be used as a numerical baseline for comparison with when bringing in heavier classification machinery, e.g if you get 19% accuracy with the heavier machinery it probably isn't very good (or your feature representation isn't very good) for that particular problem. (Note: heavy classification machinery frequently has plenty of degrees of freedom, so fine tuning them can be a challenge, same goes for feature extraction and representation).

Combining classifiers
On the other hand, if the heavy machinery classifier gave 0% accuracy you could combine it with a random classifier to only randomly select from the 3 classes the heavy machinery classifier didn't suggest.

Question 1: What is the accuracy with these combined classifiers?

Baseline for unbalanced data sets
Quite frequently classification problems have to deal with unbalanced data sets, e.g. let us say you were to classify documents about soccer and casting (fishing), and your training data set contained about 99.99% soccer and 0.01% about casting, a baseline classifier for a similar dataset could be to say - "the article is about soccer". This would most likely be a very strong baseline, and probably hard to beat for most heavy machinery classifiers.

Silver bullet classifier and feature extraction method?

Q: My friend says that classifier algorithm X and feature extraction method Y are the best for all problems, is that the case?
A: No, tell him/her to read about the ugly duckling and no free lunch theorems which clearly says that there is no universally best classifier or feature extraction approach.

note: Just some of the basics this time, something more concrete next time (I think).

Friday, April 4, 2008

A Machine Learning Theory Dream Team

Russia is the birthplace of many great mathematicians, and the work from quite a few of them have significant impact on state-of-the-art computer science and machine learning theory, e.g.:

Ludwig O. Hesse (1811-1874) - Hessian Matrix

Used in calculations of Logistic Regression (which can be used for binary classification), and feed-forward Neural Networks

David Hilbert (1862-1943) - Hilbert Space

E.g. a feature space for Radial Basis Function kernels in Support Vector Machine classifiers can be described with a Hilbert Space

Andrey N. Kolmogorov (1903-1987) - Kolmogorov Complexity

Used in algorithmic information theory, and also in theory behind evolutionary and genetic programming

Andrei A. Markov (1856-1922) - Markov Models and Markov Chains

Can be used e.g. for simulation (in games).
Noteworthy later "spinn-offs": Hidden Markov Models (HMM) and Markov Chain Monte Carlo (MCMC).

Andrei N. Tikhonov (1906-1993) - Tikhonov Regularization

Tikhonov Regularization is roughly a templating language for classification and regression. The template variable is a loss function, e.g. if using a square loss function you get Ridge Regression (also known as Regularized Least Squares Regression or Shrinkage Regression), an epsilon-insensitive loss function gives Support Vector Machine Regression, and a hinge loss function gives Support Vector Machine Classification.

Kudos.

Sunday, January 6, 2008

Increase Automation of Test-Driven Development?

I believe Test-Driven Development can be improved by increased automation, some of the suggested thoughts could perhaps also be adapted to Behavioral-Driven Development (BDD).

Test-Driven Development (TDD) is a manual repeated cycle of writing and running tests, writing the least amount of code to make tests pass, and finally refactoring code (and tests) to make it shine. Writing tests gives the important "side effect" of simultaneously designing the API.

Automation of coding part?

The word automated is sometimes mentioned together with TDD, but that usually refers to automated unit tests (that needs to be manually written) or automated refactoring (that needs to be manually performed by the user e.g. using a refactoring tool/IDE).

Writing the code is done with a greedy approach, i.e. writing just enough to make tests pass, and the coding per TDD cycle is usually only a one-to-at-most-few short methods that is called by the new test, i.e. small increments of code. These small code increments could potentially be automatically induced using machine learning, e.g. using inductive logic programming, program synthesis or genetic programming. There are at least 2 problems with this 1) readability of induced code, 2) scalability. The readability part can be manually handled in the refactoring step of the TDD cycle, and wrt scalability of induction of TDD code increments I believe one can perhaps do smart (automated) things with mocks to prune the search space for the chosen machine learning algorithm.

Towards API Driven Design (ADD?)
One of the things that is hot in software testing research is development of tools that generates unit tests automatically based on code input, e.g. if you have a large chunk of (legacy) code with zero or low test coverage you can use such tools to get high (or even full) test coverage. In their simplest forms such tools just generate tests calling methods to be tested with various permutations of input (e.g. corner/extreme value inputs), this typically leads to enormous amounts of variable quality tests, and there are fortunately tools that are smarter and seems to promise higher quality tests, e.g. DSD-Crasher. I believe such tools can perhaps be used to automatically fill in missing test-cases in TDD, e.g. if you add one test with a set of input values, such tools can generate the corner-variants of the same test. This leads to a more API driven development, since the developer doesn't have to write the API many times (with various inputs), but gets support by the tool to fill in test cases.

Remarks related to recent blogosphere postings about code size
I agree that size is code's worst enemy, but believe it would be slightly better to deal with size if much of the coding and writing tests for it is offloaded to the computer so the coder can focus more on API driven development. I don't believe everybody should be afraid of a little typing if it is to improve readability of code, but they should spare their fingers when the computer is eager to fill in.

Amund Tveit's Blog