Friday, February 17, 2012

What's the best machine learning algorithm?

Correct answer: “it depends”.

Next best answer: Random Forests.

A Random Forest is a machine learning procedure that trains and aggregates a large number of individual decision trees. It works for any generic classification or regression problem; is robust to different variable input types, missing data, and outliers; has been shown to perform extremely well across large classes of data; and scales reasonably well computationally (it’s also map-reducible). Perhaps best of all, it requires little tuning to get good results. Robustness and ease-of-use are not often appreciated as they should be in machine learning (not to the extent buzzwordy names are, anyways), and it's hard to beat tree ensembles, and Random Forests in particular, on these dimensions.

Random forests work by generating (typically hundreds) of decision trees in a specific random way such that each is de-correlated with the others. Since each decision tree is a low-bias, high-variance estimator, and each is relatively uncorrelated with the others, when we aggregate their predictions we get a final prediction with low bias AND low variance. Magic. The trick is in getting trees trained on the same dataset to be uncorrelated. This is accomplished by using randomly sampled subsets of features for evaluation at each node in each tree and a randomly sampled subset (bootstrap) of data points to train each tree.

Put simply, if you have a machine learning problem and you don’t know what to use, you should use random forests. Here, in table form (courtesy of Hastie, Tibshirani and Friedman), is why:

Random forests inherit most of the good attributes of "Trees" in the above chart, but in addition also have state-of-the-art predictive power. Their main drawbacks are a lack of good interpretability, something that most other highly predictive algorithms do even worse on; and computational performance -- if you need something for real-time production, it could be hard to justify using random forests and spending the time to evaluate hundreds or thousands of trees.

If you are interested in playing around, grab the R package.

I recently heard the president of Kaggle, Jeremy Howard, mention that Random Forests seem to show up in a disproportionate number of winning entries in their data mining competitions. Cross-validation, I call that.

More reading:
A Comparison of Decision Tree Ensemble Creation Techniques
An Empirical Comparison of Supervised Learning Algorithms

No comments:

Post a Comment