Paper of the Day: “A Few Useful Things to Know about Machine Learning”
by Pedro Domingos


The Paper of the Day for today comes to us from the distant past: a time before venture capital firms would seemingly hurl trebuchet-loads of money at anyone with a “.ai” domain name and the ability to spell “deep learning”. (I’ve been working on this long enough that my folder of papers on machine learning and data science is labeled “AI”.)

Published in the October 2012 issue of Communications of the ACM, it’s a great survey of useful “folk knowledge” needed to successfully apply machine learning in research and practice. What I like best about this paper is that it mostly avoids formalism while zeroing in on the critical caveats; this makes it an excellent refresher for experienced machine learning practioners, a good primer for aspiring practitioners, and an solid “buyer’s guide” for anyone trying to assess commercial offerings of tools or services.

The first half or so of the paper discusses the problem of overfitting (or, conversely, how to achieve the goal of generalization). While much of this should be familiar to practitioners, I still see an alarming amount of overfitting in the E&P data science world—although it is becoming more subtle than in the bad old days where few even bothered with cross-validation or holdout.

The ninth point, “more data beats a cleverer algorithm”, perhaps presages the eventual takeoff of deep learning (which, while arguably a “clever algorithm”, mostly seems enabled by having very, very large data sets from which a layered architecture can “do its own feature extraction”) for tasks like image recognition:

Suppose … the classifiers you’re getting are still not accurate enough. What can you do now? There are two main choices: design a better learning algorithm, or gather more data (more examples, and possibly more raw features, subject to the curse of dimensionality). Machine learning researchers are mainly concerned with the former, but pragmatically the quickest path to success is often to just get more data.

What’s next on Paper of the Day? Come back tomorrow and find out!

This blog will never track you, collect your data, or serve ads. If you'd like to support the blog, consider our tip jars.


BTC: bc1qjtdl3zdjfelgpru2y3epa56v5l42c6g9qylmjx

ETH: 0xAf93BaFF53b58D2CA4D2b0F3F5ec79339f6b59F7