I've been making an effort to read more academic papers. This one came up for one of the many reading groups at the Goog, and I picked it out of my stack a few days ago. Here's the abstract.
In this paper we propose a novel method for learning a Mahalanobis distance measure to be used in the KNN classification algorithm. The algorithm directly maximizes a stochastic variant of the leave-one-out KNN score on the training set. It can also learn a low-dimensional lin- ear embedding of labeled data that can be used for data visualization and fast classification. Unlike other methods, our classification model is non-parametric, making no assumptions about the shape of the class distributions or the boundaries between them. The performance of the method is demonstrated on several data sets, both for metric learning and linear dimensionality reduction.This is a very common, general problem for machine learning -- you want to build a classifier, and you think you might like to use K Nearest Neighbors; it's dead simple, and a lot of the time, it gets the job done. No complicated models to train up -- if you want to classify an instance (ooh, here's an animal -- is it a kitty?), you just find the k most similar instances in your bank of examples and let them vote it out. Four of the five most-similar things in your example set are kitties? Alright, we'll call the new one a kitty. You have to tune k as a parameter to work well with your dataset, of course, and you can get slightly more sophisticated by introducing weighted voting -- things that are less similar to the instance you're trying to classify are considered less important.
The remaining question, though -- how do you decide what counts as "similar"? Typically, you use some sort of distance metric (Euclidean or Manhattan, for example) -- plot all of your instances in some high-dimensional space and see what's close. What if some of your features are noisy or irrelevant, though? Well, you could do some feature selection and prune those features out. Worse! What if you have several informative features, but they happen to be on very different scales, such that distances in one overwhelm distances in another? Well, you could manually scale them until you get some good results...
This is starting to sound like maybe K-Nearest Neighbors isn't all that easy to use, out of the box.
This paper actually solves both of those cases. Goldberger et al harness the power of linear algebra and come up with a way to learn a projection -- just a matrix -- from the original feature space, where distance metrics might not be very useful for KNN, into a new feature space, where your distance metric does the right thing. In effect, this is the feature selection and the scaling, all in one go, and they do it in such a way as to minimize the Leave-One-Out error, all tuned for your particular data set.
That's kind of cool.
As an added bonus, if you restrict the transform to project into three or fewer dimensions (just make sure the matrix is the right shape, and you're good!), then the same algorithm produces what sounds like a very nice visualization of your data. Features that are informative in discriminating your classes will be stretched out, and instances in the same class will tend to clump -- otherwise, KNN wouldn't work very well.
Lovely paper, guys! Very well-written, clearly explained, and addresses a problem probably a lot of people have had. The only issue I can raise so far -- it's not clear how long it takes to learn the transformation. This might be a very slow process, and all we're given in the paper is something to the effect of "oh, just optimize this function and you're good..." ... the machine learning pros among us can probably just look at the formula and knock it out in a few lines of Matlab (or supposedly Lisp, if you're Charles Isbell), but I would've liked a little more discussion on the implementation side... maybe they posted some code somewhere.
Happy hacking :) alexr out!