Penguin Parens: brains 'n' balancing training data

Wednesday, March 28, 2007

brains 'n' balancing training data

- Martin points us to a nice article over on Developing Intelligence: 10 Important Differences Between Brains and Computers. Your computational metaphor just breaks down eventually, y'know? The brain is not very much like a Von Neumann computer. It's a lot squishier.

If building classifiers is your thing, you may be interested to take a look at these articles:

- Gustavo E. A. P. A. Batista , Ana L. C. Bazzan, and Maria Carolina Monard: Balancing Training Data for Automated Annotation of Keywords: a Case Study.
Three researchers, seven middle names, one novel technique for building balanced data sets out of unbalanced ones for training classifiers: generate new instances of your minority class by interpolating between examples actually in your dataset. I'm still trying to decide whether this approach should work for the general case -- does it make too many assumptions about the shape of the space? Particularly: can you arbitrarily draw lines (in higher-dimensional space) between positive instances? What if there are negative instances between those two? Which dimensions do you look at first, and how is this better than just adding some noise or weighting positive examples higher? (is that last option the same as simply counting them several times?)

- Foster Provost: Machine Learning from Imbalanced Data Sets 101.
A basic overview of the problem, examining the motivation for building classifiers at all and some different approaches to sampling. The award for Best Name Ever goes to Dr. Foster Provost.

Wednesday, March 28, 2007

brains 'n' balancing training data

No comments: