Sunday, July 31, 2011

cross-lingual word sense disambiguation

Have I mentioned what I've been working on recently? Maybe I haven't.

In general, I'm working on cross-lingual word- and phrase-sense disambiguation. WSD/PSD is the problem of deciding, for a given word or phrase, which meaning was intended, for some pre-defined sets of meanings. You might get the possible senses out of a dictionary, where they're nicely enumerated, or perhaps from WordNet. The stock example is "bank" -- is it the side of a river, or is it a building where they do financial services? Or, is it the abstract financial institution?

There's a brilliant bit from the prescient Warren Weaver, from 1955 (via):

If one examines the words in a book, one at a time as through an opaque mask with a hole in it one word wide, then it is obviously impossible to determine, one at a time, the meaning of the words . . . . But if one lengthens the slit in the opaque mask, until one can see not only the central word in question but also say N words on either side, then if N is large enough one can unambiguously decide the meaning of the central word . . . . The practical question is: "What minimum value of N will, at least in a tolerable fraction of cases, lead to the correct choice of meaning for the central word?

The "cross-lingual" kind of WSD means that we care about exactly the distinctions that cause you to pick a different word in a given target language, typically because the CLWSD system is meant to be integrated into an MT system; that's becoming fashionable (Carpuat and Wu, 2007). So in this setting, say if you're translating "bank" from English into Spanish, your system doesn't have to decide if it's the building or the institution that owns it -- it's still "banco". Now a riverbank is an "orilla".

In the general case, your system might end up learning how to make distinctions that you as a human didn't know you had to make -- for example, I'm given to understand that Japanese doesn't have just one word for "brother", but "older brother" and "younger brother", which are different enough concepts that they get totally separate words.

Making these choices is typically treated as a classification problem: you get some features for a bunch of instances of usage of a source word, and do supervised learning to get a classifier with (hopefully) good accuracy on the problem of predicting whether this is a "banco" usage or an "orilla" usage. The features are typically things like "which words are in the surrounding context?", or perhaps something fancier based on a parse of the sentence or knowledge about the document as a whole -- whatever you think will be predictive of what the target-language word should be. Hopefully your learning algorithm has some good way of filtering out irrelevant features.

And then, once that's all put together, hopefully you have some extra signal to feed into your translation system, and it makes better word choices, and everybody's happy.

And that's cross-lingual word/phrase-sense disambiguation!