Tuesday, April 20, 2010

what I'm working on: L3 overview

I haven't been writing much about what I'm working on. So let me tell you! If you prefer code to text about code: here's our project.

I'm working with the HLTDI group at IU, with Prof. Mike Gasser. Our goal is to do machine translation for medium-sized languages (say, those with a few million speakers) that don't have a lot of money behind them, and thus, not a lot of training data available. There are projects at other universities to do similar things, so we're definitely not alone. A lot of people would like to have good MT for "minority languages".

You might be familiar with statistical machine translation, which is all the rage right now, and with good reason! It works impressively well in cases where you have enough training data to sensibly cover the space of sentences that you'd like to translate. This is what Google is doing.

It's not what we're doing, for three reasons: (1) for the languages that we'd like to handle, Quechua and Amharic, there's some training data available, but not nearly enough to get a good SMT system going. (2) Both of these languages happen to have really complex morphology, so the probability of seeing a given word is extremely low, since so many different words are possible, and (3) we're not doing translation at all, yet. (but we will.)

Since we won't have enough training text to infer what we need from examples, we're going to have to make claims about what the grammars of our target languages are, and build parsers based on that. We'll see how much of the task we can off-load to machine learning over time, but we're prepared to do some grammar engineering if we need to. It turns out, thankfully, that when a language has complex morphology, it tends to have simpler syntax! You have to put the information somewhere.

So what we do have, is a bunch of morphological analyzers that Mike made, and we're working on the parser, which is going to grow to be the whole MT system once we work out how to make our grammar formalism generate the output language while it's parsing the input language, what you might call "synchronous grammar".

More about the parser and what I've been working on there, in the next post!

1 comment:

Brett W. Thompson said...

Wow, awesome!!!