Monday, April 26, 2010

what I'm working on: XDG and weighted constraint solving

Abstract: It turns out that coordination (the use of conjunctions) is really hard for most grammatical theories. We're working on ways to handle it with graceful failure by selectively relaxing constraints in our constraint-based parser -- ideally we can come up with an approach for handling unfamiliar grammatical structures just by tweaking the weights on our constraints.

Slides from Tuesday's talk:
Mike's slides: background for the project
My slides: recent progress with weighted constraint solving

So in the previous post...

In the previous post, I discussed dependency grammar, XDG, and introduced why coordination is hard for DG. The plan for our current work is to allow some of the constraints described by the grammar to be broken, such that we can get reasonable parses from input sentences that we know are valid.

This fairly simple case of English coordination is hard to describe with our formalism, and as we expand our system to handle new languages, we're certainly not going to have complete grammars. It would also be nice if we could handle sentences with actual mistakes, but that'll have to be future work. In general, we'd like to be able to tweak our system to (badly) support sentences that our grammar doesn't explicitly handle just by finding out which constraints to relax. There may be other cases where we need to make changes to the grammar, but we'd like to avoid that as much as possible.

Engineering issues and weighted constraint solving

Getting to the point where we can relax constraints and "fail gracefully" took some programming work. XDG so far has only used hard constraints, which means that if one of the rules for the grammar can't be satisfied, then the parser just gives up and you don't get a parse. The original XDG implementation is done in Mozart/Oz, which has constraint programming built right into the language; Mike's version uses python-constraint.

So what I did, is I hooked our parser up to a "weighted" constraint solver, toulbar2. It's actively under development by friendly researchers at INRA, from Toulouse and Barcelona.

Weighted constraint solving is pretty simple. We define a problem, which contains a bunch of constraints, and each constraint pertains to some variables. A solution to the problem is a total assignment to all of the variables, and each solution has some penalty associated with it, due to violating constraints. There's a cost to violating each constraint. To solve a problem, you just have to say what the variables and constraints and costs are, and what the maximum cost you're willing to accept is.

toulbar2 is really fast. It's written in C++ and has some clever solving algorithms. Once it gets rolling, it can parse a sentence with a bunch of lexical ambiguity and an embedded clause, "the old man that argues eats", in about three seconds. For comparison, python-constraint takes 70 to 90.

The only problem with this is that to run toulbar2, we first have to translate the problem into the standard WCSP format, which takes several steps:
  • list each constraint and the variables involved (OK, easy)
  • write down the "default cost" for each constraint (OK, easy -- it's 0.)
  • store a mapping for all of our variables and their domains, since toulbar2's format wants everything to be an integer (OK.)
  • write down every single non-default assignment to the relevant variables (OH NOES COMBINATORIAL EXPLOSION.)
This is a problem because (a) we don't know ahead of time which assignments are going to violate the constraint, and (b) there still may be quite a few of them. Going through all possible assignments for each constraint is expensive both in I/O and in dumb combinatorial search. (NB: it's still much smaller than the search space for the problem as a whole, since each constraint only pertains to some of the variables in the problem!)

To get around the worst part of this problem, I came up with a dumb hack that I rather like: skip constraints where there are more than 15 variables, and let toulbar2 over-generate solutions. After the toulbar2 solver returns, then just prune down the solutions that went over-cost.

Parsing a simple sentence with coordination

Getting a parse that I like for "Dolores and Jake eat yogurt", where there are subject links from eat to both "Dolores" and "Jake", and "and" is unconnected, only required a few tweaks, done by hand:

  • Allow motherless nodes (code change, not good)
  • add CONJ as a part of speech (grammar change)
  • Allow breaking these three principles: valency (to allow for more than one subject), tree (every node has exactly 1 mother), and agree (eg, agreement between subjects and verbs)
The problem with this is that this is just one parse among many many that are generated, and the one that I wanted was the 30th generated. The approach here, simply allowing a given principle to be broken, is way too coarse. We shouldn't just say "valency constraints can be broken", we should have more fine-grained features that encode "valency between what"!

Next steps: Optimization problem

Once we write down these features, we can treat tweaking the costs as an optimization problem, where our parameters are the costs for violating each kind of constraint, and maximum cost. We'll consider the grammar and the test sentences as fixed.

We want to know: how can we set the parameters so that we can parse the good sentences, but not the bad ones? And can we get parses that we like? There are many different incomplete parses that we could assign to a sentence -- which one of them should we reward?

To get started on this problem, we just need to know which constraints need to be relaxed so that the desired parse is in there somewhere, and after that, we can (hopefully) optimize the parameters with something like simulated annealing or genetic algorithms.

What I don't have an idea about yet is how to know what kinds of sentences will require tweaks to the grammar, or worse, to the parser code, in order to get sensible parses. For example, to get this coordination example to work, I had to make both of these kinds of changes, even if they were small -- the grammar had no idea about conjunctions, even conjunctions with no arcs attached to them, and the parser threw an exception if a word had no arcs attached.

From an engineering perspective, there's a lot left to do to speed up the parser. This will probably involve finding a tighter way to integrate toulbar2 with our XDG system, and I'm imagining some clever ways to avoid having to list out all the possible assignments after the first time, so the optimization process won't take forever.

Questions and suggestions gratefully accepted :) Thanks for reading!

Sunday, April 25, 2010

what I'm working on: Dependency Grammar and XDG

Abstract: It turns out that coordination (the use of conjunctions) is really hard for most grammatical theories. We're working on ways to handle it with graceful failure by selectively relaxing constraints in our constraint-based parser -- ideally we can come up with an approach for handling unfamiliar grammatical structures just by tweaking the weights on our constraints.

This post and the next one (forthcoming!) are basically a talk that Mike and I gave on Tuesday.

If you like slides, here are the slides:
Mike's slides: background for the project
My slides: recent progress with weighted constraint solving

About Dependency Grammar and XDG

For our translation system, we're using a dependency grammar formalism called XDG, the Extensible Dependency Grammar. In dependency grammar, instead of representing your analysis of a sentence as a tree of nested phrases, you're interested in the individual relationships between words (eg, "this noun is the subject of this verb"), so the analysis of a sentence ends up being a directed graph with labeled arcs, instead of a tree of phrases, like you might have with phrase-structure grammars. There are nice diagrams of this in Mike's slides. XDG was developed by Ralph Debusmann in his doctoral work -- his implementation of it is done in Mozart/Oz, and we're working on a new one in Python.

There are several competing DGs, of course. But XDG has a few nice properties that set it apart from other approaches, and one of them is that it's constraint-based. The task of producing a parse is expressed as a constrained search; there are rules in the grammar associated with each word (or class of words) that say things like "I'm a verb, and I need a subject, and it must come before me". So when we start to try to analyze a sentence, we list all of the possible links between words, find all of the constraints that apply, and then set a general-purpose constraint solver on the problem.

Another other neat thing about XDG is that it's multi-layered -- an analysis is actually a multigraph. So some of the links in the analysis represent syntax (or even different aspects of syntax, as we're doing for English), and some of them can represent your semantic model. And there are constraints that describe the relationships between different layers. In the next few months, we want to replace "semantic model" with "syntax of the target language for the translation"!

So XDG has these functions called "principles", and when these are run, they add to the list of constraints on the analysis. Some of them include the principle that the analysis should be a tree, or a planar graph (called "projectivity" in Linguistics World), and that we get other desiderata like verbs agreeing with their nouns. ("she jumps" is OK, but not "they jumps")

This works out pretty well, actually. In his dissertation, Debusmann comes up with fairly broad-coverage grammars for English and German, both of which have some involved syntax. What his grammars don't cover, though, is coordination -- he doesn't have support for conjunctions at all. It turns out that conjunctions are really hard for most grammar formalisms.

It's especially bad for dependency grammar because it seems like an inherently phrasal thing. What's the subject of this sentence?
Jake and Dolores eat yogurt.
You really want to say that it's "Jake and Dolores", but DG can't do that! Each link in the parse goes from one word to one other word. And there's a constraint that says that each verb has only one subject. What to do?

Worse, the sort of sentence that syntacticians seem to love: "Dolores makes and Max delivers yogurt" -- the two verbs here seem to share a direct object. But our "tree principle" says that each word in the graph should only have one head.

Well, let's break the constraints!

... that's the next post, in preparation!

Tuesday, April 20, 2010

what I'm working on: L3 overview

I haven't been writing much about what I'm working on. So let me tell you! If you prefer code to text about code: here's our project.

I'm working with the HLTDI group at IU, with Prof. Mike Gasser. Our goal is to do machine translation for medium-sized languages (say, those with a few million speakers) that don't have a lot of money behind them, and thus, not a lot of training data available. There are projects at other universities to do similar things, so we're definitely not alone. A lot of people would like to have good MT for "minority languages".

You might be familiar with statistical machine translation, which is all the rage right now, and with good reason! It works impressively well in cases where you have enough training data to sensibly cover the space of sentences that you'd like to translate. This is what Google is doing.

It's not what we're doing, for three reasons: (1) for the languages that we'd like to handle, Quechua and Amharic, there's some training data available, but not nearly enough to get a good SMT system going. (2) Both of these languages happen to have really complex morphology, so the probability of seeing a given word is extremely low, since so many different words are possible, and (3) we're not doing translation at all, yet. (but we will.)

Since we won't have enough training text to infer what we need from examples, we're going to have to make claims about what the grammars of our target languages are, and build parsers based on that. We'll see how much of the task we can off-load to machine learning over time, but we're prepared to do some grammar engineering if we need to. It turns out, thankfully, that when a language has complex morphology, it tends to have simpler syntax! You have to put the information somewhere.

So what we do have, is a bunch of morphological analyzers that Mike made, and we're working on the parser, which is going to grow to be the whole MT system once we work out how to make our grammar formalism generate the output language while it's parsing the input language, what you might call "synchronous grammar".

More about the parser and what I've been working on there, in the next post!

Tuesday, April 13, 2010

my dialect and subjunctive verbs

I got corrected on subjunctive verb usage today, which I found odd enough to blog about.

My utterance was something like: "If it was..." -- and I got interrupted with "if it were". My immediate response: "My dialect doesn't do that."

Rudeness aside, what would you say, and under what circumstances? What are the odds, do you think, that my interlocutor has the "if I be..." construction?