Monday, April 26, 2010

what I'm working on: XDG and weighted constraint solving

Abstract: It turns out that coordination (the use of conjunctions) is really hard for most grammatical theories. We're working on ways to handle it with graceful failure by selectively relaxing constraints in our constraint-based parser -- ideally we can come up with an approach for handling unfamiliar grammatical structures just by tweaking the weights on our constraints.

Slides from Tuesday's talk:
Mike's slides: background for the project
My slides: recent progress with weighted constraint solving

So in the previous post...

In the previous post, I discussed dependency grammar, XDG, and introduced why coordination is hard for DG. The plan for our current work is to allow some of the constraints described by the grammar to be broken, such that we can get reasonable parses from input sentences that we know are valid.

This fairly simple case of English coordination is hard to describe with our formalism, and as we expand our system to handle new languages, we're certainly not going to have complete grammars. It would also be nice if we could handle sentences with actual mistakes, but that'll have to be future work. In general, we'd like to be able to tweak our system to (badly) support sentences that our grammar doesn't explicitly handle just by finding out which constraints to relax. There may be other cases where we need to make changes to the grammar, but we'd like to avoid that as much as possible.

Engineering issues and weighted constraint solving

Getting to the point where we can relax constraints and "fail gracefully" took some programming work. XDG so far has only used hard constraints, which means that if one of the rules for the grammar can't be satisfied, then the parser just gives up and you don't get a parse. The original XDG implementation is done in Mozart/Oz, which has constraint programming built right into the language; Mike's version uses python-constraint.

So what I did, is I hooked our parser up to a "weighted" constraint solver, toulbar2. It's actively under development by friendly researchers at INRA, from Toulouse and Barcelona.

Weighted constraint solving is pretty simple. We define a problem, which contains a bunch of constraints, and each constraint pertains to some variables. A solution to the problem is a total assignment to all of the variables, and each solution has some penalty associated with it, due to violating constraints. There's a cost to violating each constraint. To solve a problem, you just have to say what the variables and constraints and costs are, and what the maximum cost you're willing to accept is.

toulbar2 is really fast. It's written in C++ and has some clever solving algorithms. Once it gets rolling, it can parse a sentence with a bunch of lexical ambiguity and an embedded clause, "the old man that argues eats", in about three seconds. For comparison, python-constraint takes 70 to 90.

The only problem with this is that to run toulbar2, we first have to translate the problem into the standard WCSP format, which takes several steps:
  • list each constraint and the variables involved (OK, easy)
  • write down the "default cost" for each constraint (OK, easy -- it's 0.)
  • store a mapping for all of our variables and their domains, since toulbar2's format wants everything to be an integer (OK.)
  • write down every single non-default assignment to the relevant variables (OH NOES COMBINATORIAL EXPLOSION.)
This is a problem because (a) we don't know ahead of time which assignments are going to violate the constraint, and (b) there still may be quite a few of them. Going through all possible assignments for each constraint is expensive both in I/O and in dumb combinatorial search. (NB: it's still much smaller than the search space for the problem as a whole, since each constraint only pertains to some of the variables in the problem!)

To get around the worst part of this problem, I came up with a dumb hack that I rather like: skip constraints where there are more than 15 variables, and let toulbar2 over-generate solutions. After the toulbar2 solver returns, then just prune down the solutions that went over-cost.

Parsing a simple sentence with coordination

Getting a parse that I like for "Dolores and Jake eat yogurt", where there are subject links from eat to both "Dolores" and "Jake", and "and" is unconnected, only required a few tweaks, done by hand:

  • Allow motherless nodes (code change, not good)
  • add CONJ as a part of speech (grammar change)
  • Allow breaking these three principles: valency (to allow for more than one subject), tree (every node has exactly 1 mother), and agree (eg, agreement between subjects and verbs)
The problem with this is that this is just one parse among many many that are generated, and the one that I wanted was the 30th generated. The approach here, simply allowing a given principle to be broken, is way too coarse. We shouldn't just say "valency constraints can be broken", we should have more fine-grained features that encode "valency between what"!

Next steps: Optimization problem

Once we write down these features, we can treat tweaking the costs as an optimization problem, where our parameters are the costs for violating each kind of constraint, and maximum cost. We'll consider the grammar and the test sentences as fixed.

We want to know: how can we set the parameters so that we can parse the good sentences, but not the bad ones? And can we get parses that we like? There are many different incomplete parses that we could assign to a sentence -- which one of them should we reward?

To get started on this problem, we just need to know which constraints need to be relaxed so that the desired parse is in there somewhere, and after that, we can (hopefully) optimize the parameters with something like simulated annealing or genetic algorithms.

What I don't have an idea about yet is how to know what kinds of sentences will require tweaks to the grammar, or worse, to the parser code, in order to get sensible parses. For example, to get this coordination example to work, I had to make both of these kinds of changes, even if they were small -- the grammar had no idea about conjunctions, even conjunctions with no arcs attached to them, and the parser threw an exception if a word had no arcs attached.

From an engineering perspective, there's a lot left to do to speed up the parser. This will probably involve finding a tighter way to integrate toulbar2 with our XDG system, and I'm imagining some clever ways to avoid having to list out all the possible assignments after the first time, so the optimization process won't take forever.

Questions and suggestions gratefully accepted :) Thanks for reading!

2 comments:

Brett W. Thompson said...

Wow, great work, Alex!!!

I like your writing style, too- even though I don't fully understand all of this, it's pleasant to read!! :)

Unknown said...

Brett! :D

Thanks so much for taking a look! I hope it gets the flavor and motivation across, at least...