Slides from Tuesday's talk:
Mike's slides: background for the project
My slides: recent progress with weighted constraint solving
So in the previous post...
In the previous post, I discussed dependency grammar, XDG, and introduced why coordination is hard for DG. The plan for our current work is to allow some of the constraints described by the grammar to be broken, such that we can get reasonable parses from input sentences that we know are valid.
This fairly simple case of English coordination is hard to describe with our formalism, and as we expand our system to handle new languages, we're certainly not going to have complete grammars. It would also be nice if we could handle sentences with actual mistakes, but that'll have to be future work. In general, we'd like to be able to tweak our system to (badly) support sentences that our grammar doesn't explicitly handle just by finding out which constraints to relax. There may be other cases where we need to make changes to the grammar, but we'd like to avoid that as much as possible.
Engineering issues and weighted constraint solving
Getting to the point where we can relax constraints and "fail gracefully" took some programming work. XDG so far has only used hard constraints, which means that if one of the rules for the grammar can't be satisfied, then the parser just gives up and you don't get a parse. The original XDG implementation is done in Mozart/Oz, which has constraint programming built right into the language; Mike's version uses python-constraint.
So what I did, is I hooked our parser up to a "weighted" constraint solver, toulbar2. It's actively under development by friendly researchers at INRA, from Toulouse and Barcelona.
Weighted constraint solving is pretty simple. We define a problem, which contains a bunch of constraints, and each constraint pertains to some variables. A solution to the problem is a total assignment to all of the variables, and each solution has some penalty associated with it, due to violating constraints. There's a cost to violating each constraint. To solve a problem, you just have to say what the variables and constraints and costs are, and what the maximum cost you're willing to accept is.
toulbar2 is really fast. It's written in C++ and has some clever solving algorithms. Once it gets rolling, it can parse a sentence with a bunch of lexical ambiguity and an embedded clause, "the old man that argues eats", in about three seconds. For comparison, python-constraint takes 70 to 90.
The only problem with this is that to run toulbar2, we first have to translate the problem into the standard WCSP format, which takes several steps:
- list each constraint and the variables involved (OK, easy)
- write down the "default cost" for each constraint (OK, easy -- it's 0.)
- store a mapping for all of our variables and their domains, since toulbar2's format wants everything to be an integer (OK.)
- write down every single non-default assignment to the relevant variables (OH NOES COMBINATORIAL EXPLOSION.)
To get around the worst part of this problem, I came up with a dumb hack that I rather like: skip constraints where there are more than 15 variables, and let toulbar2 over-generate solutions. After the toulbar2 solver returns, then just prune down the solutions that went over-cost.
Parsing a simple sentence with coordination
Getting a parse that I like for "Dolores and Jake eat yogurt", where there are subject links from eat to both "Dolores" and "Jake", and "and" is unconnected, only required a few tweaks, done by hand:
- Allow motherless nodes (code change, not good)
- add CONJ as a part of speech (grammar change)
- Allow breaking these three principles: valency (to allow for more than one subject), tree (every node has exactly 1 mother), and agree (eg, agreement between subjects and verbs)
Next steps: Optimization problem
Once we write down these features, we can treat tweaking the costs as an optimization problem, where our parameters are the costs for violating each kind of constraint, and maximum cost. We'll consider the grammar and the test sentences as fixed.
We want to know: how can we set the parameters so that we can parse the good sentences, but not the bad ones? And can we get parses that we like? There are many different incomplete parses that we could assign to a sentence -- which one of them should we reward?
To get started on this problem, we just need to know which constraints need to be relaxed so that the desired parse is in there somewhere, and after that, we can (hopefully) optimize the parameters with something like simulated annealing or genetic algorithms.
What I don't have an idea about yet is how to know what kinds of sentences will require tweaks to the grammar, or worse, to the parser code, in order to get sensible parses. For example, to get this coordination example to work, I had to make both of these kinds of changes, even if they were small -- the grammar had no idea about conjunctions, even conjunctions with no arcs attached to them, and the parser threw an exception if a word had no arcs attached.
From an engineering perspective, there's a lot left to do to speed up the parser. This will probably involve finding a tighter way to integrate toulbar2 with our XDG system, and I'm imagining some clever ways to avoid having to list out all the possible assignments after the first time, so the optimization process won't take forever.
Questions and suggestions gratefully accepted :) Thanks for reading!