Saturday, November 27, 2010
"What a good blog post this is."
(1) What a tall guy he is.
(2) I just realized what a tall guy he is.
That's definitely not the same thing as the question word being in front and, say, being the object of the verb...
(3) What did the dog knock over?
You can't rearrange (1) such that the "what" goes away.
(4) * He is what a tall guy.
If you were going to draw a parse tree, what would it look like? Surely a lot of ink has been spilled about it? I don't even know what it's called for sure. Something like a "wh-exclamatory statement"?
Saturday, October 23, 2010
workshop on FOSS machine translation!
I should probably go to this.
extended abstract for a workshop, MTMRL
I'll let you know if my project gets accepted. The workshop is in Israel, which would be a really interesting place to visit!
Here's what I wrote:
Abstract
Here we describe a work-in-progress approach for learning valencies of verbs in a morphologically rich language using only a morphological analyzer and an unannotated corpus. We will compare the results from applying this approach to an unannotated Arabic corpus with those achieved by processing the same text in treebank form. The approach will then be applied to an unannotated corpus from Quechua, a morphologically rich but resource-scarce language.See the rest here; it's short! (or as a pdf)
Just in case you're not familiar with the idea of valency for verbs: wikipedia!
Wednesday, September 22, 2010
apparently you're wrong about what's grammatical in your language
7e. Once that [that Bill had left] was clear, we gave up.As a linguist, I'm going to tell you that your naïve intuition that this sentence is ungrammatical is just because you're not smart enough to process the grammatical rules that you know subconsciously -- rules that are in fact mostly encoded in your DNA. What?
The sentence is odd for most native speakers: it is not acceptable. However, this sentence is formed according to the same principle that we posited to account for the formation of (7b)-(7d), i.e., that one sentence may become part of another sentence. Hence (7e) would be grammatical, though it is not acceptable.
Faced with intuitions such as that for (7e) the linguist might decide to modify the grammar he has formulated in such a way that sentence (7e) is considered to be ungrammatical. He may also decide, however, that (7e) is grammatical, and that the unacceptability of the sentence is due to independent reasons. For instance, (7e) may be argued to be unacceptable because the sentence is hard to process. In the latter case the unacceptability is not strictly due to linguistic factors but is due to the more general mechanisms used for processing information.
The native speaker who judges a sentence cannot decide whether it is grammatical. He only has intuitions about acceptability. It is for the linguist to determine whether the unacceptability of a sentence is due to grammatical principles or whether it may be due to other factors. It is the linguist's task to determine what makes (7e) unacceptable.
Seriously: I'd posit that, if your theory about a language doesn't account for what actual native speakers count as a valid sentence, then your theory is wrong! Is Haegeman representing the general Chomskyan position correctly here?
Our goal as scientists is to account for what happens observably. In what way does a proposed grammar of a language count as a falsifiable scientific theory if you can just say "in reality, that sentence is grammatical -- there were just processing difficulties"?
Monday, August 23, 2010
Don't Sleep, There Are Snakes
Everett has great stories to tell, and his telling of them is exciting -- moving to a tiny riverside village in the middle of the jungles of Brazil, with your family, sounds crazy. I can't imagine bringing small children into the rain forest...
His descriptions of the Pirahã culture and language are also fantastic; the Pirahã people only really want to talk about things they've experienced directly, and are largely uninterested in ideas and techniques from the outside world. It's basically not done for them to talk about things that no living Pirahã has seen. They don't appear to have a creation myth of their own, and aren't very much into creation myths that missionaries might have to offer.
And the language has all of these wild properties: most notably, it doesn't seem to feature arbitrary-depth recursion, so its modifiers don't stack, there aren't dependent clauses, and apparently the language doesn't have conjunctions at all. Also it can apparently be whistled or hummed, and it has evidentiality.
The whistle-ability and evidentiality aren't unknown among the world's languages; they just seem strange relative to most of us. The lack of recursion, though, throws into doubt a lot of what we think is inherent and unique in human language. Everett's claims have apparently caused a nontrivial amount of ink and vitriol to be spilled; he's been personally called a charlatan by Chomsky (according to wikipedia).
So, altogether: I enjoyed reading it quite a lot. Everett writes compellingly about his (and his family's) adventures trying to live in the rain forest, about the Pirahã people, about language and cognition and culture, and about his losing his faith gradually, on being confronted with a people who aren't much into abstractions or things that happened centuries ago -- and in fact don't talk about them.
Sunday, July 04, 2010
disabling the popups on links on wordpress
Maybe you've thought about how to get rid of them, but haven't done it yet? (they're called "Snap Previews...")
There are a few ways to make them gone. If you just add shots.snap.com to your AdBlock (or your /etc/hosts file), then that'll solve the problem.
Another alternative is politely asking them to not pop up; you can set an option by clicking the little "gear" icon when the window does come up. That'll store your preference in a cookie for who-knows how long.
I just put this line in my /etc/hosts, though.
127.0.0.1 shots.snap.com
Tuesday, June 22, 2010
May I present: kompressr
Wednesday, May 26, 2010
quick commandline trick: find files that you can write to
find . -writable -type f
Tuesday, May 18, 2010
the literature: Realization with CCG
For my project for the class, I read a bunch of papers by the great Michael White about the work that he and his colleagues have been doing on realization (aka generating text) with OpenCCG.
If you're interested in CCG, text generation, or text generation with CCG...
- Here are my presentation slides
- and here's my writeup about it: html, pdf.
Saturday, May 01, 2010
nice wireless card for Ubuntu: TP-Link TL-WN651G
It uses the Atheros chipset (Atheros AR5001X+), and Ubuntu figures it right out, with the ath5k wireless driver.
So if you're looking for an inexpensive PCI wireless card that you can use on Linux without using a binary blob and ndiswrapper, this is a good choice!
Hooray :)
Monday, April 26, 2010
what I'm working on: XDG and weighted constraint solving
Slides from Tuesday's talk:
So in the previous post...
In the previous post, I discussed dependency grammar, XDG, and introduced why coordination is hard for DG. The plan for our current work is to allow some of the constraints described by the grammar to be broken, such that we can get reasonable parses from input sentences that we know are valid.
This fairly simple case of English coordination is hard to describe with our formalism, and as we expand our system to handle new languages, we're certainly not going to have complete grammars. It would also be nice if we could handle sentences with actual mistakes, but that'll have to be future work. In general, we'd like to be able to tweak our system to (badly) support sentences that our grammar doesn't explicitly handle just by finding out which constraints to relax. There may be other cases where we need to make changes to the grammar, but we'd like to avoid that as much as possible.
Engineering issues and weighted constraint solving
Getting to the point where we can relax constraints and "fail gracefully" took some programming work. XDG so far has only used hard constraints, which means that if one of the rules for the grammar can't be satisfied, then the parser just gives up and you don't get a parse. The original XDG implementation is done in Mozart/Oz, which has constraint programming built right into the language; Mike's version uses python-constraint.
So what I did, is I hooked our parser up to a "weighted" constraint solver, toulbar2. It's actively under development by friendly researchers at INRA, from Toulouse and Barcelona.
Weighted constraint solving is pretty simple. We define a problem, which contains a bunch of constraints, and each constraint pertains to some variables. A solution to the problem is a total assignment to all of the variables, and each solution has some penalty associated with it, due to violating constraints. There's a cost to violating each constraint. To solve a problem, you just have to say what the variables and constraints and costs are, and what the maximum cost you're willing to accept is.
toulbar2 is really fast. It's written in C++ and has some clever solving algorithms. Once it gets rolling, it can parse a sentence with a bunch of lexical ambiguity and an embedded clause, "the old man that argues eats", in about three seconds. For comparison, python-constraint takes 70 to 90.
The only problem with this is that to run toulbar2, we first have to translate the problem into the standard WCSP format, which takes several steps:
- list each constraint and the variables involved (OK, easy)
- write down the "default cost" for each constraint (OK, easy -- it's 0.)
- store a mapping for all of our variables and their domains, since toulbar2's format wants everything to be an integer (OK.)
- write down every single non-default assignment to the relevant variables (OH NOES COMBINATORIAL EXPLOSION.)
To get around the worst part of this problem, I came up with a dumb hack that I rather like: skip constraints where there are more than 15 variables, and let toulbar2 over-generate solutions. After the toulbar2 solver returns, then just prune down the solutions that went over-cost.
Parsing a simple sentence with coordination
Getting a parse that I like for "Dolores and Jake eat yogurt", where there are subject links from eat to both "Dolores" and "Jake", and "and" is unconnected, only required a few tweaks, done by hand:
- Allow motherless nodes (code change, not good)
- add CONJ as a part of speech (grammar change)
- Allow breaking these three principles: valency (to allow for more than one subject), tree (every node has exactly 1 mother), and agree (eg, agreement between subjects and verbs)
Next steps: Optimization problem
Once we write down these features, we can treat tweaking the costs as an optimization problem, where our parameters are the costs for violating each kind of constraint, and maximum cost. We'll consider the grammar and the test sentences as fixed.
We want to know: how can we set the parameters so that we can parse the good sentences, but not the bad ones? And can we get parses that we like? There are many different incomplete parses that we could assign to a sentence -- which one of them should we reward?
To get started on this problem, we just need to know which constraints need to be relaxed so that the desired parse is in there somewhere, and after that, we can (hopefully) optimize the parameters with something like simulated annealing or genetic algorithms.
What I don't have an idea about yet is how to know what kinds of sentences will require tweaks to the grammar, or worse, to the parser code, in order to get sensible parses. For example, to get this coordination example to work, I had to make both of these kinds of changes, even if they were small -- the grammar had no idea about conjunctions, even conjunctions with no arcs attached to them, and the parser threw an exception if a word had no arcs attached.
From an engineering perspective, there's a lot left to do to speed up the parser. This will probably involve finding a tighter way to integrate toulbar2 with our XDG system, and I'm imagining some clever ways to avoid having to list out all the possible assignments after the first time, so the optimization process won't take forever.
Questions and suggestions gratefully accepted :) Thanks for reading!
Sunday, April 25, 2010
what I'm working on: Dependency Grammar and XDG
Jake and Dolores eat yogurt.
Tuesday, April 20, 2010
what I'm working on: L3 overview
Tuesday, April 13, 2010
my dialect and subjunctive verbs
Saturday, March 27, 2010
from Dan Dennett: "Preachers who are not Believers"
“OK, this God created me. It’s a perfect God that knows everything; can do anything. And somehow it got messed up, and it’s my fault. So he had to send his son to die for me to fix it. And he does. And now I’m supposed to beat myself to death the rest of my life over it. It makes no sense to me. Don’t you think a God could come up with a better plan than that?”
“What kind of personality; what kind of being is this that had to create these other beings to worship and tell him how wonderful he is? That makes no sense, if this God is all-knowing and all-wise and all-wonderful. I can’t comprehend that that’s what kind of person God is.”
“Every church I’ve been in preached that the Jonah in the Whale story is literally true. And I’ve never believed that. You mean to tell me a human was in the belly of that whale? For three days? And then the whale spit him out on the shoreline? And, of course, their convenient logic is, ‘Well, God can do anything.’”
“Well, I think most Christians have to be in a state of denial to read the Bible and believe it. Because there are so many contradicting stories. You’re encouraged to be violent on one page, and you’re encouraged to give sacrificial love on another page. You’re encouraged to bash a baby’s head on one page, and there’s other pages that say, you know, give your brother your fair share of everything you have if they ask for it.”
“But if God was going to reveal himself to us, don’t you think it would be in a way that we wouldn’t question? ...I mean, if I was wanting to have...people teach about the Bible...I would probably make sure they knew I existed. ...I mean, I wouldn’t send them mysterious notes, encrypted in a way that it took a linguist to figure out.”
Sunday, February 21, 2010
Morphology: derivation and inflection
Derivation is when new words, typically of a different part of speech, are produced from existing words. In English, we have quite a few affixes to change a word's the category, and interestingly, they're not very regular. To make something black, you blacken it, but to make something hollow, you hollow it. In Calvin and Hobbes, Calvin uses verb as a verb that means "turn something into a verb". To make something into a product, you can (recently) productize it. What word can you use to make something blue? Have you ever bluified something?
Inflection is rather simpler. It takes a base form of a word and encodes some extra meaning in it -- what extra meaning varies by language, but it's often things like plurality or gender. Typically the language requires that words be inflected properly.
Languages differ pretty broadly how much information a given inflected word carries. For example, a verb in Spanish carries more bits than one in English, so in Spanish ("Hablo castellano.") you often don't have to specify the subject of a verb, because its inflection makes it clear who you're talking about [0]. Some languages encode quite a lot of information in one verb: maybe its object, the whole tense (so no need for modal verbs like "would" or "haber"), the genders of all the participants, maybe even how the speaker came to know the information in question [1].
I have a whole lot of linguistics to learn. It's interesting being around a department where a lot of the people are linguists by background, when I've only put so much time and attention into it. So you'll get more posts like this, rest assured.
[0] This feature, of not having to specify pronouns, is called being a pro-drop language. Some languages can drop more pronouns than Spanish!
[1] This feature, grammatical evidentiality, is extremely awesome, and we need to adopt it in English.
References:
http://www.indiana.edu/~hlw/Derivation/intro.html
http://en.wikipedia.org/wiki/Inflectional_morphology
http://en.wikipedia.org/wiki/Evidentiality
http://en.wikipedia.org/wiki/Inflection#Inflection_vs._derivation
http://en.wikipedia.org/wiki/Pro-drop_language
Sunday, February 07, 2010
drawing trees with LaTeX
Installing packages from CTAN manually looks hard.
If you're on Ubuntu/Debian, though, you just need to install these two packages: texlive-humanities, texlive-pictures.
qtree itself is in texlive-humanities, but it depends on a package, pict2e, that's only in texlive-pictures, so you have to install them both or it won't work.
And then you can draw some trees just by specifying the bracketing of the phrases (see the qtree docs for exactly how).
Friday, January 08, 2010
review: a new Model M from Unicomp!
Lindsey just gave me a new one! My 1988 version (IBM part #1391401) is still fine, of course. But now I can bring one to the lab.
The new keyboard is beautiful; they're making them with USB now, and they come in black! It's not quite as heavy as my 80's vintage keyboard (no big metal plate inside), and while the keys themselves are easily removable, this model doesn't have separate keycaps. But it's just as clicky as you remember, and the feel is perfect. This design is apparently the same as some of the latter-day IBM versions.
The company manufacturing the M now, Unicomp, is great, and they totally deserve your business.
The first keyboard they shipped us actually had some problems with it -- a few of the keys were sticking! So I called up the company and got Jim on the phone almost immediately. He suggested that I pull the offending keys off and then pop them back in place (usually good M maintenance advice). After some fidgeting, we determined that I wasn't going to be able to fix it myself, so he had a replacement sent out the very next day!
So fantastic. And now I have two Ms.
(here's another Unicomp review; the blogger and everybody in the comments over there seems to have had a great customer service experience too.)
Sunday, January 03, 2010
Guns, Germs, and Steel on invention
There's a lot about germs, too. The diseases that a society carries and develops resistances to are extremely important when running into another group. A people can be totally wiped out, faced with a disease it's not accustomed to.
But I wanted to share with you a bit that particularly resonated with me, as a technology-producing person.
Thus, the commonsense view of invention that served as our starting point reverses the usual roles of invention and need. It also overstates the importance of rare geniuses, such as Watt and Edison. That "heroic theory of invention," as it is termed, is encouraged by patent law, because an applicant for a patent must prove the novelty of the invention submitted. Inventors thereby have a financial incentive to denigrate or ignore previous work. From a patent lawyer's perspective, the ideal invention is one that arises without any precursors, like Athene springing fully formed from the forehead of Zeus.
In reality, even for the most famous and apparently decisive modern inventions, neglected precursors lurked behind the bald claim that "X invented Y." For instance, we are regularly told, "James Watt invented the steam engine in 1769," supposedly inspired by watching steam rise from a tea-kettle's spout. Unfortunately for this splendid fiction, Watt actually got the idea for his particular steam engine while repairing a model of Thomas Newcomen's steam engine, which Newcomen had invented 57 years earlier and of which over a hundred had been manufactured in England by the time of Watt's repair work. Newcomen's engine, in turn, followed the steam engine that the Englishman Thomas Savery patented in 1698, which followed the steam engine that the Frenchman Denis Papin designed (but did not build) around 1680, which in turn had precursors in the ideas of the Dutch scientist Christiaan Huygens and others. All this is not to deny that Watt greatly improved Newcomen's engine (by incorporating a separate steam condenser and a double-acting cylinder), just as Newcomen had greatly improved Savery's.
Saturday, January 02, 2010
Foundation Beyond Belief launches!
The new website for the Foundation Beyond Belief is up! The mission is: "To demonstrate humanism at its best by supporting efforts to improve this world and this life; to challenge humanists to embody the highest principles of humanism, including mutual care and responsibility; and to help and encourage humanist parents to raise confident children with open minds and compassionate hearts."
Foundation Beyond Belief is a non-profit, charitable foundation that wants to encourage compassion and charitable giving for [secular] humanists. It's also working on providing support and education for non-theistic parents.
However you might feel about churches, one thing that they're good at is charity and volunteer projects. You're big-hearted and well-meaning -- but do you have somebody reminding you to volunteer for Habitat For Humanity and donate to feed the homeless every week? Apparently in the US, religious people give more to nonprofits than non-religious (according to this guide from Mint, via FriendlyAtheist).
That's what FBB is for. With FBB, you can make one-time donations, or sign up for monthly giving, and you choose how your donation is distributed! Contributions are tax deductible, and go 100% to the organizations benefited! (you can also choose to donate to FBB itself, which of course has operating costs)
There's an online community, etc! Pretty exciting!