Thursday, January 31, 2013

stupid NLP tricks: haikubot!

At Indiana, we have an IRC server where many of the CS students hang out. I keep myself logged in all the time, in a screen session that I'm usually not looking at. But even while I'm not watching, the chat logs just pile up. So I have megabytes of text from the conversations with my friends and colleagues. The clear use for this resource is, of course, generating haiku [0].

So I built a bot that does this! Here is my friend Eric Holk requesting a poem generated from snippets of text that he actually said.
< eholk> haikubot: eholk
< haikubot> do you have enough / memory to cause that was / stored unless it knew
Alright, how does this work?

I use the IRC logs to train a language model from things that each user said [1], on demand. They're cached in case somebody asks for another haiku from that same user's model again, because it can take a few seconds to train the model. The language model is something one could easily implement (although smoothing can get a little tricky), but I just used the default bigram models from NLTK.

We also need to be able to count syllables: this is pretty straightforward using the CMU Pronouncing Dictionary, which also comes with NLTK. If we don't have a stored pronunciation for a word, we back off to some heuristics about English that try to count the number of vowels (with some special rules for English spelling particularities); this comes from nltk_contrib.

To actually generate a line of the haiku, we just sample words from the language model until we have the desired number of syllables. If that fails for some reason, we try again (up to 100 times), and eventually just back off to picking a word from a list of known 5-syllable or 7-syllable words.

The IRC libraries and basic IRC bot structure are adapted from ircbot-collection by Ben Collins-Sussman et al [2].

Here's the code for haikubot! Hopefully this will be fun for somebody else too. If nothing else, it's a pretty good demonstration of some NLTK features working just fine on Python 3. It requires, at the moment, the tip of the NLTK trunk.

[0] These are of course not proper haiku as such. Just 5-7-5 syllables.
[1] There is some filtering to try to ignore, for example, code samples, and usernames at the start of lines.
[2] Related! Ben Collins-Sussman's talk: "How to Replace Yourself with an IRC Bot"