Sunday, December 22, 2013

using your HitBox or other Toodles Cthulhu input device on Linux

I'm something of a fighting games enthusiast, and I've been getting a bit more into it in recent months.

For the discerning player, there are all kinds of interesting input devices ("fightsticks") on offer these days -- you can buy them from big companies like MadCatz or Hori or Capcom, or from a more boutique outfit, or you can do a custom one -- parts are available for building your own or modding existing sticks. There are robust online communities around the whole enterprise, and at least two online stores for buying parts. Fascinating!

I got really excited about, and went ahead and bought, the HitBox fight stick. It's unusual in that it has no joystick; you input directions with buttons. With some practice, you can get really crisp, precise input, which is pretty important in fighting games. It took some getting used to, but now I'm a big fan -- playing games feels like typing, and my thumbs don't get sore.

The HitBox I got is supposed to work both PS3 and PC. It took a small amount of fiddling to make it go on Linux, but it works great now. This approach will probably also work for other devices based on the Toodles Cthulhu PCB.

the problem: the stick seems to immediately disappear...

So here's the problem: when I plugged in my HitBox on USB, it seemed as though it wasn't detected.
I checked dmesg, though, and it turned out that it was detected by the kernel, but then it immediately disconnected itself.

Mysterious shell script from the Internet to the rescue!

Then I found this thread and this shell script. The script apparently convinces the PCB not to disconnect by reading from it as soon as it's connected (the theory is that the PCB tries switching into XBox 360 mode? Unclear...), and it detects the stick by watching /dev/input/by-id.

Unfortunately, the HitBox had a different device name from the one in that original shell script, so I had to figure out where exactly it was showing up in /dev/input/by-id.

Here's an updated version that works with my HitBox.

Finding the filename for the /dev entry for the HitBox was slightly tricky, because how do you discover the filename of a very fleeting file? It disappears as soon as the PCB decides it should disconnect! Here's the command I used:
$ until ls /dev/input/by-id | grep -m 1 Toodles; do : sleep 0.2 ; done
And that helpfully output:
which I popped into that earlier script, and everything worked! And now: my (weird) fightstick works whether I'm playing on the PS3 or on a SNES emulator on my computer!

Watch for unnervingly accurate dragon punches and electric wind god fists.

Saturday, November 30, 2013

updates on language technology for Paraguay

As I wrote earlier, we've been working on language technology for Paraguay. There are a few of us on some related projects, with the goal of building both useful translation software for Spanish-Guarani and a nice website where folks can do collaborative translations, eventually with computer-assisted translation included! We're building these tools with reusability in mind too -- they should be applicable to other under-resourced language pairs in the near future.

The first tool is coming along: we've been building out Guampa, the collaborative translation website; we should be ready for the first beta users really soon. We would love some help on this system: if you're into software development and/or want to help build resources for the Guarani language, let's chat!

Coming next, watch for the Tereré translation system and the accompanying Chipa word-sense disambiguation module, completing our "Paraguayan afternoon snack" metaphor for translation tools...

In related news, the Mozilla Paraguay folks have been really busy, gearing up to translate Firefox into Guarani, in collaboration with FP-UNA. The Guarani Ñe'ẽ discussion group has been buzzing about this; from my vantage point in the frozen northern anglohablante climes, it looks like everybody is pumped about this. Pretty exciting times.

Monday, September 30, 2013

Thesis proposal!

Last week, I had my thesis proposal. I proposed, basically, that for doing machine translation into lower-resourced languages we're going to want better cross-lingual word sense disambiguation to help our MT systems make better word choices. And I outlined some methods that we might use to get that goal. I'm going to develop these approaches in the context of a few different kinds of MT systems, particularly focusing on translating from Spanish to Guarani. So I guess now all I have to do for the rest of the PhD is this project.


 If you're curious, I'm writing my dissertation in public, on github:

Let's do this.

Saturday, August 31, 2013

reading: You Are What You Speak

I just recently read You Are What You Speak by Robert Lane Greene. I can heartily recommend it as an enjoyable read, although it's aimed at a fairly general audience.

Greene covers, briefly, all kinds of things: the diversity of languages in the world, what it means to have a language, the identity politics of speaking a particular language, attempts at regulating language and how they relate to nationalism. He spends a lot of time on the history of prescriptive rules for English -- think style books like Eats, Shoots and Leaves and The Elements of Style and their historical predecessors. There's also discussion on the associated hand-wringing, class issues and emotional damage inflicted by telling people that their native dialect isn't the real way to speak a given language.

So You Are What You Speak would be a good introduction to the question of "what is a linguist? what is linguistics?" for your friend who internalized the watchful eye of your high school English teacher and yells at people about their grammar and diction on the Internet. If anything, I think Greene gives too much credit to language prescriptivists by suggesting that there is some kind of meaningful debate going on between sticklers and, y'know, scientists trying to describe language in the world.

I would have liked to see more examples from outside the Western-European world. Greene spends most of the book talking about English and French, with some bits about the Brazilian Portuguese language academy (which I didn't know was a thing). Come to think of it, more concrete examples about the socio-politics of different English dialects would have been good too. But it's not that long of a book.

So if you've been hanging out in a Linguistics department -- or just reading Language Log -- and laugh when people despair loudly that kids these days are destroying the English language, you may not need to read this book. But you might want to give it to your relatives.

Sunday, August 25, 2013

Computing Education and the ACM Paywall

Recently Mark Guzdial wrote a blog post in which he describes some of the particularities of research in computing education, and defends the continued paywalling of ACM articles in the Digital Library. Just to be clear, Mark is brilliant and friendly, and he does fantastic work. But I think he's mistaken on this particular issue.

Here is Mark's argument, to reduce it to bullet points:
  • CS Ed research is typically not funded by public funding agencies, but done on researchers' own time, so the argument that it should belong to the public does not hold.
  • Educators working in the developing world have different needs than those in the WEIRD world; we can't simply toss papers over the wall and let them figure it out.
  • ... and anyway, the ACM is basically good people, and doing good work with the money it collects, especially for the education community.
  • Ergo, the ACM should keep up its paywall.
Early in his post, Mark brings up the first sentence from the Tear Down This Paywall petition: "Computer science research is largely funded by the public, for the public good." He points out that lots of CS Ed research isn't supported by grants, and that people who are primarily educators do it on their own time, because it is important to them.

So firstly, Mark's own work is funded by the NSF (as he mentions), so the argument about funding would apply to his work, along with the bulk of CS research broadly. But even if we accept that the public can't demand access to the other CS Ed papers, we should consider: what's best for the careers and goals of the CS Ed researchers themselves?  What do they want?

Certainly CS Ed researchers trying to publicize their work -- people who care so much about it that they take it on as a labor of love -- would prefer to reach the broadest possible audience. They don't directly benefit from a paywall. They may like the ACM and want it to continue putting on events, but the paywall keeps them from readers.

But Mark takes a bizarre turn in framing the idea of dismantling the DL's paywall as forcing open access on unsuspecting researchers who didn't agree to it, "after the fact". OA wasn't part of the deal!  He says in the comments, "Certainly, volunteers can volunteer the fruits of their labors. They shouldn't be coerced.  It shouldn't be a requirement." It's hard to imagine a young researcher protesting a larger audience. People don't choose to publish with the ACM because of the paywall on the DL, but in spite of it. For many subfields, ACM conferences are simply where one must publish to be taken seriously, and dealing with the paywall is the cost of doing business.

As for the second point, about researchers and educators in the developing world -- while it is almost certainly not sufficient to release our papers if our goal is to help them develop their own curricula, it's verging on paternalistic to decide ahead of time what would and would not be helpful for them. Make the papers broadly available and let them decide what is relevant and useful. And by all means, we should develop other materials too, but this is a separate pursuit.

We find educators, working programmers, interested laypeople, and researchers from other disciplines in a similar boat -- they may not have the context to completely understand a paper intended for specialists, but they can still get something out of it. And to collaborate meaningfully with -- or join -- the specialist community, they're going to have to read lots of papers. We should reduce the barriers to entry for potentially-interested people, wherever they are. Working programmers and educators are empirically short on both time and ACM memberships.

So for most computing research, we are still seeing publicly funded work made harder to access than it should be. And for CS Ed research, we see work that researchers might want widely distributed made less available than it could and should be. Opening the DL would be an immense good for people around the world -- it's great that Mark and others put in the additional effort to make their personal papers available, but not everyone is so conscientious, or so web-savvy, or so still alive. And the current state of affairs still requires that people go hunt down each paper individually.

It would be silly to claim that the ACM doesn't need a revenue stream, and I think their continued existence is probably a good thing. But there are other funding models for scholarly societies. The current state of affairs is comfortable for Mark and other established researchers, but it could be much better for the up-and-coming looking for a broad audience, as well as for interested parties outside of well-funded academic institutions.

Sunday, July 28, 2013

ACM's optional Open Access is effectively a NOOP

Not all academics have the great moral luck to be working in NLP, where almost everything we publish is going to be Open Access whether we care about OA or not -- barring some out-of-the-way venues who really need to get their acts together.

For example, Lindsey Kuper (both my favorite programming languages researcher and my wife) just put in a paper at the Functional High-Performance Computing workshop at ICFP. And roughly five minutes after she got the acceptance notification, she got the form to sign over publishing rights to the ACM.

Now the ACM has recently made open-access publishing available through their Digital Library -- for $1100 to $1700, depending on the circumstances. I’m not opposed to APCs (“article processing charges”) as such; this seems like a step in the right direction. But I’ll argue that this particular approach is effectively a no-op.

It was unclear to Lindsey’s advisor whether they could pay the Open Access fee out of their grant money -- and while he’s a great, upstanding guy, he’s also a young pre-tenure professor, so he didn’t have a lot of spare time to look into this. He’s trying to do some science, not get bogged down in policy details. They went with the “retain copyright, but the DL copy won’t be OA“ option. I imagine this scenario will be pretty typical.

So this new policy effectively won’t change anything for the ACM’s Digital Library: all old papers are still locked down, and for most of the new ones, the authors won’t fork over the money for the OA option.

It’s a giant missed opportunity; the Digital Library could be a phenomenally useful resource. But for people without ACM membership or institutional access -- e.g., almost every working programmer -- the situation is the same as before. If you accidentally click on a link to the DL, that’s just a momentary dead end. Hopefully you can find the paper somewhere else.

Sunday, June 23, 2013

NAACL 2013 review

Just recently, I was in Atlanta for NAACL. So much fun! The hallway track is always the best -- I saw a bunch of friends from the NLP world, and especially a lot of Googlers, and met a bunch of new people! Also I managed to be present for Ray Mooney and David Forsyth and some other professors disagreeing animatedly about internal representations of meaning and to what extent you need to take the intentional stance with respect to other people.

Lots of really interesting papers this time around. There is of course Hal Daumé's expert opinion about the interesting papers at the main conference -- I saw a lot of those same talks, having mostly been hanging out at the machine translation and syntax/parsing tracks. On a personal note, it's exciting to see people I know and have worked with getting mentions on Hal's blog! (so, congratulations Greg Durrett and John DeNero and Juri Ganitkevitch!)

Additionally, here's what I thought was cool:
  • Training Parsers on Incompatible Treebanks by Richard Johansson. You want to build a parser for your language. And you've got a treebank. No! You've got two treebanks. Even better, right? But what if those two treebanks use entirely different annotation schemes? ...
  • In the invited talk on Wednesday, Kathy McKeown talked about, among other things, the idea that as NLP people we can provide evidence for or against ideas in comparative literature or literary theory, in collaboration with literature folks -- "well, the theory is that narrative works like this -- let's check!"
  • At *Sem, but also in the main conference, people are talking about using richer, more structured semantic models in our applications again. The really major change in the field in the early 1990s was to not do this -- but now we've got bigger computers and more data, and as a community we know a lot more about stats! Kevin Knight and his group are launching their Abstract Meaning Representation project ("It's like a treebank, but for semantics.") -- maybe it'll work this time!
  • Also at *Sem, Yoav Goldberg talked about the unreasonably enormous Syntactic Ngrams dataset -- it's basically chunks of parse trees from the English part of the Google Books corpus, indexed by time. That's going to be super useful.
  • I popped in to some of the Computational Linguistics for Literature talks -- Mark Riedl's invited talk about programmatically generating stories for games (slides) was especially good!
  • SemEval! There were fourteen different tasks -- lots of different aspects of understanding text! And people are using all these wildly different techniques to do it. An introductory talk about a task and then a single presentation about a system for performing that task is not always enough to really understand the problem, though...
  • I think my presentation went pretty well! People I've been citing for a while were at my talk, and people seemed engaged and asked good questions! (slides, paper)
Alright! So now, full of encouragement and ideas -- back to work.

Friday, May 31, 2013

happy hardware review: Samsung Chromebook

For the past two months, my primary laptop has been a Chromebook! The little $249 Samsung ARM one.

I'm really enjoying it, for a number of reasons. First off, you can do quite a few things from ChromeOS -- it's really surprising how much time we spend in a browser these days. ChromeOS is lovely and simple and it pretty much Just Works.

But the thing that makes it really work for me is Crouton, the semi-official way to run Ubuntu alongside ChromeOS, from champion Googler David Schneider. It's really really easy to install. You just put your Chromebook in developer mode, which gives you a shell in ChromeOS, then you run the crouton shell script, and it sets up a really minimal Ubuntu, with XFCE by default. It is exactly what you (or at least I) want. Once that's set up, it's a quick key combination to switch between ChromeOS and Ubuntu. The Ubuntu runs in a chroot, but by default your user's Downloads directory is shared with ChromeOS, which is brilliant.

The hardware is lovely, especially considering the price. It's so slim, and it feels pretty well built. And it fits in my tiny running-style backpack. The keyboard is pleasantly clicky in the way a chiclet keyboard can be. My one complaint about the hardware is that the button built into the trackpad sometimes sticks, but it seems to be doing that less and less -- maybe it just has to get broken in.

But here's the biggest thing: the battery lasts roughly forever. I'll carry it around all day, work on it for hours in a coffee shop, and I've still got plenty of battery left. It gets at least six hours.

So if what you want is a little laptop with a nice keyboard and long battery life that makes it easy to run a proper Linux (in addition to the bells-and-whistles ChromeOS things)... you could do a lot worse than the Samsung Chromebook.

Sunday, March 31, 2013

language technology for Paraguay

Earlier this month, we went to Paraguay. Why'd we do that?

Paraguay is the only country in the Americas where people are bilingual with a European language and the indigenous language. Paraguayans, the majority of them anyway, really do speak Guarani, or depending on context, a mixture of Spanish and Guarani called Jopará.

While we were there, we talked with Guarani-language teachers at the institutes where they're training translators and linguists. They let us sit in on some of the classes and talk with the students. We're working on building them a computer-assisted translation webapp that will help us collect lots of bilingual text! This is going to be huge.

While we were in the area, we also talked with the local One Laptop Per Child folks; there's an OLPC installation in the little town of Caacupé, which is near Asunción. The OLPC folks said that we should probably go visit local grade schools. Which we did!

It hit me, after the first visit to the schools, how much we were making use of our foreign-white-scientist privilege. We didn't have anything to do with the OLPC project -- aside from a desire to collaborate -- but here we were, wandering into schools without so much as a release form, talking to the kids. I'm trying to imagine Paraguayan scientists coming to the US to observe technology use among los niños estadounidenses.

The really interesting thing here: the kids in Caacupé weren't so surprised to see foreign scientist-looking guys coming to talk with them. They were really friendly, and eager to show off what they could do with the laptops! I get the impression this happens fairly frequently.

So there's a lot of stuff that needs to get built, to make computer use in Guarani more pleasant.
  • At a very basic level, it's hard for people to type the diacritics that you need for Guarani, if your keyboard layout is set to Spanish. The diacritics for Guarani actually aren't that weird; they've got tildes on some vowels, but you see that in Portuguese too.
  • There's no good spellchecker. Guarani morphology is pretty complicated, so this is not an easy thing to build. But we know a guy who's working on it...
  • Text-to-speech. They kids in the schools have text-to-speech for Spanish, and they love it! There's a program on the OLPC where you can send messages to a friend's computer, and the receiving computer will speak your message. It's hilarious. But it doesn't work for Guarani. And as you get further out into the country, the kids are more likely to be monolingual Guarani speakers...
  • The computer-assisted translation website: we're working on it. I'll write more about this soon...
I gave a talk about all of this at the computational linguistics seminar: here are the slides!

Saturday, February 09, 2013

ACM Publications Board: "How can we minimally budge so you'll stop bothering us about open access?"

You may have seen the recent article from CACM, Positioning ACM for an Open Access Future. I found the article fairly upsetting. The first paragraph...
The age of open access is upon us. Increasingly, the consensus of authors of research articles and their funding institutions is that the fruits of taxpayer-supported research should be freely available to the public. This is a compelling argument and a noble goal.
However, we're not going to do that anytime soon! They then launch into a red-herring discussion of predatory OA publishers (which are a real thing! there do exist vanity presses that have sprung up to capitalize on the OA trend) -- but this ignores how such publishers come about. You don't accidentally become such a vanity press. Serious venues with good review boards won't have the problem of "a glut of third-rate publications that add noise rather than insight to the scientific enterprise".

Afterwards, they discuss four different approaches for the ACM to not go fully OA: the first is that, optionally, authors could pay an extra fee to have their articles available from the Digital Library. The other three are simply ways in which the paywall restrictions could be lifted under some circumstances.

None of these are acceptable. Not if we believe that "the fruits of taxpayer-supported research should be freely available to the public". Why is the current situation even sort of OK?

The ACM's resistance to OA so far, its claims that figuring out a way to do it is too hard, that it's too expensive or will lead to bad publications -- as far as I can tell, these mean at least one of two things:
Which one of these is true? Both?

We need to get the ACM to stop thinking like a for-profit publisher and start thinking like their goal is to move the field forward and educate people. The ACM needs to drop both the paywall and its membership in the AAP.

Either that, or we as computing professionals need to drop the ACM.

Thursday, January 31, 2013

stupid NLP tricks: haikubot!

At Indiana, we have an IRC server where many of the CS students hang out. I keep myself logged in all the time, in a screen session that I'm usually not looking at. But even while I'm not watching, the chat logs just pile up. So I have megabytes of text from the conversations with my friends and colleagues. The clear use for this resource is, of course, generating haiku [0].

So I built a bot that does this! Here is my friend Eric Holk requesting a poem generated from snippets of text that he actually said.
< eholk> haikubot: eholk
< haikubot> do you have enough / memory to cause that was / stored unless it knew
Alright, how does this work?

I use the IRC logs to train a language model from things that each user said [1], on demand. They're cached in case somebody asks for another haiku from that same user's model again, because it can take a few seconds to train the model. The language model is something one could easily implement (although smoothing can get a little tricky), but I just used the default bigram models from NLTK.

We also need to be able to count syllables: this is pretty straightforward using the CMU Pronouncing Dictionary, which also comes with NLTK. If we don't have a stored pronunciation for a word, we back off to some heuristics about English that try to count the number of vowels (with some special rules for English spelling particularities); this comes from nltk_contrib.

To actually generate a line of the haiku, we just sample words from the language model until we have the desired number of syllables. If that fails for some reason, we try again (up to 100 times), and eventually just back off to picking a word from a list of known 5-syllable or 7-syllable words.

The IRC libraries and basic IRC bot structure are adapted from ircbot-collection by Ben Collins-Sussman et al [2].

Here's the code for haikubot! Hopefully this will be fun for somebody else too. If nothing else, it's a pretty good demonstration of some NLTK features working just fine on Python 3. It requires, at the moment, the tip of the NLTK trunk.

[0] These are of course not proper haiku as such. Just 5-7-5 syllables.
[1] There is some filtering to try to ignore, for example, code samples, and usernames at the start of lines.
[2] Related! Ben Collins-Sussman's talk: "How to Replace Yourself with an IRC Bot"