Sunday, December 22, 2013

using your HitBox or other Toodles Cthulhu input device on Linux

I'm something of a fighting games enthusiast, and I've been getting a bit more into it in recent months.

For the discerning player, there are all kinds of interesting input devices ("fightsticks") on offer these days -- you can buy them from big companies like MadCatz or Hori or Capcom, or from a more boutique outfit, or you can do a custom one -- parts are available for building your own or modding existing sticks. There are robust online communities around the whole enterprise, and at least two online stores for buying parts. Fascinating!

I got really excited about, and went ahead and bought, the HitBox fight stick. It's unusual in that it has no joystick; you input directions with buttons. With some practice, you can get really crisp, precise input, which is pretty important in fighting games. It took some getting used to, but now I'm a big fan -- playing games feels like typing, and my thumbs don't get sore.

The HitBox I got is supposed to work both PS3 and PC. It took a small amount of fiddling to make it go on Linux, but it works great now. This approach will probably also work for other devices based on the Toodles Cthulhu PCB.

the problem: the stick seems to immediately disappear...

So here's the problem: when I plugged in my HitBox on USB, it seemed as though it wasn't detected.
I checked dmesg, though, and it turned out that it was detected by the kernel, but then it immediately disconnected itself.

Mysterious shell script from the Internet to the rescue!

Then I found this thread and this shell script. The script apparently convinces the PCB not to disconnect by reading from it as soon as it's connected (the theory is that the PCB tries switching into XBox 360 mode? Unclear...), and it detects the stick by watching /dev/input/by-id.

Unfortunately, the HitBox had a different device name from the one in that original shell script, so I had to figure out where exactly it was showing up in /dev/input/by-id.

Here's an updated version that works with my HitBox.

Finding the filename for the /dev entry for the HitBox was slightly tricky, because how do you discover the filename of a very fleeting file? It disappears as soon as the PCB decides it should disconnect! Here's the command I used:
$ until ls /dev/input/by-id | grep -m 1 Toodles; do : sleep 0.2 ; done
And that helpfully output:
/dev/input/by-id/usb-Toodles_2008_HitBox_Edition_Cthulhu+-event-joystick
which I popped into that earlier script, and everything worked! And now: my (weird) fightstick works whether I'm playing on the PS3 or on a SNES emulator on my computer!

Watch for unnervingly accurate dragon punches and electric wind god fists.

Saturday, November 30, 2013

updates on language technology for Paraguay

As I wrote earlier, we've been working on language technology for Paraguay. There are a few of us on some related projects, with the goal of building both useful translation software for Spanish-Guarani and a nice website where folks can do collaborative translations, eventually with computer-assisted translation included! We're building these tools with reusability in mind too -- they should be applicable to other under-resourced language pairs in the near future.

The first tool is coming along: we've been building out Guampa, the collaborative translation website; we should be ready for the first beta users really soon. We would love some help on this system: if you're into software development and/or want to help build resources for the Guarani language, let's chat!

Coming next, watch for the Tereré translation system and the accompanying Chipa word-sense disambiguation module, completing our "Paraguayan afternoon snack" metaphor for translation tools...

In related news, the Mozilla Paraguay folks have been really busy, gearing up to translate Firefox into Guarani, in collaboration with FP-UNA. The Guarani Ñe'ẽ discussion group has been buzzing about this; from my vantage point in the frozen northern anglohablante climes, it looks like everybody is pumped about this. Pretty exciting times.

Monday, September 30, 2013

Thesis proposal!

Last week, I had my thesis proposal. I proposed, basically, that for doing machine translation into lower-resourced languages we're going to want better cross-lingual word sense disambiguation to help our MT systems make better word choices. And I outlined some methods that we might use to get that goal. I'm going to develop these approaches in the context of a few different kinds of MT systems, particularly focusing on translating from Spanish to Guarani. So I guess now all I have to do for the rest of the PhD is this project.


 

 If you're curious, I'm writing my dissertation in public, on github: http://github.com/alexrudnick/dissertation

Let's do this.

Saturday, August 31, 2013

reading: You Are What You Speak

I just recently read You Are What You Speak by Robert Lane Greene. I can heartily recommend it as an enjoyable read, although it's aimed at a fairly general audience.

Greene covers, briefly, all kinds of things: the diversity of languages in the world, what it means to have a language, the identity politics of speaking a particular language, attempts at regulating language and how they relate to nationalism. He spends a lot of time on the history of prescriptive rules for English -- think style books like Eats, Shoots and Leaves and The Elements of Style and their historical predecessors. There's also discussion on the associated hand-wringing, class issues and emotional damage inflicted by telling people that their native dialect isn't the real way to speak a given language.

So You Are What You Speak would be a good introduction to the question of "what is a linguist? what is linguistics?" for your friend who internalized the watchful eye of your high school English teacher and yells at people about their grammar and diction on the Internet. If anything, I think Greene gives too much credit to language prescriptivists by suggesting that there is some kind of meaningful debate going on between sticklers and, y'know, scientists trying to describe language in the world.

I would have liked to see more examples from outside the Western-European world. Greene spends most of the book talking about English and French, with some bits about the Brazilian Portuguese language academy (which I didn't know was a thing). Come to think of it, more concrete examples about the socio-politics of different English dialects would have been good too. But it's not that long of a book.

So if you've been hanging out in a Linguistics department -- or just reading Language Log -- and laugh when people despair loudly that kids these days are destroying the English language, you may not need to read this book. But you might want to give it to your relatives.

Sunday, August 25, 2013

Computing Education and the ACM Paywall

Recently Mark Guzdial wrote a blog post in which he describes some of the particularities of research in computing education, and defends the continued paywalling of ACM articles in the Digital Library. Just to be clear, Mark is brilliant and friendly, and he does fantastic work. But I think he's mistaken on this particular issue.

Here is Mark's argument, to reduce it to bullet points:
  • CS Ed research is typically not funded by public funding agencies, but done on researchers' own time, so the argument that it should belong to the public does not hold.
  • Educators working in the developing world have different needs than those in the WEIRD world; we can't simply toss papers over the wall and let them figure it out.
  • ... and anyway, the ACM is basically good people, and doing good work with the money it collects, especially for the education community.
  • Ergo, the ACM should keep up its paywall.
Early in his post, Mark brings up the first sentence from the Tear Down This Paywall petition: "Computer science research is largely funded by the public, for the public good." He points out that lots of CS Ed research isn't supported by grants, and that people who are primarily educators do it on their own time, because it is important to them.

So firstly, Mark's own work is funded by the NSF (as he mentions), so the argument about funding would apply to his work, along with the bulk of CS research broadly. But even if we accept that the public can't demand access to the other CS Ed papers, we should consider: what's best for the careers and goals of the CS Ed researchers themselves?  What do they want?

Certainly CS Ed researchers trying to publicize their work -- people who care so much about it that they take it on as a labor of love -- would prefer to reach the broadest possible audience. They don't directly benefit from a paywall. They may like the ACM and want it to continue putting on events, but the paywall keeps them from readers.

But Mark takes a bizarre turn in framing the idea of dismantling the DL's paywall as forcing open access on unsuspecting researchers who didn't agree to it, "after the fact". OA wasn't part of the deal!  He says in the comments, "Certainly, volunteers can volunteer the fruits of their labors. They shouldn't be coerced.  It shouldn't be a requirement." It's hard to imagine a young researcher protesting a larger audience. People don't choose to publish with the ACM because of the paywall on the DL, but in spite of it. For many subfields, ACM conferences are simply where one must publish to be taken seriously, and dealing with the paywall is the cost of doing business.

As for the second point, about researchers and educators in the developing world -- while it is almost certainly not sufficient to release our papers if our goal is to help them develop their own curricula, it's verging on paternalistic to decide ahead of time what would and would not be helpful for them. Make the papers broadly available and let them decide what is relevant and useful. And by all means, we should develop other materials too, but this is a separate pursuit.

We find educators, working programmers, interested laypeople, and researchers from other disciplines in a similar boat -- they may not have the context to completely understand a paper intended for specialists, but they can still get something out of it. And to collaborate meaningfully with -- or join -- the specialist community, they're going to have to read lots of papers. We should reduce the barriers to entry for potentially-interested people, wherever they are. Working programmers and educators are empirically short on both time and ACM memberships.

So for most computing research, we are still seeing publicly funded work made harder to access than it should be. And for CS Ed research, we see work that researchers might want widely distributed made less available than it could and should be. Opening the DL would be an immense good for people around the world -- it's great that Mark and others put in the additional effort to make their personal papers available, but not everyone is so conscientious, or so web-savvy, or so still alive. And the current state of affairs still requires that people go hunt down each paper individually.

It would be silly to claim that the ACM doesn't need a revenue stream, and I think their continued existence is probably a good thing. But there are other funding models for scholarly societies. The current state of affairs is comfortable for Mark and other established researchers, but it could be much better for the up-and-coming looking for a broad audience, as well as for interested parties outside of well-funded academic institutions.

Sunday, July 28, 2013

ACM's optional Open Access is effectively a NOOP

Not all academics have the great moral luck to be working in NLP, where almost everything we publish is going to be Open Access whether we care about OA or not -- barring some out-of-the-way venues who really need to get their acts together.

For example, Lindsey Kuper (both my favorite programming languages researcher and my wife) just put in a paper at the Functional High-Performance Computing workshop at ICFP. And roughly five minutes after she got the acceptance notification, she got the form to sign over publishing rights to the ACM.

Now the ACM has recently made open-access publishing available through their Digital Library -- for $1100 to $1700, depending on the circumstances. I’m not opposed to APCs (“article processing charges”) as such; this seems like a step in the right direction. But I’ll argue that this particular approach is effectively a no-op.

It was unclear to Lindsey’s advisor whether they could pay the Open Access fee out of their grant money -- and while he’s a great, upstanding guy, he’s also a young pre-tenure professor, so he didn’t have a lot of spare time to look into this. He’s trying to do some science, not get bogged down in policy details. They went with the “retain copyright, but the DL copy won’t be OA“ option. I imagine this scenario will be pretty typical.

So this new policy effectively won’t change anything for the ACM’s Digital Library: all old papers are still locked down, and for most of the new ones, the authors won’t fork over the money for the OA option.

It’s a giant missed opportunity; the Digital Library could be a phenomenally useful resource. But for people without ACM membership or institutional access -- e.g., almost every working programmer -- the situation is the same as before. If you accidentally click on a link to the DL, that’s just a momentary dead end. Hopefully you can find the paper somewhere else.

Sunday, June 23, 2013

NAACL 2013 review

Just recently, I was in Atlanta for NAACL. So much fun! The hallway track is always the best -- I saw a bunch of friends from the NLP world, and especially a lot of Googlers, and met a bunch of new people! Also I managed to be present for Ray Mooney and David Forsyth and some other professors disagreeing animatedly about internal representations of meaning and to what extent you need to take the intentional stance with respect to other people.

Lots of really interesting papers this time around. There is of course Hal Daumé's expert opinion about the interesting papers at the main conference -- I saw a lot of those same talks, having mostly been hanging out at the machine translation and syntax/parsing tracks. On a personal note, it's exciting to see people I know and have worked with getting mentions on Hal's blog! (so, congratulations Greg Durrett and John DeNero and Juri Ganitkevitch!)

Additionally, here's what I thought was cool:
  • Training Parsers on Incompatible Treebanks by Richard Johansson. You want to build a parser for your language. And you've got a treebank. No! You've got two treebanks. Even better, right? But what if those two treebanks use entirely different annotation schemes? ...
  • In the invited talk on Wednesday, Kathy McKeown talked about, among other things, the idea that as NLP people we can provide evidence for or against ideas in comparative literature or literary theory, in collaboration with literature folks -- "well, the theory is that narrative works like this -- let's check!"
  • At *Sem, but also in the main conference, people are talking about using richer, more structured semantic models in our applications again. The really major change in the field in the early 1990s was to not do this -- but now we've got bigger computers and more data, and as a community we know a lot more about stats! Kevin Knight and his group are launching their Abstract Meaning Representation project ("It's like a treebank, but for semantics.") -- maybe it'll work this time!
  • Also at *Sem, Yoav Goldberg talked about the unreasonably enormous Syntactic Ngrams dataset -- it's basically chunks of parse trees from the English part of the Google Books corpus, indexed by time. That's going to be super useful.
  • I popped in to some of the Computational Linguistics for Literature talks -- Mark Riedl's invited talk about programmatically generating stories for games (slides) was especially good!
  • SemEval! There were fourteen different tasks -- lots of different aspects of understanding text! And people are using all these wildly different techniques to do it. An introductory talk about a task and then a single presentation about a system for performing that task is not always enough to really understand the problem, though...
  • I think my presentation went pretty well! People I've been citing for a while were at my talk, and people seemed engaged and asked good questions! (slides, paper)
Alright! So now, full of encouragement and ideas -- back to work.