Friday, May 09, 2014

writing a tiny register machine interpreter in Go

I was feeling kind of blue at the end of April; it's probably pretty normal in the tail end of a PhD. I thought a good thing to perk me up might be a little Matt Cutts-style month-long challenge!

I thought it would be nice to make myself do some side projects unrelated to my research, so I decided that for the month of May, every day I'd write a little bit of Go! I've been meaning to get good at Go anyway.

The first interesting bit of stuff to come out of this is go-rodrego, a reimplementation of the RodRego register machine, which is basically the tiniest thing that you could imagine being Turing Complete and easy to understand in terms of imperative programs. Dan Dennett uses it to teach philosophy students (and readers of his lovely Intuition Pumps and Other Tools for Thinking) the basics of what it means to do computation.

And the virtual machine they distribute for their class is in RealBASIC and a pain to run on Linux. But now, here's a Go version!

The instruction set is so tiny: it just has "increment register", "decrement register, or failing that, do a conditional branch" and "end program". And that's all you need for it to be Turing Complete.

It's not conceptually hard to implement this interpreter, of course, but it was a nice exercise for getting used to working with the Go standard library and Go ways of doing things.

I'll write more about what I'm learning as the month progresses; there should be a few more potentially-interesting packages. So far the other thing I've been working on has been homeworks from the Functional Programming Principles in Scala class, getting a sense about how it feels to do them in Scala vs Go.

So, for your consideration, amusement, and possibly, edification: go-rodrego.

Sunday, December 22, 2013

using your HitBox or other Toodles Cthulhu input device on Linux

I'm something of a fighting games enthusiast, and I've been getting a bit more into it in recent months.

For the discerning player, there are all kinds of interesting input devices ("fightsticks") on offer these days -- you can buy them from big companies like MadCatz or Hori or Capcom, or from a more boutique outfit, or you can do a custom one -- parts are available for building your own or modding existing sticks. There are robust online communities around the whole enterprise, and at least two online stores for buying parts. Fascinating!

I got really excited about, and went ahead and bought, the HitBox fight stick. It's unusual in that it has no joystick; you input directions with buttons. With some practice, you can get really crisp, precise input, which is pretty important in fighting games. It took some getting used to, but now I'm a big fan -- playing games feels like typing, and my thumbs don't get sore.

The HitBox I got is supposed to work both PS3 and PC. It took a small amount of fiddling to make it go on Linux, but it works great now. This approach will probably also work for other devices based on the Toodles Cthulhu PCB.

the problem: the stick seems to immediately disappear...

So here's the problem: when I plugged in my HitBox on USB, it seemed as though it wasn't detected.
I checked dmesg, though, and it turned out that it was detected by the kernel, but then it immediately disconnected itself.

Mysterious shell script from the Internet to the rescue!

Then I found this thread and this shell script. The script apparently convinces the PCB not to disconnect by reading from it as soon as it's connected (the theory is that the PCB tries switching into XBox 360 mode? Unclear...), and it detects the stick by watching /dev/input/by-id.

Unfortunately, the HitBox had a different device name from the one in that original shell script, so I had to figure out where exactly it was showing up in /dev/input/by-id.

Here's an updated version that works with my HitBox.

Finding the filename for the /dev entry for the HitBox was slightly tricky, because how do you discover the filename of a very fleeting file? It disappears as soon as the PCB decides it should disconnect! Here's the command I used:
$ until ls /dev/input/by-id | grep -m 1 Toodles; do : sleep 0.2 ; done
And that helpfully output:
which I popped into that earlier script, and everything worked! And now: my (weird) fightstick works whether I'm playing on the PS3 or on a SNES emulator on my computer!

Watch for unnervingly accurate dragon punches and electric wind god fists.

Saturday, November 30, 2013

updates on language technology for Paraguay

As I wrote earlier, we've been working on language technology for Paraguay. There are a few of us on some related projects, with the goal of building both useful translation software for Spanish-Guarani and a nice website where folks can do collaborative translations, eventually with computer-assisted translation included! We're building these tools with reusability in mind too -- they should be applicable to other under-resourced language pairs in the near future.

The first tool is coming along: we've been building out Guampa, the collaborative translation website; we should be ready for the first beta users really soon. We would love some help on this system: if you're into software development and/or want to help build resources for the Guarani language, let's chat!

Coming next, watch for the Tereré translation system and the accompanying Chipa word-sense disambiguation module, completing our "Paraguayan afternoon snack" metaphor for translation tools...

In related news, the Mozilla Paraguay folks have been really busy, gearing up to translate Firefox into Guarani, in collaboration with FP-UNA. The Guarani Ñe'ẽ discussion group has been buzzing about this; from my vantage point in the frozen northern anglohablante climes, it looks like everybody is pumped about this. Pretty exciting times.

Monday, September 30, 2013

Thesis proposal!

Last week, I had my thesis proposal. I proposed, basically, that for doing machine translation into lower-resourced languages we're going to want better cross-lingual word sense disambiguation to help our MT systems make better word choices. And I outlined some methods that we might use to get that goal. I'm going to develop these approaches in the context of a few different kinds of MT systems, particularly focusing on translating from Spanish to Guarani. So I guess now all I have to do for the rest of the PhD is this project.


 If you're curious, I'm writing my dissertation in public, on github:

Let's do this.

Saturday, August 31, 2013

reading: You Are What You Speak

I just recently read You Are What You Speak by Robert Lane Greene. I can heartily recommend it as an enjoyable read, although it's aimed at a fairly general audience.

Greene covers, briefly, all kinds of things: the diversity of languages in the world, what it means to have a language, the identity politics of speaking a particular language, attempts at regulating language and how they relate to nationalism. He spends a lot of time on the history of prescriptive rules for English -- think style books like Eats, Shoots and Leaves and The Elements of Style and their historical predecessors. There's also discussion on the associated hand-wringing, class issues and emotional damage inflicted by telling people that their native dialect isn't the real way to speak a given language.

So You Are What You Speak would be a good introduction to the question of "what is a linguist? what is linguistics?" for your friend who internalized the watchful eye of your high school English teacher and yells at people about their grammar and diction on the Internet. If anything, I think Greene gives too much credit to language prescriptivists by suggesting that there is some kind of meaningful debate going on between sticklers and, y'know, scientists trying to describe language in the world.

I would have liked to see more examples from outside the Western-European world. Greene spends most of the book talking about English and French, with some bits about the Brazilian Portuguese language academy (which I didn't know was a thing). Come to think of it, more concrete examples about the socio-politics of different English dialects would have been good too. But it's not that long of a book.

So if you've been hanging out in a Linguistics department -- or just reading Language Log -- and laugh when people despair loudly that kids these days are destroying the English language, you may not need to read this book. But you might want to give it to your relatives.

Sunday, August 25, 2013

Computing Education and the ACM Paywall

Recently Mark Guzdial wrote a blog post in which he describes some of the particularities of research in computing education, and defends the continued paywalling of ACM articles in the Digital Library. Just to be clear, Mark is brilliant and friendly, and he does fantastic work. But I think he's mistaken on this particular issue.

Here is Mark's argument, to reduce it to bullet points:
  • CS Ed research is typically not funded by public funding agencies, but done on researchers' own time, so the argument that it should belong to the public does not hold.
  • Educators working in the developing world have different needs than those in the WEIRD world; we can't simply toss papers over the wall and let them figure it out.
  • ... and anyway, the ACM is basically good people, and doing good work with the money it collects, especially for the education community.
  • Ergo, the ACM should keep up its paywall.
Early in his post, Mark brings up the first sentence from the Tear Down This Paywall petition: "Computer science research is largely funded by the public, for the public good." He points out that lots of CS Ed research isn't supported by grants, and that people who are primarily educators do it on their own time, because it is important to them.

So firstly, Mark's own work is funded by the NSF (as he mentions), so the argument about funding would apply to his work, along with the bulk of CS research broadly. But even if we accept that the public can't demand access to the other CS Ed papers, we should consider: what's best for the careers and goals of the CS Ed researchers themselves?  What do they want?

Certainly CS Ed researchers trying to publicize their work -- people who care so much about it that they take it on as a labor of love -- would prefer to reach the broadest possible audience. They don't directly benefit from a paywall. They may like the ACM and want it to continue putting on events, but the paywall keeps them from readers.

But Mark takes a bizarre turn in framing the idea of dismantling the DL's paywall as forcing open access on unsuspecting researchers who didn't agree to it, "after the fact". OA wasn't part of the deal!  He says in the comments, "Certainly, volunteers can volunteer the fruits of their labors. They shouldn't be coerced.  It shouldn't be a requirement." It's hard to imagine a young researcher protesting a larger audience. People don't choose to publish with the ACM because of the paywall on the DL, but in spite of it. For many subfields, ACM conferences are simply where one must publish to be taken seriously, and dealing with the paywall is the cost of doing business.

As for the second point, about researchers and educators in the developing world -- while it is almost certainly not sufficient to release our papers if our goal is to help them develop their own curricula, it's verging on paternalistic to decide ahead of time what would and would not be helpful for them. Make the papers broadly available and let them decide what is relevant and useful. And by all means, we should develop other materials too, but this is a separate pursuit.

We find educators, working programmers, interested laypeople, and researchers from other disciplines in a similar boat -- they may not have the context to completely understand a paper intended for specialists, but they can still get something out of it. And to collaborate meaningfully with -- or join -- the specialist community, they're going to have to read lots of papers. We should reduce the barriers to entry for potentially-interested people, wherever they are. Working programmers and educators are empirically short on both time and ACM memberships.

So for most computing research, we are still seeing publicly funded work made harder to access than it should be. And for CS Ed research, we see work that researchers might want widely distributed made less available than it could and should be. Opening the DL would be an immense good for people around the world -- it's great that Mark and others put in the additional effort to make their personal papers available, but not everyone is so conscientious, or so web-savvy, or so still alive. And the current state of affairs still requires that people go hunt down each paper individually.

It would be silly to claim that the ACM doesn't need a revenue stream, and I think their continued existence is probably a good thing. But there are other funding models for scholarly societies. The current state of affairs is comfortable for Mark and other established researchers, but it could be much better for the up-and-coming looking for a broad audience, as well as for interested parties outside of well-funded academic institutions.

Sunday, July 28, 2013

ACM's optional Open Access is effectively a NOOP

Not all academics have the great moral luck to be working in NLP, where almost everything we publish is going to be Open Access whether we care about OA or not -- barring some out-of-the-way venues who really need to get their acts together.

For example, Lindsey Kuper (both my favorite programming languages researcher and my wife) just put in a paper at the Functional High-Performance Computing workshop at ICFP. And roughly five minutes after she got the acceptance notification, she got the form to sign over publishing rights to the ACM.

Now the ACM has recently made open-access publishing available through their Digital Library -- for $1100 to $1700, depending on the circumstances. I’m not opposed to APCs (“article processing charges”) as such; this seems like a step in the right direction. But I’ll argue that this particular approach is effectively a no-op.

It was unclear to Lindsey’s advisor whether they could pay the Open Access fee out of their grant money -- and while he’s a great, upstanding guy, he’s also a young pre-tenure professor, so he didn’t have a lot of spare time to look into this. He’s trying to do some science, not get bogged down in policy details. They went with the “retain copyright, but the DL copy won’t be OA“ option. I imagine this scenario will be pretty typical.

So this new policy effectively won’t change anything for the ACM’s Digital Library: all old papers are still locked down, and for most of the new ones, the authors won’t fork over the money for the OA option.

It’s a giant missed opportunity; the Digital Library could be a phenomenally useful resource. But for people without ACM membership or institutional access -- e.g., almost every working programmer -- the situation is the same as before. If you accidentally click on a link to the DL, that’s just a momentary dead end. Hopefully you can find the paper somewhere else.