Penguin Parens

writing a tiny register machine interpreter in Go

2014-05-09T01:53:00.004-04:00

I was feeling kind of blue at the end of April; it's probably pretty normal in the tail end of a PhD. I thought a good thing to perk me up might be a little Matt Cutts-style month-long challenge!

I thought it would be nice to make myself do some side projects unrelated to my research, so I decided that for the month of May, every day I'd write a little bit of Go! I've been meaning to get good at Go anyway.

The first interesting bit of stuff to come out of this is go-rodrego, a reimplementation of the RodRego register machine, which is basically the tiniest thing that you could imagine being Turing Complete and easy to understand in terms of imperative programs. Dan Dennett uses it to teach philosophy students (and readers of his lovely Intuition Pumps and Other Tools for Thinking) the basics of what it means to do computation.

And the virtual machine they distribute for their class is in RealBASIC and a pain to run on Linux. But now, here's a Go version!

The instruction set is so tiny: it just has "increment register", "decrement register, or failing that, do a conditional branch" and "end program". And that's all you need for it to be Turing Complete.

It's not conceptually hard to implement this interpreter, of course, but it was a nice exercise for getting used to working with the Go standard library and Go ways of doing things.

I'll write more about what I'm learning as the month progresses; there should be a few more potentially-interesting packages. So far the other thing I've been working on has been homeworks from the Functional Programming Principles in Scala class, getting a sense about how it feels to do them in Scala vs Go.

So, for your consideration, amusement, and possibly, edification: go-rodrego.
https://github.com/alexrudnick/go-rodrego

using your HitBox or other Toodles Cthulhu input device on Linux

2013-12-22T02:29:00.001-05:00

I'm something of a fighting games enthusiast, and I've been getting a bit more into it in recent months.

For the discerning player, there are all kinds of interesting input devices ("fightsticks") on offer these days -- you can buy them from big companies like MadCatz or Hori or Capcom, or from a more boutique outfit, or you can do a custom one -- parts are available for building your own or modding existing sticks. There are robust online communities around the whole enterprise, and at least two online stores for buying parts. Fascinating!

I got really excited about, and went ahead and bought, the HitBox fight stick. It's unusual in that it has no joystick; you input directions with buttons. With some practice, you can get really crisp, precise input, which is pretty important in fighting games. It took some getting used to, but now I'm a big fan -- playing games feels like typing, and my thumbs don't get sore.

The HitBox I got is supposed to work both PS3 and PC. It took a small amount of fiddling to make it go on Linux, but it works great now. This approach will probably also work for other devices based on the Toodles Cthulhu PCB.

the problem: the stick seems to immediately disappear...

So here's the problem: when I plugged in my HitBox on USB, it seemed as though it wasn't detected.
I checked dmesg, though, and it turned out that it was detected by the kernel, but then it immediately disconnected itself.

Mysterious shell script from the Internet to the rescue!

Then I found this thread and this shell script. The script apparently convinces the PCB not to disconnect by reading from it as soon as it's connected (the theory is that the PCB tries switching into XBox 360 mode? Unclear...), and it detects the stick by watching /dev/input/by-id.

Unfortunately, the HitBox had a different device name from the one in that original shell script, so I had to figure out where exactly it was showing up in /dev/input/by-id.

Here's an updated version that works with my HitBox.

Finding the filename for the /dev entry for the HitBox was slightly tricky, because how do you discover the filename of a very fleeting file? It disappears as soon as the PCB decides it should disconnect! Here's the command I used:

$ until ls /dev/input/by-id | grep -m 1 Toodles; do : sleep 0.2 ; done

And that helpfully output:

/dev/input/by-id/usb-Toodles_2008_HitBox_Edition_Cthulhu+-event-joystick

which I popped into that earlier script, and everything worked! And now: my (weird) fightstick works whether I'm playing on the PS3 or on a SNES emulator on my computer!

Watch for unnervingly accurate dragon punches and electric wind god fists.

updates on language technology for Paraguay

2013-11-30T17:47:00.000-05:00

As I wrote earlier, we've been working on language technology for Paraguay. There are a few of us on some related projects, with the goal of building both useful translation software for Spanish-Guarani and a nice website where folks can do collaborative translations, eventually with computer-assisted translation included! We're building these tools with reusability in mind too -- they should be applicable to other under-resourced language pairs in the near future.

The first tool is coming along: we've been building out Guampa, the collaborative translation website; we should be ready for the first beta users really soon. We would love some help on this system: if you're into software development and/or want to help build resources for the Guarani language, let's chat!

Coming next, watch for the Tereré translation system and the accompanying Chipa word-sense disambiguation module, completing our "Paraguayan afternoon snack" metaphor for translation tools...

In related news, the Mozilla Paraguay folks have been really busy, gearing up to translate Firefox into Guarani, in collaboration with FP-UNA. The Guarani Ñe'ẽ discussion group has been buzzing about this; from my vantage point in the frozen northern anglohablante climes, it looks like everybody is pumped about this. Pretty exciting times.

Thesis proposal!

2013-09-30T23:07:00.003-04:00

Last week, I had my thesis proposal. I proposed, basically, that for doing machine translation into lower-resourced languages we're going to want better cross-lingual word sense disambiguation to help our MT systems make better word choices. And I outlined some methods that we might use to get that goal. I'm going to develop these approaches in the context of a few different kinds of MT systems, particularly focusing on translating from Spanish to Guarani. So I guess now all I have to do for the rest of the PhD is this project.

If you're curious, I'm writing my dissertation in public, on github: http://github.com/alexrudnick/dissertation

Let's do this.

reading: You Are What You Speak

2013-08-31T22:03:00.000-04:00

I just recently read You Are What You Speak by Robert Lane Greene. I can heartily recommend it as an enjoyable read, although it's aimed at a fairly general audience.

Greene covers, briefly, all kinds of things: the diversity of languages in the world, what it means to have a language, the identity politics of speaking a particular language, attempts at regulating language and how they relate to nationalism. He spends a lot of time on the history of prescriptive rules for English -- think style books like Eats, Shoots and Leaves and The Elements of Style and their historical predecessors. There's also discussion on the associated hand-wringing, class issues and emotional damage inflicted by telling people that their native dialect isn't the real way to speak a given language.

So You Are What You Speak would be a good introduction to the question of "what is a linguist? what is linguistics?" for your friend who internalized the watchful eye of your high school English teacher and yells at people about their grammar and diction on the Internet. If anything, I think Greene gives too much credit to language prescriptivists by suggesting that there is some kind of meaningful debate going on between sticklers and, y'know, scientists trying to describe language in the world.

I would have liked to see more examples from outside the Western-European world. Greene spends most of the book talking about English and French, with some bits about the Brazilian Portuguese language academy (which I didn't know was a thing). Come to think of it, more concrete examples about the socio-politics of different English dialects would have been good too. But it's not that long of a book.

So if you've been hanging out in a Linguistics department -- or just reading Language Log -- and laugh when people despair loudly that kids these days are destroying the English language, you may not need to read this book. But you might want to give it to your relatives.

Computing Education and the ACM Paywall

2013-08-25T13:55:00.002-04:00

Recently Mark Guzdial wrote a blog post in which he describes some of the particularities of research in computing education, and defends the continued paywalling of ACM articles in the Digital Library. Just to be clear, Mark is brilliant and friendly, and he does fantastic work. But I think he's mistaken on this particular issue.

Here is Mark's argument, to reduce it to bullet points:

CS Ed research is typically not funded by public funding agencies, but done on researchers' own time, so the argument that it should belong to the public does not hold.
Educators working in the developing world have different needs than those in the WEIRD world; we can't simply toss papers over the wall and let them figure it out.
... and anyway, the ACM is basically good people, and doing good work with the money it collects, especially for the education community.
Ergo, the ACM should keep up its paywall.

Early in his post, Mark brings up the first sentence from the Tear Down This Paywall petition: "Computer science research is largely funded by the public, for the public good." He points out that lots of CS Ed research isn't supported by grants, and that people who are primarily educators do it on their own time, because it is important to them.

So firstly, Mark's own work is funded by the NSF (as he mentions), so the argument about funding would apply to his work, along with the bulk of CS research broadly. But even if we accept that the public can't demand access to the other CS Ed papers, we should consider: what's best for the careers and goals of the CS Ed researchers themselves? What do they want?

Certainly CS Ed researchers trying to publicize their work -- people who care so much about it that they take it on as a labor of love -- would prefer to reach the broadest possible audience. They don't directly benefit from a paywall. They may like the ACM and want it to continue putting on events, but the paywall keeps them from readers.

But Mark takes a bizarre turn in framing the idea of dismantling the DL's paywall as forcing open access on unsuspecting researchers who didn't agree to it, "after the fact". OA wasn't part of the deal! He says in the comments, "Certainly, volunteers can volunteer the fruits of their labors. They shouldn't be coerced. It shouldn't be a requirement." It's hard to imagine a young researcher protesting a larger audience. People don't choose to publish with the ACM because of the paywall on the DL, but in spite of it. For many subfields, ACM conferences are simply where one must publish to be taken seriously, and dealing with the paywall is the cost of doing business.

As for the second point, about researchers and educators in the developing world -- while it is almost certainly not sufficient to release our papers if our goal is to help them develop their own curricula, it's verging on paternalistic to decide ahead of time what would and would not be helpful for them. Make the papers broadly available and let them decide what is relevant and useful. And by all means, we should develop other materials too, but this is a separate pursuit.

We find educators, working programmers, interested laypeople, and researchers from other disciplines in a similar boat -- they may not have the context to completely understand a paper intended for specialists, but they can still get something out of it. And to collaborate meaningfully with -- or join -- the specialist community, they're going to have to read lots of papers. We should reduce the barriers to entry for potentially-interested people, wherever they are. Working programmers and educators are empirically short on both time and ACM memberships.

So for most computing research, we are still seeing publicly funded work made harder to access than it should be. And for CS Ed research, we see work that researchers might want widely distributed made less available than it could and should be. Opening the DL would be an immense good for people around the world -- it's great that Mark and others put in the additional effort to make their personal papers available, but not everyone is so conscientious, or so web-savvy, or so still alive. And the current state of affairs still requires that people go hunt down each paper individually.

It would be silly to claim that the ACM doesn't need a revenue stream, and I think their continued existence is probably a good thing. But there are other funding models for scholarly societies. The current state of affairs is comfortable for Mark and other established researchers, but it could be much better for the up-and-coming looking for a broad audience, as well as for interested parties outside of well-funded academic institutions.

ACM's optional Open Access is effectively a NOOP

2013-07-28T23:47:00.002-04:00

Not all academics have the great moral luck to be working in NLP, where almost everything we publish is going to be Open Access whether we care about OA or not -- barring some out-of-the-way venues who really need to get their acts together.

For example, Lindsey Kuper (both my favorite programming languages researcher and my wife) just put in a paper at the Functional High-Performance Computing workshop at ICFP. And roughly five minutes after she got the acceptance notification, she got the form to sign over publishing rights to the ACM.

Now the ACM has recently made open-access publishing available through their Digital Library -- for $1100 to $1700, depending on the circumstances. I’m not opposed to APCs (“article processing charges”) as such; this seems like a step in the right direction. But I’ll argue that this particular approach is effectively a no-op.

It was unclear to Lindsey’s advisor whether they could pay the Open Access fee out of their grant money -- and while he’s a great, upstanding guy, he’s also a young pre-tenure professor, so he didn’t have a lot of spare time to look into this. He’s trying to do some science, not get bogged down in policy details. They went with the “retain copyright, but the DL copy won’t be OA“ option. I imagine this scenario will be pretty typical.

So this new policy effectively won’t change anything for the ACM’s Digital Library: all old papers are still locked down, and for most of the new ones, the authors won’t fork over the money for the OA option.

It’s a giant missed opportunity; the Digital Library could be a phenomenally useful resource. But for people without ACM membership or institutional access -- e.g., almost every working programmer -- the situation is the same as before. If you accidentally click on a link to the DL, that’s just a momentary dead end. Hopefully you can find the paper somewhere else.

NAACL 2013 review

2013-06-23T00:43:00.000-04:00

Just recently, I was in Atlanta for NAACL. So much fun! The hallway track is always the best -- I saw a bunch of friends from the NLP world, and especially a lot of Googlers, and met a bunch of new people! Also I managed to be present for Ray Mooney and David Forsyth and some other professors disagreeing animatedly about internal representations of meaning and to what extent you need to take the intentional stance with respect to other people.

Lots of really interesting papers this time around. There is of course Hal Daumé's expert opinion about the interesting papers at the main conference -- I saw a lot of those same talks, having mostly been hanging out at the machine translation and syntax/parsing tracks. On a personal note, it's exciting to see people I know and have worked with getting mentions on Hal's blog! (so, congratulations Greg Durrett and John DeNero and Juri Ganitkevitch!)

Additionally, here's what I thought was cool:

Training Parsers on Incompatible Treebanks by Richard Johansson. You want to build a parser for your language. And you've got a treebank. No! You've got two treebanks. Even better, right? But what if those two treebanks use entirely different annotation schemes? ...
In the invited talk on Wednesday, Kathy McKeown talked about, among other things, the idea that as NLP people we can provide evidence for or against ideas in comparative literature or literary theory, in collaboration with literature folks -- "well, the theory is that narrative works like this -- let's check!"
At *Sem, but also in the main conference, people are talking about using richer, more structured semantic models in our applications again. The really major change in the field in the early 1990s was to not do this -- but now we've got bigger computers and more data, and as a community we know a lot more about stats! Kevin Knight and his group are launching their Abstract Meaning Representation project ("It's like a treebank, but for semantics.") -- maybe it'll work this time!
Also at *Sem, Yoav Goldberg talked about the unreasonably enormous Syntactic Ngrams dataset -- it's basically chunks of parse trees from the English part of the Google Books corpus, indexed by time. That's going to be super useful.
I popped in to some of the Computational Linguistics for Literature talks -- Mark Riedl's invited talk about programmatically generating stories for games (slides) was especially good!
SemEval! There were fourteen different tasks -- lots of different aspects of understanding text! And people are using all these wildly different techniques to do it. An introductory talk about a task and then a single presentation about a system for performing that task is not always enough to really understand the problem, though...
I think my presentation went pretty well! People I've been citing for a while were at my talk, and people seemed engaged and asked good questions! (slides, paper)

Alright! So now, full of encouragement and ideas -- back to work.

happy hardware review: Samsung Chromebook

2013-05-31T23:57:00.002-04:00

For the past two months, my primary laptop has been a Chromebook! The little $249 Samsung ARM one.

I'm really enjoying it, for a number of reasons. First off, you can do quite a few things from ChromeOS -- it's really surprising how much time we spend in a browser these days. ChromeOS is lovely and simple and it pretty much Just Works.

But the thing that makes it really work for me is Crouton, the semi-official way to run Ubuntu alongside ChromeOS, from champion Googler David Schneider. It's really really easy to install. You just put your Chromebook in developer mode, which gives you a shell in ChromeOS, then you run the crouton shell script, and it sets up a really minimal Ubuntu, with XFCE by default. It is exactly what you (or at least I) want. Once that's set up, it's a quick key combination to switch between ChromeOS and Ubuntu. The Ubuntu runs in a chroot, but by default your user's Downloads directory is shared with ChromeOS, which is brilliant.

The hardware is lovely, especially considering the price. It's so slim, and it feels pretty well built. And it fits in my tiny running-style backpack. The keyboard is pleasantly clicky in the way a chiclet keyboard can be. My one complaint about the hardware is that the button built into the trackpad sometimes sticks, but it seems to be doing that less and less -- maybe it just has to get broken in.

But here's the biggest thing: the battery lasts roughly forever. I'll carry it around all day, work on it for hours in a coffee shop, and I've still got plenty of battery left. It gets at least six hours.

So if what you want is a little laptop with a nice keyboard and long battery life that makes it easy to run a proper Linux (in addition to the bells-and-whistles ChromeOS things)... you could do a lot worse than the Samsung Chromebook.

language technology for Paraguay

2013-03-31T23:54:00.004-04:00

Earlier this month, we went to Paraguay. Why'd we do that?

Paraguay is the only country in the Americas where people are bilingual with a European language and the indigenous language. Paraguayans, the majority of them anyway, really do speak Guarani, or depending on context, a mixture of Spanish and Guarani called Jopará.

While we were there, we talked with Guarani-language teachers at the institutes where they're training translators and linguists. They let us sit in on some of the classes and talk with the students. We're working on building them a computer-assisted translation webapp that will help us collect lots of bilingual text! This is going to be huge.

While we were in the area, we also talked with the local One Laptop Per Child folks; there's an OLPC installation in the little town of Caacupé, which is near Asunción. The OLPC folks said that we should probably go visit local grade schools. Which we did!

It hit me, after the first visit to the schools, how much we were making use of our foreign-white-scientist privilege. We didn't have anything to do with the OLPC project -- aside from a desire to collaborate -- but here we were, wandering into schools without so much as a release form, talking to the kids. I'm trying to imagine Paraguayan scientists coming to the US to observe technology use among los niños estadounidenses.

The really interesting thing here: the kids in Caacupé weren't so surprised to see foreign scientist-looking guys coming to talk with them. They were really friendly, and eager to show off what they could do with the laptops! I get the impression this happens fairly frequently.

So there's a lot of stuff that needs to get built, to make computer use in Guarani more pleasant.

At a very basic level, it's hard for people to type the diacritics that you need for Guarani, if your keyboard layout is set to Spanish. The diacritics for Guarani actually aren't that weird; they've got tildes on some vowels, but you see that in Portuguese too.
There's no good spellchecker. Guarani morphology is pretty complicated, so this is not an easy thing to build. But we know a guy who's working on it...
Text-to-speech. They kids in the schools have text-to-speech for Spanish, and they love it! There's a program on the OLPC where you can send messages to a friend's computer, and the receiving computer will speak your message. It's hilarious. But it doesn't work for Guarani. And as you get further out into the country, the kids are more likely to be monolingual Guarani speakers...
The computer-assisted translation website: we're working on it. I'll write more about this soon...

I gave a talk about all of this at the computational linguistics seminar: here are the slides!

ACM Publications Board: "How can we minimally budge so you'll stop bothering us about open access?"

2013-02-09T00:31:00.001-05:00

You may have seen the recent article from CACM, Positioning ACM for an Open Access Future. I found the article fairly upsetting. The first paragraph...

The age of open access is upon us. Increasingly, the consensus of authors of research articles and their funding institutions is that the fruits of taxpayer-supported research should be freely available to the public. This is a compelling argument and a noble goal.

However, we're not going to do that anytime soon! They then launch into a red-herring discussion of predatory OA publishers (which are a real thing! there do exist vanity presses that have sprung up to capitalize on the OA trend) -- but this ignores how such publishers come about. You don't accidentally become such a vanity press. Serious venues with good review boards won't have the problem of "a glut of third-rate publications that add noise rather than insight to the scientific enterprise".

Afterwards, they discuss four different approaches for the ACM to not go fully OA: the first is that, optionally, authors could pay an extra fee to have their articles available from the Digital Library. The other three are simply ways in which the paywall restrictions could be lifted under some circumstances.

None of these are acceptable. Not if we believe that "the fruits of taxpayer-supported research should be freely available to the public". Why is the current situation even sort of OK?

The ACM's resistance to OA so far, its claims that figuring out a way to do it is too hard, that it's too expensive or will lead to bad publications -- as far as I can tell, these mean at least one of two things:

Perhaps the ACM is not as clever as USENIX, the Association for Computational Linguistics, NIPS and The Journal of Machine Learning Research (ie, most of the machine learning community), and all of those math folks...
Or alternatively, people within the ACM, despite its apparent status as a non-profit, really like all that money; they like it a lot more than the purported mission to foster "the open interchange of information..."

Which one of these is true? Both?

We need to get the ACM to stop thinking like a for-profit publisher and start thinking like their goal is to move the field forward and educate people. The ACM needs to drop both the paywall and its membership in the AAP.

Either that, or we as computing professionals need to drop the ACM.

stupid NLP tricks: haikubot!

2013-01-31T15:50:00.000-05:00

At Indiana, we have an IRC server where many of the CS students hang out. I keep myself logged in all the time, in a screen session that I'm usually not looking at. But even while I'm not watching, the chat logs just pile up. So I have megabytes of text from the conversations with my friends and colleagues. The clear use for this resource is, of course, generating haiku [0].

So I built a bot that does this! Here is my friend Eric Holk requesting a poem generated from snippets of text that he actually said.

< eholk> haikubot: eholk
< haikubot> do you have enough / memory to cause that was / stored unless it knew

Alright, how does this work?

I use the IRC logs to train a language model from things that each user said [1], on demand. They're cached in case somebody asks for another haiku from that same user's model again, because it can take a few seconds to train the model. The language model is something one could easily implement (although smoothing can get a little tricky), but I just used the default bigram models from NLTK.

We also need to be able to count syllables: this is pretty straightforward using the CMU Pronouncing Dictionary, which also comes with NLTK. If we don't have a stored pronunciation for a word, we back off to some heuristics about English that try to count the number of vowels (with some special rules for English spelling particularities); this comes from nltk_contrib.

To actually generate a line of the haiku, we just sample words from the language model until we have the desired number of syllables. If that fails for some reason, we try again (up to 100 times), and eventually just back off to picking a word from a list of known 5-syllable or 7-syllable words.

The IRC libraries and basic IRC bot structure are adapted from ircbot-collection by Ben Collins-Sussman et al [2].

Here's the code for haikubot! Hopefully this will be fun for somebody else too. If nothing else, it's a pretty good demonstration of some NLTK features working just fine on Python 3. It requires, at the moment, the tip of the NLTK trunk.

[0] These are of course not proper haiku as such. Just 5-7-5 syllables.
[1] There is some filtering to try to ignore, for example, code samples, and usernames at the start of lines.
[2] Related! Ben Collins-Sussman's talk: "How to Replace Yourself with an IRC Bot"

reading: Computer Power and Human Reason

2012-12-27T00:42:00.000-05:00

Not long ago, I got a copy of Jospeh Weizenbaum's Computer Power and Human Reason from a pile of free books -- there's so much great stuff on the free books table when a professor retires!

I’d recommend reading it if you’re into the ethical issues surrounding computing or the history of AI. The book was published in 1976; our relationship with computing has changed a lot since then, and that was probably the most striking thing about reading the book now.

Weizenbaum was worried about trust that society placed in the computer systems of the time. He describes situations in which people felt they were slaves to systems too complex for people to understand and too far removed from human judgement to be humane; examples include planning systems that told pilots where to bomb during the Vietnam War. But the systems had computers involved, and they were made by experts, so they must be right! "Garbage in; gospel out". And down came the bombs.

I'd argue that we've become less dazzled by computers as such, that we no longer think of them as infallible. But perhaps we're less likely to think about the computers themselves at all. They've become ubiquitous, just the infrastructure that makes society work. My mother (a keen observer of technology) recently remarked that it's strange that we still call them "computers" when the point is to use them for communication. We may still have problems of blind obedience, but perhaps it's better understood as blind obedience to people.

Similarly, Weizenbaum was concerned about the social power wielded by scientists, engineers, and other experts. To me, in the fair-and-balanced political climate, this sounds like a good problem to have: people used to listen to experts? Did they listen to experts when they said things that were politically inconvenient for those with money? Perhaps not...

Computer Power and Human Reason also spends some time with the exuberant claims about AI from before the AI Winter. Herbert Simon said, "... in a visible future – the range of problems they [machines] can handle will be coextensive with the range to which the human mind has been applied", which was clearly somewhat premature. But we have made progress on a lot of fronts! Weizenbaum was quite skeptical that machine translation would be any good, despite claims (which he relates in the book) that MT really just needed more processing power and more data. A few decades later, MT is often pretty good! All it took was more processing power and more data.

There's also some beautifully strange writing. Towards the beginning, he spends a few chapters explaining how computers work, in a formal, abstract way. And then we get this:

Suppose there were an enormous telephone network in which each telephone is permanently connected to a number of other telephones; there are no sets with dials. All subscribers constantly watch the same channel on television, and whenever a commercial, i.e., an active interval, begins, they all rush to their telephones and shout either "one" or "zero," depending on what is written on a notepad attached to their apparatus. ...

I have trouble imagining that this metaphor has helped many people understand digital logic circuits; but I enjoyed reading the book! Perhaps you'd enjoy it as well.

this is something new and beautiful: Coursera and Udacity

2012-07-07T02:37:00.000-04:00

Just last week, I finished the coursework for Coursera's machine learning class. It was great! I had a really good time with it, and I'm fairly proud of the accomplishment.

If you've been within earshot of me in the past few months, you probably know that I'm really excited about Coursera and Udacity and their ilk (including, but not limited to, edX, Khan Academy, and Duolingo). There are two experiences I'd like to contrast with taking a course on Coursera.

Some years ago, I was living in Atlanta and working a real job. And I went over to the Georgia Tech math department to see about taking some masters-level statistics classes, imagining that they would let me pay them lots of money in exchange for taking classes at the university where I had just graduated months prior. But it turned out that they wouldn't let me do this without being admitted for a full-time degree program.

Fewer years ago, I was starting my PhD at Indiana, and knowing exactly what I was there to learn, I picked out three classes: one on NLP, a computational linguistics class (from the Linguistics department), and one from Stats. I got a mild hassle from the department about my choices: these were all "fun" classes, and shouldn't I work on fulfilling my breadth requirements? I've since finished my IU coursework, and let me say: not all the classes I had to take as a result were very interesting, or even very well taught. Some were downright bad.

But now there are free online courses that are meant to be good, such that you take the ones you're interested in taking, as opposed to expensive in-person courses that may not be good, but you're obliged to take them anyway -- this is huge.

Whether or not you think that teaching in person is going to stay relevant, not everybody has access to good teachers in person. This remains true even for people at universities.

Moreover, online classes lower the barriers to entering or leaving a course to almost nothing. Want to sign up for a class just to try it out? Nothing could be easier! Don't enjoy it, or it's not what you thought it was, or find out you're busy with other stuff? Nothing lost, try a different one! But if you stick it out and put in the effort, then not only have you learned something, but also you get a certificate that says you finished! (maybe these could be OpenBadges sooner or later...)

There are going to be lots of bytes spilled about these things in the coming years, but just to make it clear: I'm jazzed about helping people who want to learn things get access to material about those things. And the World Music class is starting up soon, which my mother and I are going to take! Because why not?

happy hardware review: usb wireless adapter from ThinkPenguin

2012-06-30T23:30:00.000-04:00

I'm in Mountain View for the summer, working on Google Translate for another internship with that company that I seem to work for fairly often. Hooray!

Unfortunately, my laptop's built-in wireless card really doesn't agree with the apartment complex's wireless. So I ordered a little USB stick wireless adapter from ThinkPenguin, and it came pretty quickly, and I plugged it in (and told wicd to look at wlan1 instead of wlan0), and it just worked! Now my wireless connection is pretty fast, and doesn't drop every five minutes! (unlike before; it was a serious pain.)

Particularly, I got this one. Their other products may also be lovely. Thanks, ThinkPenguin!

take five minutes: support open access

2012-05-30T02:22:00.002-04:00

tl;dr: Sign this petition to support open access for publicly-funded research!! http://wh.gov/6TH

Here's the situation: there's lots of scholarly work being done. And you, as a citizen of a country, are paying academics to do science (or whatever), write about it, and review the work of other scholars. The work that makes it through the reviewing process gets published, typically in a journal or at a conference.

Here's the problem: a lot of that scholarly work is then inaccessible to you. You have to pay to read it, and often you have to pay a lot. If you're at a well-funded academic institution, your university library has to pay a lot. It's a serious problem for universities as wealthy as Harvard. Where does this money go to? It doesn't go to the academics who wrote the papers, or those who reviewed them: it goes to publishing companies with absurd profit margins who have trouble pointing at what value they add to the process, aside happening to own prestigious journals.

Concretely, this is a problem for the independent researcher, for the small business developer-of-stuff who wants to get the latest developments, for the interested public who wants to read and learn and grow, for the precocious teenager. I've come to care kind of a lot about this issue: it's because I believe in science. I think it's pretty important: it should get out to as many people as possible, not just because the citizens paid for it in the first place, but also so we can make progress faster.

The National Institutes of Health have famously set up an Open Access mandate: all the research that they fund must be available to the public pretty soon after it's published. Many universities are doing the same thing. The Association for Computational Linguistics (who run the conferences and journals where I'm personally likely to publish), do a bang-up job of making all of their articles publicly available, and I'm really proud to be associated with them. But not every professional organization, and not every field's journal are like this. Most are not!

How can you help? Right now, there's a petition on the White House website where you can ask the administration to expand the NIH-style mandate to other funding agencies: I'd really appreciate if you'd take a minute to make an account and sign the petition. Click here: http://wh.gov/6TH

(hrm, I seem to have written about this back in 2007 too)

command line tricks: ps, grep, awk, xargs, kill

2012-05-10T16:24:00.000-04:00

I recently learned a little bit of awk; if it's not in your command line repertoire, it's worth looking into! awk lets you do things like this:

$ ps auxww | grep weka | grep -v grep | awk '{print $2}' | xargs kill -9

Let's unpack what's going on here. First, we list all processes, in wide format (that's the "auxww" options to ps), then we filter with grep to only include lines that include "weka". When I wrote this line, I was debugging a long-running machine learning task, so I would start it running, then if (when) I found a bug and wanted to restart, I used this command to kill it.

Now we have all the lines of output from ps that include "weka". Unfortunately, this includes the grep process that's searching for "weka"! No problem, just use "grep -v" to filter out the lines that include "grep".

This is where awk comes in. We want to get the process numbers out of the ps output. It seems like we could use cut to just get the second column, but we don't know how wide that column is going to be! Maybe there's a cut option for that, but I don't think there is. Instead, we just use a tiny awk script that prints the second whitespace-delimited thing on each line.

Finally, we use xargs to take the process numbers and make them be arguments to kill. xargs is great: it takes each line of its standard input and makes those lines argument to a program (ie, its first argument). Usually I use xargs in combination with find, "svn status", or "git status -s", to do the same thing to batches of files. Maybe delete them or add them to version control or whatever.

Thoughts? Better ways to do this sort of thing?

startlingly bad moments in API design

2012-04-28T01:37:00.003-04:00

Weka, the machine learning toolkit, has these nice filters that let you change what's in a data set, maybe the features on the instances, or the instances themselves. Pretty useful. One is called "Remove", and it removes features. Here's a case in Weka where order matters when you're setting up the parameters for an object.

Like so: this does not remove any features.

Remove remove = new Remove();
remove.setInputFormat(instances);
remove.setAttributeIndices("7,10,100");
remove.setInvertSelection(true); // delete the other ones.
Instances out = Filter.useFilter(instances, remove);

This works just fine, though:

Remove remove = new Remove();
remove.setAttributeIndices("7,10,100");
remove.setInvertSelection(true); // delete the other ones.
remove.setInputFormat(instances);
Instances out = Filter.useFilter(instances, remove);

How are you supposed to find that out?

quals writeup: Tree Transducers, Machine Translation, and Cross-Language Divergences

2012-03-29T00:28:00.000-04:00

I hope it's not too pretentious to put things I'm writing for my phd qualifiers on arXiv. I think arXiv is really exciting, by the way. Leak your preprints there! Also pretty exciting: tree transducers for machine translation.

Abstract:

Tree transducers are formal automata that transform trees into other trees. Many varieties of tree transducers have been explored in the automata theory literature, and more recently, in the machine translation literature. In this paper I review T and xT transducers, situate them among related formalisms, and show how they can be used to implement rules for machine translation systems that cover all of the cross-language structural divergences described in Bonnie Dorr's influential article on the topic. I also present an implementation of xT transduction, suitable and convenient for experimenting with translation rules.

Paper! http://arxiv.org/abs/1203.6136

Software! http://github.com/alexrudnick/kurt

and we're back

2012-01-19T01:15:00.002-05:00

Well done, Internets!

In solidarity with everybody doing the #j18 protests against SOPA and PIPA, I blacked out this blog and my academic web page; I'd be really surprised if this directly caused anybody to call any legislators. My email to my family on the topic was probably more effective: one of my uncles wrote back, saying he'd signed a petition. So: rad!

The protests seem to have been incredibly loud and fairly effective. At this point, a congressperson would have to be incredibly dense to not get the sense that the public outcry against censoring the Internet in the US is enormous. A number of Republicans, including some former co-sponsors, have taken the opportunity to switch to opposing the bills, which seems politically expedient. (article on DailyKos about this. Kos wonders why Democrats seem to be willing to be left holding the bag...)

But even if we manage to get SOPA and PIPA scrapped, we're still left with two fundamental problems.

(1) The MPAA and RIAA can try to break the Internet again later, because they'll still own a significant number of congresspeople. Say if OPEN picks up steam and gets passed, will that be enough for them? The music and movie industries have fought tooth-and-nail against new technologies for decades; what's to stop them from taking another run against the Internet, or against whatever we have in the future? How can we reduce their money and influence, over time? As an angry Internet activist, getting all of your family and friends to boycott all media produced by major labels and studios seems extraordinarily hard. I must admit: I totally bought an Andrew W.K. CD not too long ago, and we have Netflix at our house. Should we cancel it?

(2) More fundamentally: large companies can own congresspeople. How can we take control of Congress, as citizens? Note that I didn't say "take back Congress" -- there's been a disconcerting connection between money and power for our entire history. If you haven't read Howard Zinn's A People's History of the United States, I highly recommend it.

People much more insightful and more dedicated than me have written quite a lot about this, but I suspect the solution really is campaign finance reform. Simply taking away the incentives to make awful decisions, while encouraging good behavior that at least some people like would probably result in a Congress that... makes fewer awful decisions and has a double-digit approval rating. I think term limits would also be useful, so that congresspeople don't have to worry about re-election so often (although, how to incentivize good behavior in the last term? ...), and some sort of rules to keep former congresspersons from becoming lobbyists, so we can prevent the Chris Dodd situation from happening again. He still goes by "Senator Dodd", but this year he quit the senate to become the head of the MPAA.

Oh, also: Super PACs, and the Citizens United decision.

However we move forward: today, a large number of people who had never tried to contact their elected representatives, now have. The more often we do it, the lower the psychological barrier! There are even phone apps: (android, iphone). I use the Android one all the time; it's extremely convenient.

Right. Anyway. Stop reading this blog, and go read what Lawrence Lessig has to say. I'll get back to doing some computational linguistics. Maybe you should too!

reading: Readings in Machine Translation

2011-12-31T21:10:00.001-05:00

Post before the end of the year!

I'm really enjoying Readings in Machine Translation -- it's got all of these great MT papers from past decades, going from the Warren Weaver memo from 1949 to the Brown et al. paper where they make stat-mt fashionable again in the early 90s. Apparently, a lot of the papers in the volume were somewhat hard to find online in 2003.

Really interesting: so far, the early papers have had some very detailed descriptions of the low-level particulars. "Well, we're going to need this many memory drums...", "oh, and the words will be stored in memory in alphabetical order" (which seems very archaic), and a fixation on picking the right word in the target language, in sort of a word-sense disambiguation sense (which is slightly fashionable again!).

So for people into MT who want a sense of history, these are papers that it seems like one should read -- I mean, Sergei Nirenburg and friends picked them out, so they've got to be good, right?

If you haven't read the 1949 Warren Weaver memo, though -- even if you're not an NLP person -- do yourself a favor and go ahead and read it!

securing your MoinMoin wiki

2011-11-30T22:19:00.001-05:00

I really like MoinMoin; it's straightforward, it's Pythonic, it's got WikiWords. Great! I've been keeping a bunch of notes on one.

The IU Computational Linguistics group had an aging MoinMoin too, but most of the edits and new accounts were spam. I suspect the edits were being done by humans too, because they were pretty good at fitting into the markup of existing pages.

We replaced it with a new install (here it is) and make it a bit more secure. We disallowed anonymous edits, and made it so that you can't just arbitrarily create accounts, which took a code change. I added the "if not request.user.isSuperUser() ..." block (suggested here) to MoinMoin/action/newaccount.py. The rest of the changes described on that page aren't necessary -- just make it refuse new account requests.

Then it occurred to me: spammers have probably been trying to spam my personal wiki too! I checked: there were about a hundred spam accounts; a new one every day or two. My Moin was allowing arbitrary account creation, but they were all useless because only I could edit pages!

So in the interest of discouraging future webcrap, let me issue a warning: CyrilleVincent, SabaFaulkner, CasinoBonus, life insurance quotes, and ChickyBowen -- I'm coming for you. And you'd best sleep with one eye open, paydayloansuk214. If that's your real name.

reading: Religious Literacy by Stephen Prothero

2011-10-09T00:04:00.001-04:00

Not too long ago, I read the very thought-provoking Religious Literacy: What Every American Needs to Know and Doesn't, by Stephen Prothero, likely because it came recommended by Dale McGowan. The discussion of the history of religious education in the US is fantastic.

The main argument of the book is that, until recent decades, we knew quite a lot about Protestantism, through instruction at home, in churches, and even in the school system. People apparently used the verb "catechize", as something you did to children. But this knowledge of what's actually in the Bible, and actual church doctrines is, these days, largely lost on us in the US, even though we're very caught up in our religious identities and church attendance is huge.

Political discourse is full of religious allusions, but we often don't get the references. I would have appreciated more examples of problems that this causes in practice -- is it really an issue for being a citizen in a democracy?

Prothero notes, early on, that Europeans are much less religious than Americans but more familiar with religious content. Dale McGowan says "faith is most easily sustained in ignorance"; knowing a thing or three about a few different religions makes it easier to not get caught up in any of them. Is the major problem caused by religious ignorance that it makes it easier for preachers and politicians to jerk people around by telling them that God says thus-and-such?

While Prothero doesn't address the question of why the more broadly-educated Europeans don't tend to be churchgoers, he does put forth a policy suggestion, that our curricula should have more information about the world's religions. And while I agree that it's probably a good idea, he doesn't say much about the sorts of changes we might see, with better religious studies education. I must admit, I have a hard time thinking of education-about-religion as anything except a strategy against the influence of seemingly-devout people.

Perhaps I'll pick up his more recent book, God Is Not One, about the fundamental disagreements between different religions, contrasting with the framing you'd get from Huston Smith or Karen Armstrong, who argue that different religions are grasping towards the same fundamental truth. I'm really curious about his personal position, because Prothero identifies himself as an Episcopalian, but hasn't thus far talked about any particular benefits of people believing any particular thing.

cross-lingual word sense disambiguation

2011-07-31T23:15:00.000-04:00

Have I mentioned what I've been working on recently? Maybe I haven't.

In general, I'm working on cross-lingual word- and phrase-sense disambiguation. WSD/PSD is the problem of deciding, for a given word or phrase, which meaning was intended, for some pre-defined sets of meanings. You might get the possible senses out of a dictionary, where they're nicely enumerated, or perhaps from WordNet. The stock example is "bank" -- is it the side of a river, or is it a building where they do financial services? Or, is it the abstract financial institution?

There's a brilliant bit from the prescient Warren Weaver, from 1955 (via):

If one examines the words in a book, one at a time as through an opaque mask with a hole in it one word wide, then it is obviously impossible to determine, one at a time, the meaning of the words . . . . But if one lengthens the slit in the opaque mask, until one can see not only the central word in question but also say N words on either side, then if N is large enough one can unambiguously decide the meaning of the central word . . . . The practical question is: "What minimum value of N will, at least in a tolerable fraction of cases, lead to the correct choice of meaning for the central word?

The "cross-lingual" kind of WSD means that we care about exactly the distinctions that cause you to pick a different word in a given target language, typically because the CLWSD system is meant to be integrated into an MT system; that's becoming fashionable (Carpuat and Wu, 2007). So in this setting, say if you're translating "bank" from English into Spanish, your system doesn't have to decide if it's the building or the institution that owns it -- it's still "banco". Now a riverbank is an "orilla".

In the general case, your system might end up learning how to make distinctions that you as a human didn't know you had to make -- for example, I'm given to understand that Japanese doesn't have just one word for "brother", but "older brother" and "younger brother", which are different enough concepts that they get totally separate words.

Making these choices is typically treated as a classification problem: you get some features for a bunch of instances of usage of a source word, and do supervised learning to get a classifier with (hopefully) good accuracy on the problem of predicting whether this is a "banco" usage or an "orilla" usage. The features are typically things like "which words are in the surrounding context?", or perhaps something fancier based on a parse of the sentence or knowledge about the document as a whole -- whatever you think will be predictive of what the target-language word should be. Hopefully your learning algorithm has some good way of filtering out irrelevant features.

And then, once that's all put together, hopefully you have some extra signal to feed into your translation system, and it makes better word choices, and everybody's happy.

And that's cross-lingual word/phrase-sense disambiguation!

Mr. Verb on metaphor

2011-06-05T19:06:00.001-04:00

Oh, also: there was this great Mr. Verb post, where they talk about government funding to do research on metaphor. Link to an article in The Atlantic: Why Are Spy Researchers Building a 'Metaphor Program'?

Here's the job posting, which sounds awesome, except that it will probably ultimately lead to people getting exploded:

The Metaphor Program will exploit the fact that metaphors are pervasive in everyday talk and reveal the underlying beliefs and worldviews of members of a culture. In the first phase of the two-phase program, performers will develop automated tools and techniques for recognizing, defining and categorizing linguistic metaphors associated with target concepts and found in large amounts of native-language text. The resulting conceptual metaphors will be validated using empirical social science methods. In the second phase, the program will characterize differing cultural perspectives associated with case studies of the types of interest to the Intelligence Community. Performers will apply the methodology established in the first phase and will identify the conceptual metaphors used by the various protagonists, organizing and structuring them to reveal the contrastive stances.