Sunday, March 31, 2013

language technology for Paraguay

Earlier this month, we went to Paraguay. Why'd we do that?

Paraguay is the only country in the Americas where people are bilingual with a European language and the indigenous language. Paraguayans, the majority of them anyway, really do speak Guarani, or depending on context, a mixture of Spanish and Guarani called Jopará.

While we were there, we talked with Guarani-language teachers at the institutes where they're training translators and linguists. They let us sit in on some of the classes and talk with the students. We're working on building them a computer-assisted translation webapp that will help us collect lots of bilingual text! This is going to be huge.

While we were in the area, we also talked with the local One Laptop Per Child folks; there's an OLPC installation in the little town of Caacupé, which is near Asunción. The OLPC folks said that we should probably go visit local grade schools. Which we did!

It hit me, after the first visit to the schools, how much we were making use of our foreign-white-scientist privilege. We didn't have anything to do with the OLPC project -- aside from a desire to collaborate -- but here we were, wandering into schools without so much as a release form, talking to the kids. I'm trying to imagine Paraguayan scientists coming to the US to observe technology use among los niños estadounidenses.

The really interesting thing here: the kids in Caacupé weren't so surprised to see foreign scientist-looking guys coming to talk with them. They were really friendly, and eager to show off what they could do with the laptops! I get the impression this happens fairly frequently.

So there's a lot of stuff that needs to get built, to make computer use in Guarani more pleasant.
  • At a very basic level, it's hard for people to type the diacritics that you need for Guarani, if your keyboard layout is set to Spanish. The diacritics for Guarani actually aren't that weird; they've got tildes on some vowels, but you see that in Portuguese too.
  • There's no good spellchecker. Guarani morphology is pretty complicated, so this is not an easy thing to build. But we know a guy who's working on it...
  • Text-to-speech. They kids in the schools have text-to-speech for Spanish, and they love it! There's a program on the OLPC where you can send messages to a friend's computer, and the receiving computer will speak your message. It's hilarious. But it doesn't work for Guarani. And as you get further out into the country, the kids are more likely to be monolingual Guarani speakers...
  • The computer-assisted translation website: we're working on it. I'll write more about this soon...
I gave a talk about all of this at the computational linguistics seminar: here are the slides!

Saturday, February 09, 2013

ACM Publications Board: "How can we minimally budge so you'll stop bothering us about open access?"

You may have seen the recent article from CACM, Positioning ACM for an Open Access Future. I found the article fairly upsetting. The first paragraph...
The age of open access is upon us. Increasingly, the consensus of authors of research articles and their funding institutions is that the fruits of taxpayer-supported research should be freely available to the public. This is a compelling argument and a noble goal.
However, we're not going to do that anytime soon! They then launch into a red-herring discussion of predatory OA publishers (which are a real thing! there do exist vanity presses that have sprung up to capitalize on the OA trend) -- but this ignores how such publishers come about. You don't accidentally become such a vanity press. Serious venues with good review boards won't have the problem of "a glut of third-rate publications that add noise rather than insight to the scientific enterprise".

Afterwards, they discuss four different approaches for the ACM to not go fully OA: the first is that, optionally, authors could pay an extra fee to have their articles available from the Digital Library. The other three are simply ways in which the paywall restrictions could be lifted under some circumstances.

None of these are acceptable. Not if we believe that "the fruits of taxpayer-supported research should be freely available to the public". Why is the current situation even sort of OK?

The ACM's resistance to OA so far, its claims that figuring out a way to do it is too hard, that it's too expensive or will lead to bad publications -- as far as I can tell, these mean at least one of two things:
Which one of these is true? Both?

We need to get the ACM to stop thinking like a for-profit publisher and start thinking like their goal is to move the field forward and educate people. The ACM needs to drop both the paywall and its membership in the AAP.

Either that, or we as computing professionals need to drop the ACM.

Thursday, January 31, 2013

stupid NLP tricks: haikubot!

At Indiana, we have an IRC server where many of the CS students hang out. I keep myself logged in all the time, in a screen session that I'm usually not looking at. But even while I'm not watching, the chat logs just pile up. So I have megabytes of text from the conversations with my friends and colleagues. The clear use for this resource is, of course, generating haiku [0].

So I built a bot that does this! Here is my friend Eric Holk requesting a poem generated from snippets of text that he actually said.
< eholk> haikubot: eholk
< haikubot> do you have enough / memory to cause that was / stored unless it knew
Alright, how does this work?

I use the IRC logs to train a language model from things that each user said [1], on demand. They're cached in case somebody asks for another haiku from that same user's model again, because it can take a few seconds to train the model. The language model is something one could easily implement (although smoothing can get a little tricky), but I just used the default bigram models from NLTK.

We also need to be able to count syllables: this is pretty straightforward using the CMU Pronouncing Dictionary, which also comes with NLTK. If we don't have a stored pronunciation for a word, we back off to some heuristics about English that try to count the number of vowels (with some special rules for English spelling particularities); this comes from nltk_contrib.

To actually generate a line of the haiku, we just sample words from the language model until we have the desired number of syllables. If that fails for some reason, we try again (up to 100 times), and eventually just back off to picking a word from a list of known 5-syllable or 7-syllable words.

The IRC libraries and basic IRC bot structure are adapted from ircbot-collection by Ben Collins-Sussman et al [2].

Here's the code for haikubot! Hopefully this will be fun for somebody else too. If nothing else, it's a pretty good demonstration of some NLTK features working just fine on Python 3. It requires, at the moment, the tip of the NLTK trunk.

[0] These are of course not proper haiku as such. Just 5-7-5 syllables.
[1] There is some filtering to try to ignore, for example, code samples, and usernames at the start of lines.
[2] Related! Ben Collins-Sussman's talk: "How to Replace Yourself with an IRC Bot"

Thursday, December 27, 2012

reading: Computer Power and Human Reason

Not long ago, I got a copy of Jospeh Weizenbaum's Computer Power and Human Reason from a pile of free books -- there's so much great stuff on the free books table when a professor retires!

I’d recommend reading it if you’re into the ethical issues surrounding computing or the history of AI. The book was published in 1976; our relationship with computing has changed a lot since then, and that was probably the most striking thing about reading the book now.

Weizenbaum was worried about trust that society placed in the computer systems of the time. He describes situations in which people felt they were slaves to systems too complex for people to understand and too far removed from human judgement to be humane; examples include planning systems that told pilots where to bomb during the Vietnam War. But the systems had computers involved, and they were made by experts, so they must be right! "Garbage in; gospel out". And down came the bombs.

I'd argue that we've become less dazzled by computers as such, that we no longer think of them as infallible. But perhaps we're less likely to think about the computers themselves at all. They've become ubiquitous, just the infrastructure that makes society work. My mother (a keen observer of technology) recently remarked that it's strange that we still call them "computers" when the point is to use them for communication. We may still have problems of blind obedience, but perhaps it's better understood as blind obedience to people.

Similarly, Weizenbaum was concerned about the social power wielded by scientists, engineers, and other experts. To me, in the fair-and-balanced political climate, this sounds like a good problem to have: people used to listen to experts? Did they listen to experts when they said things that were politically inconvenient for those with money? Perhaps not...

Computer Power and Human Reason also spends some time with the exuberant claims about AI from before the AI Winter. Herbert Simon said, "... in a visible future – the range of problems they [machines] can handle will be coextensive with the range to which the human mind has been applied", which was clearly somewhat premature. But we have made progress on a lot of fronts! Weizenbaum was quite skeptical that machine translation would be any good, despite claims (which he relates in the book) that MT really just needed more processing power and more data. A few decades later, MT is often pretty good! All it took was more processing power and more data.

There's also some beautifully strange writing. Towards the beginning, he spends a few chapters explaining how computers work, in a formal, abstract way. And then we get this:
Suppose there were an enormous telephone network in which each telephone is permanently connected to a number of other telephones; there are no sets with dials. All subscribers constantly watch the same channel on television, and whenever a commercial, i.e., an active interval, begins, they all rush to their telephones and shout either "one" or "zero," depending on what is written on a notepad attached to their apparatus. ...
I have trouble imagining that this metaphor has helped many people understand digital logic circuits; but I enjoyed reading the book! Perhaps you'd enjoy it as well.

Saturday, July 07, 2012

this is something new and beautiful: Coursera and Udacity

Just last week, I finished the coursework for Coursera's machine learning class. It was great! I had a really good time with it, and I'm fairly proud of the accomplishment.

If you've been within earshot of me in the past few months, you probably know that I'm really excited about Coursera and Udacity and their ilk (including, but not limited to, edX, Khan Academy, and Duolingo). There are two experiences I'd like to contrast with taking a course on Coursera.

Some years ago, I was living in Atlanta and working a real job. And I went over to the Georgia Tech math department to see about taking some masters-level statistics classes, imagining that they would let me pay them lots of money in exchange for taking classes at the university where I had just graduated months prior. But it turned out that they wouldn't let me do this without being admitted for a full-time degree program.

Fewer years ago, I was starting my PhD at Indiana, and knowing exactly what I was there to learn, I picked out three classes: one on NLP, a computational linguistics class (from the Linguistics department), and one from Stats. I got a mild hassle from the department about my choices: these were all "fun" classes, and shouldn't I work on fulfilling my breadth requirements? I've since finished my IU coursework, and let me say: not all the classes I had to take as a result were very interesting, or even very well taught. Some were downright bad.

But now there are free online courses that are meant to be good, such that you take the ones you're interested in taking, as opposed to expensive in-person courses that may not be good, but you're obliged to take them anyway -- this is huge.

Whether or not you think that teaching in person is going to stay relevant, not everybody has access to good teachers in person. This remains true even for people at universities.

Moreover, online classes lower the barriers to entering or leaving a course to almost nothing. Want to sign up for a class just to try it out? Nothing could be easier! Don't enjoy it, or it's not what you thought it was, or find out you're busy with other stuff? Nothing lost, try a different one! But if you stick it out and put in the effort, then not only have you learned something, but also you get a certificate that says you finished! (maybe these could be OpenBadges sooner or later...)

There are going to be lots of bytes spilled about these things in the coming years, but just to make it clear: I'm jazzed about helping people who want to learn things get access to material about those things. And the World Music class is starting up soon, which my mother and I are going to take! Because why not?

Saturday, June 30, 2012

happy hardware review: usb wireless adapter from ThinkPenguin

I'm in Mountain View for the summer, working on Google Translate for another internship with that company that I seem to work for fairly often. Hooray!

Unfortunately, my laptop's built-in wireless card really doesn't agree with the apartment complex's wireless. So I ordered a little USB stick wireless adapter from ThinkPenguin, and it came pretty quickly, and I plugged it in (and told wicd to look at wlan1 instead of wlan0), and it just worked! Now my wireless connection is pretty fast, and doesn't drop every five minutes! (unlike before; it was a serious pain.)

Particularly, I got this one. Their other products may also be lovely. Thanks, ThinkPenguin!

Wednesday, May 30, 2012

take five minutes: support open access

tl;dr: Sign this petition to support open access for publicly-funded research!! http://wh.gov/6TH

Here's the situation: there's lots of scholarly work being done. And you, as a citizen of a country, are paying academics to do science (or whatever), write about it, and review the work of other scholars. The work that makes it through the reviewing process gets published, typically in a journal or at a conference.

Here's the problem: a lot of that scholarly work is then inaccessible to you. You have to pay to read it, and often you have to pay a lot. If you're at a well-funded academic institution, your university library has to pay a lot. It's a serious problem for universities as wealthy as Harvard. Where does this money go to? It doesn't go to the academics who wrote the papers, or those who reviewed them: it goes to publishing companies with absurd profit margins who have trouble pointing at what value they add to the process, aside happening to own prestigious journals.

Concretely, this is a problem for the independent researcher, for the small business developer-of-stuff who wants to get the latest developments, for the interested public who wants to read and learn and grow, for the precocious teenager. I've come to care kind of a lot about this issue: it's because I believe in science. I think it's pretty important: it should get out to as many people as possible, not just because the citizens paid for it in the first place, but also so we can make progress faster.

The National Institutes of Health have famously set up an Open Access mandate: all the research that they fund must be available to the public pretty soon after it's published. Many universities are doing the same thing. The Association for Computational Linguistics (who run the conferences and journals where I'm personally likely to publish), do a bang-up job of making all of their articles publicly available, and I'm really proud to be associated with them. But not every professional organization, and not every field's journal are like this. Most are not!

How can you help? Right now, there's a petition on the White House website where you can ask the administration to expand the NIH-style mandate to other funding agencies: I'd really appreciate if you'd take a minute to make an account and sign the petition. Click here: http://wh.gov/6TH

(hrm, I seem to have written about this back in 2007 too)