Wednesday, May 30, 2012

take five minutes: support open access

tl;dr: Sign this petition to support open access for publicly-funded research!! http://wh.gov/6TH

Here's the situation: there's lots of scholarly work being done. And you, as a citizen of a country, are paying academics to do science (or whatever), write about it, and review the work of other scholars. The work that makes it through the reviewing process gets published, typically in a journal or at a conference.

Here's the problem: a lot of that scholarly work is then inaccessible to you. You have to pay to read it, and often you have to pay a lot. If you're at a well-funded academic institution, your university library has to pay a lot. It's a serious problem for universities as wealthy as Harvard. Where does this money go to? It doesn't go to the academics who wrote the papers, or those who reviewed them: it goes to publishing companies with absurd profit margins who have trouble pointing at what value they add to the process, aside happening to own prestigious journals.

Concretely, this is a problem for the independent researcher, for the small business developer-of-stuff who wants to get the latest developments, for the interested public who wants to read and learn and grow, for the precocious teenager. I've come to care kind of a lot about this issue: it's because I believe in science. I think it's pretty important: it should get out to as many people as possible, not just because the citizens paid for it in the first place, but also so we can make progress faster.

The National Institutes of Health have famously set up an Open Access mandate: all the research that they fund must be available to the public pretty soon after it's published. Many universities are doing the same thing. The Association for Computational Linguistics (who run the conferences and journals where I'm personally likely to publish), do a bang-up job of making all of their articles publicly available, and I'm really proud to be associated with them. But not every professional organization, and not every field's journal are like this. Most are not!

How can you help? Right now, there's a petition on the White House website where you can ask the administration to expand the NIH-style mandate to other funding agencies: I'd really appreciate if you'd take a minute to make an account and sign the petition. Click here: http://wh.gov/6TH

(hrm, I seem to have written about this back in 2007 too)

Thursday, May 10, 2012

command line tricks: ps, grep, awk, xargs, kill

I recently learned a little bit of awk; if it's not in your command line repertoire, it's worth looking into! awk lets you do things like this:

$ ps auxww | grep weka | grep -v grep | awk '{print $2}' | xargs kill -9

Let's unpack what's going on here. First, we list all processes, in wide format (that's the "auxww" options to ps), then we filter with grep to only include lines that include "weka". When I wrote this line, I was debugging a long-running machine learning task, so I would start it running, then if (when) I found a bug and wanted to restart, I used this command to kill it.

Now we have all the lines of output from ps that include "weka". Unfortunately, this includes the grep process that's searching for "weka"! No problem, just use "grep -v" to filter out the lines that include "grep".

This is where awk comes in. We want to get the process numbers out of the ps output. It seems like we could use cut to just get the second column, but we don't know how wide that column is going to be! Maybe there's a cut option for that, but I don't think there is. Instead, we just use a tiny awk script that prints the second whitespace-delimited thing on each line.

Finally, we use xargs to take the process numbers and make them be arguments to kill. xargs is great: it takes each line of its standard input and makes those lines argument to a program (ie, its first argument). Usually I use xargs in combination with find, "svn status", or "git status -s", to do the same thing to batches of files. Maybe delete them or add them to version control or whatever.

Thoughts? Better ways to do this sort of thing?