Texas outline

Clusterfuck: A Subversive Job-Distribution Tool

20 September 2009

I should start this off by making something clear: I am an extremely lazy programmer. I long ago adopted the Ruby mantra that programmer time is more precious than machine time, and tend to write most of my code in a way that is clear and comprehensible, with little regard for speed of execution.

When working with massive NLP datasets, this is perhaps not an entirely good idea.

Lately my experiments have begun to take up more and more days to complete. Part of this, of course, is Ruby’s fault — the language, or at least the current interpreter, is far from fast — and part of the blame lies with me, for writing fundamentally lazy code. I have, however, come up with something of an interesting fix for the problem by parallelizing almost all of my big experiments. Easily done, really — but I then realized that I don’t have access to any of the department’s clusters. But I do have a login that works on all of the public-access machines in the (huge) undergraduate labs.

Bingo.

Enter clusterfuck, a subversive job-distribution tool I’ve been working on to solve this niggly little dilemma. It’s basically a tool for automating the process of ssh-ing into multiple machines and starting jobs on each of them, writ large. It’s installable via GitHub (I’ll push a gem out to Gemcutter when I have a nice, polished version ready) and configured rake-style via clusterfiles:

Clusterfuck::Task.new do |task|
    task.hosts = %w{clark asimov}
    task.jobs = ["hostname","hostname","hostname","hostname"]
    task.temp = "fragments"
    task.username = "SSHUSERNAME"
    task.password = "SSHPASSWORD"
    #task.debug = true
end

You kick off a batch of cluster jobs via the command clusterfuck (which takes a clusterfile as an argument, or defaults to clusterfile if it exists). This example file will run the command hostname four times on two machines (two each, unless one goes down or is slow to respond) and save the output of each command into a directory fragments. A bit of a toy example to be sure, but replacing the uninteresting hostname job with something more time-consuming and extend the list of hosts to, say, several dozen idle machines can get large, decomposable jobs completed in a fraction of the time necessary to run ‘em on a single machine.

Why “clusterfuck”? The early versions of this tool were, to say the least, somewhat unreliable, and had a frustrating tendency to leave cpu-intensive processes floating around. Needless to say, the network admins were not happy about this state of affairs, and the entire thing turned into a giant, well, you get the idea.

« Previous: Getting Started With Flixel on OS X