Distributed Processing and Unix Philosophy

A few months back I started working on a news classifier for Highput. I had originally planned to use Clojure to write Hadoop jobs, but found that most of the input data was available as a stream and the allure of continuously producing results was too much. RSS feeds, twitter streams, and collaborative ranking (e.g, HN, reddit) are continuous, the classifier should be as well. The data retrieval, extraction, transformation, and loading (ETL) was done using a distributed workflow built from three simple components: Python, Beanstalk, and Tokyo Tyrant. The producer/consumer processes are Python programs, Beanstalk is a message queue and divvies jobs to consumers, while Tokyo Tyrant stores intermediate results.

Python is a great programming language for getting things done: it has loads of libraries, pseudocode-like syntax, and keeps simple things simple. Aside from gaining throughput performance by distributing across multiple machines, one of the advantages of using a workflow for data processing is the reduction of complexity that comes from dividing a larger problem. Each processing step in the workflow was implemented as a separate Python program. Each instance of the program requests a job from Beanstalk, does its work, then throws the result into a different job queue. Any of the processes can be killed at any time without losing the task. That's because Beanstalk keeps track of jobs, even after a worker has reserved it - if the worker fails to report back then the job will be returned to a ready state and made available to other workers. Beanstalk also supports persistent queues, protecting against losing jobs if the beanstalkd process dies. Finally, there is Tokyo Tyrant, the network interface to Tokyo Cabinet. Tokyo Cabinet received deserving attention in 2009, it's the best little database, supporting multiple modes useful in various circumstances: simple key/value pairs, fixed-size arrays, B-trees, and schema-free tables. Ultimately, the news classifier in production would use something with an automated process for balancing storage nodes, but Tokyo Cabinet is great for prototyping even if it's not the longterm solution.

With these tools, a distributed system prototype for ETL can be built in a day. And not just a hacked up system that should be thrown away, but one that is nearly good enough for production - add deployment management, failure detection and recovery and it's complete. That is significant work, but all are concerns for users of most heavier distributed system frameworks too.  While this solution is ad-hoc and inappropriate for some situations, it's amazing what can be done with a small set of well-designed, task-specific tools in a short period of time.