NYTimes.com Announces Map/Reduce Toolkit

misterbwong · on May 12, 2009

I have to give a lot of respect to NYT. Whether this project sticks or not, it's things like this that make me think NYT is going to be one of the few newspapers to survive the crisis hitting the papers (albeit in a much smaller and much different form). They're one of the few newspapers at least trying to get it. Others are just complaining while hemorrhaging money.

patio11 · on May 12, 2009

I don't care for the NYT's editorial stances or reporting for the most part, but they've really been pushing the envelope in terms of using the newspaper's website as more than a paperless version of a dead-tree product. I could point to any number of their news-related projects. The election had a number of great ways to present election results. Their Faces of the Dead feature was also... how to avoid breaking the etiquette here... I'm going to go with "technically well-executed".

SwellJoe · on May 11, 2009

Interesting that it's built in Ruby. I was under the impression that NYTimes did pretty much all of their dynamic language work in Perl.

bkudria · on May 11, 2009

No, they do quite a bit of PHP now as well,and the Interactive News team uses Rails and Django a lot as well.

wastedbrains · on May 12, 2009

They have given a number of talks about using Rsubyfor some of their stuff and have open sourced other Ruby libraries. They have been a pretty active contributor to the Ruby community.

matrix · on May 12, 2009

Anyone know why they built this rather than use Hive or Pig? One thing that drives me nuts is that all of these MR tools are very slow because they don't take advantage of indexes and use inefficient storage (e.g. in this case, plain text files), both of which would likely improve query performance considerably.

vicaya · on May 13, 2009

MR tools, especially high level wrappers like cascading (which gives you capability to do easy joins on Hadoop) are very good for building indices for query. You can use these tools to process the (log) data once and load the results into a scalable db like Hypertable (or even MySQL if the results set is small) and query them.

mlLK · on May 11, 2009

I think it's interesting that this was released by the New York Times. . .this could be prove to be an interesting new model/trend for newspaper publishers to remain as a viable competitor in the 21st century, given the bad-rap they seem to be giving themselves these days as newspaper publishers. D=

earle · on May 11, 2009

This is pretty meaningless -- there's already a THRIFT interface which allows easy job creation and control as well as HadoopStreaming which allows access to creating map-reduce jobs for anything using stdin/stdout.

This has dubious benefit, and just adds another unnecessary layer into this process. I'm not sure why this is news.

aaronblohowiak · on May 11, 2009

This is a convenience layer, and it comes with a way to run your map/reduce without having to have hadoop on your dev box.

adw · on May 12, 2009

last.fm did something similar, called Dumbo, for writing your Hadoop jobs in Python.

grandalf · on May 11, 2009

the least they could do is post a link to the code on github to help out a startup -- why use google code?