5/10/2013

#Hadoop Table Joins

It is possible to join 2 or more tables in Hadoop.

One way is to have a Mapper put the first table in Memory.  Then loop through and do the Joins to the second table.  However, this could overflow your memory buffer with large data sets.

Another way is to Join the two tables on the Reducer side.  By passing in the Joining key, then identifying which table is which and applying the join based on the key.  This also requires lots of resources.

Both of these methods are possible, and both use lots of memory and coding.

I prefer my way, which is to run your first Map/Reduce job from your Driver, the based on the output of the first Outer M/R job, you then call an Inner Map / Reduce job. 

This is how Visual Basic programmers back in the day joined two tables without having to know what a Join was.  It was slow then, and it's slow now.  Except the coding is simpler and is not memory intensive.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.