It is possible to join 2 or more tables in Hadoop.
One way is to have a Mapper put the first table in Memory. Then loop through and do the Joins to the second table. However, this could overflow your memory buffer with large data sets.
Another way is to Join the two tables on the Reducer side. By passing in the Joining key, then identifying which table is which and applying the join based on the key. This also requires lots of resources.
Both of these methods are possible, and both use lots of memory and coding.
I prefer my way, which is to run your first Map/Reduce job from your Driver, the based on the output of the first Outer M/R job, you then call an Inner Map / Reduce job.
This is how Visual Basic programmers back in the day joined two tables without having to know what a Join was. It was slow then, and it's slow now. Except the coding is simpler and is not memory intensive.
I signed up for the Hortonworks Certified Associate exam last Thursday. Figured if I sign up, I'd have to take the test. And if I tak...
Saw a post today on Twitter, " Microsoft releases CNTK, its open source deep learning toolkit, on GitHub " This is big news. Be...
It seems like open source applications are the mainstream today. So many new products delivered through Aache foundation. Some do this. S...