Gutting Hadoop with YARN

There's a building near my home.  The building is owned by a company.  The building was probably built in the 1960's or earlier.  Lately, they've been remodeling the building to give it a fresh new feel.  The thing is, they did not bulldoze the existing building.  They simply kept the structure in place, and build around it.

Now take Hadoop.  It's core was based upon two things.  HDFS files system.  MapReduce framework.  Along comes YARN.  A layer between the HDFS (now HDFS2) and MapReduce.  Although MapReduce is still bundled into the product, it's basically one of many applications which sit atop of YARN.  So it's no longer the sole way of getting at the data / file system.

In fact, with TEZ, the code actually gets compiled down to a lower level than MapReduce, which explains the speed up.  And yes, MapReduce still runs it's legacy applications.  However, it will be compiled down to a lower level to take advantage of YARN (not sure if this is available now).

The way I heard it is that MapReduce has to spawn off multiple jobs/threads to handle complex SQL like Join or Group By.  Those extra threads have to be monitored, they have to write to disk, then aggregated and shuffled, then brought back together, slowing it down.  YARN does not use as many jobs to get the same results.  And that's one way it was able to return the queries faster.

So what they've done, is the same thing as that building down the street from me.  They've built around it using the existing framework, added a basement, new walls and made it two stories.

Very impressive.

Likewise, Microsoft has found an alternate method of storing the data instead of HFDS file system.  They use BlobStorage which allows easy scale out on the web.

So are things changing in the world of Hadoop?  Yes siree Bob.  Change is a good thing.

And so it goes~!