5/07/2013

#Hadoop Basics I learned in Class

So I got training today on Hadoop.

Teacher was from Cloudera.

18 other programmers attended as well as our CTO.

Today was day 1 of 4.

We learned the history of Hadoop, started out as Lucene to Nutch to Hadoop.

And Hadoop Distributed File System HDFS.  Which is a file system that sits on top of the file system.

Data gets ingested through Hadoop commands which gets distributed into chunks which sits inside of a Block, typically 64mgs.  Each block is replicated 2 more times for fail over and redundancy.

The data sits in HDFS waiting to be queried.

We do that with Map Reduce.  The Mapper phase goes through each of the files, line by line, where it's dissected into Key Value pairs.

Those K/V pairs are then collected from all the Data Nodes into one or more randomly chosen Data Nodes, where they are Reduced.

In that they are aggregated (summed, averaged, min, max, etc.) and output to a file within Hadoop HDFS.

There's a "conductor" which keeps track of all the Data Nodes stored in Metadata which happens on the Named Node.  This happens to be a single point of failure, where the Data Nodes can lose 2 of the 3 versions of the data and the system will self-heal by copying that data to other healthy nodes.

The Named Node keeps track of Jobs or Programs via the Job Tracker.  It issues jobs to each of the Data Nodes through it's Task Tracker and keeps a 'heartbeat' to know when it completes and if it fails, and will spawn a new task to another node if it does fail.

It will also send the same job to another Data Node where the first job to finish wins called Speculative Execution, which can be toggled on and off in the config.

There's also an interim step called Sort, Shuffle and Merge.

This occurs after the Mapper phase and it sorts the data, shuffles it to the new Data Node and is merged with all the data from the other tasks.

It creates a Key / Value pair where the value is an array of items, which then gets split in the Reduce phase.

You can keep the Mapper jobs simple, as in the WordCount sample, which parses files into raw lines of data, which gets split into words, which get sent to the Reducer for summation.

You can also apply complex logic to the Mapper phase, by applying filters, explode the data or you don't have to call the Reducer phase, just keep the raw detailed data from the mapper.

They Mapper receives 2 main parameters, the Input file and Output file, along with the data types.  The data types must be comparable and sortable for it to work, so you must declare the variables appropriately.

You can split the Reducers into multiple Reducers to maximize performance.

That's basically what I remember from the class today, they covered a lot, and I took detailed notes which I need to transcribe.  What I do is write down almost everything the instructor says, then go back and write it so I can read it, then study it.  That's my learning pattern and it got me through college.

They also touched on the administration aspects of Hadoop, as there's a whole course dedicated to that subject.  Typically a small Cluster consists of 10-40 Data Nodes.  40+ is considered big and requires much tuning for maximum performance.

There are Master Nodes and Slave Nodes.  Some are single points of failure and some aren't.

We've got three more classes to go, where they'll talk more about HBase, Pig, Hive maybe Impala.

HBase is a NoSQL database which sits atop of HDFS.  It's got sparse data, wide columns and can sit within the Cluster.  It typically holds less data than HDFS and you can specify what and how much data to ingest.

Pig uses a scripting language called Pig Latin.  It uses data sets, called in sequence, where you can sort and group and aggregate data, which gets converted to Map Reduce which is not that fast.

Hive is a Hadoop Query Language similar to SQL.  It too gets converted to Map Reduce job and is not that fast. 

Impala was developed by Cloudera and was recently released a few weeks ago.  It too is SQL based except it runs natively and doesn't get converted to Map Reduce, thus it's extremely fast.  With my SQL background that has potential!

There's also a Machine Learning Mahout module we may learn if time permits.

Anyway, got to rest up for tomorrow's class - should be loads of fun!!!

Post from this am...

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Thoughts to Ponder