4/30/2012

My Intro to Hadoop

Big Data is big news!

And what's driving Big Data?

Hadoop!

So today I did some research.

I watched the following video:



Disclaimer, this is not my information, I transcribed the following notes from the you tube video listed above, I don't own nor do I take credit for any of this info...

Hadoop is downloadable from the Apache Website.

It basically works with large sets of files, structured or unstructured.

It works based on the premise of 'Data Nodes'.

Map-Reduce = Computation 

A way of taking the distributed out of a distributed system.

 
HDFS = Storage 

Hadoop Distributed File System - (all file system things) make directory, copy, list, permissions, groups, file sizes, rm, etc.

Hadoop works well with really big files.

Splits into blocks 128 mg default.

Each block makes 3 copies, or backups.

Blocks / replica's go onto individual machines that make up the cluster.

Distribute as even as possible.

The NameNode watches over the other nodes, kind of like a 'pointer' in the C++ language.

NameNodes will watch the nodes looking for failures, and replicating as needed.

If all nodes are lost that's bad, if you lose all but one, then you are okay because it will replicate back the bad ones.

It's basically a file system app written in Java,

The name node is a single point of failure, which doesn't fail that often.

Which you can query using 'Map-Reduce' jobs.

Where it groups everything into key/value pairs.

Which get fed into a reducer.

Which takes keys and values and created more keys and values.

The Elephant represents Hadoop.

Then you can query this language using a new language called 'Pig'.

This is equivalent to the 'Assembly' language of long ago, very tedious.

And then there's another language called 'Hive' which is very similar to SQL which is widely used.

Apache 'H Base' project is another language gaining steam.

And a another 'real time' H Base language called 'accumulo'.

Which has cell level security.

And many more new languages with different features.

And another project with data serialization project called 'Avro'.

Define data serialization which keeps the schema with all the data like a map which travels with the data.

You can really store a lot of data in Hadoop.

And here's a tutorial online from Cloudera

And another YouTube Link.

And this link gives good summary.

And one last link.

I started to download Hadoop at work today and got the files unzipped and loaded, then downloaded the Java JDK from Oracle / Sun, still need to set the Java_home config setting.


This will take some time and patience, that's for sure!

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Thoughts to Ponder