Hadoop Basics

Data Gosinta (goes into)

When thinking about Hadoop, we think of data.  How to get data into HDFS and how to get data out of HDFS.  Luckily, Hadoop has some popular processes to accomplish this.

SQOOP was created to move data back and forth easily from an External Database or flat file into HDFS or HIVE.  There are some standard commands for moving data by Importing and Exporting data.  When data is moved to HDFS, it creates files on the HDFS folder system.  Those folders can be partitioned in a variety of ways.  Data can be appended to the files through SQOOP jobs.  And you can add a WHERE clause to pull just certain data, for example, just bring in data from yesterday, run the SQOOP job daily to populate Hadoop.

Once data gets moved to Hadoop HDFS, you can add a layer of HIVE on top which structures the data into relational format.  Once applied, the data can be queried by HIVE SQL.  If creating a table, in the HIVE database schema, you can create an External table which is basically a metadata layer pass through which points to the actual data.  So if you drop the External table, the data remains in tact.

From HIVE SQL, the tables are exposed to ODBC to allow data to be accessed via Reports, Databases, ETL, etc.

So as you can see from the basic description above, if you can move data back and forth easily between Hadoop and your Relational Database (or flat files).

In addition, you can use a Hadoop language called PIG (not making this up), to massage the data into a structure series of steps, a form of ETL if you will.

Hybrid Data Warehouse
You can keep the data up to data by using SQOOP, then add data from a variety of systems to build a Hybrid Data Warehouse.  As Data Warehousing is a concept, a documented framework to follow with guidelines and rules.  And storing the data in Hadoop and Relational Databases is typically known as a Hybrid Data Warehouse.

Connect to the Data
Once data is stored in HDW, it can be consumed by users via HIVE ODBC or Microsoft PowerBI, Tableau, Qlikview or SAP HANA or a variety of other tools sitting on top of the data layer, including Self Service tools.

Machine Learning
In addition, you could apply MAHOUT Machine Learning algorithms to you Hadoop cluster for Clustering, Classification and Collaborative Filtering.  And you can run Statistical language analysis with a language called Revolution Analytic R version of Hadoop R.

And you can receive Steaming Data.

There's Zookeeper which is a centralized service to keep track of things.

And Girage, which allows Hadoop the ability to process Graph connections between nodes.

In Memory
And Spark, which allows faster processing by by-passing Map Reduce and ability to run In Memory.

You can run your Hybrid Data Warehouse in the Cloud with Microsoft Azure Blobstorage HDInsight or Amazon Web Services.

On Premise
You can run On Premise with IBM Infosphere BigInsights, Cloudera, Hortonworks and MapR.

Hadoop 2.0
And with the latest Hadoop 2.0, there's the addition of YARN which is a new layer that sits between HDFS2 and the application layers.  Although HDFS Map Reduce was originally designed as the sole batch oriented approach to getting data from HDFS, it's no longer the sole way.  HIVE SQL has been sped up through Impala which completely bypasses Map Reduce and the Stinger initiative which sits atop Tez.  Tez has ability to compress data with column stores which allows the interaction to be sped up.

New Features 2.0
With Hadoop 2.0, you can now monitor your clusters with Ambari which has an API layer for 3rd party tools to hook into.  A well known limitation of Hadoop has been Security which has now been addressed as well.

Hbase allows a separate database to allow random read/write access to the HDFS data, and surprisingly it too sits with the HDFS cluster.  Data can be ingested to HBASE and interpreted On Read, which Relational Databases do not offer.

Sometimes when developing, users don't know where data is stored.  And sometimes the data can be stored in a variety of formats, because HIVE, PIG and Map Reduce can have separate data model types.  So HCatalog was created to alleviate some of the frustration.  It's a table abstraction layer, meta data service and a shared schema for Pig, Hive and M/R.  It exposes info about the data to applications.

Here's a quick diagram showing the basics of SQOOP, Hive, HDFS, HIVE ODBC, etc.

I hope you enjoyed this blog on Hadoop basics.