8/25/2014

My Point of View on Hadoop Today

In the world of Data Warehousing your data resides in 4 places:
  1. Variety of Data Sources (structured Relational data)
  2. Extract Transform and Load (ETL)
  3. Data Warehouse Repository (Star / Snowflake Schema)
  4. Analytics / Presentation layer (OLAP Cube)
You could also argue the Business Layer exists within the Data Warehouse and sometimes the Analytics later.  The DW developer is tasked with scrubbing, aligning, denormalizing for speed, remove duplicate data, cleanse and conform to DW best practices.

In the world of Hadoop, you also have similar architecture:
  1. Variety of Data Sources (structured and unstructured data)
  2. HDFS Hadoop Filesystem layer containing variety of file types
  3. Extract Load and Transform (ELT)
  4. Analytics / Presentation layer
Data Warehousing
Data Warehouse must contain Relational Data.  DW has size limitations, at some point there's query response degradation.  Requires beefed up server(s).  Costly to host, maintain and enhance.  Good developers are hard to find.  If business rules change or a merger or acquisition, often difficult to merge data with other repositories.  DW has a solid methodology proven over the past 20 years with repeatable patterns.

Hadoop
Data can be relational but not required.  Handles greater volume of data.  Reduced cost based on commodity hardware, licensing and server requirements.  Can integrate into existing Data Warehouses.  Good developers difficult to find.  The number of Hadoop components can be overwhelming and daunting for developers to learn all and stay current.  SQL on Hadoop opens door to existing skill sets, bypassing complex Map Reduce coding.  There's a number of 3rd party offering to leverage.  Hadoop is ever evolving.

Conclusion
Based on the extreme hype over the past year or two, some people including me have suggested the hype factor.  Hadoop did not replace the Data Warehouse.  It enhanced it.  By creating Hybrid Data Warehousing.  The best of both worlds.  Which means that finding the right skill set has gotten even more difficult.

My Viewpoint
I see more and more people interested in Hadoop, even the ones who had no idea what is was a few years go.  Many IT people realize they must learn about Hadoop just to stay current.  In contrast, not many of these organizations have production level clusters, they may or may not have 10 node clusters as sandboxes to interrogate data sources, sentiment analysis and process large batch jobs.

Future of Hadoop
Hadoop 1.0 is past.  Hadoop 2.0 is here, including YARN and TEZ and Docker along with the slew of other offerings.  It has fragmented into many pieces, many vendors and no one size fits all solutions.  But there's still a lot of opportunity to be had.  With Machine Learning, building Models for Artificial Intelligence, large volumes of data, processing unstructured and disparate data sources, I feel that Hadoop will be part of my career.  And if you're job consist of collecting, processing, parsing or analyzing data, chances are, it will become part of your skill set as well.