Hadoop is Gaining Traction in Building out Data Ecosystems

This week I got to assist on a Hadoop installation of a Master Node a three Data Nodes.  We used the Ambari installation.

At first, the install was done manually, on Redhat Linux.  I spent a good time troubleshooting, poking through all the directories and configuration files.

And then we decided to use the automated scripts.  First, the Ambari server was setup on the Master node.

And the PostgreSQL database was installed and configured.

Then we stepped through the process, applying the correct settings and ran it through.  And it threw errors as some services would not start.  So we troubleshot and tried again.

I think the major issues were not having the $JAVA_HOME path set in all the right places.  Another issue was to use the actual fully qualified domain name instead of localhost in the HOSTNAME= setting.  As well as using ROOT as default user.

After that, booted up, all services were running, mission accomplished.  Since I was doing all my Hadoop development locally on a laptop, running Hadoop on Hyper-V with 3 or 4 gigs of ram, after I ported the Visual Studio 2015 project over to the client servers pointing to SQL Server 2016 and Hadoop cluster, it runs really fast.

At this point, we're ingesting some data from Excel, into HDFS, ETL using Pig, mounting Hive tables, then Hive ORC tables, cleaning up the file remnants along the way (don't be a litter bug!) and finally, pulling that data into SQL Server 2016 using Polybase.

What's next?  Adding some Data Quality Services, Master Data Services, and then flowing into a Data Warehouse using Dim and Fact tables, and then finally, pushing the data into Analysis Services Tabular Model for consumption.

There's a few other things that need to be done as well, like set up Kerberos, create some generic users/group to run the services, and standardize the directory structures along the way.

Hadoop was sold as the next big thing, shove all your data in, find unlimited insights, and then the hype wore off.  Because it's basically a bundled set of mini applications, fairly complex to set up and administer, and the lack of qualified resources to develop.  As SQL developers did not know Java, DBA's didn't like to code and traditional Java programmers didn't know the data layer.

At this point in time, years later, there's still a learning curve, but the tools to push the data through have gotten better, we are not required to write Java to write map/reduce jobs and we have many additions to mimic traditional Data Warehousing concepts, along with Machine Learning, Graph database, workflows, security, etc., etc.

So now we can leverage our existing or new Data Warehouses that have been around for a long time, and add more data sources including non-structured and semi-structured data along the path.  I could definitely see more organizations taking advantage of this paradigm to beef up their data ecosystems and find those "insights" we were promised many years ago.

It's now a data centric world.  Hop on board.  Hadoop is getting traction.