How Unstructured Data Becomes Structured - Hadoop Replace EDW

Hadoop comes in many flavors.  As in vendor distributions.  And with each one, you get a level of support, the latest patches, priority tickets and training discounts perhaps as well as the core components that make up Hadoop.

And Hadoop is gaining traction in the world today.  Because it handles unstructured data, huge volumes of data and it runs in parallel distributed architecture on commodity hardware.

And if you watch video's or attend conferences, they show you in less than an hour on how you too can leverage the awesome power of Hadoop.

Here's a question I've been thinking about lately.  We hear that the the amount of data is increasing at astonishing rates.  Most of that data is unstructured.  And Hadoop is the tool used to processes that unstructured data.  How?  By converting the unstructured data to relational data, applying HIVESQL and exposing the data to external programs to visualized, slice and consume.

Well, let's look at that scenario for just a minute.  "converting the unstructured data to relational data".  What does that mean exactly.  Who converts it?  How do they convert it?  How do you place structure on unstructured.  How do you form logic on top of chaos?  Well, the data is not completely unstructured, it's just not known at the time it's written, so it's not data on write, it's data on read.

So let's say a Hadoop developer is tasked with that exact process.  He or she ingests the flat files into HDFS file system, so he knows where it was derived from, the source, he can pa-rouse the file to look for patterns.  He then has to make decisions, the first 9 characters look like a first name, the next 25 character look like a last name, then the user id, then the web page name, etc.   The next line however, is different, has a data time stamp and some other gibberish.  He has no real idea of what he's looking at, unless someone tells him, there's a data dictionary already established, or he make a guess at how the data is structured.

So when the presenter at some conference tells you how easy it is to wrap relational logic on top of unstructured data, ask them how they do it.  What I've heard and seen, many orgs are applying logic to the unstructured data BEFORE it enters the HDFS file system.  They are cleaning and mapping the data before it gets into Hadoop.  Sure, then its' easy to apply relational tables on top.  Then apply HIVESql and do the analytics or whatever.

I find that to be the number one lack of awareness amongst newbies to Hadoop that just can't wrap their heads around.  How do you convert unstructured data into reports for insight.

So how do people do it, they get familiar with their data sources by interrogating it.  They use complex Map Reduce and / or PIG Latin language to massage and ETL (extract transform and load) the data.  I don't know of a tool which allows WYSIWYG interface to visually map the data.  Perhaps the Hadoop project needs such a tool.

Here's another thing.  Hadoop is great at handling huge volumes of data.  And as we know we can create SQL tables which map to data using Map Reduce or TEZ or Impala or whatever.  It's simply a metadata layer between the raw data split on the data nodes and the user.  And as we know there are connectors which allow data to flow from a relational SQL database and HDFS.  So we can freely move data back and forth between the two.

Now some people have suggested that Hadoop will replace the Enterprise Data Warehouse, because it can store your data in a single location, remove the need for expensive licensing, reduce costs by not storing backups and redundancy, doesn't require high priced Database Administrator DBA, it uses commodity hardware instead of a beefed up super machine which sits on a SAN with a lot of memory for speed and up time.  These are all valid points.

Except there's huge resistance from the Data Warehouse community.  We've been creating EDW for 20+ years and now this new thing comes along to dethrone us?  That's ludicrous, the EDW will be here forever because it serves a purpose.  And by that, we are talking about a rigid database structure, moving data from Source to Stage to EDW to Cube , applying business rules along the way.  This is performed by a Data Warehouse developer, who costs a bit of money, who applies strict rules to the data, by the way, they are only grabbing a small percentage of the actual data because its too expensive to keep all of it, the server which it resides is expensive, you have to buy backup tools to archive it and that's performed by server admins who cost money.  Now lets say an org goes out and builds the EDW, lets say you want to add or change a few things, very complex, and expensive.  Let's say you want to add a new division in, the phone call information, it's not an easy task to splice in new data after its already been built so there's a level of complexity.

However, the process of building a data warehouse has been well defined over the years and it's actually quite cookie cutter.  They have rules for data governance, data quality and security, these are all great aspects.

Getting back to Hadoop, if you are going to treat the HDFS system as a container to store vasts amounts of data, why would you not move the EDW to Hadoop?  It can handle fast queries, it can handle the ingestion of data through connectors, you can bring in ALL the data now not just bits and pieces, there's built in redundancy, they have security now, you can even mash your EDW with other data residing on the HDFS file system and you can massage your data to perform ETL via Pig or Map Reduce.

From my point of view, I see Hadoop replacing the EDW sooner rather than later.  Just need some best practices and data governance applied to the framework.  Your EDW data structure and data could be ingested into HDFS in the exact same format you have today, would just need to tweak the ETL process and point the source data to go into HDFS instead of the EDW.

And that will allow your org to leverage the awesomeness of Hadoop, by allowing more data to be stored, in parallel distributed architecture, with fast querying capability, build in redundancy, security which runs on commodity hardware.

Because what you are really after is the parallel processing in distributed architecture.  You don't care much about how the data is stored in HDFS.  And you don't care about the underlying techniques of how SQL runs on the cluster.  You just want a repository for your data which is consumable in near real time, which is reliable in an enterprise environment.

So the HDFS is not the main concern, nor is the map reduce.  What you're after is the architecture called the Name Node and Data Nodes as well as the Job Tracker and Task Tracker.  Those are the only things which have not been gutted in the Hadoop architecture.  You can use blob storage or other tools besides HDFS.  And Map Reduce is quickly being phased out.

Yet if you think about it, the Nodes and the Tracker are the actual core of Hadoop which are still in tact as they were in 2006.  That's the brains behind Hadoop in my opinion.  The Data nodes store the data segmented into chunks and distributed across the cluster, the Name Node knows where each of the Data Nodes resides.  The job tracker keeps track of what running where, as it sends mini jobs to each of the Task Trackers, it communicates through heartbeats every few seconds to make sure the server is available.

So this is one of my longer blog posts, covers a few different topics within the Hadoop ecosystem.  We talked about the gap in knowledge of how Unstructured data is converted to Relational data, how Hadoop is primed and ready to take on the EDW workloads and storehouse your Data Warehouse and finally what makes Hadoop so powerful besides the HDFS and Map Reduce, it's the Nodes and the Trackers.

Hope you enjoyed and thanks for reading~!