11/02/2013

Store All Data in #Hadoop

If you've been keeping up with things lately, you'll notice that Hadoop is taking off like wildfire.

And why is that?  Because it can handle huge volumes of data, variety of different kinds of data and data acquired at rapid rates.

And how does it process data?  Batch oriented parallel processing across commodity hardware.

SQL on Hadoop
So the past year, all the excitement has surrounded SQL on Hadoop.  There 's one version called Impala which is fast because it completely by-passes Map-Reduce all together.

And there's another version called Stinger, which speeds up the queries by re-architecting some of the back end, preparing processes in advance so Map-Reduce doesn't have to spin up processes every time as well as leveraging memory.

HIVE
So you can apply a metadata layer on top of HDFS, which resides in the HIVE data warehouse in H-Catalog.  The metadata is a framework to get the underlying data which can be called as Managed Data (data resides in HIVE workspace) or External where it's a pointer to the actual data.

What's Next
So what's next for Hadoop?  My guess is it's gaining ground on traditional Databases.  Perhaps Hadoop in the future will contain 'all data', including transactional.

The main barrier so far in my opinion is you can only insert data, you really can't do Updates or Deletes to the raw data in the files, you can only add to it.

There is work going on to make HiveSQL Ansi-compliant which means it will be very similar to traditional databases.  This will reduce the need for complex Map-Reduce jobs and allow more developers to get up to speed more quickly as well as leverage all the decades of experience writing SQL.

One Location
Think about the scenarios, if you already have the transactional data within Hadoop, there's less need for importing and exporting huge volumes of data, which will speed up development time.  And you won't have to structure the data on 'write', you can structure the data on 'read', if ever.

Hadoop and Business Intelligence
If you think about it, the Hadoop ecosystems contains just about ever facet of Business Intelligence today.  You can store data, cleanse data, ETL data, report on data, create dashboards on data, you can mine it, use it for predicting and clustering and you can machine learn with it.

Similar
The underlying processes for both traditional databases and Hadoop are similar.  The main difference, traditional databases max out at some point because of volume and processing power, and that's where Hadoop gets started.  So if Hadoop can handle lower volume transactional data, it can really do both functions, thus, less of a need for traditional database.  Perhaps it wouldn't extinguish them, just offer more functionality in a single ecosystem.

Legacy
And we still use the mainframe today, as we use data warehousing, as we use traditional databases.  In the world of IT, nothing really goes away.  However Hadoop offers a lot of the things we need to work with data and it's gaining traction every day.

All Data
So the hype is actually turning into everyday processes and real people are getting up to speed quickly.  Time will tell how things pan out, as anything could happen.  Just saying that one day in the not too far future, Developers may be using Hadoop a lot more than they expected.

No comments:

Post a Comment