10/04/2016

Incorporating Hadoop into Data Warehouse using Visual Studio IDE



So we went on vacation a few weeks ago.  Was great to get away.  In the sense that when you get back from vacation you don't have to work the weekend to get caught back up.  Or better yet, work on vacation taking calls and meeting sprint deadlines.  Nope, this vacation was all about relaxation.

But I programmed on vacation anyway.  Did it for fun mostly.  Got up to speed on Hadoop from Visual Studio.

And since I've been back from vacation, I leveraged that code into production code.  It reads from an Excel file, 4 tabs, converts to text file, pushed to Hadoop, Pig tweaks and sends to another folder, which gets picked up by Hive and mounted into a table, which then lands in an ORC binary table in Hive, including index.  Which will get pulled from SQL Server 2016 Polybase, flow into a Staging database, then a Data Warehouse, and land in a Tabular Model.

Also, throw in some Data Quality Services and Master Data Services along the way.

To be honest, pulling the data from Excel has been the biggest challenge.  They say throw all your data into Hadoop, do they also refer to Excel data, because there's no easy way to extract that, other than export to text or csv files.  At least that's what I've seen.

Best part about it, I'm still living in the Visual Studio world, with some Linux and Hadoop as well.  It's a nice addition to the traditional data warehouse ecosystem.  I could see a lot more business' incorporating this methodology.  Sure, I spend a few hours troubleshooting issues, but the bottom line is the data is flowing end to end in under a few weeks.

Perhaps there are tons of developers working in this space, but I think this would be a great place to hang out for awhile.  The Extract, Transform and Load (ETL) is still the trickiest part about the Business Intelligence life cycle and Hadoop provides an extra tool to leverage to enhance existing data warehouses that have been operational for decades.  Not to mention, the cloud aspect and downstream visualizations.  Once you get into the weeds of big data, it's kind of cool.

And there you have it~!

No comments:

Post a Comment

Bloom Consulting Since Year 2000