9/22/2016

Visual Studio 2015 Hadoop Component ETL using Hortonworks Hyper-V Sandbox [Part-3]

In Part 1 of this series we installed Visual Studio 2015 Community edition as well as configured our Hortonworks Sandbox 2.4 in Hyper-V.

In Part 2, we created a project in Visual Studio Data Tools for SSIS which implemented the Hadoop Task File Task.  We ingested a local file on the host laptop, sent it directly to a directory in HDFS in Hyper-V Virtual Machine sandbox.

In this Part 3, we are going to consume our recently ingested text file from Hadoop HDFS Hyper-V single node cluster sandbox using Pig and then manipulate the file.

First, we open Visual Studio 2015 Community edition, open our existing project, then add a new Package, then drop the Hadoop Pig Task component to our Control Panel canvas:


Next, we create a new connection to Hadoop. 


This time we use the WebHCat instead of WebHDFS:


Enter information in the Editor, assign Hadoop Connection Manager, along with Pig Script:


Pig Script is as follows, it basically reads in our HDFS Book1.txt file from /user/root/test/ folder:


A = LOAD '/user/root/test/Book1.txt' USING PigStorage() AS (DateID:chararray, Description:chararray, ActiveFlag:int);
B = FILTER A BY DateID != 'Date';
C = FOREACH B GENERATE DateID,Description;
store C into '/user/root/test/book1.out';




Our file has 3 fields: Date, Description, ActiveFlag:


Execute SSIS Package:


Ran with no errors:


We see the runtime details:


Along with SSIS Output:


And finally, we run over to HDFS and see our newly created file, Book1.out:


And that's a simple demo of working with Visual Studio 2015 Data Tools SSIS Package to execute a Pig Script using the Hadoop Pig Task.

View the next iteration Part 4 where we'll mount our data into a Hive table.

Thanks for following along~!

No comments:

Post a Comment