Visual Studio 2015 Hadoop Component ETL using Hortonworks Hyper-V Sandbox [Part-3]

In Part 1 of this series we installed Visual Studio 2015 Community edition as well as configured our Hortonworks Sandbox 2.4 in Hyper-V.

In Part 2, we created a project in Visual Studio Data Tools for SSIS which implemented the Hadoop Task File Task.  We ingested a local file on the host laptop, sent it directly to a directory in HDFS in Hyper-V Virtual Machine sandbox.

In this Part 3, we are going to consume our recently ingested text file from Hadoop HDFS Hyper-V single node cluster sandbox using Pig and then manipulate the file.

First, we open Visual Studio 2015 Community edition, open our existing project, then add a new Package, then drop the Hadoop Pig Task component to our Control Panel canvas:

Next, we create a new connection to Hadoop. 

This time we use the WebHCat instead of WebHDFS:

Enter information in the Editor, assign Hadoop Connection Manager, along with Pig Script:

Pig Script is as follows, it basically reads in our HDFS Book1.txt file from /user/root/test/ folder:

A = LOAD '/user/root/test/Book1.txt' USING PigStorage() AS (DateID:chararray, Description:chararray, ActiveFlag:int);
B = FILTER A BY DateID != 'Date';
C = FOREACH B GENERATE DateID,Description;
store C into '/user/root/test/book1.out';

Our file has 3 fields: Date, Description, ActiveFlag:

Execute SSIS Package:

Ran with no errors:

We see the runtime details:

Along with SSIS Output:

And finally, we run over to HDFS and see our newly created file, Book1.out:

And that's a simple demo of working with Visual Studio 2015 Data Tools SSIS Package to execute a Pig Script using the Hadoop Pig Task.

View the next iteration Part 4 where we'll mount our data into a Hive table.

Thanks for following along~!

No comments:

Post a Comment

We Interrupt this Broadcast