In Part 2, we created a project in Visual Studio Data Tools for SSIS which implemented the Hadoop Task File Task. We ingested a local file on the host laptop, sent it directly to a directory in HDFS in Hyper-V Virtual Machine sandbox.
In this Part 3, we are going to consume our recently ingested text file from Hadoop HDFS Hyper-V single node cluster sandbox using Pig and then manipulate the file.
First, we open Visual Studio 2015 Community edition, open our existing project, then add a new Package, then drop the Hadoop Pig Task component to our Control Panel canvas:
Next, we create a new connection to Hadoop.
Enter information in the Editor, assign Hadoop Connection Manager, along with Pig Script:
Pig Script is as follows, it basically reads in our HDFS Book1.txt file from /user/root/test/ folder:
A = LOAD '/user/root/test/Book1.txt' USING PigStorage() AS (DateID:chararray, Description:chararray, ActiveFlag:int);
B = FILTER A BY DateID != 'Date';
C = FOREACH B GENERATE DateID,Description;
store C into '/user/root/test/book1.out';
Our file has 3 fields: Date, Description, ActiveFlag:
Execute SSIS Package:
Ran with no errors:
We see the runtime details:
Along with SSIS Output:
And finally, we run over to HDFS and see our newly created file, Book1.out:
And that's a simple demo of working with Visual Studio 2015 Data Tools SSIS Package to execute a Pig Script using the Hadoop Pig Task.
View the next iteration Part 4 where we'll mount our data into a Hive table.
Thanks for following along~!