Visual Studio 2015 Hadoop Component ETL using Hortonworks Hyper-V Sandbox [Part-2]

In Part 1, we loaded Visual Studio 2015, applied the new Hadoop File System Task, and then went through hoops trying to get the Hyper-V Hortonworks Hadoop VM working.

Once Hadoop ran, I created a dummy Excel file with 3 columns, then export to txt delimited filed:

Then set the Hadoop Connection to use WEBHDF which is a way to interact with Hadoop files using a web api built in:

Then set a local connection string to the file:

Then went into Hadoop, created a "test" folder as "/user/root/test"

Here's the permissions settings before:

set permissions (Hadoop a=all, u=user, g=group and r=read, w=write, x=execute)

hadoop fs -chmod a=+rwx /test

set permissions (linux - for reference)
chmod a=rwx access.log (user) Read,write,execute
chmod g=rwx access.log (group)
chmod o=rwx access.log (owner)

read directory/file from Hadoop (-R=recursive, -d=directory)

hadoop fs -ls -d -R  FOLDER or FILENAME

And the permissions settings after :

Then applied settings to the Hadoop File System Task component:

Then ran the package, success, added an Execute SQL Script for no particular reason:

And a detailed view of the run:

And now we look at Hadoop HDFS to see if the file made it:

We see our Book1.txt file as expected, remember to set the Type in the Hadoop File System Task to "Directory" as "File" did not copy the file for some reason.

Next, we try to send our Excel file, modified File Connection, same folder, I shared the folder instead of using the c:\temp\hadoop directory.

And set component to use new file connection, we could automate this in the future, to loop through folders:

The package ran success, we jump over to Hadoop, see if the file landed correctly:

We see 2 files, txt and xlsx.

And that concludes Part 2 of this series of copying files from our local machine, in this case, my work laptop, to the HDFS cluster running on Hyper-V Hortonworks Sandbox 2.4.

Follow along in Part 3, where we apply Pig scripts to our data set.  And then Mount the data in Hive.  And perhaps query the Hive table.

Here's Part 1 in case you missed the setup details.

As always thanks for reading!