9/23/2016

Visual Studio 2015 Hadoop Component ETL using Hortonworks Hyper-V Sandbox [Part-4]

This is part 4 in a series "Visual Studio 2015 Hadoop Component ETL using Hortonworks Hyper-V Sandbox"
 
In Part 1 of this series we installed Visual Studio 2015 Community edition as well as configured our Hortonworks Sandbox 2.4 in Hyper-V.

In Part 2, we created a project in Visual Studio Data Tools for SSIS which implemented the Hadoop Task File Task.  We ingested a local file on the host laptop, sent it directly to a directory in HDFS in Hyper-V Virtual Machine sandbox.
 
In Part 3, we used Pig to bring in a text file, strip out the header row, and store off the data (3 fields) to an output directory.
 
To get started, took a look at the HDFS file structure for the root folder, I'm using root for simplicity, in real world environment, you'd want to use a different user of course:
 

Next we create our Hadoop Connection Manager:


Next we set WebHCat server and IP address, didn't need to set or enable WebHDFS connection:


Next we opened our Visual Studio project, added a new SSIS Package, dropped in 3 Hadoop Hive Task components, first one to create a Database, then create a Table, then Insert into the new table:
 

Next, we open each of the 3 Hive components, set the Hadoop Connection and apply the Inline Script:

 
Each of the 3 script code is here:


----------------------
-- Script 1
CREATE DATABASE IF NOT EXISTS VS2015_Test;
USE VS2015_Test;

----------------------
-- Script 2
USE VS2015_Test;
DROP TABLE IF EXISTS tblBook1;
CREATE EXTERNAL TABLE tblBook1 (DateID STRING, Description STRING, ActiveFlag INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/user/root/test/HIVE/tblBook1';

----------------------
-- Script 3
USE VS2015_Test;
LOAD DATA INPATH '/user/root/test/book1.out/part-m-00000'
OVERWRITE INTO TABLE tblBook1;

----------------------
Here's the HDFS file structure along with CAT file contents of our Pig output file book1.out/part-m-00000:


Then executed the SSIS Package:


Package completed with details:


And HDFS file structure changed, adding new files to our Hive directory, and moved the original output file:


You can see the Hive directory has a new entry "tblBook1" which is our Hive Table and the file "tblBook1/part-m-00000" is our table content, which got moved from the original directory of "book1.out".

So let's have a look at Hive now, SHOW TABLES; and DESCRIBE tblBook1 to get the table structure:


And now "select * from tblBook1", returns 3 rows as expected:


Let's add a Where clause, where ActiveFlag = 1:


Lastly, tried a UNION ALL query:


And there you have it.  4 part series, install Visual Studio 2015 and start Hortonworks Sandbox on Hyper-V, Ingest files, ETL using Pig and Create Database/Table/Insert data.  Full life cycle.

Although I'd still like to load SQL Server 2016 and try out the Polybase query, this concludes our series.

As always, hope you enjoyed the posts and got some value.  Thanks for reading~!  JB

9/22/2016

Visual Studio 2015 Hadoop Component ETL using Hortonworks Hyper-V Sandbox [Part-3]

In Part 1 of this series we installed Visual Studio 2015 Community edition as well as configured our Hortonworks Sandbox 2.4 in Hyper-V.

In Part 2, we created a project in Visual Studio Data Tools for SSIS which implemented the Hadoop Task File Task.  We ingested a local file on the host laptop, sent it directly to a directory in HDFS in Hyper-V Virtual Machine sandbox.

In this Part 3, we are going to consume our recently ingested text file from Hadoop HDFS Hyper-V single node cluster sandbox using Pig and then manipulate the file.

First, we open Visual Studio 2015 Community edition, open our existing project, then add a new Package, then drop the Hadoop Pig Task component to our Control Panel canvas:


Next, we create a new connection to Hadoop. 


This time we use the WebHCat instead of WebHDFS:


Enter information in the Editor, assign Hadoop Connection Manager, along with Pig Script:


Pig Script is as follows, it basically reads in our HDFS Book1.txt file from /user/root/test/ folder:


A = LOAD '/user/root/test/Book1.txt' USING PigStorage() AS (DateID:chararray, Description:chararray, ActiveFlag:int);
B = FILTER A BY DateID != 'Date';
C = FOREACH B GENERATE DateID,Description;
store C into '/user/root/test/book1.out';


9/21/2016

Visual Studio 2015 Hadoop Component ETL using Hortonworks Hyper-V Sandbox [Part-2]

In Part 1, we loaded Visual Studio 2015, applied the new Hadoop File System Task, and then went through hoops trying to get the Hyper-V Hortonworks Hadoop VM working.

Once Hadoop ran, I created a dummy Excel file with 3 columns, then export to txt delimited filed:


Then set the Hadoop Connection to use WEBHDF which is a way to interact with Hadoop files using a web api built in:


Then set a local connection string to the file:



Then went into Hadoop, created a "test" folder as "/user/root/test"



Here's the permissions settings before:



set permissions (Hadoop a=all, u=user, g=group and r=read, w=write, x=execute)

hadoop fs -chmod a=+rwx /test

set permissions (linux - for reference)
chmod a=rwx access.log (user) Read,write,execute
chmod g=rwx access.log (group)
chmod o=rwx access.log (owner)

read directory/file from Hadoop (-R=recursive, -d=directory)

hadoop fs -ls -d -R  FOLDER or FILENAME

And the permissions settings after :


Then applied settings to the Hadoop File System Task component:



Then ran the package, success, added an Execute SQL Script for no particular reason:



And a detailed view of the run:



And now we look at Hadoop HDFS to see if the file made it:



We see our Book1.txt file as expected, remember to set the Type in the Hadoop File System Task to "Directory" as "File" did not copy the file for some reason.


Next, we try to send our Excel file, modified File Connection, same folder, I shared the folder instead of using the c:\temp\hadoop directory.



And set component to use new file connection, we could automate this in the future, to loop through folders:



The package ran success, we jump over to Hadoop, see if the file landed correctly:



We see 2 files, txt and xlsx.


And that concludes Part 2 of this series of copying files from our local machine, in this case, my work laptop, to the HDFS cluster running on Hyper-V Hortonworks Sandbox 2.4.


Follow along in Part 3, where we apply Pig scripts to our data set.  And then Mount the data in Hive.  And perhaps query the Hive table.


Here's Part 1 in case you missed the setup details.


As always thanks for reading!

Visual Studio 2015 Hadoop Component ETL using Hortonworks Hyper-V Sandbox [Part-1]

Today I am going to get a project working to move files from a particular location on a local laptop up to a Hortonworks Hadoop Linux Virtual Machine running on Hyper-V on Windows.

The first step, download Visual Studio 2015.  Clicked on "Select All" and started the installation process:



Load the Hyper-V instance of Hortonworks Hadoop.  You can find my previous post on how to do this here.


Starting Hyper-V, we see the "New Virtual Machine":



First, check the assigned ram for the Virtual Switch, it defaults to 1024, I changed it to 2048, although they recommend minimum 4gig ram:



We first click the "Start" button, then click the "Connect" button.  The Virtual Machine begins its process to spin up the machine:



Prompts to log in:



username: root

password (first time, it prompts to change it): hadoop

remember what the new password is, can costs a bit of time trying to remember, wink, wink!


In order to get the IP Address, mine is not connected to the Internet, type "ifconfig":




What's interesting is now that I'm trying to work with my sandbox, it didn't run in the Hyper-V environment.  So basically have to troubleshoot from start.


I wasn't able to connect via internet explorer, so tried Chrome. 


I was able to load the VM in Hyper-V, using username and new password, but it indicated a "network connection issue":



So digging in a little deeper, you have to modify your Hosts file in "C:\Windows\System32\drivers\etc", the Hosts file.  I created a link between the Hyper-V IP address, 192.168.0.27 to Sandbox.Hortonworks.com:



The thing to remember, you must open the file as "administrator" or it won't let you save your changes, I used Notepad++.

Next, you'll want to go to Hadoop VM, and update your Hosts file and modify using "vi hosts":


Next, you have to allow permission to access the Hyper-V VM from your laptop, which hosts the VM, in the "hosts.allow" file:


Remember, to exit from VI editor in Linux, you press Esc, then colon ":" then "q". or if you want to saving change, Esc, then colon ":" then "wq".


In the hosts.allow file, I have added the "All" feature, followed by colon ":" and the prefix of the calling IP address of my local laptop:



This set of IP address' have access to the VM.


There's also a Hosts.Deny file, which I didn't modify:




The hosts, hosts.allow and hosts.deny files reside within the "etc" folder on Hadoop.


While booting up, because the Virtual Machine image was previously an Oracle VM, the "eth0" was not able to resolve.  After a bit of research, it turns out we need to tweak this setting.


So we drill into our ETC folder -->; udev -->; rules.d folder, to a file called:  "70-persistent-net.rules.  This file translates the "eth0" or "eth1" or eth2" to the mac address.  I found a link that said to remove the file using the "rm filename" command and it removes the file.  It rebuilds the file after restarting the VM.


Here's the contents of the file:



But first, we need to visit another file in another folder, the etc -->; sysconfig -->; "network-scripts" -->; "ifcfg-eth0", we view its contents:


I commented out the DHCP stuff by prefixing with "#" along with a few other entries not needed and added HWADDR="the mac address" from the other file:


 
You can see, I changed the IP address (static) to 192.168.0.27, that way it's on the same gateway of 192.168.0.1, otherwise it doesn't work.




The entries with # indicate "commented out" and are not read as part of the file.  The HWADDR is how the VM translates the newly assigned IP Address with the assigned Mac address.

To restart the service, type: "Service Network Restart" and it flushes and rechecks the "eth0" setting:



Next, we do an "ifconfig" to see the IP address assigned to "eth0":


We see the IP address is 192.168.0.27 and Mask is 255.255.255.0 which is what we are expecting.


So that appears to be fixed.  Here's a Hortonworks URL for reference: https://community.hortonworks.com/articles/49082/network-issues-after-cloning-a-virtual-machine-wit.html


Now, we still have the issue of connecting to the VM from the host laptop.  So we verify our Hyper-V adapter, in this case, we chose "internal":




We assigned the Internal Switch to our VM in the Settings of Hyper-V VM:



Next, we check the Network Adapter under the Network settings of the host laptop, we assigned a static IP in the 192.168.0.10:



We also need to share our internal network device, found this great post to  assist: https://blogs.msdn.microsoft.com/virtual_pc_guy/2008/01/09/using-hyper-v-with-a-wireless-network-adapter/


We also need to modify our WiFi adapter settings, can't use Obtain an IP address automatically, because it must be on the same 192.168. range:



So now we start up our Hyper-V Sandbox for Hortonworks VM and we see the IP address finally updates:



We also ping the Hyper-V IP to make sure we have round trip access:


We then check that the IP configurations were set properly on the host local laptop:



I noticed the DHCP was not enabled, but it seemed to be okay.


Next, our Hyper-V Hadoop24 VM loads:



We ignore the 127.0.0.1:8888 entry as we manually changed IP in the ifcfg-eth0 file to 192.168.0.27 earlier...attempt to load page via IP Address:



Next, we attempt to load using the assigned name in the local hosts file, sandbox.hortonworks.com:




And finally, we attempt to load Ambari, port 8080 a
nd it displays the Login credential info:


And we log in:



One other thing to mention, during the process, I had to set a rule in the Firewall to open up some ports.

Next, we'll connect with our Hyper-V Sandbox environment via  Visual Studio 2015 in Part 2, click the following link...


Visual Studio 2015 Hadoop Component ETL using Hortonworks Hyper-V Sandbox [Part-2]


Thanks for reading~!

Get Sh#t Done!