This is part 4 in a series "Visual Studio 2015 Hadoop Component ETL using Hortonworks Hyper-V Sandbox"
In Part 1 of this series we installed Visual Studio 2015 Community edition as well as configured our Hortonworks Sandbox 2.4 in Hyper-V.
In Part 2, we created a project in Visual Studio Data Tools for SSIS which implemented the Hadoop Task File Task. We ingested a local file on the host laptop, sent it directly to a directory in HDFS in Hyper-V Virtual Machine sandbox.
In Part 3, we used Pig to bring in a text file, strip out the header row, and store off the data (3 fields) to an output directory.
To get started, took a look at the HDFS file structure for the root folder, I'm using root for simplicity, in real world environment, you'd want to use a different user of course:
Next we create our Hadoop Connection Manager:
Next we set WebHCat server and IP address, didn't need to set or enable WebHDFS connection:
Next we opened our Visual Studio project, added a new SSIS Package, dropped in 3 Hadoop Hive Task components, first one to create a Database, then create a Table, then Insert into the new table:
Next, we open each of the 3 Hive components, set the Hadoop Connection and apply the Inline Script:
Each of the 3 script code is here:
-- Script 1
CREATE DATABASE IF NOT EXISTS VS2015_Test;
-- Script 2
DROP TABLE IF EXISTS tblBook1;
CREATE EXTERNAL TABLE tblBook1 (DateID STRING, Description STRING, ActiveFlag INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/user/root/test/HIVE/tblBook1';
-- Script 3
LOAD DATA INPATH '/user/root/test/book1.out/part-m-00000'
OVERWRITE INTO TABLE tblBook1;
Here's the HDFS file structure along with CAT file contents of our Pig output file book1.out/part-m-00000:
Then executed the SSIS Package:
Package completed with details:
And HDFS file structure changed, adding new files to our Hive directory, and moved the original output file:
You can see the Hive directory has a new entry "tblBook1" which is our Hive Table and the file "tblBook1/part-m-00000" is our table content, which got moved from the original directory of "book1.out".
So let's have a look at Hive now, SHOW TABLES; and DESCRIBE tblBook1 to get the table structure:
And now "select * from tblBook1", returns 3 rows as expected:
Let's add a Where clause, where ActiveFlag = 1:
Lastly, tried a UNION ALL query:
And there you have it. 4 part series, install Visual Studio 2015 and start Hortonworks Sandbox on Hyper-V, Ingest files, ETL using Pig and Create Database/Table/Insert data. Full life cycle.
Although I'd still like to load SQL Server 2016 and try out the Polybase query, this concludes our series.
As always, hope you enjoyed the posts and got some value. Thanks for reading~! JB
I signed up for the Hortonworks Certified Associate exam last Thursday. Figured if I sign up, I'd have to take the test. And if I tak...
Saw a post today on Twitter, " Microsoft releases CNTK, its open source deep learning toolkit, on GitHub " This is big news. Be...
This blog post is in no way an attempt to steal other people's work. It's basically an conglomeration of notes from research I did...