10/25/2016

Resolving Dts.Variables in Script Component of SSIS to call REST API

For a project, I was tasked with making a REST API call to send up some data using JSON, then receive the returned values and place into a SQL Server table.
 
So started out using a Script component.  And pieced together some code, to get the REST API call to work successfully.  In order to do that I first had to figure out the correct syntax.  I used the built in plug-in using Chrome.
 
 
Click the Launch button and another window appears:
 
 
 
After entering the URL for the REST API, I selected POST.
 
Then tweaked the JSON data until it processed success. So the REST API call was accessible.
 
Next, went into SSIS using Visual Studio 2015, added some code, including a Script component.  Then pieced together some code to send a JSON statement to the service using a hard coded JSON string.  And that worked.
 
Next, had to feed data into the Script component, using an Execute SQL script.
 
Next, added a For Each Loop component and set the result set to a variable: User:ResultSet
 
 
 
Then created 4 User variable to hold the contents of the result set.
 
Then within the For Each container, added a Script component, selecting "Source", not "Destination" nor "Transformation".  And attached a Destination table in SQL Server.
 
 
 
Then you set the Output expected Returned values within the Script component.
 
Then edit the script.
 
I got it to run success when hard coding the JSON string.  Of course we have to then set the thing to run using real data, flowing in from our Source data from our Execute SQL component. 
 
So we set our "ReadOnly" variables, not "Read/Write", added the 4 User variables and head back to the Edit Script again.
 
And in the code, we swap out our hard coded JSON statement with a well defined code substitution our 4 User:Variables.
 
But we ran into an issue.  It wasn't able to read our variables.
 
Error CS0234 The type or namespace name 'Variables' does not exist in the namespace 'Microsoft.SqlServer.Dts' (are you missing an assembly reference?)

I read some post online that suggested many possible solutions, none worked.  Then I remembered that you have to add the Microsoft.SQLServer.ManagedDTS.dll to the project, which I did, still threw an error.
  
Then searched in a few places, and created a Matrix, comparing the DLL values from the SQL Server 2016 box to the development box that has Visual Studio 2015:
 
 
You can see that the DLL could reside in multiple places, specifically the SDK directory and the GAC.  The 592 and 593 were the file sizes to I could identify the differences.
 
It turns out, the version in the GAC needs a different version so I overwrote it with a newer version.
 
Reopened Visual Studio:
 
string strStreet = this.Variables.street.ToString();
 

I changed the syntax slightly, and it resolved correctly and was able to flow a record through and it showed up in the database table as expected. 
 
So I opened up the SQL statement and removed the "top 1" and it ran entire way through without errors.
 
So it seems that the dll version must have corrected the issue, as well as having to change the syntax slightly for from:
 
Dts.Varialbles.["Street"].ToString().
 
to
 
this.Variables.Street.ToString(). 
 
That's my story and I'm sticking to it.

 

Getting Started with Hortonworks Certification Program

If you are interested in getting certified in Hadoop, Hortonworks offers a complete Certification path.  You can get started here.
 
They have a few paths to choose from:
 
HDPCD
HDP Certified Developer
 
HDPCD-Spark
HDP Certified Apache Spark Developer
 
HDPCD-Java
HDP Certified Java Developer
 
HDPCA
HDP Certified Administrator
 
HCA                                       
Hortonworks Certified Associate
 
I started on the Hortonworks Certification path last week with the HCA exam, I blogged about it here.
 
I think the next exam I'd like to try is The HDPCD Exam
 
Here's the Cert Sheet

And Exam Objectives.
 
And the HDP Certified Professional FAQ
 
And Guide for Attempting an HDP Certification Practice Exam where you can take a sample "practice" test on AWS, just charged for usage time at minimal rate.

And you can register for the exam here, must use Credit Card to secure the cost of the exam.

I'm thinking this Certification would make the most sense as I have some experience with HDFS, Hive, Pig and a bit of Flume.  They also have the Spark Certification, which is great, but I'm into the movement of data within HDFS at this point in time and not as much into the In Memory aspect of large scale interrogation of data sets in HDFS.  Maybe after the first exam.  The Administrator exam would be after that, and probably not take the Java Exam as that is more Hadoop 1.0.

This blog post simply consolidates some of the available links to get your started with Certification from Hortonworks platform.
`
Thanks for reading~!

10/23/2016

Hortonworks Certified Associate

I signed up for the Hortonworks Certified Associate exam last Thursday.  Figured if I sign up, I'd have to take the test.  And if I take the test, I could pass.  In order to pass, I'd better study.
 
So I studied a bit on Friday.  And Saturday.
 
And 9am on Sunday, logged on to the test site, seated from my office, out by the pool.  When you log on, they ask you to share your desktop.  And your camera & microphone.
 
Then you are instructed to pan the camera 365 degrees as well as the desktop.  Once I checked out, the exam started.  41 questions.
 
Most questions, I knew the answer immediately.  There were a few I wasn't sure, those must have been the ones I got incorrect.
 
You get an hour, but I completed in just over 25 minutes.  Scored 82%.  If you earn over 75% you pass.
 
So I officially passed the first test for Hortonworks Certified Associate.  



And I've been getting some Hadoop real world experience on my current project with HDFS, Hive, Pig, Polybase.


I actually learned Hadoop in 2012 at a prior job, took the 2 week class with a bunch of other developers.  Just took a while to get a good project.

10/20/2016

Early Adopter of Computers 35 Years Ago


The other day, my wife and I were talking in the car.  I mentioned that my family had the first version of IBM PC, before it was MS-DOS, it was called PC-DOS.

So I told her the story of how Microsoft licensed it's software to IBM.  MS didn't actually write it though, instead, they acquired it quite cheap.  And how that was probably the biggest strategic business maneuvering of our lifetime.

When I first got on a computer, at age 14, we didn't have hard drives or a mouse.  Just dual floppy disks, a keyboard, color monitor, Epson printer, and a 1200 baud modem.  

Back then, I would call up the BBS and page the Sysop, download software and games.  

Although the Commodore 64 was out and the kid up the street had an Apple, and the TRS-80 at school, IBM was cutting edge back then.

So why didn't you major in computers in college my wife asked.

Well, I sort of majored in Business for 2 years.  After being "undecided" around Junior year, they said you need to pick a major so you can graduate.

I had so many Anthropology credits, that I only needed a few more Anthro classes to graduate.  Although I was just shy of a Minor in Business.

The reason I didn't major in computers was this: in high school all I did was play tennis and do enough school work to get A's and B's.  Reason being, my education sort of suffered after moving to Florida.  As not one teacher realized that I could solve the Rubik's cube in a minute, or could speak for that matter.  And lastly, I never thought for a second about growing up and getting a job after college.

It wasn't until I got into the workforce and found programming as an occupation, that the career starting to improve exponentially. 

I said to my wife, a lot of people with similar backgrounds of early access to programming and / or growing up in an IBM family, went on to build companies and become millionaires.  So in that regard, not sure what happened.  

I said, my personal opinion, the area in which I worked had back office jobs that may have had an IT department, that my have needed some programs maintained or reports created.  There wasn't much new development.

Also, the newer technology didn't float to where I was and although I tried to learn the new stuff, the jobs and projects just weren't there.  

So it could have been all those years of "maintenance programming" as well as "location".

I said to the wife, that I've been programming since 1982, almost 35 years.  I was lucky to learn at a young age, on original IBM PC, which is now in the Smithsonian.  

And in 1990, I had a laptop the size of a suitcase with orange color screen and weighed 40-50 pounds.  After that, I had IBM 286, 486 and on and on.

Where are we now?  Well, soon we'll have hybrid people with embedded digital devices, artificial intelligence in the mainstream, automation of everyday activities, 3-D printing everyday items, flying cars and delivery drones.  Is that good or bad?  

Well, for one thing, the quality of service would become more standardized, and we wouldn't have people give preferential treatment based on bias, special favors or ignorance.  Maybe even some accountability.

And services will get faster.  And we'll have audit trails.  And personalized service.  And business activity 24/7, not just 9-5.  And perhaps more interaction with people across the globe.  Maybe use some technology to cure diseases or benefit humanity.

Technology is finding it's second wind.  As am I.  

I was there 35 years ago, in the office upstairs, typing away on that IBM home computer, at the very beginning of the technology movement.  How lucky is that? 

And that's the history of computers through the eyes of an ex-tennis playing Anthropologist turned Loan Underwriter, Programmer & Data Professional Consultant.

Thanks for your time~!

Install SSDT Data Tools for Visual Studio with no Internet Connection

In order to install SSDT Data Tools for Visual Studio, with no internet connection, it is possible:
 
https://msdn.microsoft.com/en-us/mt186501

    Once downloaded, run the following command using an administrator command prompt (cmd.exe run as administrator):

    SSDTSetup.exe /layout

    Where is the location you wish to create the administrative install point (e.g on a USB drive, a LAN drive or other accessible location). NOTE: You will need approximately 1.8GB of free space at for the full install point because it includes all possible components that might be required.

To use the install point once created, simply run SSDTSetup.exe from the location with no arguments. This will use the install point rather than attempting to download new copies of the relevant chained components.

Have to run from Dos (administrator), extract the files to another folder:

 
 
 
Then cancel the installation in progress, drill into new folder (temp) and run from Dos (administrator) same window, but run from the temp directory:

 Copied folder to c:\temp\ssdt along with 2 folders (payload & ssdt):

 
  
 
 
 
 
 
 
 
 
Now you can develop SSIS, SSRS, SSAS and Tabular projects.

10/19/2016

Self Describing Data that Tells YOU It's Story

Having data means nothing.  Unless you can interpret it.  That's the latest phrase heard recently.

And its probably true.

Data tells a story.  Data is merely accumulated pieces of raw data.  Perhaps in relational tables or flat files or Excel docs or unstructured.

Even still, something's missing.

Data needs to be more descriptive.  As in attributes that describe the data.  In language we have nouns: persons, places and things.  And we have verbs: things doing something in different tenses, past, present and future.

We also have adjectives and adverbs, which describe the nouns and verbs.  Book.  What kind of book subject?  Size?  Shape?  Contains what?  Author?  Date written?  Might be valuable information.  Describes the book.

Run.  Run where?  By whom?  When?  How far?  Started and ended where?  Describes the "run".

It seems we need an inherent way of self describing data.  If someone hands you some data, wouldn't it be nice if you ran it through an interpreter, or compiler, or some data framework to load up the data in descriptive detail, without having to write queries, joins, merge data sets, etc.

Thanks for loading this data.  This data was created at this location, at this time, by this application.  The data describes a set of purchases of books.  Here's information about the books.  The author.  Who purchased them.  How much they paid.  What else they bought.  The customer demographics. And on and on.

Plug and play.  Insert data set.  Whola!  Here's the story about the data.  And of course we could interrogate it further.  With queries or natural query language or put into charts and graphs and visualizations and compare against similar prior data sets.

Presto magic.  Self describing data.  Pieces of data with self describing attributes.  That can be loaded into a pre-built framework by anyone anywhere, so long as they have permissions.

And why not share some of these self describing data sets.  Put them out on the web to be consumed by REST Web Services or SOAP calls or query the data remotely.

Data without interpretation is like stacked bundles of hay.  Doesn't do much.  It's when you understand the data that it becomes valuable.  Have the data tell you it's story.  By labeling your data with self describing attributes to be self interpreted by you or machines.

And then have machines crawl those data sets like a network of self describing knowledge.  A world digital encyclopedia 24/7 from anywhere anytime, just expose the data sets for public consumption, keep the private data private.

That's the piece that's missing from the data centric world in my opinion.

Self Describing Data that tells you it's story.

10/13/2016

Hadoop is Gaining Traction in Building out Data Ecosystems

This week I got to assist on a Hadoop installation of a Master Node a three Data Nodes.  We used the Ambari installation.

At first, the install was done manually, on Redhat Linux.  I spent a good time troubleshooting, poking through all the directories and configuration files.

And then we decided to use the automated scripts.  First, the Ambari server was setup on the Master node.

And the PostgreSQL database was installed and configured.

Then we stepped through the process, applying the correct settings and ran it through.  And it threw errors as some services would not start.  So we troubleshot and tried again.

I think the major issues were not having the $JAVA_HOME path set in all the right places.  Another issue was to use the actual fully qualified domain name instead of localhost in the HOSTNAME= setting.  As well as using ROOT as default user.

After that, booted up, all services were running, mission accomplished.  Since I was doing all my Hadoop development locally on a laptop, running Hadoop on Hyper-V with 3 or 4 gigs of ram, after I ported the Visual Studio 2015 project over to the client servers pointing to SQL Server 2016 and Hadoop cluster, it runs really fast.

At this point, we're ingesting some data from Excel, into HDFS, ETL using Pig, mounting Hive tables, then Hive ORC tables, cleaning up the file remnants along the way (don't be a litter bug!) and finally, pulling that data into SQL Server 2016 using Polybase.

What's next?  Adding some Data Quality Services, Master Data Services, and then flowing into a Data Warehouse using Dim and Fact tables, and then finally, pushing the data into Analysis Services Tabular Model for consumption.

There's a few other things that need to be done as well, like set up Kerberos, create some generic users/group to run the services, and standardize the directory structures along the way.

Hadoop was sold as the next big thing, shove all your data in, find unlimited insights, and then the hype wore off.  Because it's basically a bundled set of mini applications, fairly complex to set up and administer, and the lack of qualified resources to develop.  As SQL developers did not know Java, DBA's didn't like to code and traditional Java programmers didn't know the data layer.

At this point in time, years later, there's still a learning curve, but the tools to push the data through have gotten better, we are not required to write Java to write map/reduce jobs and we have many additions to mimic traditional Data Warehousing concepts, along with Machine Learning, Graph database, workflows, security, etc., etc.

So now we can leverage our existing or new Data Warehouses that have been around for a long time, and add more data sources including non-structured and semi-structured data along the path.  I could definitely see more organizations taking advantage of this paradigm to beef up their data ecosystems and find those "insights" we were promised many years ago.

It's now a data centric world.  Hop on board.  Hadoop is getting traction.

Removing Carriage Return Line Feeds from Excel Source in SSIS

For a project I'm working on, the goal is to bring in some data from an Excel file.  And for 3 of the tabs, the data flowed okay.  For the fourth tab, some of the field lengths exceeded 1000 characters.

And SSIS scans the first 8 rows to determine the correct field length and limited the rows to 255 every time, even after changing the length manually, it reverted back.

In order to get around this, there's a registry setting you can apply to let it guess the correct field length by scanning the entire column, I blogged about that here.

That wasn't the end of the story.  Next issue, almost every row of data for 5 fields contained a Carriage Return / Line Feed.  What that does, it breaks the field out into multiple rows, because it uses the CR\LF as the end of row indicator.

I worked with a team mate, our Architect on the project and found a working solution.

Basically, the Excel tab still gets read in with as usual, but then we do a data conversion to DT_WSTR instead of DT_NText:


Then we do Derived Column to REPLACE the "\r\n" fields with " ":


And that corrects the problem and the fields export correctly to the text file.  Thanks to my team mate for helping out on this one~!

So those two issues are now fixed.  On to the next set of issues.

10/12/2016

Connecting SQL Server 2016 to Hadoop ORC tables using Polybase

When working with SQL Server 2016, you may have a need to pull data from Hadoop HDFS Hive Orc tables.  In order to do that, there are certain steps that must be configured.  Here's the step used to create the connection:
-----------------
-- run this from SSMS
----------------- 
select SERVERPROPERTY('IsPolybaseInstalled')
-----------------
-- Create an external data source
-- Polybase connectivity configuration -
https://msdn.microsoft.com/en-us/library/mt143174.aspx
•Option 0: Disable Hadoop connectivity
•Option 1: Hortonworks HDP 1.3 on Windows Server
•Option 1: Azure blob storage (WASB[S])
•Option 2: Hortonworks HDP 1.3 on Linux
•Option 3: Cloudera CDH 4.3 on Linux
•Option 4: Hortonworks HDP 2.0 on Windows Server
•Option 4: Azure blob storage (WASB[S])
•Option 5: Hortonworks HDP 2.0 on Linux
•Option 6: Cloudera 5.1, 5.2, 5.3, 5.4, and 5.5 on Linux
•Option 7: Hortonworks 2.1, 2.2, 2.3, 2.4, and 2.5 on Linux
•Option 7: Hortonworks 2.1, 2.2, and 2.3 on Windows Server
•Option 7: Azure blob storage (WASB[S])

-----------------
-- Create Hadoop connectivity
----------------- 
sp_configure @configname = 'hadoop connectivity', @configvalue = 7;
GO
RECONFIGURE
GO

----------------- 
-- configure local config's for Hadoop
----------------- 
C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Hadoop\conf
-----------------
need to modify the yarn-site.xml file and add your classpath in the values node
-----------------
-- create connection to Hadoop
-----------------
CREATE EXTERNAL DATA SOURCE HDP WITH
(
 TYPE = HADOOP,
LOCATION = 'hdfs://myHadoopURLorIPAddress:8020'
)

-----------------
-- Create an external file format for ORC files.
-----------------
CREATE EXTERNAL FILE FORMAT EFTHDPORC
WITH (
 FORMAT_TYPE = ORC
, DATA_COMPRESSION = {
'org.apache.hadoop.io.compress.SnappyCodec'
 );

-----------------
-- create your table in the database
-----------------
--DROP EXTERNAL TABLE MyTableORC
-----------------
CREATE EXTERNAL TABLE MyTableORC (
Field1 INT,
Field2 varchar(255),
Field3 INT,
Field4 INT,
Field5 varchar(255)
)
WITH (
LOCATION='/user/hcat/HIVE/MyTableORC',
DATA_SOURCE = HDP,
FILE_FORMAT = EFTHDPORC
)
;
-----------------

-- run in SSMS, database in which you created your table
----------------- 
SELECT
 Fiel1,
 Field2,
 Field3,
 Field4,
 Field5
FROM [DBStage].[dbo].[MyTableORC]

-----------------
-- The following new catalog views show external resources

-----------------
SELECT * FROM sys.external_data_sources;
SELECT * FROM sys.external_file_formats; 
SELECT * FROM sys.external_tables; 
SELECT name, type, is_external FROM sys.tables WHERE name='MyTableORC' 


In SQL Server, under External Resources, you will see your new settings:


Under External Tables, you will see the Metadata tables that point to your Hadoop cluster:



Using Polybase, you can expose your Hadoop Hive tables, and query them in SQL server or from SSIS as Source just like any other data source.

Hope that helps get started with Polybase in SQL Server 2016 connection to Hadoop.  Thanks~!

10/10/2016

Remove Folder Content in HDFS using Pig Script

When working with Hadoop, we move files through the pipeline.  We send text or CSV files into HDFS, then mount the file using Pig language, perhaps add some business logic, filter out some values, transform some data, output the results to another file or folder.

Then we pick up that file and mount in perhaps a Hive table.  And then, maybe into an ORC table.

And we leave a trail of files littered throughout our HDFS cluster.

Next time you go to run the same scripts with new data, you may encounter an error, as file or folder already exists.  So we need a way to clean out the existing folders, to make room for the new ones, and prevent our data flow from error.

I searched for a bit, and found individual pieces to remove a file, or files recursively, or even the folder.  But if the item didn't exist prior, it too would throw an error.

So how do you know if the file/director already exists?  Well, the way I did it was to create a dummy file in the directory I wanted to remove recursively.  That way, I know for sure the file exists as well as the folder.  First line, create a dummy txt file, second line, remove the folder recursively:

fs -touchz /user/pig/Agency.out/tmp.txt;
fs -rm -r /user/pig/Agency.out;


I run this from the Hadoop Pig Script in Visual Studio 2015.  Seems to work.

After posting this blog, got a tweet from Josh Fennessy with good suggestion:
I tested and it works great!

thanks for the feedback~!

10/09/2016

Digital World Acquisitions are the Name of the Game

If one wishes to become a leader, it's important to have a good business model. Besides that, it's important to have deep pockets.

By having deep pockets, you can acquire other established business' that are dominant in their niche.

You probably read that IBM purchased the Weather Channel and all their technology.  Not a bad resource to have, maybe feed all those data points for the past x years into Watson, mighty powerful.

And how about Microsoft purchase of LinkedIn.  Now, one company owns your resume, mine, and about every working person on the planet.  Not a bad acquisition.
 
Sometimes they purchase the org's for their assets.  Like the worlds weather data points, resumes, etc.  But they are also purchasing the programmers that come along with it.  

Like in the world of Artificial Intelligence.  Very bright people, real world experience, hard to find resources.  Just purchase the company, get the programmers as bonus.


In the digital world, you can create things out of thin air, without a full workforce, fleet of cars, etc. like Uber.  Basically a software algorithm company.  You no longer have to build widgets in a factory, distribute them to re-sellers or malls, with each person in the chain taking a piece of the profit, depending mostly on volume and repeat customers.

The digital world is creating tons of new jobs, new sectors, new markets.  And those who have the ability to adapt and change, think critically, take calculated risks, have the chance to build something new, get acquired, and drink margaritas on the beach for a while. 


And there you have it~!

10/07/2016

Starting Hyper-V with Limited Memory Solution


I've been running Hadoop in Hyper-V, yet my laptop has limited memory.  When running Hyper-V with limited memory, it becomes slow using 3 to 4 gigs of memory, when it requires 8 or more.  So here's a fix to get it running faster:
 
https://support.microsoft.com/en-us/kb/2911380
 
It will require a change to the registry: 
  1. Start Registry Editor.
  2. Locate the following registry subkey:
    HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Virtualization
  3. Right-click the Virtualization folder, point to New, and then click DWORD Value.
  4. Type MemoryReserve, and then press Enter.
  5. Double-click MemoryReserve, enter 2048 in the Value data box, select the Decimal option, and then click OK.
  6. Close Registry Editor.
 
Add new node:
 
 
Reboot and start Hyper-V.

10/06/2016

Data truncated to 255 characters with Excel ODBC driver [solution]


When importing data from Access, one method is to use Microsoft SQL Server Integration Services or SSIS for short.

When doing so, you add a Source connection, point to Excel, select a Tab, then attach to a Destination connection, perhaps  CSV or Flat file.

When the source connection is created, SSIS reads in the first 8 rows, after the header row, and tries to determine the correct field type.

Let's say you have a text field and the max length of the first 8 rows for a particular column is 35 characters.  It sets the field type to string, with a max length of 255 by default.

But what if your actual max length of the column is 1500.  Well, SSIS will throw an error, indicating the field will be truncated and will give a big red x.

So one option is to go into the advanced editor, and resize the field length.  But guess what, it reverts right back, because it's reading the first 8 rows.

So how do you get around this behavior?  Well for a one time load, you can fake it out by adding a dummy row of 1500 characters and it reads that row and sizes the field accordingly. 

But if you have a recurring job, that's not going to work.  It turns out, there's a setting in the Registry that you can modify.


In RegEdit --> HKEY_LOCAL_MACHINE --> Software --> Wow6432Node --> Microsoft --> Jet --> 4.0 --> Engines --> Excel --> TypeGuessRows (Reg_DWORD), change it from 8 to 0 (zero):


Save, exit RegEdit, restart your IDE and it should pick up the actual field size accordingly, then modify your destination and good to go.  You will need to apply the same change to the production and test servers as well.  And be careful, messing with the Registry is playing with fire sometimes, so take your time and be precise with your changes.

Here's a link from Microsoft to assist.

And there you have it~!

10/04/2016

Incorporating Hadoop into Data Warehouse using Visual Studio IDE



So we went on vacation a few weeks ago.  Was great to get away.  In the sense that when you get back from vacation you don't have to work the weekend to get caught back up.  Or better yet, work on vacation taking calls and meeting sprint deadlines.  Nope, this vacation was all about relaxation.

But I programmed on vacation anyway.  Did it for fun mostly.  Got up to speed on Hadoop from Visual Studio.

And since I've been back from vacation, I leveraged that code into production code.  It reads from an Excel file, 4 tabs, converts to text file, pushed to Hadoop, Pig tweaks and sends to another folder, which gets picked up by Hive and mounted into a table, which then lands in an ORC binary table in Hive, including index.  Which will get pulled from SQL Server 2016 Polybase, flow into a Staging database, then a Data Warehouse, and land in a Tabular Model.

Also, throw in some Data Quality Services and Master Data Services along the way.

To be honest, pulling the data from Excel has been the biggest challenge.  They say throw all your data into Hadoop, do they also refer to Excel data, because there's no easy way to extract that, other than export to text or csv files.  At least that's what I've seen.

Best part about it, I'm still living in the Visual Studio world, with some Linux and Hadoop as well.  It's a nice addition to the traditional data warehouse ecosystem.  I could see a lot more business' incorporating this methodology.  Sure, I spend a few hours troubleshooting issues, but the bottom line is the data is flowing end to end in under a few weeks.

Perhaps there are tons of developers working in this space, but I think this would be a great place to hang out for awhile.  The Extract, Transform and Load (ETL) is still the trickiest part about the Business Intelligence life cycle and Hadoop provides an extra tool to leverage to enhance existing data warehouses that have been operational for decades.  Not to mention, the cloud aspect and downstream visualizations.  Once you get into the weeds of big data, it's kind of cool.

And there you have it~!

10/02/2016

Intro to Orc tables in Hadoop

In Hadoop, we have tables in Hive.  We can create tables, import data, query from them, and drop them.

Another nice feature about Hadoop Hive tables is the ability to save your data in Binary tables.  These are known as ORC tables.  They are fairly easy to create:

USE myHadoopDatabase;
DROP TABLE IF EXISTS NewTableORC;
CREATE EXTERNAL TABLE NewTableORC (
 SiteID INT,
 LocationKey STRING,
 SiteDescription STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS ORC
LOCATION '/user/HIVE/NewTableORC';

Once created, you simply import the data using a SQL Select statement:

USE myHadoopDatabase;
INSERT OVERWRITE TABLE NewTableORC SELECT * FROM OriginalTable;

Then drop the originating table if you wish.

USE myHadoopDatabase;
DROP TABLE OriginalTable;

That way, if someone happened to be nosy and look through your data, if they had access to it, they would only see gibberish:


And the other benefit, it's reduced the file size from built in compression, here you can see the Orc table vs the Original table file size, from 477,668 to 62,554 containing about 9k rows:


There’s a built in compressions “ZLIB” table property, STORED AS ORC TBLPROPERTIES ("orc.compress"="ZLIB"); it could reduce the file size further. 

And that's a basic intro to Orc tables in Hadoop

Get Sh#t Done!