10/02/2016

Intro to Orc tables in Hadoop

In Hadoop, we have tables in Hive.  We can create tables, import data, query from them, and drop them.

Another nice feature about Hadoop Hive tables is the ability to save your data in Binary tables.  These are known as ORC tables.  They are fairly easy to create:

USE myHadoopDatabase;
DROP TABLE IF EXISTS NewTableORC;
CREATE EXTERNAL TABLE NewTableORC (
 SiteID INT,
 LocationKey STRING,
 SiteDescription STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS ORC
LOCATION '/user/HIVE/NewTableORC';

Once created, you simply import the data using a SQL Select statement:

USE myHadoopDatabase;
INSERT OVERWRITE TABLE NewTableORC SELECT * FROM OriginalTable;

Then drop the originating table if you wish.

USE myHadoopDatabase;
DROP TABLE OriginalTable;

That way, if someone happened to be nosy and look through your data, if they had access to it, they would only see gibberish:


And the other benefit, it's reduced the file size from built in compression, here you can see the Orc table vs the Original table file size, from 477,668 to 62,554 containing about 9k rows:


There’s a built in compressions “ZLIB” table property, STORED AS ORC TBLPROPERTIES ("orc.compress"="ZLIB"); it could reduce the file size further. 

And that's a basic intro to Orc tables in Hadoop

No comments:

Post a Comment

We Interrupt this Broadcast