In Hadoop, we have tables in Hive. We can create tables, import data, query from them, and drop them.
Another nice feature about Hadoop Hive tables is the ability to save your data in Binary tables. These are known as ORC tables. They are fairly easy to create:
DROP TABLE IF EXISTS NewTableORC;
CREATE EXTERNAL TABLE NewTableORC (
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS ORC
Once created, you simply import the data using a SQL Select statement:
INSERT OVERWRITE TABLE NewTableORC SELECT * FROM OriginalTable;
Then drop the originating table if you wish.
DROP TABLE OriginalTable;
That way, if someone happened to be nosy and look through your data, if they had access to it, they would only see gibberish:
And the other benefit, it's reduced the file size from built in compression, here you can see the Orc table vs the Original table file size, from 477,668 to 62,554 containing about 9k rows:
There’s a built in compressions “ZLIB” table property,
STORED AS ORC TBLPROPERTIES ("orc.compress"="ZLIB");
it could reduce the file size further.
And that's a basic intro to Orc tables in Hadoop
This blog post is in no way an attempt to steal other people's work. It's basically an conglomeration of notes from research I did...
I signed up for the Hortonworks Certified Associate exam last Thursday. Figured if I sign up, I'd have to take the test. And if I tak...
Saw a post today on Twitter, " Microsoft releases CNTK, its open source deep learning toolkit, on GitHub " This is big news. Be...