10/12/2016

Connecting SQL Server 2016 to Hadoop ORC tables using Polybase

When working with SQL Server 2016, you may have a need to pull data from Hadoop HDFS Hive Orc tables.  In order to do that, there are certain steps that must be configured.  Here's the step used to create the connection:
-----------------
-- run this from SSMS
----------------- 
select SERVERPROPERTY('IsPolybaseInstalled')
-----------------
-- Create an external data source
-- Polybase connectivity configuration -
https://msdn.microsoft.com/en-us/library/mt143174.aspx
•Option 0: Disable Hadoop connectivity
•Option 1: Hortonworks HDP 1.3 on Windows Server
•Option 1: Azure blob storage (WASB[S])
•Option 2: Hortonworks HDP 1.3 on Linux
•Option 3: Cloudera CDH 4.3 on Linux
•Option 4: Hortonworks HDP 2.0 on Windows Server
•Option 4: Azure blob storage (WASB[S])
•Option 5: Hortonworks HDP 2.0 on Linux
•Option 6: Cloudera 5.1, 5.2, 5.3, 5.4, and 5.5 on Linux
•Option 7: Hortonworks 2.1, 2.2, 2.3, 2.4, and 2.5 on Linux
•Option 7: Hortonworks 2.1, 2.2, and 2.3 on Windows Server
•Option 7: Azure blob storage (WASB[S])

-----------------
-- Create Hadoop connectivity
----------------- 
sp_configure @configname = 'hadoop connectivity', @configvalue = 7;
GO
RECONFIGURE
GO

----------------- 
-- configure local config's for Hadoop
----------------- 
C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Hadoop\conf
-----------------
need to modify the yarn-site.xml file and add your classpath in the values node
-----------------
-- create connection to Hadoop
-----------------
CREATE EXTERNAL DATA SOURCE HDP WITH
(
 TYPE = HADOOP,
LOCATION = 'hdfs://myHadoopURLorIPAddress:8020'
)

-----------------
-- Create an external file format for ORC files.
-----------------
CREATE EXTERNAL FILE FORMAT EFTHDPORC
WITH (
 FORMAT_TYPE = ORC
, DATA_COMPRESSION = {
'org.apache.hadoop.io.compress.SnappyCodec'
 );

-----------------
-- create your table in the database
-----------------
--DROP EXTERNAL TABLE MyTableORC
-----------------
CREATE EXTERNAL TABLE MyTableORC (
Field1 INT,
Field2 varchar(255),
Field3 INT,
Field4 INT,
Field5 varchar(255)
)
WITH (
LOCATION='/user/hcat/HIVE/MyTableORC',
DATA_SOURCE = HDP,
FILE_FORMAT = EFTHDPORC
)
;
-----------------

-- run in SSMS, database in which you created your table
----------------- 
SELECT
 Fiel1,
 Field2,
 Field3,
 Field4,
 Field5
FROM [DBStage].[dbo].[MyTableORC]

-----------------
-- The following new catalog views show external resources

-----------------
SELECT * FROM sys.external_data_sources;
SELECT * FROM sys.external_file_formats; 
SELECT * FROM sys.external_tables; 
SELECT name, type, is_external FROM sys.tables WHERE name='MyTableORC' 


In SQL Server, under External Resources, you will see your new settings:


Under External Tables, you will see the Metadata tables that point to your Hadoop cluster:



Using Polybase, you can expose your Hadoop Hive tables, and query them in SQL server or from SSIS as Source just like any other data source.

Hope that helps get started with Polybase in SQL Server 2016 connection to Hadoop.  Thanks~!

No comments:

Post a Comment

We Interrupt this Broadcast