10/03/2018

Azure Data Factory with Azure Databricks or Azure Data Lake Analytics U-SQL

We have been building and supporting Data Warehouses for two decades plus.  Lots of standardization and design patterns have been flushed out.

With the advent of Cloud, new tools and architectures provide new options to developers.

With change, there's been discussion, on comparison, between traditional SQL Server Integration Services and Azure Data Factory for Extract Transform and Load.

SSIS can pick up, transform and push data with specific design patterns, to handle multiple scenarios across variety of data sources.  

Packages are run in variety of ways, to automate and orchestrate data loads to assist in the heavy lifting required to populate and refresh data warehouses.

Azure Data Factory is a new tool for developers to get the data to Azure.  You can set Pipelines to lift and shift the data, in a secure environment, which can be scheduled, automated, logged and monitored over time.

Once data lands in the Azure Portal, using Azure Blob Storage or Azure Data Lake Store, the ETL transformations need to cleans the data, and then move data to your landing zone, Azure SQL Data Warehouse.  

Within Azure Data Factory, you have the ability to use Azure Databricks to apply those transformation.  

Azure Databricks is an Azure Service, which sits atop Apache Spark, which opens up in memory processing, variety of languages including Scala, Python, R, Java and SQL.  

Programmers use Notebooks to interact with the data, by selecting a single language or combining multiple languages to process the data.  

These jobs can be automated in a schedule, resulting in a push to your Azure SQL Data Warehouse.

This simulates our traditional SSIS ETL data process and transformation.

However, there is another approach within Azure Portal, using similar methodology.  

Instead of using Azure Databricks from within Azure Data Factory, you also have the option to use Azure Data Lake Analytics.  ADLA uses the U-SQL language.  

U-SQL combines many flavors of familiar languages, like SQL, c#, Linq and the language is quite powerful.

You can string series of commands together, to pull in data, mount it, apply transformations, filters, place the data into a variety of formats and locations within Azure Portal.  

U-SQL handles structured data, unstructured data, CSV, Text files, JSON files, IoT and Streaming Data, using Avro, Orc, Text and Parquet data storage types.  

And the output can be sent back to Azure Data Lake Store, Azure SQL Database and even to your Azure SQL Data Warehouse.

A nice thing about U-SQL is the ability to write, modify and execute code from Visual Studio 2015 and 2017, which integrates nicely with Team Foundation Server and Git source code repositories.

The jobs are executed on the Azure Poral Cloud platform, you can specify the number of Analytic Units to process the query and results are stored for audit trail and to Success and Failure along with Error Codes, and it stores the exact query and timestamp.

U-SQL is a great language addition to the Azure Data Platform.

And here's a video of the Azure Data Factory / Azure Databricks use case on YouTube:




So regarding the discussions between traditional SSIS and Azure Data Factory, each has benefits depending on the use case.

And when using Azure Data Factory, you have two good options: 
Azure Databricks or Azure Data Lake Analytics.

Thanks for reading~!