SSIS & U-SQL Project Using Multi-Hops to Flow Data to Azure

Building out an Azure solution, using SSIS to pull from source database, apply some data messaging with dynamic Derived Columns to swap out "\n", flow data to Azure Data Lake Store using SSIS component, pick up the CSV file using U-SQL .Csv Extractor, into U-SQL database table.

Solid concept, yet it's also automated, to build out each independent SSIS file, based on source table / field name / field type, Derived Column's also built dynamically for any Strings (Varchar  Char) fields, Data Lake Store file path dynamic, as well as U-SQL dynamically built for each table, and dynamically swap back the "\n" characters,

If you've ever looked under the hood of an SSIS package, the XML is a complex set of code, yet it has patterns, and those patterns can be duplicated.  How is this accomplished?

Using an SSIS Script Component using c#, reading from Config file and Package Variables.  It also builds unique dtsConfig file, same naming convention of package DTSX file.

Files can be set to run using series of parent packages, executed from DTExec BAT file.

I've used Visual Studio 2015 with all the required installs and Add-Ons, and then tried VS2017, and prefer the newer version.  Have ability to create U-SQL projects to write U-SQL, execute jobs from on-premise VS, view job progress in Azure Solution Explorer, and verify data on Azure side, make sure it lands in correct place, data in correct format, and lands in U-SQL, and view all previous job, both success and failure.  No ability to set breakpoints on U-SQL code in real time, so execute view, modify script, repeat.

VS integrates with source control.

One thing to keep in mind, are Nulls.  When creating tables and writing U-SQL scripts, must append "?" indicate Nulls allowed.  And when pushing data to Azure Data Lake, it transposes Nulls to "\N", so you have to handle them in the U-SQL code, although a built Null Escape function handles this.  You do need to handle special characters embedded within source raw columns such as tabs, commas, new line feeds, etc. or the column counts will be off and throw errors.

U-SQL is quite powerful, as you can nest statements, use variables, reference DLL assemblies, and have a variety of formats to work with, such as CSV, Text, JSON file source data, and variety of data formats like Avro, ORC, Parquet and Text.  You can source data from Azure Data Lake Store, Blob Storage as well as other Azure data storage types.  So it goes beyond traditional Relational Database requirements of knowing Structure ahead of time, so semi-structured and Un-Structured data is okay.

After job runs, you can view run times, description of errors and it points to the specific error with "#' sign, and keeps exact script ran for future reference, and of course, its in the Cloud so others can view what jobs are running in real time.

With U-SQL databases, you can specify Database name, apply Schema names, Table names, Column names.  Data Types are standard, although there's no direct mapping from Source system to CSV to Database to SQL Server. 

You can write your own custom functions, in c#, upload the DLL to Azure, register and reference code to handle specific scenarios.

Because you can ingest JSON, you can pull in Streaming data from Azure Blob Storage or Azure Data Lake Store, pull the data in, parse, apply rules, and send data along the trail for storage and analysis.  This is nice for IoT data, reading in lots of data from mini-packets from sensors and devices out in the field.

As you know, storing data is not terribly expensive in Azure, so archiving data from On-Premise data can save money in the long run, make the data available for Ad-Hoc reporting, as well as integrate into other data sets like Azure SQL Data Warehouse,

U-SQL is a nice addition to the Azure platform, it provides coders much flexibility, easy ramp up time and handles Unstructured data.  Basically provides a language to do the heavy lifting once data reaches the cloud.

Combing with Azure Data Factory opens up lots of opportunity, flexibility and choices.  I'm just waiting to sink teeth into building out that next.

Thanks for reading~!


In With the New U-SQL

I worked on a solid Microsoft Business Intelligence project, went okay, then off to another.  Parallel Data Warehouse, for a while, then off to another.  Then a Hadoop - PolyBase - Microsoft BI - MDS - Tabular project, went okay, then off to another.  Then a quick Tableau project, went okay, then off to another.  Then a SQL against AWS Hadoop project, went okay, then off to another.

Where are we at now?  Building out an Archive project, using Microsoft SSIS, c# script component to automate, Azure Data Lake Store and Azure Data Lake Analytics components, U-SQL Databases.

You can see, the projects differ every time.  Get up to speed, become expert level, deliver, then change technology.

Traditional Data Warehousing never excited me, I was late to the party, will slowly lose luster and support, for variety of reasons.  I think the Cloud is the place to be.

U-SQL is a newer language that combines SQL, c#, Linq and ETL.  It runs as a job, can be kicked off from On Premise IDE Visual Studio or in the Cloud.  It's dynamic, works with Structured and Unstructured data.

I think the language will mature, the IDE install will be streamlined, the reporting on U-SQL Databases will be available to more Reporting tools and Intellisense will improve.

U-SQL reminds me of combination of Hive and Pig, in the familiar Visual Studio IDE we've known for years, executes in the Cloud Azure, stores past jobs both Success and Failed, run time, Error Messages.

Quite tasty.


Current State of Machine Learning and Artificial Intelligence

Machine Learning has three main features.  

For some, recommendation engines one websites are valuable tools, perhaps suggesting good movies similar to ones you've watched already, based on preferences

For some, clustering of homes on real estate sites are good tools.  What is my home value based on similar homes within the region.

And for others, classifying images is a good tool, perhaps recognizing your friends in a picture on social media, without having to enter the tags in manually.

Recommendation, Clustering and Classifying.  Three solid pieces of machine learning or types.  And a variety of algorithms to get to that goal.

There's also Neural Networks, that are useful for artificial intelligence, they have weights that get toggled true or false, depending on incoming variables, with multiple layers, which produce an end result weighted probability.

Some common themes are lots of data, to train the model, to learn from known data.  some data needs to be labeled ahead of time, which is "Supervised Learning".  The opposite is having the model learn on its own, with the luxury of "tagged" data ahead of time, called "Unsupervised Learning".

There're also models that learn from other models, to reinforce learned behavior, the event happened, this was the result, let me track that, for future reference.

Some Neural Nets are very large, with hundreds of layers deep, to get fairly precise results.  They do require more compute power and memory and take time to process.

There are a slew of languages and tools to use when working with machine learning and artificial intelligence, both on-premise, virtual machines and in the Cloud.

The fact we have bigger data sets, better compute power, more ram and machines chips that can crunch more data faster.  In the past, typically large institutions had access to these types of machines, so Universities or large computer organizations were the only place to work on this technology.  Now you can program this on a laptop in you living room if you desire.  So things have tricked down into the hands of the many.  Thus, faster progress, cleaner solutions, better results.

The results have gotten better, in image recognition, speech recognition, translation tools from language to language, some in real time and many more.  At this point machine learning is a useful tool to assist humans in everyday activities.  It has not gotten to the point where AI is replacing everyday jobs, in most cases.  But the tides will shift at some point, where people are competing with smart machines for everyday jobs.  And when robots are tasked with specific tasks, that too will automate some of the workforce.  And when smart machines can perform complex tasks in real time with minimal error, then we will take notice as computers do not necessarily require vacation time, or health insurance, or 401k matching contributions, and they work 24/7 without complaining or forming Unions.

So, we have come a long way, computers can crunch data and translate into meaningful information for consumption, to lower costs and automate menial tasks.  Where do we go from here?  More automation, easier to use tools, better integration across domains, access in everyday tasks, embedded IoT devices that assist in real time.  Artificial Intelligence should not be interpreted strictly as computers rising up, taking over man kind.  Surely, anything is possible, at this point in time, we are still telling the machines what to do.  And they do their specific tasks, in a sort of black box, and we do not necessarily know exactly how the results were derived.

With Moore's Law, things will only get faster, cheaper, easier to use and proliferate through everyday society.  That should keep us busy for a while, for sure.


Hop Aboard the Data Bandwagon

It does appear the Cloud is the place to be, certainly gained traction past few years.  Few key players to choose from.  Offer mostly the same services, just depends on what flavor you prefer, how well it integrates into current stack, and your developer pool staff skills and availability.

It seems Machine Learning is still hot along with Artificial Intelligence, although it bypassed many folks from the data pool skills set, requiring new languages like Python, R, Scala, Spark, Notebooks and algorithms.

Big Data didn't solve all our problems, but it's a nice addition to the data stack, has some good hooks to get data in and out, as well as reporting, although it won't be our traditional transaction database as expected.

Organizations still want their data, in readable format, in timely manor, with accuracy expected.  Although hiccups in the Extract Transform and Load still seem to play havoc on our daily reporting needs.

Data Science still seems hot, although just because you have talented DS, doesn't mean they can write standard SQL or knock out some traditional reports.  The two seem mutual exclusive and don't necessarily overlap, as new college grads jump straight into DS with little to no knowledge of traditional data ecosystem.

Data Engineer is a hot position, joining disparate data sets for others to work with.

Data Management has probably risen the fastest, as new GDPR rules necessitate solid data practices.  With that, having knowledge of where all the data resides, what it contains, how to access has pushed the Metadata into the spotlight, with Data Catalogs to handle such requests.

We are no longer at the point where knowing SQL or Access or Excel can guarantee you a data position.  Data skills have proliferated, grown, splintered, gone sideways and every which way.  The Data Ecosystem has exploded and many new comers have entered the arena, as developers, architects, software vendors and applications.

Throw in Blockchain, Agile Methodology, Streaming, IoT and Domain Knowledge, you can clearly have your hands full for the next decade or so.

Suffice to say, data is hot.  Anyone entering the workforce that's looking for a solid career, should look no further than data.  It's the bread and butter of every department in every organization.  There are some good companies foaming at the mouth, to get some talent in the door, to hit the ground running, and add value across the board.

So hop aboard the Data Bandwagon.