Thoughts on Spark on Azure HDInsight

Tonight I attended the Tampa Data Science Meetup User Group.  The topic was Spark on Azure HDInsight.

We watched a video with Scott Hanselman and a member of the Spark group discuss some of the functionality in demo format.

I sort of know what Spark does.  It's a layer that can sit on top of Hadoop which stores data in memory.  You can stream data, query data similar to HiveSQL using SQL or Data Frames.  They have Graph technology.  And work with a variety of languages.

The one thing I saw was the ability to have notebooks.  I'd sort of seen this before but wasn't sure what they do or why to use them.

One called Apache Zeppelin 

  • Data Ingestion
  • Data Discovery
  • Data Analytics
  • Data Visualization & Collaboration
And he other is called Jupyter which does "Open source, interactive data science and scientific computing across over 40 programming languages."

You can connect to a cloud data source, query its metadata, run queries in SQL or Scala against Spark and initiate streaming jobs and query the contents in real time.

Seems like a portal to the cloud without having to connect to their VM or server, quite powerful.  I will need to check that out for sure.

One thing I will say about the online demo.

In Spark on Azure HDInsight, you can stream data and view the stream in real time.  Microsoft already provides this in Event Hub and Streaming Analytics.

In Spark on Azure HDInsight, you can query Hadoop.  Microsoft already has their own flavor of Hadoop called HDInsight.

In Spark on Azure HDInsight, you can run Machine Learning SparkML.  Microsoft already has their own flavor of ML called AzureML, with built in WYSWYG editor in the cloud, powerful pre-built algorithms, lots of help files and you can deploy solutions to public end point, to be consumed from clients in Excel or custom c#, Python or R.

In Spark on Azure HDInsight, you can query Graph databases.  I'm not sure if this functionality exists on Azure already, will need to research.

The other thing, the demo showed someone querying the Spark data from Power BI.  But they didn't go into the underlying tables and how to mount a table similar to HIVE.  And there wasn't much mention of how to manipulate the data using an equivalent to PIG.  They did mention Kafka Streaming in traditional Hadoop.  And they didn't mention how to get the data up to the cloud.  And I'm not sure if they have Spark 2.0 in the cloud.  However, when you spin up your cluster, they pre-load some good software so you don't waste time doing installations.

One thing is for certain, if Microsoft has committers on the Spark Open Source team, that says a lot.  Just like SQL Server for Linux.

I don't think there is a technology that will not eventually end up in the cloud.  Great for renting "time" like we did back in the olden days, where you rented time on the mainframe or VAX.  Spin up fast, mount your data in Azure Storage, tear down your server, only get billed for up time.  How much, wait until the invoice shows up end of month.  Like leasing a car, with variable monthly payments.

Overall, Spark on Azure HDInsight seems like a cool thing to get into, and it has all the bells and whistles of the Azure infrastructure that we all love and enjoy, our one stop shopping for any technology under the sun.

And there you have it~!  My 2 cents.