9/01/2014

My First Microsoft Azure HDInsight Hadoop Cluster

To continue with my latest trend to learn Microsoft Azure, I dabbled in HDInsight yesterday.

After logging onto Azure with the single sign on, I created a Hadoop HDInsight cluster, took about 5 minutes to spin up.


Here you can see the cluster being created:

It's been created, it's now running...

You can see it's Dashboard...


And it's usage monitor...


You should probably take note of this toggle button, as you get charged when Hadoop is running, so you can shut it here...
Here you can run Hive Jobs, in the Hive Editor...

I started a Hive job, here you can see it just completed...



Here's the file browser...



With a HiveSampeData.txt file in one of the sub folders...


And the raw contents of the text file...


Going back to the main portal page... you can view by Hive...


Or you can view by MapReduce...


I downloaded one of the Azure Explorer applications, installed it, here you can see the folder content, each of the files contained with the folder...


And a closer look...


Here its installing the Microsoft Web Platform Installer...




Which allows you to open a Remote Desktop session, where the entire Hadoop cluster resides, here I scanned the HDFS folder / contents...


So that's as far as I took it yesterday.  Remember to shut down your Microsoft Azure cluster when not in use as it's like the power meter, every second costs money.  I've heard some stories of people racking up bills because they forgot to power HDInsight cluster down when finished.


So from one short session, connected to Microsoft Azure, I was able to provision a Hadoop HDInsight, in my part of the country / region, selecting 4 node cluster and it was available in about 5 to 10 minutes.  Once active, I could upload files to the Cloud using the Azure Explorer application I downloaded.  From there, you could mount your data into a Hive table, run a HiveSQL query and view the contents.  Then expose that Hive table to other applications like Machine Learning, PowerBI or what have you.  Once logged into the Remote Desktop, you have free reign to build out your Hadoop environment.  I'm not sure if I took a screenshot but Yarn was part of the Hadoop cluster so you know it's the latest and greatest version of Hadoop.

How much does this cost?  I don't know.  I do know that installing a Hadoop cluster takes time.  Time and money.  And lots of settings to tweak, dependency's to install, servers to provision.  Imagine having a real Hadoop cluster up and running in 5 to 10 minutes, pointing to your data stored in Blog Storage in the cloud.  So factor out the cost of hosting it yourself vs. on-demand compute power, scalable, redundant, always available, via the web browser, this should definitely be taken seriously when preparing to roll out a Hadoop Cluster.

Thanks for reading!

Babalon