8/31/2014

My First Microsoft ML Project in Azure

Today I entered the world of Microsoft Azure Machine Learning.  Posted a blog about it here:

http://www.bloomconsultingbi.com/2014/08/my-view-of-microsoft-azure-machine.html

So later today, I signed up for Microsoft Azure using the account for my employer.  It took about 5 to 10 minutes to create the account, set up a blob storage and then click on Machine Learning.

I attempted to do a complete project, based off the blog post from Sebastian Brandes, located here:

http://blogs.msdn.com/b/sbrand/archive/2014/07/22/tutorial-how-to-train-a-neural-network-with-azure-machine-learning.aspx

I took some screenshots along the way to document the steps:



Creating a Microsoft Azure account based off my company's MSDN subscription:



Created:


It gives a $100 / 30 Day free credit for signing up:


Here's the main site:


And Machine Learning:


Here's I'm creating a ML site:


It's procuring here:

And there it is, the first Machine Learning site on Microsoft Azure:


The main Machine Learning page:


And the main menu, with a variety of links:


And some sample projects to wet the appetite:


The Dashboard shows your usage:

And here's some setting for my workspace:

Shows that I have no Web Services created yet:

Here's where I clicked on a sample project from Machine Learning Samples:


Example #2:


Example #3 from Samples:


And another sample:



Here I'm uploading my own Data Set, called BreastCancer.csv, which I downloaded off the web:


Showing the progress of the upload, I really like the color schema of the web site as well:


Here's the actual project I'm going to work on, starting from scratch:


I've dropped my uploaded csv data set onto the project canvas:


Some property info:
Adding the Columns component, which will be linked from Dataset by dragging the circle to the Project Columns component, very similar if you've worked with Microsoft SQL Server Integration Services (SSIS, formerly DTS Services 2000):


Here I'm excluding the column Code Number as its a unique identifier and provides no added value to the project:

And here you can see the link from Data Set to Project Columns:



And some more properties, in the video I'm trying to duplicate, the return values of the Class, which indicates if the breast cancer was benign or malignant, is indicated by 2 for benign and 4 for malignant, so we're dividing by 2 here:


While replacing the field Class (overwrite basically):


And here we're subtracting 1, so (4/2)-1 = 1 or (2/2)-1=0, to give a 0 or 1 as final answer, which translates to most languages as False / True.  Just a different way of doing it I suppose:


Here we're cleaning the data, replacing any null values by removing the Row entirely:


Added this component, didn't change any of the settings:


Here's we're splitting the data set into 80% used to Train the Model and 20% used to test the Model, this is typical with Neural Networks:



Checking the name of the BlobStorage account, in order to place a CSV file there as output:


Because it prompts for Account name, key and directory path, suggest you prefix with "results/"


Here you can see the Project running:


And it completed successfully, after a few fall starts:


We then go to our BlobStorage to view the output CSV files:


Here's the first one, you can see, although I didn't tell it to send the column headers, the first field is 97.8% accurate which would indicate an expected value of 1 or True in that breast cancer was detected or as in the 0.05, would return a False or no cancer detected, and so on.


Here's the other file, row by row, you can see the first 0 or 1 column was the actual value in the original spreadsheet, the 0/1 next to it is what the predictive model said the result would be.  In most cases of this model, the two numbers matched, indicating a fairly accurate model.



And it appears no hours were used during this demo:


And here you can see the spike on Microsoft Azure where I performed the Machine Learning job and ran it, didn't take very long:

Here's the Experiment Dashboard view, with the latest project accessible:


And this shows the job history with a couple of failed runs, which I had troubleshoot and clean up, in my case the output directory had to have the prefix of "results/":


There's a ton of useful components available in Machine Learning.  Here's a few of the groups to choose from:






Quite a lot to learn to say the least.


So as you can see, from the vast number of screenshots, based on the blog post from Sebastian, I was able to duplicate to some degree the same steps.  However, I uploaded my data set to Azure using a Custom CSV file downloaded from this website, same as he did:

http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/

I now believe that there's a major difference between a Report Writer, a Data Warehouse Developer, combined to make a Business Intelligence Developer, which may or may not include Hadoop / NoSQL development.  And finally there's the Data Scientist, who works with Machine Learning and statistical algorithms to derive insights into the data.  And if he / she is using Microsoft Azure Machine Learning, he has the entire arsenal of solutions in a simple web browser.  It also reduces the pain of writing your own algorithms, custom code to interrogate the data and finally, to push the entire solution to production with a simple mouse click.

It does get more complicated.

Thanks for reading my blog post and I do have a lot more respect for what a true data scientist does after this en-devour.

You can read my latest blog post about Predictive Models here: http://www.bloomconsultingbi.com/2014/08/so-what-is-predictive-modeling.html

Babalon