My First Microsoft ML Project in Azure

Today I entered the world of Microsoft Azure Machine Learning.  Posted a blog about it here:


So later today, I signed up for Microsoft Azure using the account for my employer.  It took about 5 to 10 minutes to create the account, set up a blob storage and then click on Machine Learning.

I attempted to do a complete project, based off the blog post from Sebastian Brandes, located here:


I took some screenshots along the way to document the steps:

Creating a Microsoft Azure account based off my company's MSDN subscription:


It gives a $100 / 30 Day free credit for signing up:

Here's the main site:

And Machine Learning:

Here's I'm creating a ML site:

It's procuring here:

And there it is, the first Machine Learning site on Microsoft Azure:

The main Machine Learning page:

And the main menu, with a variety of links:

And some sample projects to wet the appetite:

The Dashboard shows your usage:

And here's some setting for my workspace:

Shows that I have no Web Services created yet:

Here's where I clicked on a sample project from Machine Learning Samples:

Example #2:

Example #3 from Samples:

And another sample:

Here I'm uploading my own Data Set, called BreastCancer.csv, which I downloaded off the web:

Showing the progress of the upload, I really like the color schema of the web site as well:

Here's the actual project I'm going to work on, starting from scratch:

I've dropped my uploaded csv data set onto the project canvas:

Some property info:

Adding the Columns component, which will be linked from Dataset by dragging the circle to the Project Columns component, very similar if you've worked with Microsoft SQL Server Integration Services (SSIS, formerly DTS Services 2000):

Here I'm excluding the column Code Number as its a unique identifier and provides no added value to the project:

And here you can see the link from Data Set to Project Columns:

And some more properties, in the video I'm trying to duplicate, the return values of the Class, which indicates if the breast cancer was benign or malignant, is indicated by 2 for benign and 4 for malignant, so we're dividing by 2 here:

While replacing the field Class (overwrite basically):

And here we're subtracting 1, so (4/2)-1 = 1 or (2/2)-1=0, to give a 0 or 1 as final answer, which translates to most languages as False / True.  Just a different way of doing it I suppose:

Here we're cleaning the data, replacing any null values by removing the Row entirely:

Added this component, didn't change any of the settings:

Here's we're splitting the data set into 80% used to Train the Model and 20% used to test the Model, this is typical with Neural Networks:

Checking the name of the BlobStorage account, in order to place a CSV file there as output:

Because it prompts for Account name, key and directory path, suggest you prefix with "results/"

Here you can see the Project running:

And it completed successfully, after a few fall starts:

We then go to our BlobStorage to view the output CSV files:

Here's the first one, you can see, although I didn't tell it to send the column headers, the first field is 97.8% accurate which would indicate an expected value of 1 or True in that breast cancer was detected or as in the 0.05, would return a False or no cancer detected, and so on.

Here's the other file, row by row, you can see the first 0 or 1 column was the actual value in the original spreadsheet, the 0/1 next to it is what the predictive model said the result would be.  In most cases of this model, the two numbers matched, indicating a fairly accurate model.

And it appears no hours were used during this demo:

And here you can see the spike on Microsoft Azure where I performed the Machine Learning job and ran it, didn't take very long:

Here's the Experiment Dashboard view, with the latest project accessible:

And this shows the job history with a couple of failed runs, which I had troubleshoot and clean up, in my case the output directory had to have the prefix of "results/":

There's a ton of useful components available in Machine Learning.  Here's a few of the groups to choose from:

Quite a lot to learn to say the least.

So as you can see, from the vast number of screenshots, based on the blog post from Sebastian, I was able to duplicate to some degree the same steps.  However, I uploaded my data set to Azure using a Custom CSV file downloaded from this website, same as he did:


I now believe that there's a major difference between a Report Writer, a Data Warehouse Developer, combined to make a Business Intelligence Developer, which may or may not include Hadoop / NoSQL development.  And finally there's the Data Scientist, who works with Machine Learning and statistical algorithms to derive insights into the data.  And if he / she is using Microsoft Azure Machine Learning, he has the entire arsenal of solutions in a simple web browser.  It also reduces the pain of writing your own algorithms, custom code to interrogate the data and finally, to push the entire solution to production with a simple mouse click.

It does get more complicated.

Thanks for reading my blog post and I do have a lot more respect for what a true data scientist does after this en-devour.

You can read my latest blog post about Predictive Models here: http://www.bloomconsultingbi.com/2014/08/so-what-is-predictive-modeling.html

So What Is Predictive Modeling

If a user requests a report, we can get that in a relatively short time.
If they want to join desperate data sources into a Data Warehouse, we can do that too.
At some point, the user has to analyze the data.  In doing so, they bring in bias, perhaps skewed perspective and misreading of the data.
However, that's been the world of data for the past few decades for the average Data Professional.

We did have predictive models years ago.  I know from working at the bank approving loans.  My decisions were based on a score.  Where did that score come from?  It read in variables from the loan application in addition to the credit report.  It looked at age, number of years at residence / employment as well as number of revolving loans, installment loans, bad pay history and inquiries.  All those factors and more were sent to the model which returned a score.  The score had thresholds, if the customer exceeded the threshold, he/she was approved.  Otherwise decline.

However, that predictive model based the statistical chances of that customer repaying the loan in full.  There were no guarantees.  My boss always said, there's the chance the customer gets hit by a bus and the loan goes default.  The score is a probability of return payment.

So if we had predictive scoring models in 1994, what's changed.

A lot.  First off, the increase in data has blossomed.  Second, the tools to create models have entered the hands of the average person.  So no PhD required.  Third, with a little knowledge in Statistics, Data Professionals can now create end to end full life cycle applications in a short amount of time.

Based on the new offerings from Microsoft Azure, I could log on, upload a data set, create an application by simply dragging and dropping components onto the canvas, hook them up in such a way as to cleanse and transform the data, build a model, train the model by running a subset of the data through to create weighted averages, then run a real data set through, have it back propagate the data set to increase the accuracy, score the model, output the results for analysis.

Not only that, it will seamlessly create a web REST API, allowing programmers to send data to the EndPoint, have the model application score the data, and return a percentage of probability to the calling app.   What that means is the model is exposed for actual use by a number of calling applications.

What's an example of end to end model usage.  If I'm running a business, a person wishes to purchase a product, as a company we extend credit to customers.  I'd like to know if this applicant is worthy of credit.  So as the person applies online, I as a programmer, send the necessary information to the REST API, through code.  I send the pertinent info, the REST web service receives that info, sends it through the model, returns a score to the calling app and says this potential customer, based on their specific characteristics will be a good paying customer, so yes, extend a line of credit.  Or the opposite, do not extend credit because his / her profile matches a subset of people who did not repay.

So I could create a model to predict if a customer is credit worthy.  Without having to hire PhD Mathematicians or Statisticians.  In fact, I could to it all within a web browser in a relatively short amount of time.

And this is just the tip of the iceberg.  With the growing data sets available, the possibilities are endless.  But what makes it so unique is Microsoft Azure Machine Learning offers the entire package online, with single user login, collaboration, end to end solution with easy to use software and good documentation.

This is where the world of Data is heading.  I think its incredible.

My View of Microsoft Azure Machine Learning

Read a tweet today, Information of Things surpassed Big Data on the Hype Cycle.

Which is interesting, because my interest in Hadoop has fallen over the past few months.  I initially liked it because it was new territory, had value in handling unstructured data and large volumes.

Except I don't have unstructured data or large volumes.  And I never got to work on a project that had Hadoop.  As far as setting up and administering a cluster (Hadoop DBA) or developing HIVE or PIG or Flume or whatever.

It's like the best Christmas gift (or Chanukah gift) never opened.  And now every Tom Dick and Harry is entering the space which is getting a little too crowded for me.

However, my real interest has always been Artificial Intelligence and Neural Networks and Machine Learning.  So when I watched the following video on Microsoft Azure Machine Learning:


It kind of sparked my creativity again.  So today I watched another video on the subject:


My basic takeaways are this.

They've removed the need to program advanced algorithms.  So no PhD in Math required.
They've removed the need to program essentially.

In that the entire process can be performed within a web browser.

First you upload your data set, or use an existing data set.  You simply add the data set to your project, start dragging and dropping and connecting steps into your workflow from the huge list of available options.

Which means you kind of have to know what each of the the widgets do.  As in Classification, Clustering and Regression for starters, but then some of advanced algorithms and what they do as well.

Microsoft Azure Machine Learning is tightly integrated into Azure so there's the single sign in, connectors to BloB Storage, Azure Tables and SQL Azure as well as Hive and Hadoop.

So those two factors, single sign on and web browser based are huge factors.

Throw in the full life cycle processing, no need to learn advanced Mathematics or Programming and I see this as the future.

Plus the ability to move to production within minutes, having it build a REST API consumable by c# code for programmers to send in data and receive a result based on the trained model, or you can send batches of data and receive a batch back.

However, I'll still need to get up to speed on deeper understanding of Statistics, how to interpret the results and what kinds of projects to work on.

Anyway you look at it, this stuff is awesome~!


Machine Learning on Microsoft Azure

Reporting started out very basic.  Connect to a data source, place some fields on the screen layout, add some groups, header / footer, perhaps a few Sums, run it and you have a working report.

Great, you could send those report to people in Excel or PDF via Email.

Next, Business Intelligence and Data Warehousing allowed developer to pull a variety of data sources, consolidate into a model, either Star or Snowflake schema, load into a Cube, write some reports and allow users to consume.

Great.  Except these scenarios require an end user to interpret the data.

So what's next?  Data Science.  Machine Learning.  And if you want to get started with Microsoft Machine Learning, check out this great video:


It will blow your mind.  Because it's web based, looks easy to use and can be ported to production quickly.

This is definitely cool.


Build a Winning Culture

Culture.  Every corporation has one.  Yet you can't see it.  You can only feel it.

I've worked for a variety of companies.  Some big, some small.  I've worked on team where we just got along, everyone was excited about the projects, we collaborated, joked, went to lunch together and had fun while we worked.

I've worked at some companies where everyone was against each other, knowledge was kept secret, people never helped each other, stabbing in the back, politics, silos of people working on the same team.

And there have been cultures in middle.

Who sets the culture?  Well, I think it trickles down from above.  If the CEO is a jerk, chances are the atmosphere will be a struggle.  If the CEO is cutting edge forward thinking, chances are the environment will be easy going.

I believe that you want a culture that promotes creativity, problem solving, suggest ideas, where people give 110% because everyone else is also.  Which means you cannot micromanage, you can not run the people into the ground, you can not promote every man for themselves.

We've all seen the reality shows where everyone is on the same team working against each other for their own survival.  That's not how it should be.  It drains people and the quality of work suffers.

If people are constantly in fear of losing their jobs, the quality suffers.  Which trickles down to the customers, who experience bad service, shabby products and will tend to move on.

You can measure all the numbers in the world for insight, but if you want to pick up sales, drive great products which people want to purchase and retain customers, you'd best work on creating a friendly work environment.

I believe this is often overlooked as many companies are so focused on profit and stock price and mergers and layoffs, we need to get back to the root of the problem, happy workers.  A happy worker will pay dividends over time.

It doesn't take a Data Scientist to figure this out.  It takes common sense.


What is Scope Creep?

So what is scope creep?

You know when you order a sandwich at a restaurant, maybe a salami sandwich, the waitress brings it to you.  You look at it, say you know what, I'd really like some mustard with this, could you please bring me some.  Waitress comes back with mustard.  Oh miss, you know what, I'd really like this on a challah roll, could you do that?  Sure, the customer is always right, be right back.  She returns with your salami sandwich with mustard on a challah roll.  Oh miss, I'd really like a pickle, could you please bring me one.  Sure I'll be right back.  Miss, I've decided I'd really like a coffee instead of water and lemon, could you please bring me one.  Thanks.  Oh miss, I'd really like an extra cream for my coffee, could you be a darling?  Oh miss, can I have some extra nakpkins?  Oh miss, can you please split the checks?  But sir, you're eating alone.  Doesn't matter, customer is always right.

And there you have scope creep.  The customer is not always right.  There are limits to what you get when you buy something.  There's some leeway, but if you take advantage of it, you will have to shell out extra money.  It's not an all you can eat buffet of unlimited changes.  There's time and effort involved, which translates to $.


My Point of View on Hadoop Today

In the world of Data Warehousing your data resides in 4 places:
  1. Variety of Data Sources (structured Relational data)
  2. Extract Transform and Load (ETL)
  3. Data Warehouse Repository (Star / Snowflake Schema)
  4. Analytics / Presentation layer (OLAP Cube)
You could also argue the Business Layer exists within the Data Warehouse and sometimes the Analytics later.  The DW developer is tasked with scrubbing, aligning, denormalizing for speed, remove duplicate data, cleanse and conform to DW best practices.

In the world of Hadoop, you also have similar architecture:

  1. Variety of Data Sources (structured and unstructured data)
  2. HDFS Hadoop Filesystem layer containing variety of file types
  3. Extract Load and Transform (ELT)
  4. Analytics / Presentation layer
Data Warehousing
Data Warehouse must contain Relational Data.  DW has size limitations, at some point there's query response degradation.  Requires beefed up server(s).  Costly to host, maintain and enhance.  Good developers are hard to find.  If business rules change or a merger or acquisition, often difficult to merge data with other repositories.  DW has a solid methodology proven over the past 20 years with repeatable patterns.


Data can be relational but not required.  Handles greater volume of data.  Reduced cost based on commodity hardware, licensing and server requirements.  Can integrate into existing Data Warehouses.  Good developers difficult to find.  The number of Hadoop components can be overwhelming and daunting for developers to learn all and stay current.  SQL on Hadoop opens door to existing skill sets, bypassing complex Map Reduce coding.  There's a number of 3rd party offering to leverage.  Hadoop is ever evolving.


Based on the extreme hype over the past year or two, some people including me have suggested the hype factor.  Hadoop did not replace the Data Warehouse.  It enhanced it.  By creating Hybrid Data Warehousing.  The best of both worlds.  Which means that finding the right skill set has gotten even more difficult.

My Viewpoint

I see more and more people interested in Hadoop, even the ones who had no idea what is was a few years go.  Many IT people realize they must learn about Hadoop just to stay current.  In contrast, not many of these organizations have production level clusters, they may or may not have 10 node clusters as sandboxes to interrogate data sources, sentiment analysis and process large batch jobs.

Future of Hadoop

Hadoop 1.0 is past.  Hadoop 2.0 is here, including YARN and TEZ and Docker along with the slew of other offerings.  It has fragmented into many pieces, many vendors and no one size fits all solutions.  But there's still a lot of opportunity to be had.  With Machine Learning, building Models for Artificial Intelligence, large volumes of data, processing unstructured and disparate data sources, I feel that Hadoop will be part of my career.  And if you're job consist of collecting, processing, parsing or analyzing data, chances are, it will become part of your skill set as well.


Fiddling over #Hadoop #BigData

When I was a kid, my mother asked if I wanted to play the violin.  Sure, sounds like fun.

So we purchased one.  Brought it home.  Strummed the bow across the bridge.  Didn't sound too good.

Many company's are hopping aboard the Big Data Express.  They download the software, have an IT guy install it.  Not much value.

Soon I began violin lessons.  Learned the scales.  Practiced them over and over.  Soon learned to play twinkle twinkle.

Some company's have imported some internal data sets into Hadoop.  Perhaps in a 10 node cluster.  Perhaps created some HIVE tables, some PIG scripts, maybe even some Streaming data.

After that, I learned to play some intricate songs.  Mozart.  Complex songs with challenging note sequences.  Played concerts at the school.  Switched teachers to learn more advanced songs.  Was eventually one of the better violin players in school.

Next, these company's with 10 node clusters are parsing unstructured data, implementing HBASE.  Perhaps scheduling jobs to import data, crunch through it, resulting in aggregated output, consumed by OLAP for Web reporting in Dashboards and Reports.

Eventually, I was playing 1st chair at the Elementary School, performing concerts for parents night.  When I first got the violin, it was a piece of wood and some horse tail bow.  Didn't have much value.  It wasn't until I learned how to use it properly, to make it play songs which people wanted to hear, that it gained in value.

Integrating Hadoop into your existing Reporting structure, a hybrid solution, to compliment your Data Warehouse is the ideal scenario.  When you first install Hadoop, it doesn't have much value.  It's when you get the thing humming and crunching data to find insights for user consumption that it provides value.  Otherwise, it's a fancy open source software product sitting idle on a dozen commodity hardware devices.

It doesn't happen overnight.  Learning the violin or learning Hadoop.  I'm no expert at Hadoop at this point, perhaps some advanced twinkle.

And there's plenty of wrong ways to play the violin or Hadoop, which sound horrible.  I imagine that once you get them in tune, you could be playing wonderful melody's for your organization.

Have a listen to one of the better know fiddle players:

Itzhak Perlman


Estimate Your Project Wisely

Working as a consultant, you have to treat time differently than working a full time position.

Because when you bid for a contract, you have to estimate the size and effort of the work, known as scoping the project.

This is actually a difficult task, because you have to determine how many hours each task will take, validate your data and assume you will get the level of help needed from the client.

As difficult as that seems, the real difficulty is estimating the unknown factors.  Such as tracking down business rules, obtaining good test data, servers going down, getting database back ups and refreshing data.

There are so many unknowns, you have to allow time in your estimate.

And if that's not enough, the real trap to watch out for is Scope Creep.  What is that you say?  Well, with reports, sometimes the end user has a general idea of what they want, but once they see an actual report, they always have changes.  Align this, change the font, make this bigger, you know cosmetic changes.  And then there's always new fields to be added, extra functionality, link to other reports, there's really no end to the amount of changes they could ask for.

And this is an unknown.  So when you estimate your project, you MUST account for this.  Because it always happens.  So when working on the project, you always throw in some extra functionality, but there's a fine line where you must go back to the client and say, this is a change request because the level of effort is longer than expected, so perhaps a Phase 2, or append your Statement of Work to account for the extra time.

Sometimes the coding of the reports is rather quick, but it's the changes that eat up much of the budget, so watch out for this trap.  In the end, the Client has to sign off on your work so you can get paid, and you don't want to be doing work for free, so learn how to estimate your level of effort and account for the unknowns.

That's what I've learned working as a Consultant.  You want your client to be happy, you want to produce a quality product and perhaps get repeat business.  So estimate your projects wisely.

And there you have it~!


Just One Word. Data

Have you ever seen the movie with Dustin Hoffman The Graduate.

In it, some bright business guy takes Dustin aside to give some profound advice about the future, one word in fact.


Have a watch: http://youtu.be/PSxihhBzCjk?t=30s

Fast forward to today, that one word.

Data.  There's a future in data.

Why's that?  Data is everywhere.  Every department in every organization can leverage their data.

To understand the business.  Forecast for growth.  Modify and streamline processes.  Figure out who's doing what when.  Data can help your organization propel into the future.  Or get left in the dust.

Your choice.  If you haven't picked up on the data / information revolution, you'd better get crackin'.