8/30/2015

Empower the Users through Self Service Reporting

In 2007 or so, I worked a job for the County.  We supported many agencies.  In doing so, we wrote and maintained the applications and data.

I was assigned a new project to help out with reports.  During our first meeting, the client mentioned their data was silo'd behind our walls and they never had access to their own data, except for what they downloaded off the internal application.

The conference room had a computer, so during the meeting I logged on, opened Microsoft Access, connected to the development instance of their data stored in Oracle, and proceed to download their entire database in about a minute.

Opened a copy of Crystal Reports, pointed to the Access database, used the Wizard to create a quick report.  Within a minute, we had a report displayed on  the screen from the projector.

The Agency could see how easy it was to view the data and build a report.  They couldn't believe it.  And asked us for access to the database and how they could purchase 5 licenses of Crystal Reports.  And some of my time to get them up to speed on the reporting tool.

It was great to be the one to open their eyes and see their reaction.  A hidden world opened up in just a few minutes.  Empowerment.  Self service.  No longer dependent on IT.

Fast forward to today, just about every organization is aware of the benefits of accessing their data.  Back then, being a one eyed person in the land of the blind made a huge impact.

8/29/2015

From Reporting to Hadoop to Machine Learning to Black Box Algorithms

Back in the day the DBA was the gate keeper to the data.  Very tight access.  They had the keys to the kingdom.

Programmers were granted access to the applications, reluctantly.  First client server, then web and now mobile.

And then there were the report writers or Business Intelligence people.  DBA never liked them much.  Always pulling reports, slowing down the database, causing locks.  Writing crappy code.

Enter Hadoop.  I didn't see many, if any, DBA's making the leap into Hadoop.  Since they don't like writing SQL, and they don't program Java, Python, Scala, Hive or Pig, why would they like Hadoop.

Report Writers - Business Intelligence - Data Warehouse people, are they getting into Hadoop?  Some.  They have an understanding of the data, although Unstructured and Semi-Structures is a bit foreign.  They can mount the data into SQL Tables in Hive and access from a variety of sources, including ODBC, OData feeds, Power BI, etc.  Just turn the data into Relational and they're off and running.  But they don't necessarily program Java, Python or Scala, from what I've seen.

So if the main players in the data space, the DBAs and Report Writers / BI / Data Warehouse don't have a clearly defined entry way into the world of Hadoop, that could explain why there was more hype than real life Hadoop growth.

From what I saw, the people working with Hadoop, were out of college, or start ups or major companies with financial resources to bring in top developers.  Perhaps now, more companies are jumping on the bandwagon.

It used to be, "What is Hadoop?"

Then, "What do I do with Hadoop?"

Now, Hadoop is another tool in the tool chest, for working with data.

The "shiny" newness wore off.

For companies entering the Hadoop space, I suppose they need a real world question to answer, a use case.  Then create a Hadoop sandbox, obtain a developer and admin, and start ingesting data.  Mash it around.  Look for insights.  Use those insights to run the business.

Reporting has always been past tense, numbers explained what happened.

The hot thing now is Machine Learning, algorithms and statistical analysis.  This allows forward thinking, predictive analysis, forecasting.  This space is hot right now.

Looking forward, algorithms will become a commodity.  Just pick one off the shelf, integrate into your app, and you're off and running.

There's already sites out there now to leverage existing black box algorithms.

You can't expect people with 20 years of experience in IT to magically produce a PhD degree, it's impossible.  The train left the station two decades ago.  Instead, bring the complexity down to a level that Data Professionals can work with it.

That's how I see it.  Time will tell.

8/27/2015

Real Life Experiment Using Machine Learning to Classify News Stories

I wonder if anyone's done some analytic classification of the daily news.

If I were to classify stories into categories, I'd choose some basic group to cluster.

  • Distraction
  • Terror
  • Helplessness
  • Joyful

A distracting article would be like the OJ Simpson trial or some zany political candidate making outlandish claims with no real intent on running.  Capturing the attention of society with shocking revelations as a form of entertainment and distraction from the real issues.

A terror article would be a downed plane, murder, unfortunate event somewhere on the planet to shock the audience resulting in perpetual fear.

A helplessness article would be some distraught series of events, such as global warming, rampant inflation, eroding middle class, corruption, where the audience has absolutely no control or influence on the outcome, creating feelings of despair.

A joyful article would be where an authority figure or random person did some act of kindness with no expectations in return, which are quite rare, to give the occasional uplift to the viewer.

We could add a few more buckets to capture all events in the media perhaps.  But then feed the news articles for the past 50 years into a Hadoop cluster, run some machine learning jobs to classify each story into a neatly defined bucket, and extract the ratio of news stories.  And then run some statistical analysis on the result set, to produce some meaningful insight on how the news portrays it's stories.  And perhaps link that data to the stock market, or unemployment rates or key events during the same time frame to draw possible conclusions.

Perhaps we may conclude that the rain in Florida for the past 60 days, which is quite unprecedented, followed by the upcoming hurricane in the gulf, amplified the damages and resulted in excessive claims, which caused the insurance companies to double their rates.  Something like that.

And then we find a similar set of events, like the half dozen hurricanes in a single year from last decade also produced the same results.  And then you find patterns, and draw conclusions.

Or you do trend analysis and determine that it rains on every major holiday in the US based on conclusive evidence derived from the experiment.  Once you have the data to support your claim, then you dig in to find cause and effect, look for root cause analysis, cross reference other data sets.  And link that to the hurricane in New Orleans and the tsunami in Japan and the hurricane in Haiti.  Potential for mind blowing conclusions.

Simply an experiment in a real life scenario on how machine learning could be used as a tool to derive meaningful insights.

8/23/2015

Shortage of Data Scientists

in 1995, when I worked as a Crystal Reports developer, there were no peers. Nobody was strictly a report writer.

There weren't many vendors. Almost no literature. Reports were an after thought. Right behind documentation.

Then, I noticed Business Intelligence spring up. Scorecards, KPI, Cubes, Dashboards.

Now, the market is flooded with report writing tools. And lots of developers. And the occupation of mere report writer is limited at best.

Data is now the hot topic. Variety of formats. Ways to manipulate and mash. Delivery mechanisms. Self Service.

The report writer has had their legs knocked out from under them. The underlying technology has grown. And a new generation is stretching the limits of what's possible.

Hadoop. Cloud. Mobile. Unstructured data. Streaming data. Statistics. Algorithms. Neural networks. Deep learning. Artificial Intelligence. Micro services. Internet of things.

The industry has matured overnight. And left the traditional report writers in the dust.

The term Data Scientist appeared out of nowhere. One who knows advanced Math, Statistics, algorithms, programming, domain knowledge, communication skills, visualizations, Hadoop, Spark along with a PhD or advanced degree.

how does a report writer accumulate all the required skills overnight. They don't. The new breed already has an assortment of require skills.

Hiring companies want experienced rock stars out of the gate. There is no established career path allowing report writers to make the leap across the chasm. The number of skills required is quite staggering.few people have PhD degrees, advanced Statistics knowledge and programming skills. It's possible to acquire these skills on your own, but that takes time, and effort. And how does one gain practical experience on the job.

Granted, there are a number of sites allowing people to gain skills such as Kaggle. Still, one does not become expert in Statistics in a few days.

There is a shortage of Data Scientist. I'd day one reason is because one enters the field out of school, not grow from the bottom up.

However, like anything in technology, in order to open up to the masses, they'll need to "dumb" it down a tad so everyday people can contribute as well.  Otherwise, the demand will exceed the supply until the schools can pump out enough qualified people.
 

8/20/2015

Using PowerShell to Pull .ppt Files off the Web

I was perusing the web looking for free online training on Data Mining.

Found a course on the site: http://www.kdnuggets.com/data_mining_course/index.html

The instructions say to prefix the URL and then append the suffix.


So I proceeded to copy-paste the list into Excel:


Copy-Pasted the results to a new tab:


Found a site with example script to pull files using PowerShell: https://blog.jourdant.me/3-ways-to-download-files-with-powershell/

Copied and modified script to the following:



Opened PowerShell ISE as Administrator, changed directory path to the folder containing the script, loaded in PowerShell ISE and ran the following command:

Set-ExecutionPolicy RemoteSigned

Then ran just file #9:


PowerShell Script file:


I'm sure there's a way to loop through all 19 files, and read from Excel, but instead, I just copied and pasted the code 18 times, modified:



Execute Script.ps1:



$url = "http://www.kdnuggets.com/data_mining_course/dm8-decision-tree-cart.ppt"
$output = "$PSScriptRoot\dm8-decision-tree-cart.ppt"
$start_time = Get-Date
Invoke-WebRequest -Uri $url -OutFile $output
Write-Output "Time taken: $((Get-Date).Subtract($start_time).Seconds) second(s)"

$url = "http://www.kdnuggets.com/data_mining_course/dm7-decision-tree-c45.ppt"
$output = "$PSScriptRoot\dm7-decision-tree-c45.ppt"
$start_time = Get-Date
Invoke-WebRequest -Uri $url -OutFile $output
Write-Output "Time taken: $((Get-Date).Subtract($start_time).Seconds) second(s)"

$url = "http://www.kdnuggets.com/data_mining_course/dm6-decision-tree-intro.ppt"
$output = "$PSScriptRoot\dm6-decision-tree-intro.ppt"
$start_time = Get-Date
Invoke-WebRequest -Uri $url -OutFile $output
Write-Output "Time taken: $((Get-Date).Subtract($start_time).Seconds) second(s)"

$url = "http://www.kdnuggets.com/data_mining_course/dm5-classification-basic.ppt"
$output = "$PSScriptRoot\dm5-classification-basic.ppt"
$start_time = Get-Date
Invoke-WebRequest -Uri $url -OutFile $output
Write-Output "Time taken: $((Get-Date).Subtract($start_time).Seconds) second(s)"

$url = "http://www.kdnuggets.com/data_mining_course/dm4-output-representation.ppt"
$output = "$PSScriptRoot\dm4-output-representation.ppt"
$start_time = Get-Date
Invoke-WebRequest -Uri $url -OutFile $output
Write-Output "Time taken: $((Get-Date).Subtract($start_time).Seconds) second(s)"

$url = "http://www.kdnuggets.com/data_mining_course/dm3-input-concepts.ppt"
$output = "$PSScriptRoot\dm3-input-concepts.ppt"
$start_time = Get-Date
Invoke-WebRequest -Uri $url -OutFile $output
Write-Output "Time taken: $((Get-Date).Subtract($start_time).Seconds) second(s)"

$url = "http://www.kdnuggets.com/data_mining_course/dm2-intro-machine-learning-classification.ppt"
$output = "$PSScriptRoot\dm2-intro-machine-learning-classification.ppt"
$start_time = Get-Date
Invoke-WebRequest -Uri $url -OutFile $output
Write-Output "Time taken: $((Get-Date).Subtract($start_time).Seconds) second(s)"

$url = "http://www.kdnuggets.com/data_mining_course/dm1-introduction-ml-data-mining.ppt"
$output = "$PSScriptRoot\dm1-introduction-ml-data-mining.ppt"
$start_time = Get-Date
Invoke-WebRequest -Uri $url -OutFile $output
Write-Output "Time taken: $((Get-Date).Subtract($start_time).Seconds) second(s)"

$url = "http://www.kdnuggets.com/data_mining_course/dm19-data-mining-and-society.ppt"
$output = "$PSScriptRoot\dm19-data-mining-and-society.ppt"
$start_time = Get-Date
Invoke-WebRequest -Uri $url -OutFile $output
Write-Output "Time taken: $((Get-Date).Subtract($start_time).Seconds) second(s)"

$url = "http://www.kdnuggets.com/data_mining_course/dm18-microarray-data-mining.ppt"
$output = "$PSScriptRoot\dm18-microarray-data-mining.ppt"
$start_time = Get-Date
Invoke-WebRequest -Uri $url -OutFile $output
Write-Output "Time taken: $((Get-Date).Subtract($start_time).Seconds) second(s)"

$url = "http://www.kdnuggets.com/data_mining_course/dm17-targeted-marketing-kdd-cup.ppt"
$output = "$PSScriptRoot\dm17-targeted-marketing-kdd-cup.ppt"
$start_time = Get-Date
Invoke-WebRequest -Uri $url -OutFile $output
Write-Output "Time taken: $((Get-Date).Subtract($start_time).Seconds) second(s)"

$url = "http://www.kdnuggets.com/data_mining_course/dm16-summarization-deviation-detection.ppt"
$output = "$PSScriptRoot\dm16-summarization-deviation-detection.ppt"
$start_time = Get-Date
Invoke-WebRequest -Uri $url -OutFile $output
Write-Output "Time taken: $((Get-Date).Subtract($start_time).Seconds) second(s)"

$url = "http://www.kdnuggets.com/data_mining_course/dm15-rules-regression-knn.ppt"
$output = "$PSScriptRoot\dm15-rules-regression-knn.ppt"
$start_time = Get-Date
Invoke-WebRequest -Uri $url -OutFile $output
Write-Output "Time taken: $((Get-Date).Subtract($start_time).Seconds) second(s)"

$url = "http://www.kdnuggets.com/data_mining_course/dm16-rules-regression-knn.ppt"
$output = "$PSScriptRoot\dm16-rules-regression-knn.ppt"
$start_time = Get-Date
Invoke-WebRequest -Uri $url -OutFile $output
Write-Output "Time taken: $((Get-Date).Subtract($start_time).Seconds) second(s)"

$url = "http://www.kdnuggets.com/data_mining_course/dm14-association-rules.ppt"
$output = "$PSScriptRoot\dm14-association-rules.ppt"
$start_time = Get-Date
Invoke-WebRequest -Uri $url -OutFile $output
Write-Output "Time taken: $((Get-Date).Subtract($start_time).Seconds) second(s)"

$url = "http://www.kdnuggets.com/data_mining_course/dm13-clustering.ppt"
$output = "$PSScriptRoot\dm13-clustering.ppt"
$start_time = Get-Date
Invoke-WebRequest -Uri $url -OutFile $output
Write-Output "Time taken: $((Get-Date).Subtract($start_time).Seconds) second(s)"

$url = "http://www.kdnuggets.com/data_mining_course/dm12-data-preparation.ppt"
$output = "$PSScriptRoot\dm12-data-preparation.ppt"
$start_time = Get-Date
Invoke-WebRequest -Uri $url -OutFile $output
Write-Output "Time taken: $((Get-Date).Subtract($start_time).Seconds) second(s)"

$url = "http://www.kdnuggets.com/data_mining_course/dm11-evaluation-lift-cost.ppt"
$output = "$PSScriptRoot\dm11-evaluation-lift-cost.ppt"
$start_time = Get-Date
Invoke-WebRequest -Uri $url -OutFile $output
Write-Output "Time taken: $((Get-Date).Subtract($start_time).Seconds) second(s)"

$url = "http://www.kdnuggets.com/data_mining_course/dm10-evaluation.ppt"
$output = "$PSScriptRoot\dm10-evaluation.ppt"
$start_time = Get-Date
Invoke-WebRequest -Uri $url -OutFile $output
Write-Output "Time taken: $((Get-Date).Subtract($start_time).Seconds) second(s)"

$url = "http://www.kdnuggets.com/data_mining_course/dm9-rules-regression-knn.ppt"
$output = "$PSScriptRoot\dm9-rules-regression-knn.ppt"
$start_time = Get-Date
Invoke-WebRequest -Uri $url -OutFile $output
Write-Output "Time taken: $((Get-Date).Subtract($start_time).Seconds) second(s)" 


Files #5 and #8 weren't found, other than that, all files downloaded fine:


That's a basic example of how to pull files off the web using PowerShell ISE.  PowerShell is completely flexible and there's so much you can do with it.  The one thing I remember from watching some online courses, use the "help files".

Get Sh#t Done!