8/30/2013

New Data Warehouse Project

I found out yesterday at 5:45pm that we had a potential client presentation today at 2pm.

So I logged onto my boss' computer and we began to build a prototype using the AdventureWorks database.

We brought in some SSAS Cubes to PerformancePoint and created 4 dashboards and 2 KPI with parameters and drill through capabilities.

We worked until Midnight without a break, 6 hours.

When I got up this morning, I worked on the prototype some more until heading out for lunch around noon.

After lunch we met the potential Client and did a 2.5 hour presentation, no breaks.

When I got home, received a call that we got the go-ahead to begin a Proof of Concept starting next Tuesday.

So I'll begin gathering specs for a few Dimension tables and a Fact table, pulling from the Transaction Database(s), applying ETL business rules in Stage and then migrating the data to the Data Warehouse where they'll get processed into the SSAS cube, finally available in SharePoint PerformancePoint and SSRS reporting.

If we do good on that, we'll get the go-ahead to do a full blown Data Warehouse with tons of Dashboards and Reports, could be a long term gig.

Nice way to go into a 3 day weekend!

Data Warehousing Isn't Dead

Data Warehousing is dead!

That's a false statement.

Just like saying the Mainframe is dead.

Which it isn't.

DW will be around for a long, long time.

As will traditional reporting.

Along with newcomers of Self Service, Visualization, Embedded, Mobile and yes, Big Data.

They will all be part of the Data ecosystem and a Data person will need to become experts in a variety of skills sets, if not all.

Yet that's a daunting task, baseball player's don't have to become hall of famer at every position, perhaps one if they're lucky.

You can make an entire career on just one skill set, however for marketability, it'd be wise to learn as many as possible.

So what are you doing reading this blog, shouldn't you be studying some new technology to keep current?  Just kidding!

Thanks for your continued support!

8/29/2013

Master Customer Identification Database

The problem is not too much data.

The problem is not being able to link up the data between disparate data sources.

You may have a lead database, a marketing database, web traffic data, call volume data, sales data, renewals data, etc., etc.

And the customer may show up in one or many data sources.

Each with their own formatting, data types, unique identifies.

Leads DB: Bob Smith, id = 12345
Marketing DB: Robert Smith, id = 234H12
Sales DB: Smith, Bob E., id = 999222
Web Traffic DB: 192.173.68.22, id = 44433

Call Volume DB: 724-125-4458, id = 33aahb2
Renewals DB: Robbert Smith, id, = 2222333

What each company needs is a "Master Customer Identification" database.

Global CustID, LeadsDBID, SalesDBID, WebTrafficDBID, CallVolumeDBID, RenewalsDBID
123                       234H12      999222       44433                    33aahb2               2222333

Which links all the data sources together.

Without that, the ETL involved to mash up the data sources gets real ugly real quickly.

So I'd say we need some vendor to come along and house the core customer data in a central repository that links data from several data sources, including in-house, social networks and the cloud.

Make it so #1.

8/27/2013

SSIS Master Package Protection Level Setting = DontSaveSensative

So now that we have 6 SSIS packages working, I was asked to schedule the job on SQL Server SQL Agent.

Sounds easy.  I tested all 6 packages individually, ran as expected.

Then ran the "Master Package" which calls the 6 other packages and it failed.

So looking at the most trusted advisor for solving problems, Google found this solution:

http://social.msdn.microsoft.com/Forums/sqlserver/en-US/74ecba4a-0986-476a-b168-cb13a48c4e82/calling-child-packages-from-master-package-is-failing-when-ssis-package-is-scheduled-to-execute-from

And sure enough that solved the issue.  You have to change the "Protection Level" to "DontSaveSensative"

8/23/2013

Programming vs. Data People Skill Sets

Programming takes a different mind set than Business Intelligence.  For one thing, when you program, for either Client Server or Web based, you have many things to worry about.  The user interface, is it intuitive enough for the user.  Functionality, does the program perform the required actions.  Networking, does the app use limited networking resources by limited the calls back to the server.  Database, have I created proper SQL statements and are they efficient.  Security, is my app prone to SQL Injections or some other hack.  Deployment, will my app run on multiple browsers or need additional plug ins/ DLLs to work properly.  Will it work on Mobile devices.  Plus they need to know the programming language, SQL, perhaps object oriented methodology, encryption, middle tier programming, etc.  Hence,  programming apps is no easy task.

Now take Business Intelligence.  Must know data, data modeling, SQL, coding efficiencies, DBA maintenance, ETL, Reporting, Dashboards, KPI, Analytics, Big Data, Hadoop, R, Map/Reduce, Hive, Pig, Oozie, SQOOP, SharePoint, Visual Studio, etc.  Challenging to know everything.

Yet different from Programming apps because there's a lot of Wizards built into the BI applications, you can even use Wizards to build you SQL if the truth be told.  When you get into OLAP cubes and MDX, you're talking about steep learning curves.  Once you cross over to Hadoop, that goes up tremendously.  Map / Reduce is something new and requires programming skills, which many BI people never learned.  There's the administration part which is no easy task and usually involves training.  Sure you can download a single node cluster to play around with, but that's not the same thing as setting up a 10 node cluster, administering, setting up permissions, loading the data, writing the jobs, exporting the data, doing the analytics, etc.

To summarize, Data people have difficult jobs as well, except the challenges they face are somewhat different than programmers.

Either way, text box learning can get you out of the gate.  Once you encounter real world problems, you're going to have to use your noggin, logic and problems solving skills.

And never stop learning!

8/22/2013

How to Avoid Data Silos

With any organization, they have accounting packages, marketing data, web site data and a variety of other sources.

And each app was probably built independently from other apps or purchase from 3rd party vendors.

Then they throw it over the wall to IT and say "Integrate this mess".

And after that's done, IT throws it over their fence to the BI team and says "Report on this mess".

Thus, we have infrastructures full of data silos.

Tough to integrate, few keys to relate systems back and forth, dirty data, you name it.

BI gets to carry the weight for all the dysfunction upstream.

So I say this, to avoid data silos, have the data people brought in at the beginning of a project lifecycle, not the end for the cleanup.

Reporting people, ETL people, Analytics people need to be included in the front end designing of system, so they can assist in relational designs, to make the reporting easier and more fluid.

We have technical Data Architects who take and process the data once it's in the database into a Data Warehouse or what have you.

What I'm saying is there needs to be someone way before that, before the transactional database is constructed, so "hooks" can be inserted to make the back end reporting simpler.

When I started in Reporting for IT in 1996, I was called in at the end of a huge project to create some reports in Crystal Reports, after the system was already in place and the process hasn't change in 20 years.

BI PEOPLE NEED TO BE INCLUDED UP FRONT WHEN DESIGNING NEW APPS TO SPEED UP THE ANALYTIC CREATION AND SIMPLIFY REPORTING AND AVOID DATA SILOS.

8/21/2013

Finding Under Valued Assets

Business Intelligence converts data to information in a timely manor.

BI Guy: Okay boss, here's the summary report for last night's baseball game.  We ended up with 7 runs, the opponent had 5 runs.

Boss: Okay, this looks good.  Do we also have a Details Report of runs per inning?

BI Guy: Sure thing boss.  Here's the detailed version of the report.  You can see the runs per inning.

Boss:  Okay, this looks good.  Yet the numbers are ice cold.  They don't convey much other than runs per inning.  How did the runs score?  By whom?  Were there an errors?  During what inning?  Any home runs?  How many triples?  Did the defense have any double plays?  Were there any injuries?

BI Guy:  Hold on boss, that will take some time to go find the data sources, massage the data into proper formats and then translate the business rules into a report.  See you in about 6 months.

And that's how the typical BI Guy/Girl services Management.  We report on the past with cold stale data.

All this is changing however.  Take Statistics in Baseball, based on the movie Moneyball.

Where one team with a very low financial budget was forced to compete with teams with higher budgets.  In order to compete, they studied and analyzed all of Baseball, to find players with the most value.  In other words, players whos skills were overlooked because of some character flaw.

This baseball team acquired great players for low wages and amassed a great team for low dollar amount.  And they produced a winning ball club.

Take this model and apply it to Business or Wall Street.  How do people find valued assets for low costs.  How can you identify trends based on past behavior?  Which products are undervalued?

And that is how analyzing metrics found it's way into Baseball and changed the game forever.

This is just the tip of the iceberg.

8/20/2013

Was Back Before the Internet, We had BBS

Back when I was a kid, we didn't have no stinkin' internet.

What we did have were BBS Bulletin Board Systems.



From my father's IBM PC in the den, I could load a program, connect to the 1200 baud modem, dial a number (local, not long distance), and presto, my pc would be connected to another computer.

What would I do, I'd poke around see what programs they had for download.

I'd search for listings of other BBS to call and do the same thing.

They had SysOps (System Operators) which you could page, and presto, the guy running the BBS was instant messaging back and forth, as the green screen of text would scroll down faster than a person could read.

And if I found some cool calendar to download, I'd send it to the dot matrix Epson printer.

This was the pre-internet, which I was accessing in the early to mid 1980's.

Long before AOL, before http://, before Windows, the program I used was called "PC-Dos", not "MS-DOS".

It was loaded onto the floppy drive and booted into RAM memory, there was no hard drive.

I must have been 14 years old at the time and I was allowed to dial anywhere in the local area code, when my father wasn't on the PC connected to IBM where he worked.

So that's my story of the BBS and my jump start into the world of Computers at an early age.

Get Up to Speed Fast in #Hadoop

If you've gone down the path of Big Data and Hadoop the first thing you'll notice is the complexity.

First of all, understanding the distributed architecture across multiple nodes, primary & secondary servers, running Map / Reduce in Java, SQL commands in Hive, data transformations in Pig, you soon realize the multiple layers of understanding required.

Then throw in SQOOP (ingesting and sending data to OLTP systems), Mahout and machine learning and predictive analytics as well as security challenges.

It soon becomes overwhelming.

Luckily the top vendors have made available VM to download to assist the average user in getting up to speed, without having to worry about setting up the environment.

Here's the links to someVendors (not in any order):

Hortonworks Sandbox
Cloudera CDH
Microsoft HDInsight for Azure
IBM Big Data Platform

Personally, I've downloaded both and even attended a two week course on Cloudera training.

I really like the Cloudera Impala implementation of writing SQL like language to bypass the Map Reduce process.

Although Hortonworks has a similar project called Stinger which utilized the TEZ methodology to speed up the Map Reduce process with goal of 100 x faster than traditional HIVE.

That should help get up to speed on Hadoop and both sites have good demo's on Use Case scenarios and why you would want to leverage the awesomeness of Big Data.

Hope you enjoyed this blog and good luck in your Hadooping!

Full Time Consulting

What's the best part about being a full time consulting working from home?

Drinking.  All day.  Uh coffee that is.  What were you implying?

And providing both a service and value to the clients.

And variety of work, with multiple clients, there's multiple projects.

So you get different flavors of business and technologies.

And I can watch over our animals, two Golden Retrievers, 3 cats and the newest addition to the household:  Sammie.

8 week old Golden Puppy.  What a handful!



And this year, it looks like I will get the opportunity to attend the SQL Pass Summit!

This job offers the most freedom I've ever had, good mix of technology and increasing skill set.

Wish I'd found it sooner!

Big Bang of Technology

We simply have too many choices.

If you don't believe me, go to any store anywhere.

The selection is mind boggling.

Same in the world of Data.

I believe we have too many choices, too much competition for limited recourses.

Everybody's got a stake in the game, each vendor provides similar technology, with slight differences.

I think it's too fragmented, putting the customer at a disadvantage.

Is there a legitimate need for BI software, yes.

Are there 100s of Vendors to choose from, yes.

Is the technology changing at blistering speeds, yes.

How is a customer supposed to know which is the best fit for their needs when there's so many to choose from, too much choice can stunt action.

How are developers supposed to stay current in all variety's of technology, it's like drinking from a fire hose.

I call it the "Big Bang of Technology", where everything's expanding outward and increasing in speed.

Funny, I thought technology was supposed to automated the daily drudgery to give us more free time, to relax at the beach, sip a cold one and spend time with the family.  Our strategy is at direct odds with reality.

Create chaos for increased simplicity.

8/19/2013

Top 16 Habits for Successful Orgs


  1. Cut costs.
  2. Sell more.
  3. Spend Marketing dollars on likely candidates.
  4. Create efficiencies in process.
  5. Don't create silos.
  6. Document the business rules.
  7. Cross train your employees.
  8. Allow creativity to blossom.
  9. Teamwork at all costs.
  10. "Can-do" attitudes rise to the top.
  11. Cliques frowned upon.
  12. Keep everyone informed.
  13. Meetings should serve a purpose.
  14. Document things in Email.
  15. Every person counts.
  16. Use data for precision analytics.

Know the Business, Programming & Data and Analytics





To be a Data Scientist you may or may not have a PhD.

Or a background in Statistics of advanced Math.

What you will have is the ability to build Models using a combination of Business Skills, Programming and Data Manipulation and Analytics.

20 years ago, you may have had access to huge data, some type of Statistical software to build Models and the know how to bring it all together.

My experience was working for a bank, they built a model on customer's based on a combination of the Credit Score, their time on job and time at residence, so when I, working as a Credit Underwriter, reviewed the application, the system basically scored the Customer and approved or declined based on factors in the Model.

So that was back in 1995.

Fast forward to today, we still have huge data, data Models and smart people, except the tools have been modified so that average Data people can do the work themselves.

Without having a PhD or Statistical Post Graduate degree.

The role of Data Scientist is not new, the role has been around for a while.

What is new is the ability for someone like me to use large sets of data, mashed from a variety of sources including Structured, Semi and Unstructured data, crunch the numbers and produce somewhat compelling Visualizations in a short period of time.

In other words, they are brining the technology closer to the Developers rather than retrain all the Developers to use new technology.

A dumbing down if you will.

If you know the Business, Programming and Data and Analytics, you're half way to being a Data Scientist today.

And so it goes!

8/15/2013

Information Industry is in State of Flux

Build it and they will come.  Famous lines from the movie Field of Dreams.

Same too in the world of data.

Convert data to information to gain insight to derive action.

Except the data has grown.  The technology has grown.  And people are finally starting to take notice that people who work with data are somewhat cool.

Many existing company's have produced quality apps and many new company's have popped up.

All competing for turf in the information space.

New roles have mushroomed, the Data Scientist.

New open source technologies, Hadoop Distributed File System.

The goal now is to integrate that Big Data with all the traditional Business Intelligence tools.

Self Service BI has blossomed as the Business got tired of waiting for IT to get around to writing reports.

Artificial Intelligence has been around for decades, however, neural networks have gotten smarter and faster and AI appears in many products today.

Machine Learning and Predictive Analysis and Streaming Data have taken off.

There's actually a tremendous lack of talent all of a sudden to fill the demand.

However, there have been some casualties.

The traditional report writer, the data warehouse and waterfall methodology have taken somewhat of a back seat in this data frenzy.

So too has the traditional relational database, as the NoSQL database are inching closer and taking some of the pie.

We are in a state of flux at the moment, and that change is actually growth.

For those who can hang on long enough, he or she can earn a descent living in this industry.

The key is to be flexible, open to challenges, and be prepared to change on a dime.

We ain't in Kansas anymore, or This ain't your father's Oldsmobile.

And so it goes!

First Try #Hortonworks Hadoop


Today I decided to download Hortonworks version of Hadoop.

URL: http://hortonworks.com/

Open the downloaded file in VMPlayer...

Simply open a browser on you PC, type in the URL instructed and a screen pops up like magic...
 

Just looking around on some of the screens, I really like the visualization of the layout and the nice look and feel.






And now to show the results...


With built in Visualizations (charts / graphs) - nice!


And feature to Export data to CSV or Excel...


And the Job Tracker shows the run query...



I really was up and running in about 15 minutes.

Very impressive!

At this point, I tried to log into the CentOS console.


Found the Username / Password here: http://hortonworks.com/community/forums/topic/shell-login-passwords/

and we're in (root/hadoop)


did an "ls", "cd ..", and "ls" to find my bearings...


So far I like it, still need to play around in the command prompt a bit see how things are connected.

Just in case I get a new project to setup a 3 node cluster (wink, wink!).

And there you have it!

First try Hortonworks Hadoop!

Here's another version of the install from Hortonworks: http://hortonworks.com/wp-content/uploads/2013/03/InstallingHortonworksSandboxonWindowsUsingVMwarePlayerv2.pdf

And a video on YouTube:
Arun Murthy - Hadoop Summit 2013 - theCUBE
http://www.youtube.com/watch?v=vNsn3kGwqX0&feature=c4-overview-vl&list=PLenh213llmcbnZLiehqCeXX07hreEpLiE

8/14/2013

Many Information Vendors Have Same Functionality

In the world of information, data is stored either in a relational database in structured format, or in a NoSQL database with Un/Semi Structured data.

For either variety, there are numerous vendors to choose from.

Which basically have the same features.

Such as a car.  We have small cars, medium cars, and big cars with the same basic functionality.

Which all have wheels, doors, brakes (hopefully), and depending on how much you want to spend, you can buy entry level models or deluxe models.

Same with Data Vendors.  They usually have a repository or way to ingest data, an ETL process to massage with business rules, perhaps store in an OLAP Cube database, a reporting mechanism, perhaps Visualization and Dashboard capabilities, an export option, and a ton of other features.

No matter which vendor you choose, you will probably get the core features, with add ons of course.

And some products do things better than other products, others less so.

What this means is greater competition for your spending dollars.

It also means a steep learning curve for developers to keep current.

As well as investing in a particular company with service agreements and such, kind of lock you and your company in for a while.

So if you go with one approach and decide it's not the best fit, you may find some discomfort switching over to another product as well as finding new developers or training existing ones.

As the world of information grows, I would say go with a vendor who has a proven track record across the board.  Because locking your company into a flash in the pan software vendor could have downstream ramifications.

Just saying!

Cross Training not a Top Priority

Programmers frequently maintain production applications.

So you become an expert in specific code.

Well, what if you want to take a vacation.

Or what if you leave for new position.

Or get hit by an eco friendly hybrid diesel bus.

That's why managers are always stressing, "You need to cross train somebody on this".

Unfortunately, this almost never happens.  People don't have the time, desire or incentive to learn other people's apps.

They are usually too busy to keep up with their own stuff.

There's clearly a disconnect between Manager's expectations and Programmer's reality.

I know I worked a job for 4 years without taking a real vacation.  Yes, I flew to NY for a tennis tournament, and ran the month end from the hotel lobby free computer for guests, ran it until 3 in the morning. 

Working on vacation, because Management wouldn't prioritize cross training.

Sad but true!

Documenting Software

I've worked in a bunch of different IT shops over the years.

And the pattern seems to be the same.

Not much source code documentation.

No written manuals and no comments in the code to explain the logic.

That's a travesty.  When I was a Junior programmer, my mentor instilled in my, through great fear, the necessity of writing comments into the code, no matter what.

I'd say the reason people don't comment is because many programmers suck at their jobs.

Out of sheer ignorance, lack of concern for quality, or could be job security.

With the way the economy is now, many many programmers fear for there livelihood and will stack the cards in their favor my hiding the business rules, yes it's true.

Documenting the code is just a part of programming, if you don't document even sparsely, you are not an exceptional programmer.

Now creating manuals for how an app works, I've seen some really good ones and some really not so good ones.  From 20 page docs to a single page.

"This app does something.  It was written in a computer language.  Many users access the app."

And even if the documentation is immaculate, things change, docs don't get updated, etc.

Documentation is a fact of life for developers, better get with the program people.

Programmers Solve Mini Problems All Day

First thing you learn working as a computer programmer is every day is different.

Because we don't work in a traditional production line environment.

So our jobs are not repetitive.

By the way, I saw a tweet yesterday said that any regular 9-5 job will eventually become automated.

Yet I digress.  Back to the story.

So because our jobs are free form, we don't necessarily have someone telling us how to do our jobs.

We are basically tasked with completing an assignment, which requires lots of motivation, drive and ability to solve problems.

So when we stop to think about it, we are thinking for a good portion of the day.

"Let's see, the program needs to do this.  Wait, what if I try this, no wait, that won't work.  How about this, hmmm, let me think about that, that might work."

So we try it.  It may or not work.  If not, we go back and try again, if it does work, we move to the next task.

I find myself thinking throughout the day, asking question, pondering ideas, mini hypothesis if you will.

How else are we supposed to solve problems if we don't use analytic skills.  Which means thinking to yourself, considering possibilities, experimenting, failing, succeeding, and moving forward all the time.

Programmers don't write code, they solve mini problems though out the day, some of those problems are related to code, others aren't.

We are in the problem solving / analytical / free form experimentation business.

8/13/2013

First Try at #SiSense

I've seen the company SiSense for a while now.

Today I decided to investigate.

Their URL is: http://www.sisense.com/

I listened to Bruno Aziza on SiliconAngle, at the Hadoop 2013 conference: http://www.youtube.com/watch?v=IUsAEGPR7dQ&list=PLenh213llmcbnZLiehqCeXX07hreEpLiE

SiSense's approach is different, in that they don't leverage multiple commodity hardware across hundreds of servers.  They had hardware people design software and leverage fewer servers to accomplish the same thing.  They do this by leveraging the "Cache" in addition to In-Memory and compression technologies.

They can store high Terabyte range data, they offer high end analytics, with big name clients and offer solutions in the Cloud on a Subscription basis.

Andrew Brust did a great job summarizing the latest version called Prism:
http://www.zdnet.com/sisense-announces-prism-10x-7000015698/

So I decided to download the latest version of Prism:






Chose a very simple Excel file with a single tab with 36 rows...
 
Which brings up the BI Prism Studio:

 
With quite a few Widgets...
 
 
And the Server Console:



Has a nice look and feel to it.  The machine I'm using no longer has a relational database installed at the moment so this is as far as I'll take it for tonight.  Once I have SQL-Server loaded, I can play around with some real data, for now, just wanted to see the environment.

Overall, SiSense seems to have a solid product, can connect to a variety of data sources, mash the data using drag and drop, opening the BI Prism Studio Utility, all without having to write a single line of code. 

The next step would be to build some stunning visualizations and deploy to the Cloud.

Very impressive!

8/12/2013

Data Scientist Job Description Encompasses Everything

To become a Data Scientist, you must be all encompassing.

In that you must know the industry you're working in.

As well as all the data sources in the organization which you work.

And third part data, such as Facebook, Twitter, etc.

Then you must ensure the data is cleansed property.

And then mash the data together in ETL Extract Transform and Load.

You probably want to ingest all this data into Hadoop.

Then you should probably know Map / Reduce, SQOOP, HIVE, PIG, HDFS, etc.

And then be able to apply Statistical analysis using tools such as R programming languages.

And then you must know how to analyze the data to derive insights.

As well as send this data to the user in nice looking visualizations, cubes, dashboards, reports.

Of course they must create Models to predict consumer behavior, must be able to read machine generated logs and / or manufacturing sensors, streaming data,  as well as having the computer learn as it goes by creating Neural Networks.

Lastly, you must be able to communicate this data to information in plain English to the senior execs.

Plus have a PhD.

Simply enough.

It's equivalent to playing all 9 positions of a baseball team, being the manager, collecting the tickets before the game, selling the hot dogs and popcorn, and then getting all the gear to and from the game, driving the bus, and cleaning the uniforms.

Simply put, whoever created the job description for a true Data Scientist, how could they expect a single person to tackle every point along the trail of processing data.

It's a daunting task to say the least and a bit intimidating for someone entering the field.

Suffice to say, a Data Scientist job description encompasses everything.

Slowly Changing Dimensions

Working with Data Warehousing, one must identify the type of changes over time.

What is referred to as Slowly Changing Dimension?

There are three types.

Type 1: Records gets inserted to Data Warehouse once, data never changes.

Type 2: Record gets inserted to Data Warehouse, if changes to front end system, record gets updated repeatedly.

Type 3: Record gets inserted to Data Warehouse with an active flag set to current record, if changes occur on the front end database, the DW current record is flagged as not current with a date timestamp and a new record gets inserted as set to current record.

Type 3 keeps an audit trail of changes over time, where Type 2 only stores the current value and Type 1 never changes.

Example of Type 1 is "First Visit Date", it occurs once and never changes.
Type 2, if it's a new record, it gets inserted, if the record changed, it does an update.
Type 3, the existing row gets expired, the new row gets inserted.

Depending on your system requirements, you may use one, two or all three types in a given project.

That's Slowly Changing Dimensions to the best of my recollection.

SSIS: Suggested Best Practices and naming conventions

8/11/2013

Did You Just Blink?

Blink.

Did you catch that, 10 new technology's just popped up.

How do you expect to keep up with all the changes if you're going to stop for just a moment.

I was busy learning my job, the business, the technology, the industry, learning a sub-set of a sub-set of a technology.

And during that time, hundreds of new subjects sprung up from nowhere.

And each has splintered into hundreds more.

Just in the world of data, there have been more changes in the past few years, it's mind boggling.

How does one stay current in everything, they don't.  They pick and choose their items of interest, and hope for the best.

Pick a subject, learn what you can, move to the next subject, continue.

It never ends!

Data Dictionary in NoSQL Database

NoSQL databases are known for the lack of schema.

So where's the data dictionary.

How does someone new to a project/database/application know where the data is.

And in what format / delimiter / folder/ etc.

Turns out there HCatalog, keeps a data dictionary with common data types because pig, hive, etc. have different data types.

Good to know.

8/09/2013

Certified Data Appraiser (CDA)

Here's a new position created today on the weekly #BIWisdom chat.

Certified Data Appraiser = one who determines current and potential value of accumulated data sets based on a agreed upon standards.

This person, certified by an organization, has the know how and understanding of financial balance sheets as well as intricate knowledge of data, applies approximate dollar value to specific data sets based on set scale.

Once price is determined, can then be added to financial statement, be used as collateral on loan and can be insured as corporate asset.

Free Learning Courses

If you're interested in learning some of the latest cutting edge data techniques, there's a free site which allows you to learn by attending online courses.

The site is called Coursera and the URL is: https://www.coursera.org/

They've got hundreds of courses to choose from on a wide variety of subjects.

They are taught by some of the best names in higher education today.

Many of the classes are structured in a specified # of weeks per course.

Which have online video's to watch from the convenience of your computer.

Many have homework assignments, quizzes, and final exams and even offer certificates of completion.

I plan to sign up for some courses depending on time and availability.

This is an amazing site and appears to have a tremendous value.

Even if you're an expert in your field, there's something there for everyone to gain knowledge.

And we're all students for life so sign up today!

#BigData Cloud Limitation

Big Data in the Cloud seems to be a logical solution.

The thing they don't tell you, how do you get your Volumes of data to the Cloud?

I've heard this limitation is being worked on to streamline a process.

There are existing tools similar to Windows Explorer to facilitate this process.

However, they seem to be manual.

I've actually heard someone say they ship their hard drive to the Cloud provider.

Strange but true.

Someone will need to find a working solution to this issue.

Because you can't have Big Data with out lots of Data.

Just saying!

#Data Density with #BigData and #SmallData

Big Data typically has low density data.  By that, the volume of data is so large that the key pieces of valuable info is sparsely populated.  It's up to having a good data person to search the data to find the nuggets of insight in the giant haystack.

Similarly, you can have Small Data that has high density of data, where the volume is less, with potentially higher nuggets of insight.

So just because you decide to move to Big Data, don't assume that your Small Data is obsolete.

There is still value there as well.

How to Solve Any Problem

Solving a problem requires a mindset of open endedness.

You stand on one side, the solution stands on the other.

It sits there, waiting patiently, to be found.

And it's your job to uncover all the solutions that don't work.

Until the actual solution presents itself.

Has anyone ever lost their keys?

How did you solve the problem?

Did you frantically start searching in every direction, forgetting where you already looked, haphazard approach of problem solving?

Or did you stop, get a picture in your mind, of where you last remember having your keys.

You retrace your steps, searching every place they could be.  You checked everywhere, they must have vanished.  So you sit down for a minute to regroup, what's that sound you hear.  It's the car outside, with the engine running, and the all doors are locked, with the ice cream melting.  Thus, you solved the problem by locating the keys.

However, you've discovered a newer, bigger problem or two.

Yikes, however, by removing all the places the keys weren't, you kept the problem open ended by not giving up the search, systematically reducing the possible solutions, until the answer presented itself.

And that's how problem solving works, you try things with an approach based on logic, until you've exhausted the search, where you stop and regroup, allow your mind to relax and free itself, and a bolt of lightning pops in with the answer, ah ha!

8/08/2013

My First Data Scientist Project - ROI

About 9 months ago, I was tasked with a project regarding ROI.

How did the customer enter the business ecosystem, what were the stopping points along the trail, and did the customer purchase a product and how much revenue was gained and spent along the way.

A difficult task none the less.

So I started with our Leads database.  Obtain all entries on specific dates.

Then match them up with the SalesForce data.  Were they in the Contact, Account, Lead or Opportunity tables?

SalesForce is unique as a person can be entered in one, many or none of the tables listed.  Sometimes the data is not the same between systems which causes havoc and mayhem.

Did the person download any white papers, view online webinars, how many touch points were there per individual?

Then I worked on the Great Plains data or our Sales Cube which contains only closed sales.

Now these data sets were not small by any means. 
And mashing the data using SQL-Server was quite the challenge.

Because I used up the tempdb resources on more than one occasion and had to have the server re-started, which one of the Architects issued a condescending email to write more efficient queries, where I responded that I'm pulling vast amounts of data across the network, across the databases and mashing them together, so crashing the SQL-Server is not unexpected behavior.

In addition to that, I was using Fuzzy Logic to mash data that was otherwise not correlated.  So it basically looked for 100% match between Country and Email Domain name ( XXXX.BobsCarpet.com).  Once those matches were determined, I used the value 82% match rate on the Company name (Bobs Carpet Mart vs. Bob' Carpet Mart, Inc.).

This really slowed the server down to a crawl.  However, it matched lots of records that were previously unmatchable.

I was dedicated to this ROI project for over a month.  Each day my progress would increase as I learned the business slowly but surely.  And slowly found more data sources.  How much did we spend on Ad-Words from Google.  What percent went from Lead to SalesForce to Sale.  What dollar amount was associated based on dollars spent.  By Product.  By Region.  Etc. Etc.

I occasionally met with the project sponsors to relay progress.  I was told this project was tried several times prior to me with unsuccessful results and was promised a steak dinner if I could solve it.

However, this was around the holiday time last year and our company announced a decision to spin off a subset of the company and my project fell to the wayside, never to be resumed.

So this story describes my first Data Scientist project, how I learned the business, had to locate the data sources, mash data using Fuzzy Logic, battle resource limitations.  It was basically working independently using nothing but business savvy, technology prowess and ability to  translate a business problem into code to produce analytics.

If I'm not mistaken, that's the definition of Data Scientist.  The project could possibly been smoother had I chosen Hadoop to store the datasets, which were quite large, and definitely slowed the progress.

This project was exciting, first of all because nobody thought I could solve it, and secondly because it was swimming in data searching for nuggets of insight.

8/07/2013

Data Scientist Dream

In the world of data, we often hear Data Quality.

And what that mean is the data is not dirty.

So what's clean data?

Data that can produce insights.

Because the data has lots of characteristics about specific objects.

It could be census data, weather data, stock market data, archived emails, just about anything.

And a good data person can take that data set, mine it, and turn it into information.

And what if you have multiple clean data sets which you could mash together.

To really slice and dice and find patterns and identify predictive scenarios.

Imagine how much clean data is accumulated at some of the gov't data centers.

Mashing Facebook, LinkedIn, YouTube, Twitter, Cell phone company's, Taxi and rental car company's, unlimited data sources.

Shove it in Hadoop and begin their search and discover.

And store it forever.  Imagine the possibilities.

So wouldn't it be nice to have as much clean data as possible.

To derive insight, mine it over time and predict future behavior.

That's a data scientist dream!

Simple Example of Ingesting Excel into #Hadoop

So here's the way Big Data works.

You take some data, in an Excel spreadsheet.

Export it to a Comma Separated Value CSV file.

Upload that file to Hadoop HDFS file system.

Mount it with a Hive table.

Expose it to Excel via the Hive connector.

And the data is back in Excel.

As you just saw, we took Excel data, and transformed it back into Excel.

Magic you say?  We took the long way around to get the same thing.

Except that's just one scenario.

In actuality, you can dump data from anywhere, Relational DB, Sensors, Log Files, you name it.

And the data doesn't have to be well formed, it can be Semi Structured or completely Unstructured.

And the good news, as the data funnels through the process, you can apply business rules, mash it up with other data sets as well as provide fully capable Business Intelligence as the final product.

So you see, we took a simple process to explain how to make it more complex and add value and find insights!

Installing the new Power Query Add In for Excel 2010

In order to download the new Microsoft Power Query, goto this URL...

Select 32 or 64 bit depending on your Excel version...











Logged into Microsoft Azure Marketplace...









Who knows what the data is referring to as it's not English, however, the point is you can pull data from just about anywhere in short amount of time and visualize, even throw into Pivot tables as well as Power Pivot.

Hope you enjoyed the instructional blog...

Get Sh#t Done!