About 9 months ago, I was tasked with a project regarding ROI.
How did the customer enter the business ecosystem, what were the stopping points along the trail, and did the customer purchase a product and how much revenue was gained and spent along the way.
A difficult task none the less.
So I started with our Leads database. Obtain all entries on specific dates.
Then match them up with the SalesForce data. Were they in the Contact, Account, Lead or Opportunity tables?
SalesForce is unique as a person can be entered in one, many or none of the tables listed. Sometimes the data is not the same between systems which causes havoc and mayhem.
Did the person download any white papers, view online webinars, how many touch points were there per individual?
Then I worked on the Great Plains data or our Sales Cube which contains only closed sales.
Now these data sets were not small by any means.
And mashing the data using SQL-Server was quite the challenge.
Because I used up the tempdb resources on more than one occasion and had to have the server re-started, which one of the Architects issued a condescending email to write more efficient queries, where I responded that I'm pulling vast amounts of data across the network, across the databases and mashing them together, so crashing the SQL-Server is not unexpected behavior.
In addition to that, I was using Fuzzy Logic to mash data that was otherwise not correlated. So it basically looked for 100% match between Country and Email Domain name ( XXXX.BobsCarpet.com). Once those matches were determined, I used the value 82% match rate on the Company name (Bobs Carpet Mart vs. Bob' Carpet Mart, Inc.).
This really slowed the server down to a crawl. However, it matched lots of records that were previously unmatchable.
I was dedicated to this ROI project for over a month. Each day my progress would increase as I learned the business slowly but surely. And slowly found more data sources. How much did we spend on Ad-Words from Google. What percent went from Lead to SalesForce to Sale. What dollar amount was associated based on dollars spent. By Product. By Region. Etc. Etc.
I occasionally met with the project sponsors to relay progress. I was told this project was tried several times prior to me with unsuccessful results and was promised a steak dinner if I could solve it.
However, this was around the holiday time last year and our company announced a decision to spin off a subset of the company and my project fell to the wayside, never to be resumed.
So this story describes my first Data Scientist project, how I learned the business, had to locate the data sources, mash data using Fuzzy Logic, battle resource limitations. It was basically working independently using nothing but business savvy, technology prowess and ability to translate a business problem into code to produce analytics.
If I'm not mistaken, that's the definition of Data Scientist. The project could possibly been smoother had I chosen Hadoop to store the datasets, which were quite large, and definitely slowed the progress.
This project was exciting, first of all because nobody thought I could solve it, and secondly because it was swimming in data searching for nuggets of insight.
I signed up for the Hortonworks Certified Associate exam last Thursday. Figured if I sign up, I'd have to take the test. And if I tak...
Saw a post today on Twitter, " Microsoft releases CNTK, its open source deep learning toolkit, on GitHub " This is big news. Be...
It seems like open source applications are the mainstream today. So many new products delivered through Aache foundation. Some do this. S...