Databases store information, typically transaction data, from a front end application. Each proprietary vendor has a proprietary database. The report developer has to learn each database, typically studying the visual data dictionary diagrams, very painful. Perhaps over time, they push out a pleathera of reports based on custom SQL statements, going through hoops to apply proper table joins, views, stored procedures, can become quite large, and messy. And all the knowledge is in house. That gives the report developer some status as he / she can 'get' the information you need.
However, each org typically has more than one proprietary database with unique database schema. How does one merge the two or more databases, to get a complete insight into the organization? Very difficult. Typically requires an ETL developer, data modeler, data warehouse architect and team of developers and someone who knows the business domain.
Next, let's add in data from outside the organization, like the cloud. For example social media like Facebook or Twitter or LinkedIn. Or how about Google Analytics. How does that data get integrated into a cohesive whole? Very carefully.
You can see right off the bat that proprietary databases from vendors has created an entire industry of report writers and database developers and business intelligence tools.
Now, let's add in Hadoop. It can handle the volume you're looking for, just shove all your data into Hadoop, then apply the "links" to integrate the disparate data sets for your big picture analytics. Still, setting up Hadoop ecosystem is not an easy task, requires a Hadoop admin, Hadoop developer, pick an out of the box Hadoop distribution, load onto commodity server and begin. Still, not an easy task.
However, Microsoft offers a cloud based solution on the Azure platform. To store data in HDInsight Hadoop, SQL Database, SQL Data Warehouse, AzureML to build models in WYSIWYG cloud based IDE, plenty of adapters to bring in data from other cloud offerings as well as on-premise or data born in the cloud. And do your reports in PowerBI.
So the issue of bringing together 'all' your data is now available. But the inherit issue of merging differnt data sets still exists. Requiring an ETL developer. Or business domain knowledge expert. ETL still seems to be the bottleneck of easy automation of full life cycle analytics. You have to apply business rules, figure out how to mash customer x from Salesforce database to Google Adwords to Microsoft Dynamics to the Call Center data. No easy process for this. Unless you bake a solution into each part along the trail. Typically done in-house.
We now have all the pieces required for full life cycle analytics. However, there's still a few more problems to solve before complete automation can occur.
In the meantime, the amount of data continues to expand, thanks to Social Media and Internet of Things. Perhaps in the near future the missing link of automated ETL will occur and then we'll take all that data, apply machine learning in near real time. Then we'll be one step close to a cross domain artificial intelligent brain used to predict probable future based on past events, segregate things into buckets, find anomalies in data outliers and create recommendation engines without the assistance of people. That should increase sales, reduce costs and create efficiency across the board. And propel humanity into the next stage of the Information Age / Revolution.
And there you have it~!
I signed up for the Hortonworks Certified Associate exam last Thursday. Figured if I sign up, I'd have to take the test. And if I tak...
This blog post is in no way an attempt to steal other people's work. It's basically an conglomeration of notes from research I did...
Saw a post today on Twitter, " Microsoft releases CNTK, its open source deep learning toolkit, on GitHub " This is big news. Be...