If Hadoop is a Data Lake, then Developers must have good boats in order to sail the rough seas.
Hadoop offers structured, semi and un-structured data. Data can be added by a variety of means and not every developer is going to know where all the data resides, what it contains and how it integrates into the ecosystem.
Developers must rely on the meta data layer called HCatalog
In addition to that, the data types can vary for the same field depending the developer used Pig or Hive.
Data gets added at different intervals, batches run at different times and days, so knowing how fresh the data is may also prove to be a problem. For example, your sales lead data may get updated hourly, but the financial data gets loaded on Tuesday and Thursday at midnight. Consumers of the data will need to know the data refresh patterns.
And what if jobs fail. You may think the data is running along fine, until a user calls in with a complaint the data is out of data on their reports. Creating alerts may be necessary to prevent excessive tickets in the to-do list.
Many of these issues have already been addressed with current day Business Intelligence models. Hadoop adds an extra layer and developers need to consider this.
The Data Lake adds value to organizations, as well as complexity. So to stay ahead of the curve, the Data Captains of the Hadoop Ships must be cognizant of the extra layers and prepare for smooth sailing.
This blog post is in no way an attempt to steal other people's work. It's basically an conglomeration of notes from research I did...
I signed up for the Hortonworks Certified Associate exam last Thursday. Figured if I sign up, I'd have to take the test. And if I tak...
Saw a post today on Twitter, " Microsoft releases CNTK, its open source deep learning toolkit, on GitHub " This is big news. Be...