Data Lakes Require Data Captains of the Hadoop Ships

If Hadoop is a Data Lake, then Developers must have good boats in order to sail the rough seas.

Hadoop offers structured, semi and un-structured data.  Data can be added by a variety of means and not every developer is going to know where all the data resides, what it contains and how it integrates into the ecosystem.

Developers must rely on the meta data layer called HCatalog

In addition to that, the data types can vary for the same field depending the developer used Pig or Hive.

Data gets added at different intervals, batches run at different times and days, so knowing how fresh the data is may also prove to be a problem.  For example, your sales lead data may get updated hourly, but the financial data gets loaded on Tuesday and Thursday at midnight.  Consumers of the data will need to know the data refresh patterns.

And what if jobs fail.  You may think the data is running along fine, until a user calls in with a complaint the data is out of data on their reports.  Creating alerts may be necessary to prevent excessive tickets in the to-do list.

Many of these issues have already been addressed with current day Business Intelligence models.  Hadoop adds an extra layer and developers need to consider this.

The Data Lake adds value to organizations, as well as complexity.  So to stay ahead of the curve, the Data Captains of the Hadoop Ships must be cognizant of the extra layers and prepare for smooth sailing.