Reason being, data never got respect. The Johnny Dangerfield of IT. Nobody took consideration of the data, the downstream reporting aspect. No wonder the data was so difficult to work with, not joining to other data sets, 20 tables to get a query to work, left outer, right outer, inner, cross, why are these queries taking so long to execute, why don't these number match the other reports, or these numbers are great, except they're a month old stale, sure would have been great at quarter end.
So how do we fix things? Go to the root of the matter. Tag the data at inception time of creation. How do we tag it, by self referencing the key aspects pertinent downstream, perhaps an XML tree like structure, self referencing and self describing. Future developers or automatic ingestion programs could add to the tree downstream, original value, original field type, original date, then who modified, when what is new value, perhaps why, etc.
Self describing data, down the the database -> schema -> table -> field -> row -> exact field.
Make it generic, use a standard format template framework, using XML for each field, table, schema, database.
And make it consumable by a generic report reader.
But we'd have to modify our storage databases, to handle this new self describing XML format. Okay, modify them. That's considered progress.
So what are the benefits of tagging data at time of creation?
- Audit Trails
- Consumable data ingestion without tagging time for Machine Learning (unassisted by human intervention)
- Consistent Patterns
- Transparent Data Sets
- Public Open Data Sets
- Store data for life
- Certify Data Sets
What if we stored the type of data, the key elements available for joining to other data, we could create programs to read the WSDL self referencing files, have the program perform the joins unassisted, throw that data into a model, without having to tag the data, for unsupervised machine learning processing.
How about expose that data via Web Service, so people can pull that data, in ODATA format, or JSON format, or XML format, you name it.
And how about receiving data directly from Internet of Things (IoT) devices, process events and micropings of data across the internet.
And how about reading from BlockChain distributed ledgers across the globe, exposed via Web Service.
Or how about sending data to a Web Service, and storing the returned data contents back to your system. Here's a row of data, which specific characteristics, what is the result, let's store that off, and build models off that.
So that opens up a market for providing pre-built Web Service data, for subscription or per data item fee.
The potential is unlimited.
Self Describing data, tagged at time of creation. Even Johnny Dangerfield got some respect eventually.
See similar post: http://www.bloomconsultingbi.com/2018/04/apply-structure-to-data-sets-for.html