Apply Structure to Data Sets for Accountability to Ripple Across Globe

When I began coding in Visual Basic 4, 5 and 6, there were many advantages.  Such as rapid application development, code was dynamic in that entry level people to accountants to highly complex code could be written and maintained over time.  The downside, the code was not object oriented and so not considered a true language by the core programmers.  It was soon shelved for the most part although code still exists in production in many shops as well as VBA embedded in Office products.

When I programmed Java J2EE, the in thing were frameworks like Struts or Hibernate, the list has grown tremendous.  The nice thing about frameworks are the consistency, so you could dive into someone else's code and get up to speed fairly quick.  It also segregates things into nice buckets, like front end UI or middle tier layer or back end database layer so people can become experts in specific niches.  Each iteration of framework seems to correct some things and possibly cause new issues or obstacles and there's always the backwards comparability issue and having to completely re-write applications.

Let's take a look at the world of data.  Data gets created and stored in dozens of applications including Microsoft SQL Server, Microsoft Access, Oracle, Sybase, MySQL, and the list goes on and on.  We started with VSAM Flat files on the mainframe, but have grown into Cloud based solutions, Graph databases, NoSQL value / pair database, and Hadoop storage using HDFS or Storage in the cloud across clusters of servers.  Data is data.  There're different types for strings and numbers and blobs, there's XML and JSON to send data to other systems, places, people, there's Excel to manipulate, store and transport data and we can access reports via web, mobile, email, network drives, PDF and we can schedule that data to run specific times.

When XML first arrived on the scene, there was mention of replacing Object Oriented coding patterns, perhaps premature speculation.  Except you could flow data using XML and apply a transformation using XSLT.  And for the web pages developers, we have CSS to apply standard tags to elements so they display in certain way, a way to organize and formalize the structure and output of web pages.  Although no design or architecture is bullet proof to handle every possible scenario, these are good efforts to apply standards to technologies.

Back to the world of data, why don't we have similar standards.  Why don't we have the ability to tag our output data sets with specific information.  For example, where was the date created, in what system, date time stamp, format of data upon inception.  And then have tags along the way, so when the data gets sent to others, we have an audit log of who touched the data when, how, and perhaps why, maybe throw in some IP address', some user ids, etc.  Perhaps embed these attributes within the data, as a self referencing data view-able to others along the trail.  So we have some accountability.  This would benefit the proliferation of data, especially in the world of open data sets.  Not just a text file describing the data, but row level and field level attributes as well.

In addition to audit trails, perhaps apply attributes such as field description, such as text:varchar(27), float(9,2), etc. so the data could be plopped into another reporting tool without modification or effort.  Perhaps having a generic report format, with a generic report reader, that can read any data set, with some features to modify the data once in the tool, like group by this field, sort by this, filter by this.  Yet it would be generic and not specific to Windows or Linux or Web or Mobile, it would be completely transportable and completely dumbed down such that it would simply appear in the report reading / writing tool and the user could be off and running, with the knowledge of where the data was derived from and who touched it along the way.

Lastly, we could offer some type of certification as the data.  This set of data was validated on such and such date, by whom, what the data consists of, and is available on the web, for free, or perhaps purchase, bundled and ready to go.  Think about the downstream ramifications, we tagged this set of data with key attributes, which is validate and certified, and you can obtain this data set, to load into your machine learning systems or artificial algorithms for perhaps unsupervised AI.  And if others exposed their data as well, you could have unsupervised AI talking with other AI all day every day creating and updating models in real time with a certain degree of accuracy and accountability via audit logs.  Throw in Internet of Things (IoT), and you have devices talking with servers and proliferating that data to others in real time for a pinball effect of data bouncing all over the place in a network of machine learning AI model based digital world.

See similar post: http://www.bloomconsultingbi.com/2018/04/self-describing-data-tagged-at-time-of.html