4/28/2019

Attach Value / Pair Data Attributes for Supervised Learning

Machine learning works best with data that is labeled called Supervised Learning.  The source data set has information describing the content of the data.

For example, you have a series of pictures.  Some of the pictures have cats.  Those pictures are identified through labels to train the model.

You can also train the model without labels, using Unsupervised Learning.  The data is not labeled ahead of time, takes longer for the program to learn the content of the data.

What if all data were labeled.

We have relational databases that store relational data.  Tables with columns with field types.  No descriptors to inform the viewer of that data, much of anything, unless the table or column has meaningful name.

What if every database table had a way to store attributes within the content of the data.  One way, add a list of value / pair descriptors attached to the table, and expose through Metadata.

For example, table called "InvoiceLineItems".  What are some meaningful descriptors in value / pair format.

Company | Fred's Deli and Tire Rotation
Creation Date | 01/01/2014
Table Description | Table contains data for invoice line items
Table Granularity | Table is at invoice Line Item grain level
Table Primary Key | InvoiceLineItemID
Table Primary Key Field Type | integer
Table Field Count | 22 fields
Foreign Key 1 Table Name | Invoice
Foreign Key 1 Name | InvoiceID
Foreign Key 1 Field Type | int
etc. | etc.

Data gets added to table, eventually extracted and sent to reporting tool.  Reporting tool can interpret the value / pair data, as its in common format, perhaps XML or JSON or what have you, and dynamically joins InvoiceLineIItem to Invoice table which joins to Customer and Address and Product on the fly based on interpretation of primary keys info contained in value / pair info, no need to guess as the foreign keys remove the guess work.

Basically you have a relational database that stores descriptive attributes about the data, self contained in value / pair data, built directly into the database, combining traditional database and NoSQL.

The data would be stored at Table level, yet could be queried within a traditional SQL Statement:

Select InvoiceLineItemID, ItemNumber, ItemAmount, ValuePair.TableDescription from dbo.InvoiceLineItem

In this example, the SQL statement pulls data directly from Value / Pair into the returned data set.

And when data gets ported to other systems and exposed for reporting, the reporting tool would interpret the Value / Pair data as the Metadata gets sent along also.  The reporting tool would know what attributes are available by reading the Metadata from Value / Pair without user intervention.

Take this a step further, with the machine learning example earlier, if all data have value / pair attributes stored at time of creation, most machine learning would become Supervised.  So more accurate models, trained faster with higher degree of probability.

Take a step further, you could have jobs that scraped the internet, have the application read in the data files on the fly, process, train model, interpret and then some action.  Imagine all sporting events open their data sets online, with rows of data, with metadata and value / pair attributes in real time.  Jobs would continuously scan for updates 24 / 7, without human intervention, and update accordingly.

Now add weather data, or stock information, or blockchain data, or commodity prices or exchange rates, or crime data or traffic data, the list is endless.

Tag the data during creation, append to value / pair data attached to database, with Metatdata layer for reporting tools to identify the data set, similar to WSDL, and have machines read and process the data unassisted.

That would speed up the artificial intelligence genre x times.  And maybe remove some of the human intervention to prep the data, and give us a lot more meaningful data.

What do you think?  Is this possible?