1/01/2017

Temple of Babel, a Case for Open Data

How many active languages are there in the world?


Apparently, the all knowing crystal ball says "roughly 6500".  Who knows every single language?  I surely don't.  English is my main language, I tried learning Hebrew as a child, no luck.  Took 3 years of high school Spanish and two years in college.  Donde esta el bano?

Sure makes it tough to communicate clearly when nobody knows what everyone is saying.

Fear not!  We have technology that translates in real time, both spoken and written.

What about computer languages, why so many?  Iterations over time.  New features.  More precision.  Less memory usage.  Access to kernel level functionality.  Tons of reasons.

Within each language, are there mandatory design patterns?  Syntax, yes, patterns, no.

That's why most people prefer to write code, not maintain others.  What in the heck was this person trying to do in all this spaghetti code?  Who knows.

What about the world of data?  Are we mandated to use specific patterns?  Naming servers, databases, tables, schemas.  How about variable types, precision, international formatting.  It's a free for all.

How about the actual data?  We have free form text boxes on the front end, accepts any characters including hex and carriage returns and what have you.

What about address cleansing?  Do we mandate specific patterns.  If you've worked with address data for any length of time, you'll find out quickly how even the spelling of a city can vary drastically, let alone Rt  vs. Route or Street vs St. or MLK vs Martin Luther King.  It's maddening.

What we've done basically, is created a Temple of Babel, such that almost no two databases are alike.  How in the world can we unite datasets seamlessly when nobody follows the same patterns.

If anything, I would propose prefixing public data sets with a 5 part naming convention.  Organization : Server : Database : Schema : Table.  Then open up the database to real time querying remotely.  Either in real time JOINS across disjointed networks, or through the use of API calls or Web Services.

Similar to Web Services, you provide your Data version of the WSDL, so people know what data you are exposing, with 5 part naming convention, sample data, user access with generic credentials, etc.

Once orgs start opening up their data, we would no longer need to download CSV files, save to disk, ingest into database, and find a matching field to join on.  Simply make the call, JOIN your data set to theirs, run the query and get results in real time.

But why not automate that.  Have a centralized list of all public data sets, have your computer scan the list in real time through code, have it query the data without your assistance.  Then add that data to machine learning or artificial intelligence or internet of things, and have the entire thing run without the assistance of a human.  Have it create the reports, find the insights, create visualizations, viewable on smart phones or tablets or generate emails.

Have a new data set, add it to the list, soon people will be reading your data remotely, assuming you allow it.

We've got tons of data.  That data needs to be read, interrogated, aggregated, and interpreted to combine with your data to find those nuggets of insight.  Without common frameworks, we can never unite the data into a pool of world data.  Why not expedite the process, form some standards, a means to expose the data, have a list of available data sets and let the computer churn through 24/7.

We need to remove the data version of the Temple of Babel.  And create a platform for real Open Data.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.