I went to the Tampa SQL Server BI User Group in Tampa last night. Great presentation on DQS or Data Quality Services.
It's a product from Microsoft, bundled into the SQL Server install along with SSRS and MDS. It installs a client on the server along with three databases. Essentially you create a Knowledge Base repository, where you customize rules for each and every field you wish to track. Then the power user can run a project, point to the Knowledge Base, and it will find the issues, log them off somewhere, for someone to go back and correct or base a decision on how to handle. Stuff like duplicate data, missing data, data spelled incorrect, synonymy, etc.
You can also call DQS from SSIS components. I asked the speaker a question, in SSIS as the data flows through, when it finds data that needs a human pair of eyes, does it stop the flow of data? Apparently no. It gets mashed back into the ODS after it's cleaned up, and SSIS picks it up from there to move to the data warehouse. Another approach would be to use Staging database, which is how we do it.
I asked a question about that, is there a way to flag the data as it flows in from source, because you don't want to send up the same records to Azure for validation every time the ETL runs, seems like a waste of money. He suggested a technique to create a Hash based on the concatenation of data, check it before sending to Azure to see if data changed. Someone in the audience asked if "CheckSum" was used, sometimes, but not as flexible as another option, which handles upper and lower case changes as well as increased field lengths.
The speaker showed us quite a bit on how to leverage Microsoft Azure to pull in data from "Melissa Data". Seems kind of cool, easy to use, some costs associated with it.
The entire process needs sign off to allow the data steward to be part of the life cycle of data, typically found in a Master Data Management shop with a board of people who make key decisions on how to interpret and handle data.
In thinking about data quality, I often wonder if that will trickle into self service data quality, as in PowerBI. It would make sense that many shops have similar data patterns, and perhaps there's a way to re-use some of the knowledge databases in the cloud, or a way to import to your PowerBI notebook or something.
Because we all know that having accurate reports is key. And with the rise of self service, would be nice if there reports were correct as well as IT's enterprise level reports that use DQS.
Just a thought.
And there you have it~!
This blog post is in no way an attempt to steal other people's work. It's basically an conglomeration of notes from research I did...
I signed up for the Hortonworks Certified Associate exam last Thursday. Figured if I sign up, I'd have to take the test. And if I tak...
Saw a post today on Twitter, " Microsoft releases CNTK, its open source deep learning toolkit, on GitHub " This is big news. Be...