When you work with data, there are many facets to the job.
First you source the data, apply business rules, move the data to staging, then the Enterprise Data Warehouse, then move it to the cube, add some more logic, the data is ready for consumption.
And guess what, sometimes the end numbers do not match the expected results.
And that's when the fun begins. I actually enjoy this part of the coding.
Basically, you run a query against the raw data, showing the details.
Then you run a similar query against the data warehouse data.
Then you add them to a single Excel worksheet, sort each data set according so the columns match up.
Then add a field and subtract field 1 from data set 1 and field 1 from data set 2.
In most cases, the difference will result to zero. However, not always.
And the records that don't match, usually it skipped a record on either side, I push the data set down by one row, as to keep the data consistent going forward, then highlight the bad row and continue until there is no more data.
At that point, you sum up the differences, you sum up the totals of each data set, and more likely than not, the difference between the two data sets equals that of the sum of the differences.
And what you have is the exact records causing the glitch.
At that point, you can continue the exercise at each point along the data trail to pinpoint where the data diverges.
By following this methodology, the errors leap out from the Excel file telling you the specific records causing the issue. Which leads to more questions, what happened to the missing data? Was it a timing issue? Is there business logic further upstream that you are unaware of? Lots of questions.
At that point, if no resolution, I typically go back to the client for a discussion, inform them of my analysis, ask them questions about the source data, when the two data sets were run to see if it's a timing issue, a business logic change, etc.
I've been using this approach ever since my first reports way back when. It's a proven technique, however it does take time and patience and knowing where to look and what questions to ask, and being very methodical in the approach as well as keeping copies of the interrogation queries for later use.
So there you have it, troubleshooting data sets 101.
I signed up for the Hortonworks Certified Associate exam last Thursday. Figured if I sign up, I'd have to take the test. And if I tak...
This blog post is in no way an attempt to steal other people's work. It's basically an conglomeration of notes from research I did...
Saw a post today on Twitter, " Microsoft releases CNTK, its open source deep learning toolkit, on GitHub " This is big news. Be...