12/31/2013

Data Warehousing 101

To work with a Data Warehouse there are many aspects to consider.

First off, where is the source data from, what database, schema, what tables are there, and what information are you trying to bring in.

And what other data sources are available.

How can all the data sets be combined to form a logical set of data.

Once your data sources are defined, how can the data be modeled.  Are there people, things, events, time/dates, what can be measured, and how would the users like to slice it.

And you begin to build your fact tables, your sums, counts, averages, mins and maxes, often referred to as "verbs" or action.

And you design your dimension tables, typically your "nouns".

And you will need to connect your Fact tables to your Dim tables, by using keys, surrogate SK which are unique identifier integers typically and your AK fields which point back to the unique identifier of the original data.

And the data warehouses I've been involved with, around 10 - 20 tables, they form a "star schema" in that the fact tables remain in the center and the dimension tables surround it, joined by SK fields.

However there's an interim step, in that the source data must be brought into the ecosystem by use of Staging tables.  These tables don't have SK fields, they usually have a unique identifier knows as the AK field.  The raw data is brought in through ETL, Extract Transform and Loan.

From stage, the ETL process determines if the record already exists in the data warehouse, if so, it performs an Update, if not it does an Insert, sometimes using the MERGE statement, depending on data volume, as it doesn't scale well past 10,000 records in SQL Server.

So the flow is an ETL process which pulls the data from the source(s), staged in interim tables, where business rules can be applied, and then brought into the actual data warehouse where the Fact tables join to the Dim tables using Surrogate keys.

From there, they can be ported into OLAP cubes, for reporting and dashboards, written in a language called MDX.

I've been doing data warehousing for a few years now, and every day I learn something new.  There's definitely a pattern to it, which is repeatable, however, it seems to me every data warehouse is unique.

Lots of orgs could benefit by building a data warehouse, because when you have access to your data, you can discover patterns, look for trends, spot anomaly's, drill down to the details, have Key Performance Metrics, Charts, Graphs, etc.

Please contact if interested in more information!

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.