In a previous post, I described a data validation technique to identify statistical outliers. An additional layer of pro-active data quality sanity checks can easily be added to existing solutions using the pattern I’ll describe in this post. The attached SSIS package and SQL code can be used against AdventureWorksDW2008 to get you started; but, can easily be modified to work in your environment.
This package is meant to execute after the nightly ETL process has all but completed. It loops through a database getting the table names for all tables that start with Dim or Fact (modifiable to use any pattern.) It will store the date, table name, and row count. This data can be used to track the rate of growth for each of your tables. Thresholds can be set so that the right people can be notified if the metrics are over or under certain amounts.
Secondly, this package loops through the fact tables getting a list of all of the columns that contain “Key”. It then tracks by day the number of rows that contain the default unknown zero key value. This key is used in many data warehouse design patterns when a lookup match proved unfruitful and the dimension is keyed to the unknown member. Some places use 19000101 as the default date key so you may need to modify this logic slightly. Thresholds can be set so that actions can be taken if a particular key column has an inordinate amount of zero key values.
Trying it out… Download this ZIP file.
1. Execute the attached SQL code against the AdventureWorksDW2008 database.
2. Create a new integration services solution and import the attached SSIS package.
3. Execute the package
4. Analyze work.FactTableZeroKeys and work.TableRowCounts
Added SQL Queries to report the growth rate of zero key, null columns, and table row count.