Is Your Data Correct? – Applying Statistics to Identify Outliers

I recently had the pleasure of working with a very talented group of professionals on a very challenging project. Everything during the project went so smoothly we had the luxury of adding an additional layer of icing on the cake in the form of a data validation layer unlike anything I had ever implemented or seen in the field. One piece of that layer was in developing a process based on Kimball’s Design Tip #10: Is Your Data Correct? (http://www.kimballgroup.com/html/designtips.html) It would be worth your while to take a minute and read this design tip. It’s only two pages long.

To summarize, the design tip describes a process in which outliers are identified after a daily ETL load completes for further investigation by the data steward(s) of the data warehouse. This is done using a query that finds entries in the fact table that are outside of three standard deviations of the mean. In a normal distribution, 99% of the data falls above or below that threshold. This requires us to look at the entire time history of data and can be very expensive to compute.

In an effort to improve the efficiency of this process a table is used to track three accumulating numbers to aid in the calculation of the formula: number of days, the sum of a metric, and sum square of the metric. In the example below I am tracking these three values using sales by store for each day from the Contoso Retail sample database. (http://www.microsoft.com/en-us/download/details.aspx?id=18279)  The query can be used in an SSRS report set up with a data driven subscription to automatically email the interested parties when outliers occur.

The example in the design tip uses stores and departments. Contoso doesn’t have departments and my initial attempts of just using store and day of the week were not enough for me to drill down into the detail of what caused the outlier exception. I decided to use three components to slice up the total sales amount instead of two: Store, day of the week, and Product Category Name.

The following code below populates the OutliersFactSales table. It took approximately six hours to run on a quad core laptop with 8 Gig of RAM. Thank goodness this part only needs to be run once to build the history. Subsequent runs would be for only the new days added to the data warehouse with each incremental load. The end result is a table with about 17,000 rows: one row for every store for every day of the week (Sunday – Saturday) for every Product Category. The size of the table will not grow unless new product categories or stores are added. I’ve included a script to insert the data into a table so that you don’t have to run the code yourself.  Click here to download.

--SQL SNIPPET - KIMBALL OUTLIER METHOD

/* 
--COMMENTS START
create nonclustered index ix_MMFactSales on factSales
(DateKey,StoreKey,ProductKey,SalesKey);
drop table dbo.OutliersFactSales
truncate table dbo.OutliersFactSales
CREATE THE Outlier TABLE AND SEED WITH FIRST DATE IN FACT TABLE
SELECT 
		 StoreKey
		,p3.ProductCategoryKey
		,DayOfTheWeek		= DATEPART(weekday,dateKey)
		,NumberOfDays		= 1
		,SUMMetric			= SUM(SalesAmount)
		,SUMMetricSquared	= SQUARE(SUM(SalesAmount))
  INTO [ContosoRetailDW].dbo.OutliersFactSales
  FROM [ContosoRetailDW].[dbo].[FactSales] f
  JOIN [ContosoRetailDW].[dbo].DimProduct p on p.ProductKey = f.ProductKey
  JOIN [ContosoRetailDW].[dbo].DimProductSubcategory p2 on p2.ProductSubcategoryKey = p.ProductSubcategoryKey
  JOIN [ContosoRetailDW].[dbo].DimProductCategory p3 on p3.ProductCategoryKey = p2.ProductCategoryKey
  WHERE DateKey = (SELECT MIN(DATEKEY) FROM [dbo].[FactSales])			--1/1/2007
  GROUP BY StoreKey,p3.ProductCategoryKey,datekey;													
CREATE TABLE [dbo].[OutliersFactSales] (
	[StoreKey] [int] NOT NULL
	,[ProductCategoryKey] [int] NOT NULL
	,[DayOfTheWeek] [int] NOT NULL
	,[NumberOfDays] [int] NOT NULL
	,[SUMMetric] [money] NULL
	,[SUMMetricSquared] [float] NULL
	,CONSTRAINT [PK_OutliersFactSales] PRIMARY KEY CLUSTERED (
		[StoreKey] ASC
		,[ProductCategoryKey] ASC
		,[DayOfTheWeek] ASC
		) WITH (
		PAD_INDEX = OFF
		,STATISTICS_NORECOMPUTE = OFF
		,IGNORE_DUP_KEY = OFF
		,ALLOW_ROW_LOCKS = ON
		,ALLOW_PAGE_LOCKS = ON
		) ON [PRIMARY]
	) ON [PRIMARY]

--COMMENTS END
*/

DECLARE @DateToRun INT
	,@EndingDate INT;

--get starting date
SELECT @DateToRun = CAST(CONVERT(VARCHAR, MIN(DateKey), 112) AS INT)
FROM DBO.FactSales;

SELECT @EndingDate = CAST(CONVERT(VARCHAR, MAX(DateKey), 112) AS INT)
FROM DBO.FactSales;

SET @EndingDate = @EndingDate - 1;

WHILE @DateToRun <= @EndingDate
BEGIN
	-- update current records
	UPDATE a
	SET NumberOfDays = NumberOfDays + 1
		,SumMetric = a.sumMetric + m.sumMetric
		,SumMetricSquared = a.SumMetricSquared + m.SumMetricSquared
	FROM dbo.OutliersFactSales a
	INNER JOIN (
		SELECT StoreKey
			,p3.ProductCategoryKey
			,DayOfTheWeek = DATEPART(weekday, dateKey)
			,SUMMetric = SUM(SalesAmount)
			,SUMMetricSquared = SQUARE(SUM(SalesAmount))
		FROM [ContosoRetailDW].[dbo].[FactSales] f
		INNER JOIN [dbo].DimProduct p ON p.ProductKey = f.ProductKey
		INNER JOIN [dbo].DimProductSubcategory p2 ON p2.ProductSubcategoryKey = p.ProductSubcategoryKey
		INNER JOIN [dbo].DimProductCategory p3 ON p3.ProductCategoryKey = p2.ProductCategoryKey
		WHERE CAST(CONVERT(VARCHAR, DateKey, 112) AS INT) = @DateToRun --1/1/2007
		GROUP BY StoreKey
			,p3.ProductCategoryKey
			,datekey
		) m ON m.StoreKey = a.StoreKey
		AND m.DayOfTheWeek = a.DayOfTheWeek
		AND m.ProductCategoryKey = a.ProductCategoryKey;

	--add new records
	INSERT INTO dbo.OutliersFactSales
	SELECT StoreKey
		,p3.ProductCategoryKey
		,DayOfTheWeek = DATEPART(weekday, dateKey)
		,NumberOfDays = 1
		,SUMMetric = SUM(SalesAmount)
		,SUMMetricSquared = SQUARE(SUM(SalesAmount))
	FROM [ContosoRetailDW].[dbo].[FactSales] f
	INNER JOIN [ContosoRetailDW].[dbo].DimProduct p ON p.ProductKey = f.ProductKey
	INNER JOIN [ContosoRetailDW].[dbo].DimProductSubcategory p2 ON p2.ProductSubcategoryKey = p.ProductSubcategoryKey
	INNER JOIN [ContosoRetailDW].[dbo].DimProductCategory p3 ON p3.ProductCategoryKey = p2.ProductCategoryKey
	WHERE CAST(CONVERT(VARCHAR, DateKey, 112) AS INT) = @DateToRun --1/1/2007
		AND cast(StoreKey AS VARCHAR(10)) + ' ' + cast(DATEPART(weekday, dateKey) AS CHAR(1)) + ' ' + cast(p3.ProductCategoryKey AS VARCHAR(10)) NOT IN (
			SELECT cast(StoreKey AS VARCHAR(10)) + ' ' + cast(dayOfTheWeek AS CHAR(1)) + ' ' + cast(ProductCategoryKey AS VARCHAR(10))
			FROM dbo.OutliersFactSales
			)
	GROUP BY StoreKey
		,p3.ProductCategoryKey
		,DateKey;

	-- increment date
	SET @DateToRun = CONVERT(INT, CONVERT(VARCHAR, DATEADD(dd, 1, CONVERT(DATE, convert(VARCHAR, @DateToRun))), 112))

	PRINT @DateToRun
END

--- six hours later...

The next block of code is what you would base your SSRS report on. It returns the store, day of the week, product categories, and metrics for a specific date in the data warehouse for outliers of the Sales Amount metric. These results could be further refined to only include outliers that meet a certain threshold. This report could be set up with a regular or data driven subscription so that the outliers can be analyzed after each refresh.

--execute and view outliers
--select * from drop table dbo.OutliersFactSales
SELECT f.StoreKey
	,p3.ProductCategoryName
	,dayoftheweek
	,f2.NumberOfDays
	,f2.sumMetric
	,f2.SumMetricSquared
	,sumMetricAdding = SUM(SalesAmount)
	,SumMetricSquaredAdding = square(sum(SalesAmount)) --(sum(fueladded)*SUM(FuelAdded))
	,leftside = ABS(sum(SalesAmount) - (1 / f2.NumberOfDays) * f2.sumMetric)
	,rightside = 3.0 * SQRT((1.0 / (f2.NumberOfDays - 1)) * (f2.SumMetricSquared - (1.0 / f2.NumberOfDays) * (SQUARE(SUMMetric)))) --summetric*summetric))
	,LeftRightDifference = (ABS(sum(SalesAmount) - (1 / f2.NumberOfDays) * f2.sumMetric)) - (3.0 * SQRT((1.0 / (f2.NumberOfDays - 1)) * (f2.SumMetricSquared - (1.0 / f2.NumberOfDays) * (SQUARE(SUMMetric))))) --summetric*summetric))
	,AvgSalesByDay = f2.sumMetric / f2.NumberOfDays
	,Mean = SQRT((1.0 / (f2.NumberOfDays - 1.0)) * (f2.SumMetricSquared - (1.0 / f2.NumberOfDays) * (SQUARE(SUMMetric))))
--,STDEV(f2.sumMetric)
FROM [ContosoRetailDW].[dbo].[FactSales] f
INNER JOIN [ContosoRetailDW].[dbo].DimProduct p ON p.ProductKey = f.ProductKey
INNER JOIN [ContosoRetailDW].[dbo].DimProductSubcategory p2 ON p2.ProductSubcategoryKey = p.ProductSubcategoryKey
INNER JOIN [ContosoRetailDW].[dbo].DimProductCategory p3 ON p3.ProductCategoryKey = p2.ProductCategoryKey
INNER JOIN dbo.OutliersFactSales f2 ON f2.StoreKey = f.StoreKey
	AND f2.dayoftheweek = DATEPART(weekday, dateKey)
	AND f2.ProductCategoryKey = p3.ProductCategoryKey
WHERE CAST(CONVERT(VARCHAR, f.DateKey, 112) AS INT) = 20090329
	AND f2.NumberOfDays > 1
GROUP BY f.storekey
	,p3.ProductCategoryName
	,dayoftheweek
	,f2.NumberOfDays
	,f2.sumMetric
	,f2.SumMetricSquared
HAVING ABS(SUM(SalesAmount) - (1 / f2.NumberOfDays) * f2.sumMetric) > 3.0 * SQRT((1.0 / (f2.NumberOfDays - 1)) * (f2.SumMetricSquared - (1.0 / f2.NumberOfDays) * (SQUARE(SUMMetric)))) --SUMMetric*SUMMetric));
ORDER BY LeftRightDifference DESC;

These are the outliers for Sunday, March 29th 2009 sorted in descending order by the difference between the total sales and three standard deviations from the mean.

clip_image002

Let’s look at the first row. Store 307, Cameras and camcorders, Sunday there are 156 days of sample data in our OutlierFactSales table for this Store/Day of Week/Product Category. The SumMetric is the total sales for those 156 days and the SumMetricSquared is the accumulating square of those sums. The SumMetricAdding column is the amount that was added to SumMetric for this days sales and likewise for SumMetricSquaredAdding. The leftside column is the left portion of our HAVING clause (which incidentally is the same as the SumMetricAdding column.) The right side is the value that is three standard deviations from the mean. The filter in the having clause states that the left side must be greater than the right side for the record to be returned. The LeftRightDifference is the left side minus the right side. The AverageSalesByDay divides SumMetric by Number of Days.

If we look at the record at the top of the results we see the row in which the difference between the left side and right side is the greatest. This is for store 307, Cameras and camcorders. Total Sales for that Sunday were $235,261 and the average Sunday sales for cameras and camcorders for that store are only $160,040.

The query below shows the details for that record. (Careful of the embedded space after Product Category Name! LTRIM(RTRIM(‘anyone? ‘))

select * from dbo.FactSales f
JOIN [ContosoRetailDW].[dbo].DimProduct p on p.ProductKey = f.ProductKey
JOIN [dbo].DimProductSubcategory p2 on p2.ProductSubcategoryKey = p.ProductSubcategoryKey
JOIN [dbo].DimProductCategory p3 on p3.ProductCategoryKey = p2.ProductCategoryKey
JOIN dbo.DimPromotion x on f.PromotionKey = x.PromotionKey
WHERE CAST(CONVERT(VARCHAR,f.DateKey,112) AS INT) = 20090329
and StoreKey = 306 and ProductCategoryName = 'Cameras and camcorders '
order by SalesAmount Desc;

This query shows us some pretty big orders. They sold 108 units of cameras for a total of almost $35,000. Next on the list are 18 HD TVs for a total of $17,000. It looks like the Asian Spring Promotion really brought some business in. Looking at store 306 for the same product category we see they sold some large quantities of camcorders and cameras for their European Spring Promotion.

Next, let’s add an outlier and see if our query picks it up! I chose store 5, Computers because that store/category wasn’t returned by our original query. This statement changes the sales amount for a record from $2,208.80 to $22,088.00.

update f
set SalesAmount = 22088 --2208.8
from dbo.FactSales f
JOIN [ContosoRetailDW].[dbo].DimProduct p on p.ProductKey = f.ProductKey
JOIN [dbo].DimProductSubcategory p2 on p2.ProductSubcategoryKey = p.ProductSubcategoryKey
JOIN [dbo].DimProductCategory p3 on p3.ProductCategoryKey = p2.ProductCategoryKey
JOIN dbo.DimPromotion x on f.PromotionKey = x.PromotionKey
WHERE CAST(CONVERT(VARCHAR,f.DateKey,112) AS INT) = 20090329
and StoreKey = 5 and ProductCategoryName = 'Computers';

Running our query again… nothing happened! I updated the record again and added a zero, re-ran our query and it picked it up. Looking at the record we see why $22,088 wasn’t enough to trigger a response. Three standard deviations from the mean is $28,945.

I think this is a really cool and fairly simple way to add additional value to the data warehouse. I especially want to thank Cory Anderson, a member of my last project team; he went the extra mile helping me polish the data validation layer for our solution. I should also thank Ralph Kimball for publishing his design tips for all to share. Helping identify bad or suspect data in the source system once again makes the BI team company heroes.

UPDATE: 9/3/2013

I got this to work using DMX!

CREATE MINING STRUCTURE BrandSales
(
	brand_partner_id LONG KEY,
	taxonomy_level_1 TEXT DISCRETE,
	day_of_week TEXT DISCRETE,
	net_sales LONG CONTINUOUS
);

GO

ALTER MINING STRUCTURE BrandSales
ADD MINING MODEL BrandSalesClusters
(
	brand_partner_id,
	taxonomy_level_1,
	day_of_week,
	net_sales
) USING Microsoft_Clustering WITH DRILLTHROUGH;

GO

INSERT INTO MINING STRUCTURE BrandSales
(
	brand_partner_id,
	taxonomy_level_1,
	day_of_week,
	net_sales
)
OPENQUERY(IBI_DW,'SELECT brand_partner_id = c.customer_ship_to_brand_partner_id
		,taxonomy_level_1 = t.taxonomy_level_1_id
		,day_of_week = DATEPART(weekday, d.[date])
		, net_sales = SUM(net_amt)
	FROM sales.fact_sales f
	INNER JOIN common.dim_taxonomy t
		ON t.taxonomy_id = f.taxonomy_id
	INNER JOIN COMMON.dim_customer_current c
		ON c.customer_ship_to_id = f.customer_ship_to_id
			AND C.customer_id = F.CUSTOMER_ID
	INNER JOIN common.Dim_Date d
		ON d.Date_Skey = f.invoice_date_skey
	WHERE d.[date] >= ''1/1/2013''
		AND C.customer_ship_to_brand_partner_id <> 0
		AND NOT (
			(
				f.source_header_id = 1049106
				AND f.is_credit = 1
				)
			OR (
				f.source_header_id = 19765302
				AND f.is_credit = 0
				)
			)
	GROUP BY c.customer_ship_to_brand_partner_id
		,t.taxonomy_level_1_id
		,d.[date]')

SELECT FLATTENED
	t.brand_partner_id,
	t.taxonomy_level_1,
	t.day_of_week,
	t.net_sales,
	PredictCaseLikelihood() AS CaseLikelihood
FROM BrandSalesClusters
PREDICTION JOIN
	OPENQUERY(IBI_DW,'SELECT brand_partner_id = c.customer_ship_to_brand_partner_id
		,taxonomy_level_1 = t.taxonomy_level_1_id
		,day_of_week = DATEPART(weekday, d.[date])
		, net_sales = SUM(net_amt)
	FROM sales.fact_sales f
	INNER JOIN common.dim_taxonomy t
		ON t.taxonomy_id = f.taxonomy_id
	INNER JOIN COMMON.dim_customer_current c
		ON c.customer_ship_to_id = f.customer_ship_to_id
			AND C.customer_id = F.CUSTOMER_ID
	INNER JOIN common.Dim_Date d
		ON d.Date_Skey = f.invoice_date_skey
	WHERE d.[date] = ''8/24/2013''
		AND C.customer_ship_to_brand_partner_id <> 0
	GROUP BY c.customer_ship_to_brand_partner_id
		,t.taxonomy_level_1_id
		,d.[date]') AS t
	ON BrandSalesClusters.brand_partner_id = t.brand_partner_id
	AND BrandSalesClusters.taxonomy_level_1 = t.taxonomy_level_1
	AND BrandSalesClusters.day_of_week = t.day_of_week
	AND BrandSalesClusters.net_sales = t.net_sales
	//WHERE PredictCaseLikelihood()  = 0;

//3/6/2013
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s