First Steps for Designing an Enterprise Data Warehouse

These are the steps I recommend taking when starting out on the fantastic journey of building a data warehouse. Whether you’ve been doing this a long time or this is your first trip, I hope this will provide a checklist of considerations and interesting discussions that should take place. Please provide comments and suggestions for improvements that I can include in future updates to this document.

First Steps

  • Choose a well documented and established approach.
    • Kimball – Dimensional Model, De-normalized. Most common. Enterprise data warehouse consists of many star schema models that are combined in the semantic layer through conformed dimensions.
    • Inmon – Information Factory – Normalized. Includes an additional 3rd normal form (3NF) staging area that is supposed to be inaccessible for reporting which publishes data to many de-normalized star schema data marts that are combined in the semantic layer through conformed dimensions.
    • Data Vault – Hybrid less established approach, some consider to be the best of 3NF and star schema.
  • Take a cookbook approach and read a cookbook.  Without following an established, well documented approach, there is little to hold data architects’ accountable or ensure your efforts will be successful and maintainable.
  • Decide if the EDW will support Operational Reporting and/or Analytical Reporting (detailed, line level vs. bird’s eye, aggregated) and during which phases.
  • Decide if there will be an OLAP database sitting on top of the relational data warehouse and if not, what, if any, aggregated fact tables are going to needed.
  • If possible, understand how the information will be presented to the end user through the semantic layer. Examples include:
    • Excel pivot tables connected to OLAP database (Microsoft SSAS, TM1 for example)
    • Reports against OLAP databases and/or the EDW relational database (SSRS, Crystal Reports, Cognos, and many others)
    • Dashboards (Power BI, Sharepoint / PerformancePoint, Tableau, Cognos, Qlik, and many more)
  • Other important considerations
    • Data Scientists – Will a separate database be required to present data in a format conducive to data mining efforts?
    • Protected data
      • How will security and access be handled?
      • How will protected information like social security numbers and/or health information be safe guarded?
    • Data Quality
      • What are the data quality requirements?
        • Proper casing, misspellings, and cosmetic changes
        • Duplicate data
        • Address cleansing
        • Complementing internal data with data from external sources for attributes such as latitude, longitude, census information, etc.
      • How will they be enforced?
        • Microsoft Data Quality Services
        • Cleansing / de-duplication routines in the ETL process using SQL or SSIS
        • Third party components for SSIS including Pragmatic Works Task Factory, Cozy Roc
    • Data Governance
      • How will the company manage the task of conforming dimensions and deciding which attributes are important to the business? Are there groupings and hierarchies that must be defined and maintained outside of the data contained within in the source systems?
        • Data Stewards – Ideally, key team members throughout the business would be identified and given the responsibility to work w/ the dimensional modelers to document and make these decisions
      • Microsoft Master Data Services

Dqsinstaller.Exe Exportkbs Does Not Export A Backup

Hi Everybody!

It’s been a long time; but, I’m at a great place, doing work I believe in, with people I love working with on a daily basis.  Lately, I’ve been working on my latest masterpiece, my first Enterprise Data Warehouse / OLAP solution that incorporates data governance using Microsoft Master Data and Data Quality Services (EDW / MDS / DQS.)  The engagement with the business is incredible.  Upper management is buying into the Kimball approach right down to the nine-steps.  This is truly an amazing time for me!

Okay on to why you’re probably here.  I had this problem again, 9 months after I had it before.  I had to scratch my head some more resulting in even more hair loss and I was reminded about how great it is to blog every now and then.  (Pragmatic Works FTW)

This guy had the same problem and in the interest of efficiency…

“I’ve tried numerous times to export my KBs from DQS.  I run the command line tool (runas admin on the cmd prompt)  and the command executes almost immediately with no errors and no backup file as I specified (dqsinstaller.exe –exportkbs c:\PDM\DQSBackup.dqsb)  I am on SQL 2012 SP1 CU 4.  This is all very frustrating. ”

Punch line:


I don’t know if you can tell from the screenshot; but, the dash is dastardly different.

Type the command into notepad using your keyboard.  Copy and paste that into the command prompt window or your batch file and you should be good to go.







Brian Roberts–Here’s Some FREE Advice on How BI Could Help Improve Comcast Customer Service

Anyone who follows me on twitter (@MikeRMilligan) probably knows that I’ve had some bad experiences with Comcast in the past.  So, when my wife and I decided to move into a new house, rather than mess around w/ their customer service and try to get it transferred; I made the decision to have her set up new service in her name at the new address in her name.  We’ve been at the new address exactly one week and after the second interruption of service, I called to get some credit for the account only to find that we owed over $100 and were past due almost $70.  It appears that they started the service about 20 days early. 

The realtor had actually called us around the middle of the month to let us know that there was a box from Comcast sitting outside the property.  My wife called their customer service then to let them know that she told them we weren’t moving into the property for a good two weeks and we would not be responsible if the equipment was stolen or damaged.

The first representative apparently didn’t like my pointing out that I had already entered the phone number and last four digits of the social into the automated phone system and decided to hang up on me.  I proceeded to their webpage chat rather than wait on hold again to try and get a supervisor to call me.  Once one finally did, she gave me the number to a payment center and then hung up on me as well.  I called the payment center and they couldn’t adjust my bill and suggested I speak to the retention center.  I called the number again and was automatically transferred to billing b/c of the past due amount.  They transferred me to the retention center which didn’t service Florida who finally transferred me to someone who is actually trying to help.

So, how can technology help Mr. Roberts improve Comcast customer service?

  1. When I type the phone number and last 4 digits of the SS# into your automated system it should pass that information on to the representative so they don’t have to ask for it again.
  2. The representative should start by giving me a case number and encourage me to write it down.  This establishes accountability on the representative’s part.  In the event the representative decides to hang up on the customer, the customer has more than a first name to go by when they decide to speak to a supervisor about the situation.
  3. Track how many times the same account connects to your automated phone system.  Once it detects that a customer has called 5 or 6 times, maybe it’s time for it automatically connect them to a supervisor?
  4. Have a way for a customer to send an email.  I don’t want to have to wait for a chat representative or a phone representative to correct your mistake.  I should be able to very simply write an email explaining the situation and know that I will get a response.  If you can’t do that, how about a snail mail address?  That’s what I’ll be using when I copy you and the BBB, FCC, and Attorney General on this blog post contents.

Ok, /rant off.  Mr. Roberts, I will go to bed tonight praying that Google Fiber or some alternative comes to Jacksonville.  Even if it costs twice as much and even if it were twice as slow I would gladly take it as an opportunity to fire Comcast.

If you want even more ideas you have my number.

In the meantime, please stop intentionally disrupting my service.  I’ll pay my bill; but, I doubt that I am past due after only seven days of service.

Address Parse Transform–Task Factory vs. SSIS+



This is an old post from September 2012 that lost the images during their upgrade.

Pragmatic Works Task Factory is a collection of twenty-six high performance SSIS components one of which is the Address Parse Transform component. This component is used to parse unformatted address data into USPS standardized address data. It can also be used to transform addresses from other countries as well and is extremely fast and efficient as I will demonstrate in this blog entry. It is so easy to use, even a caveman could do it. No offense to any cavemen.

Cozy Roc also has a collection of SSIS components called SSIS+ one of which also does address parsing. It has a weird limitation that your address has to be stored in a single column. Nonetheless, I thought it would be interesting to test the two and compare the differences.

The US data set and contains 350,000 records and comes from the following website: I am going to test loading this data set twice. Once using it the way it comes and then once again using a slightly modified version where I have concatenated the address, city, state, and zip columns into one column. This will allow me to compare the two components side by side.

Before we begin, it is important to define exactly what it means for an address to conform to United States Postal Service address standards. The following excerpts were taken from Postal Addressing Standards (Publication 28) and provide a sample of their requirements.

Jointly developed by the Postal Service and mailing industry, standardized address information enhances the processing and delivery of mail, reduces undeliverable-as-addressed mail, and provides mutual cost reduction
opportunities through improved efficiency.

A standardized address is one that is fully spelled out, abbreviated by using the Postal Service standard abbreviations (shown in this publication or as shown in the current Postal Service ZIP+4 file), and uses the proper format
for the address style (shown in this publication).

Format all lines of the address with a uniform left margin. Uppercase letters are preferred on all lines of the address block. Lowercase letters in various type styles are acceptable provided they meet postal guidelines for OCR readability.

Secondary address unit designators, such as APARTMENT or SUITE, are preferred to be printed on the mail piece for address locations containing secondary unit designators. The preferred location is at the end of the Delivery Address Line. The pound sign (#) should not be used as a secondary unit designator if the correct designation, such as APT
or STE, is known or is shown in the ZIP+4 file. If the pound sign (#) is used, there must be a space between the pound sign and the secondary number.

Spell city names in their entirety. When abbreviations must be used due to labeling constraints, use only the approved 13-character abbreviations provided in the City State file.

Format the Last Line with at least one space between the city name, two-character state abbreviation, and ZIP+4 Code.

Test One – Transform 350,000 US Addresses into USPS Standardized Address Format

First, I created a new solution in BIDS and created a new package. I added a data flow task and renamed it to Test One. I dragged the flat file source to my data flow canvas and linked it to the CSV I downloaded containing 350K rows of sample US addresses. I set double quotes as my text qualifier and checked the box indicating that the first row contains the column names.

Now comes the fun part! I dragged the Address Parse Transform component onto my data flow designer and linked the flat file source to the component. Double clicking the component shows the setup screen below. For this test, I am going use the second option: Address data is spread across multiple columns. The Cozy Roc SSIS+ implementation doesn’t seem to allow you to do this. I mapped the columns to each field as shown below.


Next I clicked the Address Quality tab and checked the box to include only high quality parsed addresses. I decided to redirect errors to non parsable output so I selected that option from the drop down. Finally, I checked the box to replace ‘Null’ values with an empty string. According to the dialog, sometimes data from CSV or text files actually contain the text ‘Null’ where an actual NULL value was before. That can’t be good for anyone. I can imagine a letter carrier looking for someone named Null or a city named Null.


Now, I clicked the Parsed Output tab. I decided to replace the original columns where applicable and I changed the Output Case Formatting. I selected Upper Case from the dropdown since this is what the USPS prefers. Other options include: no formatting or proper case.


When I clicked Ok, I got an error message!


I don’t have a column for AddressLine2. It doesn’t look like I can ignore this column either so I went back to the Address Data Input tab and mapped Address to that field also. I’ll just ignore the AddressLine2 column downstream.


This is a screenshot showing what my data flow looked like once I was finished adding my destination components for the parsed addresses and for the bad addresses.

Finally, for a proper test we need to record when the data flow starts and when it finishes. Sure, I could have manually modified the package to log that info into a table; but, I have BIxPress so I did it the easy way!

The first run failed. clip_image008

The “output column “FirstName” (23)” failed because truncation occurred, and the truncation row disposition on “output column “FirstName” (23)” specifies failure on truncation.

A modification of my destination tables and explicitly setting the length of the string data types to 250 using the advanced editor for the source component on the input output properties tab should fix that.


I still received errors so I decided to redirect error rows from the source to an error table rather than fight them when I got a nice surprise from BIxpress!


Very cool! Of course I clicked Yes!

Next time it ran green. Let’s check the output in BIDS.


Check out what I get in BIxPress!


Not only does it tell me the row counts; but, it also tells me the buffers used! Not to mention how long it took which was almost 5 and a half minutes. Why can’t BIDS tell you that at least?

Tests are being performed on an IBM Lenovo X201 Tablet running Win 7 w/ an Intel I7 quad core, 6 GB RAM and a SSD.

Looking at the good addresses I see that it made everything uppercase like I wanted and used the USPS approved and recommended abbreviations. There are two new columns that the component made available to me: Quality and CSZQuality.


I was curious to see why one of the addresses was given a quality rating of medium so I went back to the source. That record had the following stored in the address field: 3202 W Charleston Blvd Bldg. So the ‘Bldg’ string was removed from the field and quality was set to medium. Neat!

Looking at examples where CSZQuality was medium I discovered entries where the zip code was missing; but, had all of the other elements which would still make it deliverable. Records with both columns set to medium were missing addresses and zip codes. These should be discarded or investigated.

Looking at the records in the bad addresses table it looks like most of them have incomplete information in the address field.


Test Two – Transform 350,000 US Addresses into USPS Standardized Address Format From a One Column Source

This time I modified the source to concatenate all of the address columns into one column. Let’s see how Address Parse Transform does now!


Wow! BIxPress totally blows me away again!


Of course I do!

The final destination component now shows an error. I’ve got to correct the mappings.


I delete the mappings to address, city, state, and zip. Then I mapped addressLine1 to address, TF Address Parse Transform City to city, TF Address Parse Transform State to State, PostalCode to Zip, Country to Country, Quality to Quality, and CSZQuality to CSZQuality. AddressLine2 can just remain unmapped.


Execute Package!

BIDS results:


BIxPress Results:


It took just under 6 minutes. A couple more thousand rows went to the bad address pipeline as would be expected under the circumstances.

Next, I want to test the same functionality using Cozy Roc’s SSIS+ Address Parse Component. I was unable to test this component in the same fashion as test one of Task Factory component because this one only parses addresses that are contained in a single column.

The following screenshots document how I set the component up for the test.



First, the result in BIDs:


BIxPress results:


So it took about 30 seconds longer using Cozy Roc’s component and there were ~30K bad addresses compared to Pragmatic Work’s ~15K.

Let’s look at some of the good addresses.  The first thing I notice is that the addresses are not using uppercase as is preferred by the USPS. Secondly I see for the third line it chopped the NW off of the street address and put it in front of the city.


Now, let’s look at those same addresses in the good address table from the Task Factory test two.  It also seems to have a hard time with the cities; but, the mail still should be deliverable.  Notice the city RAPIDS which should be GRAND and PARK PITTSFORD which should be just PITTSFORD.  Nonetheless, I would still consider the mail deliverable especially if we can assume the zip code is correct.


Nutshell: Pragmatic Works Task Factory Address Parse component is slightly faster and had half the number of bad addresses as Cozy Roc’s SSIS+ Address Parse component.  The Task Factory component also has the ability to accept addresses that are already in separate columns and the SSIS+ component does not.  The Address Parse Transform component from Pragmatic Works will be able to pay for itself in the long run by reducing the costs of returned mail.

Lost Letter Carrier

New Years Resolutions – Professional Goals

Happy New Year!  Here are my goals:

1.  Reboot

2.  Avoid Snowflakes!

3.  Learn more data mining and the DMX query language (and blog about it.)

4.  Learn more about Master Data Services (and blog about it.)

5.  Become a master at MDX

6.  Architect an enterprise solution in which Contoso buys AdventureWorks and Northwind (and blog about it.)

7.  One presentation per quarter and at least one really good blog entry per month

8.  Mentor someone outside of the company I work

9.  Keep a personal log in which I will plan each day and reflect on the previous day every morning.

10.  Publish Segment Analyzer on Codeplex (and blog about it.)


Segment Analyzer is an idea I have to help analyze various customizable groups of customers, products, or any type of dimension in an analysis services cube.  The main force behind the solution is a .Net application that can run against any data warehouse that uses a standard table and column naming scheme.  It won’t matter what the naming scheme is as long as it is consistent, that criteria would be configurable.

An analyst uses a GUI interface to create criteria for segments to be analyzed.  The criteria is based on dimensional attributes, and aggregated and non-aggregated facts and supports complex parenthetical and/or logic at many levels.  Drop downs are populated based on the distinct values in the dimensional attributes and mathematical operations for the numeric information.  Which attributes and which metrics are featured in the interface is handled through a one-time configuration process. Segments can have parent segments in which the criteria of the parent is inherited to infinite levels of children.  This will be accomplished by parsing the segment criteria into dynamic SQL that populates bridge tables to support many to many dimensions in Analysis Services.  The segment dimension itself uses the parent child hierarchy feature in SSAS.  Two known query performance killers, I know; but, the features would only be used on specialized cubes or databases just for this purpose.

Examples of segments of customers from adventure works:

Customers that visited the store within the last six months who are female, are between 16 and 24 years old, live within 25 miles of the store and have spent at least $100 in the past six months and redeemed two promotional coupons.

Customers that visited the store within the last year, are male, live in the 12345 zip code, own one car, spent at least $200 in the past year, and redeemed 0 promotions.