First Steps for Designing an Enterprise Data Warehouse

These are the steps I recommend taking when starting out on the fantastic journey of building a data warehouse. Whether you’ve been doing this a long time or this is your first trip, I hope this will provide a checklist of considerations and interesting discussions that should take place. Please provide comments and suggestions for improvements that I can include in future updates to this document.

First Steps

  • Choose a well documented and established approach.
    • Kimball – Dimensional Model, De-normalized. Most common. Enterprise data warehouse consists of many star schema models that are combined in the semantic layer through conformed dimensions.
    • Inmon – Information Factory – Normalized. Includes an additional 3rd normal form (3NF) staging area that is supposed to be inaccessible for reporting which publishes data to many de-normalized star schema data marts that are combined in the semantic layer through conformed dimensions.
    • Data Vault – Hybrid less established approach, some consider to be the best of 3NF and star schema.
  • Take a cookbook approach and read a cookbook.  Without following an established, well documented approach, there is little to hold data architects’ accountable or ensure your efforts will be successful and maintainable.
  • Decide if the EDW will support Operational Reporting and/or Analytical Reporting (detailed, line level vs. bird’s eye, aggregated) and during which phases.
  • Decide if there will be an OLAP database sitting on top of the relational data warehouse and if not, what, if any, aggregated fact tables are going to needed.
  • If possible, understand how the information will be presented to the end user through the semantic layer. Examples include:
    • Excel pivot tables connected to OLAP database (Microsoft SSAS, TM1 for example)
    • Reports against OLAP databases and/or the EDW relational database (SSRS, Crystal Reports, Cognos, and many others)
    • Dashboards (Power BI, Sharepoint / PerformancePoint, Tableau, Cognos, Qlik, and many more)
  • Other important considerations
    • Data Scientists – Will a separate database be required to present data in a format conducive to data mining efforts?
    • Protected data
      • How will security and access be handled?
      • How will protected information like social security numbers and/or health information be safe guarded?
    • Data Quality
      • What are the data quality requirements?
        • Proper casing, misspellings, and cosmetic changes
        • Duplicate data
        • Address cleansing
        • Complementing internal data with data from external sources for attributes such as latitude, longitude, census information, etc.
      • How will they be enforced?
        • Microsoft Data Quality Services
        • Cleansing / de-duplication routines in the ETL process using SQL or SSIS
        • Third party components for SSIS including Pragmatic Works Task Factory, Cozy Roc
    • Data Governance
      • How will the company manage the task of conforming dimensions and deciding which attributes are important to the business? Are there groupings and hierarchies that must be defined and maintained outside of the data contained within in the source systems?
        • Data Stewards – Ideally, key team members throughout the business would be identified and given the responsibility to work w/ the dimensional modelers to document and make these decisions
      • Microsoft Master Data Services