What are Web Parts?

The term “web parts” can mean different things to different people.  A simple web search for “Microsoft web parts” returns some interesting results.  My introduction to web parts began with SQL Server 2008 R2 Reporting Services and the introduction of the report viewer web part (not to be confused w/ the report part.)  This allowed you to embed reporting services .RDL files within a PerformancePoint 2007 dashboard.  I believe you could also embed a report in any SharePoint page using the add web part feature.

A Developer’s Introduction to Web Parts (pertaining to SharePoint 2003) states: “Developers can build Web Parts as ASP.NET custom controls. Administrators can install Web Parts on any site based on Windows SharePoint Services. Users can add Web Parts to pages by dragging and dropping in a browser, and they can personalize them by setting properties. Web Parts can connect to other Web Parts using standard interfaces.”

The Introduction to Customizing Pages by Using Web Parts (applies to SharePoint Server 2007) provides an excellent overview of Web Parts and Web Part Pages:

——————————————————————————-

A Web Part is a modular unit of information that forms the basic building block of a Web Part Page. You can add Web Parts to Web Part zones in a Web Part Page and then customize the individual Web Parts to create a unique page for your site users.

The following example uses the Image Web Part to describe the basic features of a Web Part.


Sample Image Web Part

Callout 1 The Web Part title bar contains the heading for the Web Part.

Callout 2 The Web Part menu contains functions that enable you to minimize or close the Web Part, edit the Web Part, or get Help for a specific Web Part. When the page is in edit mode, you can also use this menu to delete the Web Part or connect it to other Web Parts, depending on the type of Web Part that you are using.

Callout 3 The body of the Web Part contains the content that you specified for the type of Web Part that is being used. In this example, the Web Part is an Image Web Part, which displays an image.


A Web Part Page is a special type of Web page in which you can use Web Parts to consolidate data, such as lists and charts, and Web content, such as text and images, into a dynamic information portal that is built around a common task or special interest.

——————————————————————————-

Finally, WikePedia defines a web part this way:

——————————————————————————-

A Web Part, also called a Web Widget, is an ASP.NET server control which is added to a Web Part Zone on Web Part Pages by users at run time. The controls enable end users to modify the content, appearance, and behavior of Web pages directly from a browser. It can be put into certain places in a web page by end users, after development by a programmer.

Web Parts can be used as an add-on ASP.NET technology to Windows SharePoint Services.

——————————————————————————-

Web Parts can be custom created for your needs or purchased through third party vendors.

Here are some Free Microsoft SharePoint Web Parts.

As a PerformancePoint dashboard designer, the web part customization is done after the dashboard has been deployed.  The frustration for me as a developer is that the customization tasks must be repeated every time the dashboard is re-deployed.  I’ve had to create long text cheat sheets of the various web part height, width, and chrome settings for a dashboard.  If anything needs changed to the dashboard using Dashboard Designer that causes you to have to re-deploy, these needs to be done individually by hand.  As far as I know, there is still no way to automate this task.  This always baffled me since these configurations can’t really be stored in source control and can’t be automated as the solution is deployed from dev to QA to production.

UPDATE!  SharePoint 2013 allows for easy migrations!

One new feature of SharePoint 2013 that I really like is the new way it handles exporting and importing a dashboard project.  In previous versions of SharePoint, I had to create long lists of customizations that had to be done to the web parts after Dashboard Designer had deployed the dashboard.  Things such as hiding filters, resizing objects, and changing the chrome settings had to be done over and over again every time we made a change to the dashboard that required a re-deployment.  Moving the project from the development environment to the QA environment and finally to the production environment was a manual tedious process.  Furthermore, these customizations were not stored in source control.  I was pleased as punch to discover the export/import feature of SharePoint 2013 solves both problems.  I can extract the resulting .CMP file using a cab extractor and put the entire project into source control and I don’t have to repeat those customizations over and over again.

Advertisements

PerformancePoint 2013 Dashboards: Applying Dynamic Security using CustomData and EffectiveUserName

I recently had the great pleasure of solving a PerformancePoint 2013 puzzle and I thought I’d share how I did it to help others and as a reference for my fellow colleagues.  Our team was trying to implement dynamic active directory security for PerformancePoint 2013 dashboards.  A variety of PerformancePoint web parts were tested as part of this endeavor including Analysis Services scorecards, analytic charts and grids, SSRS reports (native mode), and Excel Services.

Devin Knight does an excellent job of explaining how to get this to work in his blogpost:  Analysis Services Dynamic Security.

For this method to work with PerformancePoint and Excel Services we will need two Roles:  ReadUserName and ReadCustomData.  The first one, ReadUserName is set up exactly how Devin describes.  This will be the role that Excel and SSRS will use.  The second one, ReadCustomData is the same except you need to replace the function UserName() with CustomData().  This ReadCustomData role will be the role that is referenced in the data source editor of PerformancePoint dashboard designer.

To test this I had to make sure that I wasn’t in any other AD groups or roles.  I only had to add myself to the ReadUserName role.  I did not have to be in the ReadCustomData role, only the service account that runs the SharePoint 2013 PerformancePoint content (a.k.a. Proxy.)  We also have a ReadUnrestricted Role.  I had to make sure I wasn’t in that role.

Here is another link, that describes how to get this working with PerformancePoint 2010:  How to configure dynamic security in Analysis Services and Performance Point 2010 without Kerberos.

Here is a link that describes various methods for testing if the security is working correctly:

Unit Testing Role Security in Analysis Services.

The series of posts below are wonderful!  I won’t repeat what is said there; but, I will say I was relieved to see that this person also came to the conclusion that two roles would be needed.

http://dbaexperience.com/2013/01/24/sharepoint-2010-sql-server-2012-bi-dynamic-security-part-1/

http://dbaexperience.com/2013/01/31/sharepoint-2010-sql-server-2012-bi-dynamic-security-part-2/

http://dbaexperience.com/2013/03/12/sharepoint-2010-sql-server-2012-bi-dynamic-security-part-3/

Here is where things got interesting… No, really.

This post:  Using EffectiveUserName To Impersonate in SSAS mentions that you have to set up Kerberos on your SSAS server to get this to work.  I can tell you we got it to work without that.  I did set up the data source in SSRS as this person describes.  Credentials stored securely in the report server, use as windows credentials, Impersonate the authenticated user after a connection has been made to the data source.

I got a little lucky (ok, maybe a lot lucky) in that my account had certain permissions that no one else in my active directory group had.  If it weren’t for that, I don’t know if I would have ever gotten any of this working.

Our initial goal was to set these permissions on either the SharePoint proxy account or the domain service account that process the cube.  That worked for the SSRS MDX reports; but, the Excel Services dashboard pages quit working.  Currently we are faced with the possibility that these permissions will need to be set for all the dashboard users and we are investigating what that actually means.  (http://windowsitpro.com/active-directory/hiding-data-active-directory-part-3-enabling-list-object-mode-forest)

image

We think the List contents is the one that made the difference.  My Active Directory sent me these links that he felt describe what is going on pretty well.  Here is what he said:

‘The link below explains the end result as the Account authenticating for the SSRS reports needs Read access to the TokenGroupGlobalAndUniversal attribute. This is done by adding the required account to the Windows Authorization Access  group in Active Directory.

Originally it was thought the SharePointProxy account was the account being used. I did add it to the group last Saturday and it did not resolve the issue. But today it was identified that the service account that connects to the data warehouse for cube processing was the account needing to verify the users group memberships to present the reports based on this information to the clients.

I also confirmed this to be the attribute by allowing Read access to all the User Object attributes and removing them until this single attribute was remaining. Removing this last attribute denied the reports with the Unknown username or password error.’

http://sqlinsight.net/blog/ssrs/ssrs-subscriptions-not-working/

http://blogs.msdn.com/b/richin/archive/2008/08/27/ssrs-subscription-tggau-rights-authzinitializecontextfromsid-exceptions.aspx

The other piece that seemed to make a difference was that the SSAS service is running under the NT AUTHORITY\NetworkService account.

Post in the comments if you have questions.

SSAS Gotcha: The Operation Has Been Cancelled Due To Memory Pressure

My team and I recently had an issue with our SSAS instance in which users were receiving the following message from Excel:

Data Could not be retrieved from the external data source.  Error message returned by the external data source:

Server: The operation has been cancelled due to memory pressure. 

Ok

image

Another user reporting a different error from Excel when trying to create a new connection to the Analysis Services database:

An error has occurred in the transport layer.

Errors in the OLE DB provider.  An error occurred while loading the connection dialog box components for prompting.

image

image

The following blog posts helped us identify and kill the MDX query and the user was executed:

Finding and killing SSAS Sessions

Kill a Session, SPID or Connection to Analysis Service Cube

The problem seemed to be caused by poorly constructed Excel pivot tables.  Users can do some strange things.  Things like dragging product numbers to the columns and product descriptions to the rows.  Regardless, a single user shouldn’t be able to bring down an entire server making it useless to anyone else in the company, right?

For the short term we had to find a way to automatically kill queries like these automatically.  Otherwise, our team could never go to lunch together.  Someone would always have to stay behind to log the locked query and restart the SSAS service if the server spun out of control.

Luckily, we were able to reproduce the condition by simply executing one of the queries we had captured on our QA server.  We tried a SSIS package that Chris Webb wrote to automatically kill locked sessions.  Unfortunately, when the issue occurred, the SSIS package couldn’t get a connection to check for locks let alone kill them.  I also played with the Server Timeout advanced property of the SSAS instance; but, that would only work reliably if we set it to a very small time period, such as 30 seconds.  Otherwise, it would blow right past the time out and never get cancelled.  I also got the PowerShell version of Chris Webb’s method to work; but, again it would only work reliably for short time periods.

*Make sure the user executing the Agent job is a SERVER administrator to get the PowerShell solution to work.  Only a WINDOWS SERVER admin can issue cancel XMLA requests.

Killing sessions automatically with SSIS

Dealing with Long Running SSAS Queries using Powershell

After more research and testing, I found this snippet:

Check the CancelCoordinatorCount property, you can decrease a bit from its default value, which will force the engine to check the IsCancelled property more often. As a result, the queries will be stopped faster. We had big problems with that, but after some post SP2 updated + with using that CancelCoordinatorCount thighs got much better.

http://social.msdn.microsoft.com/Forums/en-US/sqlanalysisservices/thread/72fb777a-1198-46d0-991b-df753ea164a6

I changed the CancelCoordinatorCount property from 1000 to 100 and the PowerShell solution would work with a slightly larger threshold.  Coupled w/ a larger failsafe threshold on the Server Timeout property (which also started working) I was getting closer to what I wanted.

Something about that post triggered a memory and a promise to myself I had made years ago.  The part about applying a service pack.  Many years ago, I worked as a BI Developer for a company that specialized in software for the Casino industry.  From time to time, I would be asked to help troubleshoot a problem with a client’s BI solution.  The first thing I would always do would be to reboot the server.  I can’t tell you how many times this would fix the immediate problem.  Second, I would make sure all of their windows updates and service packs were applied.  When they weren’t up to date, many times this fixed the problem.  As a consultant for Pragmatic Works, one of the first things I would always do when starting a new BI project or troubleshooting an issue would be to make sure all updates and service packs were applied.

To make a long story short… applying SQL Server 2008 R2 Service Pack 1 completely fixed our problem.  Two queries that would run for almost two hours on our QA server before finally bombing with a memory pressure error ran in less than five minutes and produced results.

Old lessons learned again.

Data Mining with DMX

To my knowledge, Microsoft has three methods of performing data mining within their BI stack.  The Excel Add-In, the GUI and wizards within BIDS, and the DMX query language.  Obviously the key draw back to the Excel Add-in is that you can’t schedule it to run automatically without human intervention.  However, the Excel Add-in provides an excellent method to form your hypothesis and test it.  The GUI and the wizards are nice and you can create repeatable processes using those tools alone.  The command line interface that DMX provides has it’s own appeal to old folks like me who can remember when computer magazines included program listings.

There are basically four steps to data mining with DMX:

  1. Creation of Structures
  2. Creation of Mining Models
  3. Training
  4. Prediction

Think of structures as you would a framework for a house.  Now imagine there are those half mannequins scattered throughout the house structure.  The ones without the arms, legs, and head.  Just the torso in a skirt.  These mannequins are clothed with cheap garments.  The physical shape of the models is determined by the various data mining algorithms.  Some are fit, some are chunky, some are pencil thin.  Next imagine tons of paint being dumped on those garments by the bucket.  One color for each bucket.  At the end of the day, the garments are going to contain patterns of colors streaking down them derived from the buckets of different colored paints.  Those are your models.  Now, imagine a sophisticated computer is attached to those models.  Based on the colors of paint that were dumped on the garments during their training, we can predict the streaks of colors that will occur if an entirely new batch of paint is dumped upon those models.  That is prediction.  This is DMX.  DMX provides the language that defines the structures, models, and querying language.  Now imagine there is a purple cat in the corner of the house drinking lemonade.

The best thing about DMX is that it so closely resembles SQL.  They tried to do that with MDX; but, conceptually it just doesn’t fit quite right.  Knowing SQL can be a hindrance to knowing MDX.  Not so with DMX.

The most commonly used data types in DMX closely resemble those that we encounter with SQL:  Long, double, text, date, Boolean, and table.  Types in SQL that are not in DMX, such as integer, money, float, etc. are implicitly converted.  We are on common ground here.

Content types are a new concept for the relational developer.  The content types that have the word ‘key’ in them have parallels to our familiar primary / foreign key concepts (KEY, KEY TIME, KEY SEQUENCE).  Let’s ignore them for a moment and concentrate on what remains:  DISCRETE, CONTINUOUS, and DISCRETIZED.

DISCRETE is a categorical value such as Male or Female, Married or Single, Between 32 and 37 years old, between 38 and 42 years old.  In the context of a content type it means the data is already presented that way.  The DISCRETIZED keyword just tells the engine that we want it to do the categorization.  The default is a 5 buckets approach.  If there is not enough data to support that, the engine will automagically try fewer.If that fails it takes a clustering approach.

CONTINUOUS values are familiar to most of us.  It means a numerical value.  Age columns that include values such as: 32, 37, 87, 17.  Monetary columns including amounts such as $1.42, $200.76, $432.45, etc.

image

MINING STRUCTURES

Before we can create any mining structures we have to create a new DMX query window.  This is accomplished by connecting to any Analysis Services database using SQL Server Management Studio, right clicking on SSAS instance and selecting: New Query, DMX.

image

All of the examples in this post will be based on two mining structures.  The first, simply called Customers, is our case level structure.  Don’t worry about what that means for now, let’s focus on the syntax.

CREATE MINING STRUCTURE [Customers] (
	 CustomerKey	LONG	KEY
	,CustomerName	TEXT	DISCRETE
	,Gender		TEXT	DISCRETE
	,Age		LONG	CONTINUOUS
	,AgeDisc	LONG	DISCRETIZED(EQUAL_AREAS,4)
	,MaritalStatus	TEXT	DISCRETE
	,Education	TEXT	DISCRETE
	);

We have a column name, a data type, and a content type very much like the SQL syntax for creating a table.

The possible values for the various columns are: gender – Male or Female, Age – 17, 32,64,42, etc., AgeDisc – 17-22, 23-28, 29-40, etc., MaritalStatus – Married or Single, Education: Bachelors, High School, Partial College, Graduates, etc.

CREATE MINING STRUCTURE [CustomersNested] (
     CustomerKey        LONG    KEY
    ,CustomerName        TEXT    DISCRETE
    ,Gender                TEXT    DISCRETE
    ,Age                LONG    CONTINUOUS
    ,AgeDisc            LONG    DISCRETIZED(EQUAL_AREAS,4)
    ,MaritalStatus        TEXT    DISCRETE
    ,Education            TEXT    DISCRETE
    ,Purchases            TABLE (
         Product            TEXT    KEY
        ,Quantity            LONG    CONTINUOUS
        ,Discounted            BOOLEAN    DISCRETE
        )
    ) 
    WITH HOLDOUT (30 PERCENT OR 10000 CASES) REPEATABLE (42);

This second mining structure is our nested table structure.  The nested table is a like a table that sits in a column within another table.  In this example, the column PURCHASES has a data type of TABLE.  It has a key column, the product name, a quantity column, and a discounted field.  Nested tables are like mining structure itself as they have a name and column list.

The HOLDOUT keyword will randomly select a certain percentage or number of records to be set aside for testing the model.

The REPEATABLE keyword will cause the same records to be set aside for testing the model each time it is populated if it used with a non 0 integer.  This is useful for testing the behavioral consistency of your scenarios.

Simple Simon was a Singleton

image

Here is an example of a record that is either being used to train a mining model within our nested table example, or as a singleton query to perform a prediction upon.  Notice the nested table is sort of a table in and of itself embedded in a column.  A singleton query is when you send one row to a model to make a prediction.  They are often used in real-time prediction engines.

image

MINING MODELS

ALTER MINING STRUCTURE [Customers]
    ADD MINING MODEL [ClusteredCustomers]
    USING Microsoft_Clustering;

DROP MINING MODEL [ClusteredCustomers];

Here we have simplest mining model money can buy.  This particular model would only be useful for clustering algorithms.  Notice there aren’t any column names in the mining model definition. If no columns are specified, all columns are considered inputs.

The USING keyword designates the algorithm your model will use. Understand that all algorithms use the same structures and model declarations and the same query language.

One thing I found out the hard way is that the more mining models your structure has, the longer it takes to populate your structure. This makes sense if you think about it, because it has to train so many models. If you go through the examples on your own, I’d recommend that you issue the drop statements or your structures will take forever to populate and some examples simply won’t work.

ALTER MINING STRUCTURE [Customers]
    ADD MINING MODEL [PredictMaritalStatus](
         CustomerKey
        ,Gender
        ,Age //NOTICE WE ARE USING THE CONTINUOUS VERSION OF AGE
        ,Education
        ,MaritalStatus PREDICT
        )
    USING Microsoft_Decision_Trees;

DROP MINING MODEL [PredictMaritalStatus];

This model is against our case level table structure, we are using the continuous version of the age column, and the model is using a decision tree algorithm.  The best way I can describe a case level structure is any structure without nested tables.

The word PREDICT in the model above is called a usage flag.  A usage flag has three states: PREDICT, PREDICT_ONLY, and NULL (the absence of a usage flag.)

  1. PREDICT means the columns is both an input and an output
  2. PREDICT_ONLY means the column is an output only.
  3. A column that is missing a usage flag is considered an input only.

We can only perform predictions against columns that are an output.  Input columns are fed into the model and used to make predictions on the output columns.

ALTER MINING STRUCTURE Customers
ADD MINING MODEL PredictMaritalStatusBayes(
     CustomerKey
    ,CustomerName
    ,Gender
    ,AgeDisc //NOTICE WE ARE USING THE DISCRETE VERSION OF AGE
    ,Education
    ,MaritalStatus PREDICT
    )
USING Microsoft_Naive_Bayes;

DROP MINING MODEL PredictMaritalStatusBayes;

In this example notice we are using the discrete version of the age column and the model is using the Naïve Bayes algorithm.  The algorithms Microsoft Naive Bayes and Microsoft Association Rules support categorical analysis and do not support continuous types.  The Microsoft Linear Regression algorithm accepts only continuous data.

ALTER MINING STRUCTURE CustomersNested
ADD MINING MODEL PredictMaritalStatusNestedTrees(
     CustomerKey
    ,Gender         //INPUT ONLY B/C USAGE FLAG IS ABSENT
    ,AgeDisc AS Age //INPUT ONLY B/C USAGE FLAG IS ABSENT
    ,Education      //INPUT ONLY B/C USAGE FLAG IS ABSENT
    ,MaritalStatus PREDICT //INPUT AND OUTPUT
    ,Purchases (
         Product
        ,Quantity   //VALUE ATTRIBUTE
        ,Discounted //VALUE ATTRIBUTE
        )
    )
USING Microsoft_Decision_Trees(COMPLEXITY_PENALTY = .5);

This model can be described as a model used to predict a person’s marital status based on that person’s gender, age, education, marital status, and the quantities of their product purchases, and whether or not they were purchased on sale.

ALTER MINING STRUCTURE CustomersNested
    ADD MINING MODEL PredictPurchasesTrees(
         CustomerKey
        ,Gender
        ,AgeDisc AS Age
        ,Education
        ,MaritalStatus 
        ,Purchases PREDICT (    //INPUT AND OUTPUT
             Product                
            )
        )
    USING Microsoft_Decision_Trees;

This model predicts product purchases based on a person’s gender, age, education, marital status, and other products purchased.  Since the table column Purchases is marked PREDICT instead of PREDICT ONLY, each attribute in the nested table is both input and output. If it were marked PREDICT ONLY it would not take into account the other products purchased.  The table is marked with the usage flag not the key. The model predicts the value of an attribute.  The model predicts the rows that make up the nested table. This is why it is the table that accepts the usage flag in this example.

For a nested table it is acceptable and common for it to have only a single column that is the key. Nested tables without supporting columns are the most common and are often used for market basket analysis.

The existence of a nested table row can be inferred by the existence of a value in any non-key nested column. This model creates what are called valueless attributes.

Valueless attributes are those attributes that do not have a separate table column representing their value. They are simply EXISTING or MISSING. The value of the attributes for the single case of Simple Simon would simply be:

  1. Purchases: Tires, Existing
  2. Purchases: Fenders, Existing
  3. Purchases: Playing Cards, Existing
  4. Purchases: Clothes Pins, Existing

Theoretically, the contents of a case also contain all of the “Missing” attributes too. So for every product you didn’t purchase that would also be a value attribute.

Microsoft’s algorithms have been written to specifically handle variable length cases. This is a radical departure from other data mining packages that assume each case is identically large.

ALTER MINING STRUCTURE CustomersNested
ADD MINING MODEL PredictQuantity(
     CustomerKey
    ,Gender
    ,AgeDisc AS Age
    ,Education
    ,MaritalStatus 
    ,Purchases (            //INPUT ONLY NESTED TABLE
          Product
         ,Quantity PREDICT    //INPUT AND OUTPUT
        )
    )
USING Microsoft_Decision_Trees;    

DROP MINING MODEL PredictQuantity;

This is an example of a Nested Table without a Usage Flag.  This model predicts the quantity of products purchased based on gender, age, education, marital status, and the quantity of other purchased products. The nested table contains a predictable column that is an input and output as it is marked with the PREDICT usage flag.

A valueless attribute (existing or missing) is not necessary and is not created because the table is an input and includes a value column that is also an input.

So valueless attributes were created in the previous example because all the nested table had was product which was either existing or missing. In this case they are not created since the quantity field is there they are not needed.

Question: What can we do with Quantity that we can’t do w/ any other column?

ALTER MINING STRUCTURE CustomersNested
ADD MINING MODEL PredictOnlyTable(
     CustomerKey
    ,Gender
    ,AgeDisc AS Age
    ,Education
    ,MaritalStatus 
    ,Purchases PREDICT_ONLY (    //OUTPUT ONLY
          Product                
         ,Quantity                //INPUT ONLY
        )
    )
USING Microsoft_Decision_Trees;    

DROP MINING MODEL PredictOnlyTable;

Answer: Predict.

This model predicts what products are likely to be purchased based on gender, age, education, marital status, and the quantity of other items purchased. A valueless attribute is created for the outputs since in this case you can’t predict quantity.

ALTER MINING STRUCTURE CustomersNested
ADD MINING MODEL PredictOnlyTableQuantity(
     CustomerKey
    ,Gender
    ,AgeDisc AS Age
    ,Education
    ,MaritalStatus 
    ,Purchases PREDICT_ONLY (        //OUTPUT ONLY
          Product
         ,Quantity PREDICT_ONLY        //OUTPUT ONLY
        )
    )
USING Microsoft_Decision_Trees;

This model predicts the quantity of products purchased based only on gender, age, education, and marital status. This model matches the table value column’s usage flag with the table usage flag. Since there isn’t a discrepancy between the usage flags, valueless attributes are not created.

Another way to think of this… Valueless attributes are created whenever the usage flags of the table column cannot be match with the usage flags of any of the nested table’s value columns.

ALTER MINING STRUCTURE Customers
ADD MINING MODEL FilterByAge(
     CustomerKey
    ,Gender
    ,Age
    ,Education PREDICT
    ,MaritalStatus 
    )
USING Microsoft_Decision_Trees
WITH FILTER(AGE > 30);

DROP MINING MODEL FilterByAge;

Filters can include case-level and nested-level columns.

ALTER MINING STRUCTURE CustomersNested
ADD MINING MODEL FilterByBalls(
     CustomerKey
    ,Gender
    ,Age
    ,Education PREDICT
    ,MaritalStatus 
    )
USING Microsoft_Decision_Trees
WITH FILTER(EXISTS(SELECT * 
                   FROM Purchases 
                   WHERE Product = 'Bearing Ball' 
                   AND Discounted));

DROP MINING MODEL FilterByBalls;

Contents of nested tables can be filtered.

ALTER MINING STRUCTURE CustomersNested
ADD MINING MODEL FilterByNested(
     CustomerKey
    ,Gender
    ,Age
    ,Education
    ,MaritalStatus 
    ,Purchases PREDICT (
        Product
        ) WITH FILTER(NOT Discounted)
    ) 
USING Microsoft_Decision_Trees;

DROP MINING MODEL FilterByNested;

Also worth noting is that Columns referenced in the filter are structure columns and need not be part of the model definition.

image

TRAINING

Once a mining model has been trained or processed, it contains algorithmic patterns derived from the data. Models are used against new data to perform predictions on any output columns defined during their creation. Patterns are referred to as the model content.

TRAINING METHODS

  • OPENQUERY
  • SHAPE
  • Other DMX Queries
  • MDX
  • Stored Procedures
  • Row-set parameters as the source data query for INSERT INTO

This blog post will cover the first two, OPENQUERY and SHAPE.

What do you get when you combine DMX w/ MDX in your company’s data mining project? Job security.

Data Sources

  • You can use the data source in the adventure works SSAS sample.
  • You can create one using BIDS as part of an Analysis Service project and deploy
  • You can use the AssProcs assembly.

AssProcs Detailed Instructions:
http://marktab.net/datamining/2010/07/10/microsoft-decision-trees-algorithm/

AssProcs Download:  Appendix B
http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470277742,descCd-DOWNLOAD.html

CALL ASSprocs.CreateDataSource(
     'Adventure Works DW'              
    ,'Provider=SQLNCLI10.1;Data Source=localhost
      ;Integrated Security=SSPI
      ;Initial Catalog=AdventureWorksDW2008R2'
    ,'ImpersonateCurrentUser','','');

TRAINING OUR case level structure using OPENQUERY

INSERT INTO MINING STRUCTURE Customers (
         CustomerKey
        ,CustomerName
        ,Gender
        ,Age
        ,MaritalStatus
        ,Education
        )
        OPENQUERY(
             [Adventure Works DW]
            ,'Select 
                     CustomerKey
                    ,LastName 
                    ,Gender
                    ,DATEDIFF(YEAR,BirthDate,GETDATE())
                    ,MaritalStatus
                    ,EnglishEducation
              FROM dbo.DimCustomer'
              );

Random number Example

CREATE MINING STRUCTURE [CustomersRandom] (
     CustomerKey        LONG    KEY
    ,CustomerName        TEXT    DISCRETE
    ,Gender                TEXT    DISCRETE
    ,Age                LONG    CONTINUOUS
    ,MaritalStatus        TEXT    DISCRETE
    ,Education            TEXT    DISCRETE
    ,Random                DOUBLE    CONTINUOUS
    );

INSERT INTO MINING STRUCTURE CustomersRandom (
         CustomerKey
        ,CustomerName
        ,Gender
        ,Age
        ,MaritalStatus
        ,Education
        ,Random
        )
        OPENQUERY(
             [Adventure Works DW]
            ,'Select 
                     CustomerKey
                    ,LastName 
                    ,Gender
                    ,DATEDIFF(YEAR,BirthDate,GETDATE())
                    ,MaritalStatus
                    ,EnglishEducation
                    ,CAST((ABS(CHECKSUM(
                     NEWID())) % 1000) AS FLOAT)
                     /1000 AS Random
              FROM dbo.DimCustomer'
              );

TRAINING OUR NESTED TABLE STRUCTURE

INSERT INTO MINING STRUCTURE CustomersNested (
     CustomerKey
    ,CustomerName
    ,Gender
    ,Age
    ,AgeDisc
    ,MaritalStatus
    ,Education
    ,Purchases(
         SKIP
        ,Product
        ,Quantity
        ,Discounted
        )
    )
SHAPE {  
        OPENQUERY(
             [Adventure Works DW]
            ,'Select 
                 CustomerKey1 = CustomerKey
                ,LastName 
                ,Gender
                ,DATEDIFF(YEAR,BirthDate,GETDATE())
                ,DATEDIFF(YEAR,BirthDate,GETDATE())
                ,MaritalStatus
                ,EnglishEducation
               FROM dbo.DimCustomer
               ORDER BY CustomerKey'
            )
        }
APPEND (
            {
            OPENQUERY(
                 [Adventure Works DW]
                ,'Select 
                     CustomerKey2 = f.CustomerKey
                    ,p.EnglishProductName
                    ,f.OrderQuantity
                    ,CASE 
                        WHEN f.DiscountAmount > 0 
                            THEN CAST(1 AS BIT)
                        ELSE CAST(0 AS BIT)
                     END
                  FROM dbo.FactInternetSales f
                  JOIN dbo.DimProduct p 
                    ON p.ProductKey = f.ProductKey
                  ORDER BY f.CustomerKey'
                  )
            } RELATE CustomerKey1 TO CustomerKey2
        ) AS Purchases;

The SHAPE syntax is used to form the flat table source data to the hierarchical nested representation required by the mining structure. The relationship between rows and record set is accomplished using the RELATE keyword which maps a column in the outer record set to a foreign key column in the nested record set. You can add as many record sets as desired simply by providing additional definitions. The record set must be ordered by the columns used to relate them. This is a key requirement (pun intended.)

SKIP indicates columns that exist in the source data that will not be used to fill the structure. It is primarily used in cases where you don’t have control over the columns returned.

image

PREDICTION

Prediction: Applying the patterns that were found in the data to estimate unknown information. The final goal in any data mining scenario, it provides the ultimate benefits of data collection and machine learning which can dramatically influence how business is conducted.

DMX simplifies these possibilities by using a consistent syntax for prediction across all of the various algorithms. It allows predictions to be scheduled and results stored in various formats including relational databases. Some are performed in real-time.

Examples Include:

  • Future values of a time series (How much will we earn next month?)
  • What other products a customer might be interested in purchasing
  • Likelihood of a customer switching to a competitor
  • Will the borrower repay the loan?
  • Does anything seem abnormal here?
  • What is the most effective advertisement to display to this customer?
  • How can I classify customers, products, or events?

QUERYING STRUCTURED DATA

Three Rules

  1. You can only select cases from models that support DRILLTHROUGH
  2. You can only see data the model can see (unfiltered)
  3. By default, you will only see columns used with the model*

*There is a way to get the additional columns

Three Can Dos

  1. You can query column contents to see discrete values or continuous range values used in those columns.
  2. You can query the model content to explore data patterns discovered by the algorithms.
  3. WITH DRILLTHROUGH you can see how training data cases reinforce the validity of the patterns that were found.

PREDICTION JOIN

image

Consider a mining model to be a table containing all possible combinations of input and output variables. Now imagine you are doing a traditional SQL join to a theoretical table that will determine a prediction result. Such a theoretical table would be not be pragmatic considering the number of possible combinations within nested tables and impossible once you consider continuous columns. Models contain patterns learned from data in a compressed format allowing for efficient execution of predictions.

PREDICTION JOIN SYNTAX

image

Three Ways to Query Structured Data

  1. Select all cases
  2. Select cases as a flat record set
  3. Select only test cases
SELECT * 
FROM MINING STRUCTURE CustomersNested.CASES;
//SELECT ALL CASES

SELECT FLATTENED * 
FROM MINING STRUCTURE CustomersNested.CASES;      
//SELECT CASES AS A FLAT RECORD SET

SELECT * 
FROM MINING STRUCTURE CustomersNested.CASES 
WHERE IsTestCase();        
//SELECT ONLY TEST CASES

Select All Cases or All Test Cases Results

image

Select Cases as a Flat Record-set Results

image

WITH DRILLTHROUGH Clustering

ALTER MINING STRUCTURE CustomersNested
ADD MINING MODEL ClusterDrill(
     CustomerKey
    ,Gender
    ,Age
    ) USING Microsoft_Clustering
    WITH DRILLTHROUGH;

INSERT INTO ClusterDrill;

SELECT DISTINCT Gender 
FROM ClusterDrill;

SELECT DISTINCT RangeMin(Age),Age,RangeMax(Age)
FROM ClusterDrill;

SELECT * 
FROM ClusterDrill.CONTENT;

SELECT * 
FROM ClusterDrill.CASES 
WHERE IsInNode('001');

DROP MINING MODEL ClusterDrill;

imageimage

image

QUERY Predict Marital Status Using Naïve Bayes

DELETE FROM MINING STRUCTURE Customers;

ALTER MINING STRUCTURE Customers
    ADD MINING MODEL PredictMaritalStatusBayes(
         CustomerKey
        ,CustomerName
        ,Gender
        ,AgeDisc
        ,Education
        ,MaritalStatus PREDICT
        )
    USING Microsoft_Naive_Bayes;

INSERT INTO MINING STRUCTURE Customers (
         CustomerKey
        ,CustomerName
        ,Gender
        ,AgeDisc 
        ,MaritalStatus
        ,Education
        )
        OPENQUERY(
             [Adventure Works DW]
            ,'Select 
                     CustomerKey
                    ,LastName 
                    ,Gender
                    ,DATEDIFF(YEAR,BirthDate
                              ,GETDATE())
                    ,MaritalStatus
                    ,EnglishEducation
              FROM dbo.DimCustomer'
              );

SELECT 
     t.CustomerName
    ,Predict(MaritalStatus) 
        AS PredictedMaritalStatus
FROM PredictMaritalStatusBayes     
PREDICTION JOIN
OPENQUERY(
            [Adventure Works DW]
            ,'SELECT
                CustomerKey
                ,CustomerName = LastName
                ,Gender
                ,AgeDisc = DATEDIFF(YEAR,BirthDate
                                    ,GETDATE())
                ,MaritalStatus
                ,Education = EnglishEducation
              FROM dbo.DimCustomer'
        ) AS t
ON     PredictMaritalStatusBayes.AgeDisc = t.AgeDisc 
AND PredictMaritalStatusBayes.Education = t.Education
AND PredictMaritalStatusBayes.Gender = t.Gender;

image

DMX requires a fully qualified descriptor for all the column names in the mapping because joins are bound by column name rather than column order.

Predict Marital Status Nested Trees

SELECT 
    t.CustomerName
    ,Predict(MaritalStatus) AS PredictedMaritalStatus
FROM PredictMaritalStatusNestedTrees
PREDICTION JOIN
SHAPE { OPENQUERY([Adventure Works DW]
    ,'SELECT
         CustomerKey1 = CustomerKey,CustomerName = LastName
        ,Gender,Age = DATEDIFF(YEAR,BirthDate,GETDATE())
        ,Education = EnglishEducation,MaritalStatus
      FROM dbo.DimCustomer ORDER BY CustomerKey')
       } APPEND ( {    OPENQUERY([Adventure Works DW]
    ,'SELECT
         CustomerKey2 = f.CustomerKey
        ,Product = p.EnglishProductName
        ,Quantity = f.OrderQuantity,Discounted = CASE 
            WHEN f.DiscountAmount > 0 THEN CAST(1 AS BIT)
            ELSE CAST(0 AS BIT)
          END
      FROM dbo.FactInternetSales f
      JOIN dbo.DimProduct p ON p.ProductKey = f.ProductKey
      ORDER BY f.CustomerKey')
    } RELATE CustomerKey1 TO CustomerKey2
) AS Purchases AS t
ON PredictMaritalStatusNestedTrees.Age = t.Age
AND PredictMaritalStatusNestedTrees.Gender = t.Gender
and PredictMaritalStatusNestedTrees.Purchases.Product 
    = t.Purchases.Product
AND PredictMaritalStatusNestedTrees.Purchases.Quantity 
    = t.Purchases.Quantity
AND PredictMaritalStatusNestedTrees.Purchases.Discounted 
    = t.Purchases.Discounted
and PredictMaritalStatusNestedTrees.Education 
    = t.Education;

image

Singleton Queries

SELECT
    Predict(MaritalStatus) AS PredictedMaritalStatus
FROM PredictMaritalStatusBayes
NATURAL PREDICTION JOIN
(SELECT 'M' AS Gender
        ,35 AS Age
        ,'Graduate Degree' AS Education) AS t;    

SELECT
    Predict(MaritalStatus) AS PredictedMaritalStatus
FROM PredictMaritalStatusBayes
NATURAL PREDICTION JOIN
(SELECT 'F' AS Gender
        ,22 AS Age
        ,'High School' AS Education) AS t;

imageimage

NATURAL PREDICTION JOIN – matches columns from the source with the same names as input columns of the model.

SELECT
    Predict(MaritalStatus) AS PredictedMaritalStatus
FROM PredictMaritalStatusNestedTrees
NATURAL PREDICTION JOIN
(
    SELECT 'M' AS Gender
           ,35 AS Age
           ,'Graduate Degree' AS Education
           ,
    (        
        SELECT 'Touring Tire' AS Product
               ,2 AS QUANTITY 
        UNION
        SELECT 'Mountain-200 Silver
               ,38' AS Product
               ,1 AS QUANTITY 
        UNION
        SELECT 'Fender Set - Mountain' AS Product
               ,1 AS QUANTITY 
    ) AS Purchases
) AS t;

image

Degenerate Prediction– a prediction without source data and therefore has no prediction join clause. If 42% of my customers are married, the degenerate prediction for marital status would be single.

Predict Function Histogram

DELETE FROM MINING STRUCTURE Customers;
ALTER MINING STRUCTURE Customers
ADD MINING MODEL FilterByAge(
CustomerKey
,Gender,Age
,Education PREDICT
,MaritalStatus)
USING Microsoft_Decision_Trees
WITH FILTER(AGE > 30);

INSERT INTO MINING STRUCTURE Customers (
CustomerKey
,CustomerName
,Gender
,Age
,MaritalStatus
,Education)
OPENQUERY([Adventure Works DW]
,’Select
CustomerKey
,LastName
,Gender
,DATEDIFF(YEAR,BirthDate,GETDATE())
,MaritalStatus
,EnglishEducation
FROM dbo.DimCustomer’);

SELECT
‘Histogram’ AS Label
,PredictHistogram(Education) AS Hist
FROM FilterByAge;

SELECT FLATTENED
(SELECT $Probability
FROM PredictHistogram(Education)
WHERE Education = ‘Bachelors’)
FROM FilterByAge;

//RETURNS THE PROBABILITY ASSOCIATED W/
//THE VALUE ‘BACHELORS.’

SELECT FLATTENED
(SELECT Education,$Probability
FROM TopCount(PredictHistogram(Education)
,$Probability,5))
FROM FilterByAge;
//USES TOPCOUNT FUNCTION TO RETURN TOP 5 ROWS
//OF THE HISTOGRAM TABLE BASED ON PROBABILITY

DROP MINING MODEL FilterByAge;

image

imageimage

image

This is the most comprehensive prediction function for case-level columns. It returns a table w/ all the information available about the prediction of a scalar column.

Target – the predicted value.  Bachelors, partial college, and so on.
$Support – How many cases support this fact
$Probability – computed probability of a categorical output.  For continuous values represents the likelihood of a value being present.
$AdjustedProbability – modified probability used to boost likelihood of rare events, frequently used for predicting nested tables.
$Variance – of a continuous prediction.  0 for discrete predictions.
$StDev – of a continuous prediction.  0 for discrete.

When you’re dealing w/ functions like PredictHistogram that return tables, you can select rows from them like any other table.

DMX doesn’t allow TOP and ORDER BY clauses in sub-selects. TOPCOUNT AND BOTTOMCOUNT substitute for this functionality.

SELECT PredictHistogram(MaritalStatus)
FROM PredictMaritalStatusBayes;

SELECT
    MaritalStatus,PredictProbability(MaritalStatus)
FROM PredictMaritalStatusBayes;        

//Shortcuts for returning values such as the 
//probability and support:

//PredictProbability
//PredictSupport
//PredictAdjustedProbability
//PredictVariance
//PredictStdDev 

SELECT
    VBA![Log](PredictProbability(MaritalStatus)) 
        AS NaturalLog
FROM PredictMaritalStatusBayes;        
//EXAMPLE SHOWING HOW TO ACCESS SCIENTIFIC 
//FUNCTIONS FROM VBA SCRIPT LANGUAGE

First let’s look at the predict histogram function against the predictMaritalStatusBayes model.

image

Results from the PredictProbability function:

image

All functions return their values from the most likely row of the predictHistrogram record set without requiring messy sub-select syntax.

All of these functions also let you extract appropriate value for any row of the histogram.

Predict functions have Polymorphic behavior dependent on whether supplied case-level or column reference.

You don’t even need to use function when predicting a case-level column value. SELECT education = SELECT PREDICT(education) as long as education is a predictable column.

Results from the VBA![Log](PredictProbability) query:

image

Additional functions are shortcuts that give additional information about the prediction (support, likelihood, or the part of the model used for prediction.)

A reference of all DMX functions appears in Appendix B from the WROX download site mentioned previously in regards to the AssProcs download.

PredictAssociation

This function takes a nested table and returns a ranked list of results.  It is useful for recommendation predictions.

// PredictAssociation takes a nested table 
// and returns a ranked list of results

SELECT
    PredictAssociation(Purchases,3)
FROM PredictPurchasesTrees;
//top 3 items based on the default measure
//probability

SELECT
    PredictAssociation(Purchases,3,$AdjustedProbability)
FROM PredictPurchasesTrees;
//same but uses adjusted probability to rank the items

SELECT(SELECT * 
       FROM PredictAssociation(Purchases
                               ,INCLUDE_NODE_ID
                               ,INCLUDE_STATISTICS)
    WHERE $Probability > .1)
FROM PredictPurchasesTrees;        
//Shows all items that are recommended 
//with a probability of 20 percent or greater.  
//Also includes that statistics and node ID of 
//each prediction

imageimage

image

Adjusted Probability is useful in associative predictions because it penalizes items that are too popular in favor of less-popular items. If 90% of your customers want eggs, but for a particular customer the engine recommends milk with an 80% likelihood. The engine is telling you that for this customer the chance of them wanting milk is less than average.

//Predicting the values of columns in nested tables 
//instead of the contents of the tables themselves 

SELECT 
    (SELECT 
        Product
        ,Predict(Quantity)
    FROM Purchases
    )
FROM PredictOnlyTableQuantity;
//returns the predicted quantity for each 
//item assuming each item is purchased.

SELECT 
    (SELECT 
        Product
    FROM Purchases
    WHERE Predict(Quantity) > .42
    )
FROM PredictOnlyTableQuantity;    
//returns a list of all products with a 
//predicted quantity of greater than .42  

SELECT 
    (SELECT 
        Product
        ,(SELECT 
            $Probability
          FROM PredictHistogram(Quantity)
          WHERE Quantity = NULL)
    FROM Purchases
    )
FROM PredictOnlyTableQuantity;        
//returns a list of all possible products 
//for each case and in doubly nested 
//table returns the probability that a product 
//will NOT be purchased.

SELECT 
    (SELECT 
        Product
        ,Predict(Quantity)
    FROM Purchases
    WHERE PredictProbability(Quantity, NULL) < .8)
FROM PredictOnlyTableQuantity;    
//simplifies the previous example by using 
//a scalar function instead of the table 
//returning predictHistogram and returns the 
//predicted quantity of all products that are likely 
//to be recommended with a probability of at least 80%.
//i.e. determining the likelihood of it NOT being 
//recommended is less than 80%

imageimage

image

Predicting the values of columns in nested tables instead of the contents of the tables themselves is extremely useful; but, rare scenario.

Additional Mining Model Management Syntax

EXPORT MINING STRUCTURE Customers TO 'C:\c\Customers.abf'
WITH PASSWORD='mike';    

RENAME MINING STRUCTURE Customers TO Customer1;
DELETE FROM MINING STRUCTURE Customer1.CASES;

IMPORT FROM 'C:\c\Customers.abf' WITH PASSWORD='mike';

DROP MINING STRUCTURE customers;
DROP MINING STRUCTURE customersNested;
DROP MINING STRUCTURE customersRandom;

REFERENCE

image

Data Mining with Microsoft SQL Server 2008 is an excellent resource to learn more about data mining.  If you want to learn more about data mining with the Microsoft BI stack, this is a MUST read.

——————————————————————————-

This is a recording of my screen and voice during a presentation on the DMX Data Mining Extensions available in Microsoft SQL Server at SQL Saturday in Jacksonville, Florida on April 27th, 2013.

Data Mining with DMX from Mike Milligan on Vimeo.

——————————————————————————-

image

Generational Changes to Enterprise Data Warehouse Architecture

I had been on the job only three weeks before I began work on my secret plans.  It started out as just a simple BUS matrix; but, very quickly became a drill down BUS matrix of epic proportions.  I knew from the onset if my schemas were ever to hit a database I would need to be clever, patient, and a maybe even a little cunning.  I would have to keep my cards close to my chest, revealing my intentions only in bits and easily digestible bytes, sometimes as if they were not even of my own creation.

Prior to this gig I had been working as a hired gun, traveling the nation helping those in need of business intelligence consulting.  My particular specialty was the model.  It was always about the model in my views.  Actually, I prefer a physical model where the views merely reflect those structures; but, I digress.  I don’t know how many times I have seen a solution that relies on massive contortions of MDX to accomplish what would have been more scalable and elegant by simply cooking up another fact table to support that specific type of analysis.  Working as a consultant was a lot of fun and extremely challenging.  I am grateful for the opportunity I had to really make a difference on so many teams and projects.  However, after some time, the length of the journeys, the hardships of the trail, and my longing to be with my family led me to become a FTE at a billion dollar conglomeration of almost twenty individual companies.

Being a full time employee was a completely different role in my case.  No longer was I being asked to evaluate existing architectures and recommend best practice approaches as my primary function.  Initially, my priorities were mainly focused on immediate business needs and technical challenges.  In the first few months there were several challenging tasks outside of my comfort zone.  I couldn’t even estimate the distance in which I managed to herd an entire group of databases of an instance from one piece of hardware to another.  I built the structures for a 3 tier SharePoint farm (including PowerPivot) from my own bare hands, eight fingers, and two thumbs.  I implemented the processes and structures required to track prices and at least five ways to measure cost historically only after an elusive hunt for the mysterious “actual actual cost.”

Through it all I quietly kept working on my master plan.  I planted the seeds for it’s growth by mentioning what I refer to as “Kimballisms” (most folks refer to them as “Design Tips”) every chance I got.    I recited them like a preacher’s boy recites passages from the Bible, especially the dimensional modelers mantra:   “Ease of use, Query Performance!”  At the water cooler, I’d say things like, “If only we could drill across via some sort of conformed dimension.”  During the U.S. presidential election I expressed my hope that Ralph Kimball would run.  I pondered aloud how the world would be different if Kimball had designed all of Apple’s products.  “Load data at it’s most atomic level!” I’d mutter to myself as I wandered through the hallways.  It even got to the point that I’d attribute almost any words of wisdom to the man, “Kimball says not to go in the water for at least 30 minutes after eating.”

One day, an opening appeared.  My manager, whom I’ll refer to as Mr. Scratch, asked me to saddle up and produce a prototype PerformancePoint dashboard that would combine information about sales and collections.  I rassled up a prototype lickity split; but, there was a big problem.  The billing information was sitting pretty  on a SQL Server Analysis Services database; but, the sales roster was in a completely different database.  Furthermore, variations on similar dimensions complicated the matter such that the prototype had to have two different drop downs for basically the same information.  In one, the clerk number was in front of the storekeeper’s name and in the other it was after the name!  This was the perfect opportunity to talk about conformed dimensions with Mr. Scratch!  Much to my surprise, he took to the idea like a cat to fish so much so that I think he thought it was his own.  I don’t want to downplay the significance of this event.  Many existing reports and user generated solutions would become obsolete and have to be rewritten.  This was no trivial endeavor to suggest.  My plan was taking shape!  The seeds had sprouted and the roots were taking hold.

I couldn’t reveal my entire strategy all at once.  I had to bide my time and proceed carefully.  The existing data warehouse was created through the massive efforts of an extremely effective albeit small team.  They had gone through the painful trials and tribulations of understanding disparate legacy source systems and reverse engineering their business logic through painful iterations involving buckets of blood, sweat, and tears.  They had engaged the business and delivered analytic capabilities that had been unimaginable prior to the data warehouse.  The corporate buy-in was tremendous and the team had become the champions of the organization long before I had arrived.  They had bravely fought the good fight and won the hearts and minds of the entire corporation.  As can be expected, any suggestion about improvements had to be made with cautious candor.

It’s not hard to be tall when you get to stand on the shoulders of giants.  Their existing solution was better than at least three quarters of all the solutions I had ever seen.  They had avoided many of the common pitfalls and gotchas of data warehouse and OLAP cube development.  They were properly managing a type two slowly changing dimension.  Their robust solution was working, reliable, trusted, and respected by organization.  Important decisions were made every day based on the data provided by the data warehouse.

Eventually, it came time to share my proposed BUS matrix and drill down BUS matrices to the data warehouse manager.  Mr. Scratch played the devil’s advocate like a well worn fiddle.  We sat down to let him throw rocks at my model and had a rather heated debate on the subject of snowflake vs. star schema designs as he played his fiddle.  My first attempt was to cite Kimball design tip #105, “Snowflakes, Outriggers, and Bridges.”  In this tip modelers are encouraged “to handle many to one hierarchical relationships in a single dimension rather than snowflaking.”  Next, I referenced Chris Adamson’s Star Schema Central blog article Star vs. Snowflake.  Mr. Scratch was well versed in the debate strategies for refuting expert testimony and would have none of it.  That man sure can play a fiddle.  By the end of the discussion, I was convinced I would have to work something up that would measurably prove once and for all that the star schema was better performing than a snowflake.  I tried; but, lets just say that the results were inconclusive.  I personally still stand by Kimball and Adamson and I will accept their expert testimony.  For more on this subject, please listen to or read the transcript of the recent SQL Down Under podcast with guest Erin Welker, “Dimensional Modeling.”  I don’t want to beat a dead horse here so on with the story.

The next day, my manager went down to Georgia, and the discussion was eventually forgotten.  He never held me to a snowflake design pattern and we ended up with a strict star schema.  So here we are today, months after that initial design session and one week after an intense closed beta period.  Tomorrow marks the beginning of open beta and in three to four weeks the new data warehouse and cube replaces the old.

Looking over the summary of changes as I create a PowerPivot presentation for my manager I can reflect on all of the important design decisions that were made over the past weeks.  Most in business intelligence are familiar with Kimball’s concept of iterative changes and a dimensional lifecycle where incremental improvements are implemented to the data warehouse on a periodic repeating pattern.  These can include additional fact tables to existing data marts, changes to dimensions, and additional calculated measures.  Although I can’t remember exactly where, I do remember reading that a data warehouse isn’t considered mature until it has under gone its third generation.  Generational change occurs when the solution is effectively torn down and rebuilt from scratch.  This process uncovers unknown bugs and issues that may have gone undetected for years.  It forces one to rethink everything that has gone before at least in respect from the ODS to the data warehouse in this situation.

On the relational data warehouse side we went from 54 tables to 30.  Tables are differentiated by schemas that represent the business process they model.  Conformed dimensions are placed in a COMMON schema.  This was achieved primarily through the consolidation of related dimensions as attributes into a main dimension.

On the Analysis Services side we started with four databases containing one or two cubes each derived from one fact table per cube.  Now we will have one analysis services database, one cube, and six measure groups (three of which provide brand new analytic capabilities.)  We reduced the number of dimensions by about half primarily through consolidation.  We added several new dimensions and attributes and we have added time intelligence capabilities.

image

Old version – Snowflake Schema Design

image

New version – Star Schema Design

List of Significant Architectural Design Changes

  • Consolidated header/detail fact tables into one by allocating header values to the lower granularity
  • Conformed common dimensions
  • Consolidated dimensions
  • Added a new type of fact table with data that was not available before, an accumulating snapshot to track inventory levels
  • Added a new fact table to track prices and costs of products historically (and all of the components that make up those calculations I might add.)
  • Added a new fact table to track goals to actuals
  • Added attributes to represent discretization buckets
  • Proper cased the data elements themselves (everything was in all caps from the legacy systems and it’s much more purdy now)
  • Several new behavioral attributes to existing dimensions

Next on the Agenda

  • Introduce page level database compression
  • Introduce relational partitioning
  • Implement automatic analysis services partitioning mirroring the relational partitioning
  • Installation of an n-tier SharePoint 2013 farm
  • Creation of PerformancePoint 2013 dashboards
  • Schedulable forecasting using the DMX data mining query language and SSIS
  • Migration to SQL Server 2012 Reporting Services running in SharePoint integrated mode
  • Take advantage of new capabilities in PowerView and PowerPivot

I hope you’ve enjoyed this campfire tale.  I thought it might be good to share.  I’m tired of typing.  I’ve got blisters on my fingers!  My digits are weary and I can no longer see the forest through the trees  Imagine as I ride my old mustang off into the sunset playing a harmonica.

 

New Years Resolutions – Professional Goals

Happy New Year!  Here are my goals:

1.  Reboot

2.  Avoid Snowflakes!

3.  Learn more data mining and the DMX query language (and blog about it.)

4.  Learn more about Master Data Services (and blog about it.)

5.  Become a master at MDX

6.  Architect an enterprise solution in which Contoso buys AdventureWorks and Northwind (and blog about it.)

7.  One presentation per quarter and at least one really good blog entry per month

8.  Mentor someone outside of the company I work

9.  Keep a personal log in which I will plan each day and reflect on the previous day every morning.

10.  Publish Segment Analyzer on Codeplex (and blog about it.)

——————————————————————————-

Segment Analyzer is an idea I have to help analyze various customizable groups of customers, products, or any type of dimension in an analysis services cube.  The main force behind the solution is a .Net application that can run against any data warehouse that uses a standard table and column naming scheme.  It won’t matter what the naming scheme is as long as it is consistent, that criteria would be configurable.

An analyst uses a GUI interface to create criteria for segments to be analyzed.  The criteria is based on dimensional attributes, and aggregated and non-aggregated facts and supports complex parenthetical and/or logic at many levels.  Drop downs are populated based on the distinct values in the dimensional attributes and mathematical operations for the numeric information.  Which attributes and which metrics are featured in the interface is handled through a one-time configuration process. Segments can have parent segments in which the criteria of the parent is inherited to infinite levels of children.  This will be accomplished by parsing the segment criteria into dynamic SQL that populates bridge tables to support many to many dimensions in Analysis Services.  The segment dimension itself uses the parent child hierarchy feature in SSAS.  Two known query performance killers, I know; but, the features would only be used on specialized cubes or databases just for this purpose.

Examples of segments of customers from adventure works:

Customers that visited the store within the last six months who are female, are between 16 and 24 years old, live within 25 miles of the store and have spent at least $100 in the past six months and redeemed two promotional coupons.

Customers that visited the store within the last year, are male, live in the 12345 zip code, own one car, spent at least $200 in the past year, and redeemed 0 promotions.

SSRS Gotcha: Multi-value Parameter Design Patterns

Adding a Multi-value parameter to a report is so easy in SSRS.  If you only want one.  If you want more than one things get a whole lot more complicated.  For example, inline SQL is no longer even an option as far as I can tell.  You must create a stored procedure to handle the request.

Recently, I had the pleasure of assisting a cohort with the following dilemma: a stored procedure would serve the correct results when run as inline SQL; (i.e. commenting out the create procedure code) but, incorrect results were retrieved when the stored procedure was executed through SSRS.

I ran a profiler trace and it appeared at first glance as if the SSRS engine was sending the same parameters as the inline SQL.  Example: exec rpt_someSillyCode @variable=N’this,that,the other’

Notice the N?  I think that means NVarchar.

Turns out this situation was caused by the fact that in his procedure declaration the variable lengths were defined as 10 characters; but, when he ran the procedure as inline SQL his test variables were defined as 20 characters.

The whole ordeal reminded me of the fact that multi-value parameters can be a real pain so I wanted to document what I know.

There are at least three different design patterns (that I know about) when dealing with multi-value parameters in SSRS.

——————————————————————————-

Pattern #1, Dynamic SQL:

Example:

SET @variable = CHAR(39) + replace(@variable, ',', CHAR(39) + ',' + CHAR(39)) + CHAR(39)
--@variable is a multi-value parameter sent from SSRS
SET @strSQL = 
	'Select SUM(stuff) as StuffSum  
     into #ReportTempTable
     from #ADifferentTempTable
     Where VariableColumn  IN (' 
	+ @variable + ')';

--print @strSQL
EXECUTE (@strSQL)

——————————————————————————-

Design Pattern #2, Use a function to parse the multi-value:

CREATE FUNCTION SplitParameterValues (
	@InputString NVARCHAR(max)
	,@SplitChar CHAR(5)
	)
RETURNS @ValuesList TABLE (param NVARCHAR(255))
AS
BEGIN
	DECLARE @ListValue NVARCHAR(max)

	SET @InputString = @InputString + ','

	WHILE @InputString <> ''
	BEGIN
		SELECT @ListValue = SUBSTRING(@InputString, 1, CHARINDEX(',', @InputString) - 1)

		INSERT INTO @ValuesList
		SELECT @ListValue

		SELECT @InputString = SUBSTRING(@InputString, CHARINDEX(',', @InputString) + 1, 
				LEN(@InputString) - CHARINDEX(',', @InputString))
	END

	RETURN
END

——————————————————————————-

Design Pattern #3, Parse the multi-value parameters into a #temp table and join to it

This is done in a similar fashion to the function.  Except the parsing is done right in the stored procedure.  This can be used in conjunction with an ‘All’ member.

——————————————————————————-

Click here to download an SSRS solution and related SQL that demonstrates all three design patterns.

——————————————————————————-

The whole multi-value parameter thing can be a huge problem once you dig into it.  Check the links below for the pages that really helped me see the light.  Many think Microsoft should have given developers the option to hide the ‘select all’ option and provide our own ‘all’ solution (see this connect link.) This would prevent SSRS from sending a huge string of values that basically meant all of my values separted by columns as opposed to sending just the word ‘everything.’  These clever technicians have devised a multitude of ways around this.

Related Links:

How to Default to ALL in an SSRS Multi-select Parameter

http://dataqueen.unlimitedviz.com/2011/06/how-to-default-to-all-in-an-ssrs-multi-select-parameter/

Using Multi-Select in SSRS

http://jsimonbi.wordpress.com/2011/01/15/using-multi-select-with-stored-procedures/

Passing multi-value parameter in stored procedure (SSRS report)

http://munishbansal.wordpress.com/2008/12/29/passing-multi-value-parameter-in-stored-procedure-ssrs-report/

——————————————————————————-

Create a Table valued function in SQL Functions as below

Step 1

CREATE FUNCTION [dbo].[FnSplit] (
	@List NVARCHAR(2000)
	,@SplitOn NVARCHAR(5)
	)
RETURNS @RtnValue TABLE (
	Id INT identity(1, 1)
	,Value NVARCHAR(100)
	)
AS
BEGIN
	WHILE (Charindex(@SplitOn, @List) > 0)
	BEGIN
		INSERT INTO @RtnValue (value)
		SELECT Value = ltrim(rtrim(Substring(@List, 1, Charindex(@SplitOn, @List) - 1)))

		SET @List = Substring(@List, Charindex(@SplitOn, @List) + len(@SplitOn), len(
					@List))
	END

	INSERT INTO @RtnValue (Value)
	SELECT Value = ltrim(rtrim(@List))

	RETURN
END

Step 2 in your store procedure change the parameter where condition something like below

CREATE PROCEDURE [dbo].[SomeSP]
	-- Add the parameters for the stored procedure here 
	@CostCentre NVARCHAR(255)
AS
BEGIN
	SELECT [ProjectCode]
		,[ProjectName]
		,[ProjectManager]
		,SUM([Hours]) AS [Hours MTD]
	FROM dbo.Rpt_NRMA_CATS NC
	INNER JOIN PeriodID P
		ON NC.PeriodID = P.PeriodID
	WHERE ([CostCentre]) collate database_default IN (
			SELECT Value
			FROM dbo.FnSplit(@CostCentre, ',')
			)
END

Thanks Desari!

——————————————————————————-

Copied and pasted from: http://social.msdn.microsoft.com/Forums/en-US/sqlreportingservices/thread/0ead7ceb-3fdd-4625-aa82-1d4195f984b1/

——————————————————————————-

Know another way?  Please leave a comment.