The Analyze Key Influencers tool is used to show how column values in a data set might determine the values of a specified target column. The process creates a temporary mining model in Microsoft SQL Server Analysis Services using the Naïve Bayes algorithm. It then produces a Main Influencers report which represents the key influencers for a distinct value of the target column. You have the option of creating one or many additional Discrimination Reports that compares the influencers for any two distinct values of the target column. The Discrimination Reports are only useful if your target column contains more than two distinct states.
The Naïve Bayes algorithm is a simple probabilistic classifier based on applying Bayes’ theorem with strong independence assumptions. The naïve part of the name comes from the fact that it assumes that all attributes are unrelated to each other and that the combination of attributes independently contribute to the probabilities that it predicts. For example, a fruit may be considered an orange if it is round, has the color orange, has seeds, grows on a tree, etc. Even if any of these features depend on the existence of other features, a Naïve Bayes classifier considers these properties to independently contribute to the probability that the fruit is an orange. One advantage of this algorithm is that it only requires a small set of data to estimate the means and variances of the variables required for classification.
This blog post will work through two examples using the sample data provided with the Microsoft SQL Server 2012 Data Mining Add-ins and another example using data from the Contoso sample database.
Which properties of a customer in the sample data help to predict a customer’s level of education?
- Open the DMAddins_SampleData.xlsx file.
- Select the Table Analysis Tools sample sheet, highlight a cell within the table so the ribbon at the top displays the Table Tools, Analyze ribbon, and click the Analyze Key Influencers button.
- Select the column Education to analyze for key factors and click the link that says ‘Choose columns to be used for analysis.’
- Uncheck the ID column. This is just a sequential number that has nothing to do with anything other than the order the row was inserted into the table. We also want to uncheck any other columns that have nothing to do with the customer’s education level to streamline our analysis and improve our accuracy. Let’s also uncheck the purchased bike column. Click Ok, and then Run.
- Once it finishes thinking, move the Discrimination based on key influencers dialog out of the way for a moment.
The Key Influencers Report for Education shows which columns and which values of those columns have a significant impact over the value of the Education column. According to this report, people between the age of 37 and 46 who work in Management are very likely to have their Bachelors degree. Persons with only one car and work in a clerical profession are very likely to have only attended some College. People with two cars that work in a manual occupation and earn less than about 39K per year are likely to have only attended high school. Similar characteristics apply for those that only received a partial high school education. Persons that do not own an automobile are very likely to have completed a graduate degree.
Now, back to the Discrimination report dialog that we moved out of the way. Let’s run a discrimination report that compares those with graduate degrees with those who only attended some of High School.
We can add as many discrimination reports as we want.
The Table Analysis Tools Sample worksheet only contains 1000 rows. When we go through the exact same steps on the Source Data sheet which has 10,000 rows, we get remarkably similar results.
Next, I’ll run the tool to see what factors most strongly influence whether or not the customer is likely to purchase a bike.
- Give the Source Data worksheet focus. Click the Analyze Key Influencers button.
- Select BikeBuyer as the column to analyze. Uncheck ID from the columns to analyze and run the analysis.
- Go ahead and run a Discrimination report against the Yes/No values. This will demonstrate that this report is useless for target columns with only two values.
The Key Influencers Report for BikeBuyer shows us that strongest predictors of whether or not the customer is likely to purchase a bike are when the customer doesn’t own any cars, and that they are between the ages of 36 and 46. The strongest predictors that they will not buy a bike are when they own two cars and are over or equal to the age of 64.
The discrimination report shows us essentially the same thing.
For the next example, I have imported the V_Customer view from the Contoso Retail demo database which you can download from Microsoft.
If you import the data using the Data ribbon, From Other data sources button it will automatically format it as a table which is required. If you import your data from a CSV or copy and paste it into a spreadsheet it may not be formatted as a table.
- Once the data is Excel, formatted as a table, click the Analyze Key Influencers button and select HomeOwnerFlag as the column to analyze.
- Click the Choose columns to be used for analysis link and uncheck CustomerKey and Consumption and Run the analysis.
Here we see that MaritalStatus has the most impact on influencing the value of HouseOwnerFlag. We also see that not having any children is a strong indicator for not owning a home.
I hope this explains how to use the Analyze Key Influencers tool sufficiently. If you have any questions, please use the comments section below.
Here are some additional links: