Data Mining Lab : Association Rules

Tutorial: Association Rules

In this tutorial we will first look at association rules, using the APRIORI algorithm in Weka.

APRIORI works with categorical values only. Therefore we will use a different dataset called "adult"; This dataset contains census data about 48842 US adults. The aim is to predict whether their income exceeds $50000. The dataset is taken from the Delve website, and originally came from the UCI Machine Learning Repository. More information about it is available in the original UCI Documentation.

Download a copy of adult.arff and load it into Weka.

This dataset is not immediately ready for use with APRIORI. First, reduce its size by taking a random sample. You can do this with the 'ResampleFilter' in the preprocess tab sheet: click on the label under 'Filters', choose 'ResampleFilter' from the drop down menu, set the 'sampleSizePercentage' (to 15 eg), click 'OK' and 'Add', and click 'Apply Filters'. The 'Working relation' is now a subsample of the original adult dataset. Now we have to get rid of the numerical attributes. You can choose to discard them, or to discretise them. We will discretise the first attribute ('age'): choose the 'DiscretizeFilter', set 'attributeIndices' to 'first', bins to a low number, like 4 or 5, and the other options to 'False'. Then add this new filter to the others. We will get rid of the other numerical attributes: choose an 'AttributeFilter', set 'invertSelection' to 'False', and enter the indices of the remaining numeric attributes (3,5,11-13). Apply all the filters together now. Then click on 'Replace' to make the resulting 'Working relation' the new 'Base relation'.

Now go to the 'Associate' tab sheet and click under 'Associator'. Set 'numRules' to 25, and keep the other options on their defaults. Click 'Start' and observe the results. What do you think about these rules? Are they useful?

From the previous, it is obvious that some attributes should not be examined simultaneously because they lead to trivial results. Go back to the 'Preprocess' sheet. If you have replaced the original 'Base relation' by the 'Working relation', you can include and exclude attributes very easily: delete all filters from the 'Filters' window, then remove the check mark next to the attributes you want to get rid of and click 'Apply Filters'. You now have a new 'Working relation'. Try to remove different combinations of the attributes that lead to trivial association rules. Run APRIORI several times and look for interesting rules. You will find that there is often a whole range of rules which are all based on the same simpler rule. Also, you will often get rules that don't include the target class. This is why in most cases you would use APRIORI for dataset exploration rather than for predictive modelling.

Exercise 2

Association analysis is concerned with discovering interesting correlations or other relationships between variables in large databases. We are interested into relationships between features themselves, rather than features and class as in the standard classification problem setting. Hence searching for association patterns is no different from classification except that instead of predicting just the class, we try to predict arbitrary attributes or attribute combinations.

1. Fire up Weka software, launch the explorer window and select the \Preprocess" tab. Open the weather.nominal data-set (\weather.nominal.arff", this should be in the ./data/ directory of the Weka install).

2. Often we are in search of discovering association rules showing attribute-value conditions that occur frequently together in a given set of data, such as; buys(X, computer") & buys(X, \scanner") =) buys (X,\printer") [support = 2%, confidence = 60%]. Where confidence and support are measures of rule interestingness. A support of 2% means that 2% of all transactions under analysis show that computer, scanner and printer are purchased together. A confidence of 60% means that 60% of the customers who purchased a computer and a scanner also bought a printer. We are interested into association rules that apply to a reasonably large number of instances and have a reasonably high accuracy on the instances to which they apply.

Weka has three build-in association rule learners. These are, \Apriori", \Predictive Apriori" and \Tertius", however they are not capable of handling numeric data. Therefore in this exericse we use weather data.

(a) Select the \Associate" tab to get into the Association rule mining perspective of Weka. Under \Associator" select and run each of the following \Apriori", \Predictive Apriori" and \Tertius".

Briefly inspect the output produced by each Associator and try to interpret its meaning.

(b) In association rule mining the number of possible association rules can be very large even with tiny datasets, hence it is in our best interest to reduce the count of rules found, to only the most interesting ones. This is usually achieved by setting minimum thresh-

olds on support and confidence values. Still in the \Associate" view, select the \Apriori" algorithm again, click on the textbox next to the \Choose" button and try, in turn, different values for the following parameters \lowerBoundMinSupport" (min threshold for support), \minMetric" (min threshold for confidence). As you change these parameter values what do you notice about the rules that are found by the associator? Note that the parameter \numRules" limits the maximum number of rules that the associator looks for, you can try changing this value.

(c) This time run the Apriori algorithm with the \outputItemSets" parameter set to true. You will notice that the algorithm now also outputs a list of \Generated sets of large itemsets:" at di_erent levels. If you have the module's Data Mining book by Witten &

Frank with you, then you can compare and contrast the Apriori associator's output with the association rules on pages 114-116 (I will have a couple copies circulating in the lab during the session, just ask me for one). I also strongly recommend to read through

chapter 4.5 in your own time, while playing with the weather data in Weka, this chapter gives a nice & easy introduction to association rules. Notice in particular how the item sets and association rules compare with Weka and tables 4.10-4.11 in the book.

(d) Compare the association rules output from Apriori and Tertius (you can do this by navigating through the already build associator models in the \Result list" on the right side of the screen).

Make sure that the Apriori algorithm shows at least 20 rules. Think about how the association rules generated by the two different methods compare to each other?

Something to always remember with association rules, is that they should not be used for prediction directly, that is without further analysis or domain knowledge, as they do not necessarily indicate causality.

They are however a very helpful starting point for further exploration and for building a better understanding of our data.

As you should certainly know by this point, in order to identify associations between parameters a correlation matrix and scatter plot matrix can be very useful fs.

Exercise 3: Boolean association rule mining in Weka

The dataset studied is the weather dataset from Weka’s data folder

The goal of this data mining study is to find strong association rules in the weather.nominal dataset. Answer the following questions:

a. What type of variables are in this dataset (numeric / ordinal / categorical) ?

b. Load the data in Weka Explorer. Select the Associate tab. How many different association rule mining algorithms are available?

c. Choose Apriori algorithm with the following parameters (which you can select by clicking on the chosen algorithm: support threshold = 15% (lowerBoundMinSupport = 0.15), confidence threshold = 90% (metricType = confidence, minMetric = 0.9), number of rules = 50 (numRules = 50). After starting the algorithm, how many rules do you find? Could you use the regular weather dataset to get the results? Explain why.

d. Paste a screenshot of the Explorer window showing at least the first 20 rules.

e. Define the concepts of support, confidence, and lift for a rule. Write here the first rule discovered. What is its support? Its confidence? Interpret the meaning of these terms and this rule in this particular example.

f. Apriori algorithm generates association rules from frequent itemsets. How many itemsets of size 4 were found? Which rule(s) have been generated from itemset of size 4 (temperature=mild, windy=false, play=yes, outlook=rainy)? List their numbers in the list of rules.