Tutorial: Association
Rules
In this tutorial we will first look at association rules,
using the APRIORI algorithm in Weka.
APRIORI works with categorical values only. Therefore we
will use a different dataset called "adult"; This dataset contains
census data about 48842 US adults. The aim is to predict whether their income
exceeds $50000. The dataset is taken from the Delve website, and originally
came from the UCI Machine Learning Repository. More information about it is
available in the original UCI Documentation.
Download a copy of adult.arff and load it into Weka.
This dataset is not immediately ready for use with
APRIORI. First, reduce its size by taking a random sample. You can do this with
the 'ResampleFilter' in the preprocess tab sheet: click on the label under
'Filters', choose 'ResampleFilter' from the drop down menu, set the
'sampleSizePercentage' (to 15 eg), click 'OK' and 'Add', and click 'Apply
Filters'. The 'Working relation' is now a subsample of the original adult
dataset. Now we have to get rid of the numerical attributes. You can choose to
discard them, or to discretise them. We will discretise the first attribute
('age'): choose the 'DiscretizeFilter', set 'attributeIndices' to 'first', bins
to a low number, like 4 or 5, and the other options to 'False'. Then add this
new filter to the others. We will get rid of the other numerical attributes:
choose an 'AttributeFilter', set 'invertSelection' to 'False', and enter the
indices of the remaining numeric attributes (3,5,11-13). Apply all the filters
together now. Then click on 'Replace' to make the resulting 'Working relation'
the new 'Base relation'.
Now go to the 'Associate' tab sheet and click under
'Associator'. Set 'numRules' to 25, and keep the other options on their
defaults. Click 'Start' and observe the results. What do you think about these
rules? Are they useful?
From the previous, it is obvious that some attributes
should not be examined simultaneously because they lead to trivial results. Go
back to the 'Preprocess' sheet. If you have replaced the original 'Base
relation' by the 'Working relation', you can include and exclude attributes
very easily: delete all filters from the 'Filters' window, then remove the
check mark next to the attributes you want to get rid of and click 'Apply
Filters'. You now have a new 'Working relation'. Try to remove different
combinations of the attributes that lead to trivial association rules. Run
APRIORI several times and look for interesting rules. You will find that there
is often a whole range of rules which are all based on the same simpler rule.
Also, you will often get rules that don't include the target class. This is why
in most cases you would use APRIORI for dataset exploration rather than for
predictive modelling.
Exercise
2
Association
analysis is concerned with discovering interesting correlations or other
relationships between variables in large databases. We are interested into
relationships between features themselves, rather than features and class as in
the standard classification problem
setting. Hence searching for association patterns is no different from classification
except that instead of predicting just the class, we try to predict arbitrary
attributes or attribute combinations.
1.
Fire up Weka software, launch the
explorer window and select the \Preprocess" tab. Open the weather.nominal
data-set (\weather.nominal.arff", this should be in the ./data/ directory
of the Weka install).
2. Often we are in search of discovering association rules
showing attribute-value conditions that occur frequently together in a given
set of data, such as; buys(X, computer") & buys(X, \scanner") =)
buys (X,\printer") [support = 2%, confidence = 60%]. Where confidence and support
are measures of rule interestingness. A support of 2% means that 2% of all
transactions under analysis show that computer, scanner and printer are
purchased together. A confidence of 60% means that 60% of the customers who
purchased a computer and a scanner also bought a printer. We are interested
into association rules that apply to a reasonably large number of instances and
have a reasonably high accuracy on the instances to which they apply.
Weka has three build-in association rule learners. These
are, \Apriori", \Predictive Apriori" and \Tertius", however they
are not capable of handling numeric data. Therefore in this exericse we use
weather data.
(a) Select the \Associate" tab to get into the
Association rule mining perspective of Weka. Under \Associator" select and
run each of the following \Apriori", \Predictive Apriori" and
\Tertius".
Briefly
inspect the output produced by each Associator and try to interpret its
meaning.
(b) In association rule mining the number of possible
association rules can be very large even with tiny datasets, hence it is in our
best interest to reduce the count of rules found, to only the most interesting
ones. This is usually achieved by setting minimum thresh-
olds on support and confidence values. Still in the
\Associate" view, select the \Apriori" algorithm again, click on the
textbox next to the \Choose" button and try, in turn, different values for
the following parameters \lowerBoundMinSupport" (min threshold for
support), \minMetric" (min threshold for confidence). As you change these
parameter values what do you notice about the rules that are found by the
associator? Note that the parameter \numRules" limits the maximum number
of rules that the associator looks for, you can try changing this value.
(c)
This time run the Apriori algorithm with the \outputItemSets" parameter
set to true. You will notice that the algorithm now also outputs a list of
\Generated sets of large itemsets:" at di_erent levels. If you have the
module's Data Mining book by Witten &
Frank
with you, then you can compare and contrast the Apriori associator's output
with the association rules on pages 114-116 (I will have a couple copies
circulating in the lab during the session, just ask me for one). I also
strongly recommend to read through
chapter 4.5 in your own time, while playing with the
weather data in Weka, this chapter gives a nice & easy introduction to
association rules. Notice in particular how the item sets and association rules
compare with Weka and tables 4.10-4.11 in the book.
(d)
Compare the association rules output from Apriori and Tertius (you can do this
by navigating through the already build associator models in the \Result
list" on the right side of the screen).
Make sure that the Apriori algorithm shows at least 20
rules. Think about how the association rules generated by the two different
methods compare to each other?
Something to always remember with association rules, is
that they should not be used for prediction directly, that is without further
analysis or domain knowledge, as they do not necessarily indicate causality.
They
are however a very helpful starting point for further exploration and for
building a better understanding of our data.
As you should certainly know by this point, in order to
identify associations between parameters a correlation matrix and scatter plot
matrix can be very useful fs.
Exercise 3: Boolean
association rule mining in Weka
The dataset
studied is the weather dataset from Weka’s data folder
The goal of this data mining
study is to find strong association rules in the weather.nominal
dataset. Answer the following questions:
a.
What type of variables are in this dataset (numeric /
ordinal / categorical) ?
b.
Load the data in Weka Explorer. Select
the Associate tab. How
many different association rule mining algorithms are available?
c.
Choose Apriori algorithm with the
following parameters (which you can select by clicking on the chosen algorithm:
support threshold = 15% (lowerBoundMinSupport = 0.15), confidence threshold =
90% (metricType = confidence, minMetric = 0.9), number of rules = 50 (numRules
= 50). After starting the algorithm, how many rules do you find? Could you use
the regular weather dataset to get the results? Explain why.
d.
Paste a screenshot of the Explorer window
showing at least the first 20 rules.
e.
Define the concepts of support, confidence,
and lift for a rule. Write here the first rule discovered. What
is its support? Its confidence? Interpret the meaning of these terms and this
rule in this particular example.
f.
Apriori algorithm generates association rules
from frequent itemsets. How many itemsets of size 4 were found? Which rule(s)
have been generated from itemset of size 4 (temperature=mild, windy=false,
play=yes, outlook=rainy)? List their numbers in the list of rules.