Case Study 2
Classification /
Prediction / Cluster analysis
The goal of this assignment is to review prediction mining principles and
methods, cluster analysis principles and methods, and to apply them to a dataset
using Weka data mining tool.
Heart dataset
The first dataset studied is the cleveland
dataset from UCI repository. This dataset describes numeric
factors of heart disease. It can be downloaded from http://www.cs.waikato.ac.nz/~ml/weka/index_datasets.html
and is contained in the datasets-numeric.jar archive.
Zoo dataset
The second dataset studied is the
zoo dataset from UCI repository. This dataset
describes animals with categorical features. It can be downloaded from http://www.cs.waikato.ac.nz/~ml/weka/index_datasets.html
and is contained in the datasets-UCI.jar archive.
1. Prediction in Weka (100 points, 5 points per
question)
The goal of this data mining study is to predict the
severity of heart disease in the cleveland dataset (variable num)
based on the other attributes. Answer the following questions:
a.
What types of variables are in this dataset (numeric /
ordinal / categorical)?
b.
Load the data in Weka Explorer. Select
the Classify tab. How many
different prediction algorithms are available (under functions)?
c.
Explain what is prediction in data
mining.
d.
Choose LinearRegression algorithm.
Explain what is the principle of this algorithm.
e.
Results of this algorithm can be interpreted in the
following way. The first part of the output represents the coefficients of the
linear equation of the type
num = w0 + w1a1 + … + wkak.
The numbers provided in front of each attribute ak represent the wk. Based on this, interpret the results you get from running LinearRegression on the dataset. What is the equation of the line found?
num = w0 + w1a1 + … + wkak.
The numbers provided in front of each attribute ak represent the wk. Based on this, interpret the results you get from running LinearRegression on the dataset. What is the equation of the line found?
f.
The second part of the results states the correlation
coefficient, which measures the statistical correlation between the
predicted and actual values (a coefficient of +1 indicates a perfect
positive relationship, 0 indicates no linear relationship, and –1
indicates a perfect negative relationship). Only positive correlations make
sense in regression, and a coefficient above 0.5 signals a large
correlational effect. The remaining figures are the mean absolute error (the
average prediction error), the root mean squared error (the square root
of the mean squared error), which is the most commonly used error measure, the
relative absolute error (which compares this error with the one obtained if the
prediction had been the mean), the root relative squared error (the square root
of the error in comparison with the one obtained if the prediction had been the
mean), and the total number of instances considered.
The overall interpretation of these is the following: a prediction is good when the correlation coefficient is as large as possible, and all the errors are as small as possible. These figures are used to compare several prediction results. How do you evaluate the fit of the equation provided in e), meaning how strong is this prediction?
The overall interpretation of these is the following: a prediction is good when the correlation coefficient is as large as possible, and all the errors are as small as possible. These figures are used to compare several prediction results. How do you evaluate the fit of the equation provided in e), meaning how strong is this prediction?
g.
It is also notable that an important figure is the
square of the correlation coefficient (R2). In statistical
regression analysis, which invented this prediction method, the most used
success measures are R and R2. The latter represents
the percentage of variation in the target figure accounted for by the model.
For example, if we want to predict a sales volume based on three factors such
as the advertising budget, the number of plays on the radio per week, and the
attractiveness of the band, and if we get a correlation coefficient R of
0.8, then we learn from the model
that R2 = 64% of the
variability in the outcome (the sales volume) is accounted for by the three
factors. How much of the variability of num can be predicted by
the other attributes?
h.
Are theses results compatible with the results of
assignment #1, which used classification to predict num?
Now compare these figures with the other classifiers
provided in functions and fill-in the following table (except the last line):
Method
|
Correlation coefficient
|
Mean absolute error
|
Root mean squared error
|
Relative absolute error
|
Root relative squared
error
|
LinearRegression
|
|
|
|
|
|
SMOreg
|
|
|
|
|
|
MultilayerPerceptron
|
|
|
|
|
|
MultilayerPerceptron
(optimized) |
|
|
|
|
|
a.
Which prediction method provides best results with this
dataset?
b.
Try using the other functions to calculate the same
regression. What problem(s) are you facing?
c.
Explain what is logistic regression, and how it differs
from linear regression.
d.
Is in fact logistic regression a prediction method? If
not, what kind of data mining method is logistic regression?
e.
In the MultilayerPerceptron function, how
many input nodes does this multiplayer perceptron have?
f.
In the MultilayerPerceptron function, how
many output nodes does this multiplayer perceptron have?
g.
In the MultilayerPerceptron function, how
many hidden layers does this multiplayer perceptron have?
h.
After choosing GUI in the panel of MultilayerPerceptron
options, paste here a screenshot of the graphical representation of this neural
network.
i.
What is its learning rate?
j.
By changing the MultilayerPerceptron parameters,
what is the configuration for the best results you get?
k.
What best prediction results do you get (fill in the
table above)?
2. Clustering in Weka (50 points, 5 points per
question)
The goal of this data mining study is to find groups of
animals in the zoo dataset, and to check whether these groups
correspond to the real animal types in the dataset.
a.
What types of variables are in this dataset?
b.
How many rows / cases are there?
c.
How many animal types are represented in this dataset?
List them here.
d.
After removing the type attribute, go to
the Cluster tab. How many clustering algorithms are available in Weka?
e.
List the clustering algorithms seen in class, and map
these to the ones provided in Weka.
f.
Start using the SimpleKMeans clusterer
choosing 7 clusters. Do the clusters learnt and their centroids seem to match
the animal types?
g.
Compare results with EM clusterer (with 7
clusters), MakeDensityBasedClusterer, FarthestFirst
(with 7 clusters), and Cobweb. Which algorithm seems to provide
the best clustering match for this dataset?
h.
Explain the principles of SimpleKMeans, EM,
MakeDensityBasedClusterer, and Cobweb clustering
algorithms.
i.
Are results easy to interpret, even with the tree
visualizations provided?
j.
What would make it easier to evaluate the usefulness of
the clusters found?
a.
List some animals that are misclassified, meaning
classified in a cluster that does not correspond to their actual type, for
instance a mammal clustered with fish, or a reptile clustered with amphibian.
b.
By modifying the selected parameters, improve the
classification, explain which modifications you made, and paste here the
resulting dendrogram.
Case
Study 3
In this assignment, you have to compare the performance of
four classification approaches (simply compare the accuracy of the approaches):
·
Decision Trees
·
Ripper (Rule Learning system (JRip in WEKA)
·
SVMs (Not in WEKA? If not use SVMLight or the
like)
·
Decision Trees with AdaBoost
on three different data sets from UCI, or from other sources
of your choice.