Data Mining Lab : Classification / Prediction / Cluster analysis Case Study 2

Case Study 2

Classification / Prediction / Cluster analysis

The goal of this assignment is to review prediction mining principles and methods, cluster analysis principles and methods, and to apply them to a dataset using Weka data mining tool.

Heart dataset

The first dataset studied is the cleveland dataset from UCI repository. This dataset describes numeric factors of heart disease. It can be downloaded from http://www.cs.waikato.ac.nz/~ml/weka/index_datasets.html and is contained in the datasets-numeric.jar archive.

Zoo dataset

The second dataset studied is the zoo dataset from UCI repository. This dataset describes animals with categorical features. It can be downloaded from http://www.cs.waikato.ac.nz/~ml/weka/index_datasets.html and is contained in the datasets-UCI.jar archive.

1. Prediction in Weka (100 points, 5 points per question)

The goal of this data mining study is to predict the severity of heart disease in the cleveland dataset (variable num) based on the other attributes. Answer the following questions:

a. What types of variables are in this dataset (numeric / ordinal / categorical)?

b. Load the data in Weka Explorer. Select the Classify tab. How many different prediction algorithms are available (under functions)?

c. Explain what is prediction in data mining.

d. Choose LinearRegression algorithm. Explain what is the principle of this algorithm.

e. Results of this algorithm can be interpreted in the following way. The first part of the output represents the coefficients of the linear equation of the type
num = w₀ + w₁a₁ + … + w_ka_k.
The numbers provided in front of each attribute a_krepresent the w_k. Based on this, interpret the results you get from running LinearRegression on the dataset. What is the equation of the line found?

f. The second part of the results states the correlation coefficient, which measures the statistical correlation between the predicted and actual values (a coefficient of +1 indicates a perfect positive relationship, 0 indicates no linear relationship, and –1 indicates a perfect negative relationship). Only positive correlations make sense in regression, and a coefficient above 0.5 signals a large correlational effect. The remaining figures are the mean absolute error (the average prediction error), the root mean squared error (the square root of the mean squared error), which is the most commonly used error measure, the relative absolute error (which compares this error with the one obtained if the prediction had been the mean), the root relative squared error (the square root of the error in comparison with the one obtained if the prediction had been the mean), and the total number of instances considered.
The overall interpretation of these is the following: a prediction is good when the correlation coefficient is as large as possible, and all the errors are as small as possible. These figures are used to compare several prediction results. How do you evaluate the fit of the equation provided in e), meaning how strong is this prediction?

g. It is also notable that an important figure is the square of the correlation coefficient (R²). In statistical regression analysis, which invented this prediction method, the most used success measures are R and R². The latter represents the percentage of variation in the target figure accounted for by the model. For example, if we want to predict a sales volume based on three factors such as the advertising budget, the number of plays on the radio per week, and the attractiveness of the band, and if we get a correlation coefficient R of 0.8, then we learn from the model that R²= 64% of the variability in the outcome (the sales volume) is accounted for by the three factors. How much of the variability of num can be predicted by the other attributes?

h. Are theses results compatible with the results of assignment #1, which used classification to predict num?

Now compare these figures with the other classifiers provided in functions and fill-in the following table (except the last line):

Method	Correlation coefficient	Mean absolute error	Root mean squared error	Relative absolute error	Root relative squared error
*LinearRegression*
*SMOreg*
*MultilayerPerceptron*
MultilayerPerceptron (optimized)

a. Which prediction method provides best results with this dataset?

b. Try using the other functions to calculate the same regression. What problem(s) are you facing?

c. Explain what is logistic regression, and how it differs from linear regression.

d. Is in fact logistic regression a prediction method? If not, what kind of data mining method is logistic regression?

e. In the MultilayerPerceptron function, how many input nodes does this multiplayer perceptron have?

f. In the MultilayerPerceptron function, how many output nodes does this multiplayer perceptron have?

g. In the MultilayerPerceptron function, how many hidden layers does this multiplayer perceptron have?

h. After choosing GUI in the panel of MultilayerPerceptron options, paste here a screenshot of the graphical representation of this neural network.

i. What is its learning rate?

j. By changing the MultilayerPerceptron parameters, what is the configuration for the best results you get?

k. What best prediction results do you get (fill in the table above)?

2. Clustering in Weka (50 points, 5 points per question)

The goal of this data mining study is to find groups of animals in the zoo dataset, and to check whether these groups correspond to the real animal types in the dataset.

a. What types of variables are in this dataset?

b. How many rows / cases are there?

c. How many animal types are represented in this dataset? List them here.

d. After removing the type attribute, go to the Cluster tab. How many clustering algorithms are available in Weka?

e. List the clustering algorithms seen in class, and map these to the ones provided in Weka.

f. Start using the SimpleKMeans clusterer choosing 7 clusters. Do the clusters learnt and their centroids seem to match the animal types?

g. Compare results with EM clusterer (with 7 clusters), MakeDensityBasedClusterer, FarthestFirst (with 7 clusters), and Cobweb. Which algorithm seems to provide the best clustering match for this dataset?

h. Explain the principles of SimpleKMeans, EM, MakeDensityBasedClusterer, and Cobweb clustering algorithms.

i. Are results easy to interpret, even with the tree visualizations provided?

j. What would make it easier to evaluate the usefulness of the clusters found?

a. List some animals that are misclassified, meaning classified in a cluster that does not correspond to their actual type, for instance a mammal clustered with fish, or a reptile clustered with amphibian.

b. By modifying the selected parameters, improve the classification, explain which modifications you made, and paste here the resulting dendrogram.

Case Study 3

In this assignment, you have to compare the performance of four classification approaches (simply compare the accuracy of the approaches):

· Decision Trees

· Ripper (Rule Learning system (JRip in WEKA)

· SVMs (Not in WEKA? If not use SVMLight or the like)

· Decision Trees with AdaBoost

on three different data sets from UCI, or from other sources of your choice.