Data Mining Lab : Exercises from WEKA textbook

Exercises from WEKA textbook

1) Weather.nominal.arff

What are the values that the attribute temperature can have?

Load a new dataset. Click the Open file button and select the file iris.arff. . How many instances does this dataset have? How many attributes? What is the range of possible values of the attribute petallength?

2) Weather.nominal.arff

What is the function of the first column in the Viewer window? What is the class value of instance number 8 in the weather data?

Load the iris data and open it in the editor. How many numeric and how many nominal attributes does this dataset have?

3) Load the weather.nominal dataset. Use the filter weka.unsupervised.instance.RemoveWithValues to remove all instances in which the humidity attribute has the value high. To do this, first make the field next to the Choose button show the text RemoveWithValues. Then click on it to get the Generic Object Editor window, and figure out how to change the filter settings appropriately. Undo the change to the dataset that you just performed, and verify that the data has reverted to its original state.

4) Load the iris data using the Preprocess panel. Evaluate C4.5 on this data using (a) the training set and (b) cross-validation. What is the estimated percentage of correct classifications for (a) and (b)? Which estimate

is more realistic? Use the Visualize classifier errors function to find the wrongly classified test instances for the cross-validation performed in previous Exercise . What can you say about the location of the errors?

5) Glass.arff

How many attributes are there in the dataset? What are their names? What is the class attribute? Run the classification algorithm IBk (weka.classifiers.lazy.IBk). Use cross-validation to test its performance,

leaving the number of folds at the default value of 10. Recall that you can examine the classifier options in the Generic Object Editor window that pops up when you click the text beside the Choose button. The default

value of the KNN field is 1: This sets the number of neighboring instances to use when classifying.

6) Glass.arff

What is the accuracy of IBk (given in the Classifier Output box)? Run IBk again, but increase the number of neighboring instances to k = 5 by entering this value in the KNN field. Here and throughout this section, continue to use cross-validation as the evaluation method.

What is the accuracy of IBk with five neighboring instances (k = 5)?

7) Ionosphere.arff

For J48, compare cross-validated accuracy and the size of the trees generated for (1) the raw data, (2) data discretized by the unsupervised discretization method in default mode, and (3) data discretized by the same

method with binary attributes.

8) Apply the ranking technique to the labor negotiations data in labor.arff to determine the four most important attributes based on information gain. On the same data, run CfsSubsetEval for correlation-based

selection, using the BestFirst search. Then run the wrapper method with J48 as the base learner, again using the BestFirst search. Examine the attribute subsets that are output. Which attributes are selected by both

methods? How do they relate to the output generated by ranking using information gain?

9) Run Apriori on the weather data with each of the four rule-ranking metrics, and default settings otherwise. What is the top-ranked rule that is output for each metric?