Exercises from WEKA textbook
1) Weather.nominal.arff
What are the values that the attribute temperature can have?
Load a new dataset. Click the Open
file button and select the file iris.arff. . How many instances
does this dataset have? How many attributes? What is the range of possible
values of the attribute petallength?
2) Weather.nominal.arff
What is the function of the first column in the Viewer window? What is the class value of instance number 8 in the weather data?
Load the iris data and open it in the editor. How many numeric and
how many nominal attributes does this dataset have?
3) Load the weather.nominal dataset. Use the filter weka.unsupervised.instance.RemoveWithValues
to remove all instances in which the humidity attribute has the value
high. To do
this, first make the field next to the Choose button show the text RemoveWithValues. Then click on it to get the Generic Object Editor window, and
figure out how to change the filter settings appropriately. Undo the change to
the dataset that you just performed, and verify that the data has reverted to
its original state.
4) Load the iris data using the Preprocess panel. Evaluate C4.5 on
this data using (a) the training set and (b) cross-validation. What is the
estimated percentage of correct classifications for (a) and (b)? Which estimate
is more realistic? Use the Visualize classifier errors function
to find the wrongly classified test instances for the cross-validation
performed in previous Exercise . What can you say about the location of the
errors?
5) Glass.arff
How many attributes are there in the dataset? What are their
names? What is the class attribute? Run the classification algorithm IBk (weka.classifiers.lazy.IBk). Use cross-validation to test its performance,
leaving the number of folds at the default value of 10. Recall
that you can examine the classifier options in the Generic Object Editor window
that pops up when you click the text beside the Choose button. The default
value of the KNN field is 1: This sets the number of neighboring
instances to use when classifying.
6) Glass.arff
What is the accuracy of IBk (given in the Classifier Output box)? Run IBk again, but increase the number of neighboring
instances to k = 5 by entering this value in the KNN field. Here and throughout
this section, continue to use cross-validation as the evaluation method.
What is the accuracy of IBk with five neighboring instances (k = 5)?
7) Ionosphere.arff
For J48, compare
cross-validated accuracy and the size of the trees generated for (1) the raw
data, (2) data discretized by the unsupervised discretization method in default
mode, and (3) data discretized by the same
method with binary attributes.
8) Apply the ranking technique to the labor negotiations data in labor.arff to determine the four most important attributes based on
information gain. On the same data, run CfsSubsetEval
for correlation-based
selection, using the BestFirst
search. Then run the wrapper method with J48 as the base learner, again using the BestFirst search. Examine the
attribute subsets that are output. Which attributes are selected by both
methods? How do they relate to the output generated by ranking
using information gain?
9)
Run Apriori on the weather data with each of the four rule-ranking metrics,
and default settings otherwise. What is the top-ranked rule that is output for
each metric?