Exercises
on Nearest Neighbor Learner
• We use a subset of the “Iris Plants Database” dataset (i.e., provided
by WEKA, contained in the “iris.aff” file).
•
Each plant record (i.e., example) is represented by the 5 attributes.
-
SepalLength – the plant’s sepal length in cm.
-
SepalWidth – the plant’s sepal width in cm.
-
PetalLength – the plant’s petal length in cm.
-
PetalWidth – the plant’s petal width in cm.
-
Class – the classification attribute, with the possible values {Iris-setosa,
Iris-versicolor, Iris-virginica}.
Exercises on
Decision tree
•
Let’s assume that we have collected the following data set of users who decided
to buy a computer and others who decided not.
•
Each user record (i.e., example) is represented by the 5 attributes.
-
Age, with the possible values {Young, Medium, Old}.
-
Income, with the possible values {Low, Medium, High}.
-
Student, with the possible values {Yes, No}.
-
Credit_Rating, with the possible values {Fair, Excellent}.
-
Buy_Computer – the classification attribute, with the possible values {Yes,
No}.
UserID
Age Income Student Credit_Rating Buy_Computer
1
Young High No Fair No
2
Young High No Excellent No
3
Medium High No Fair Yes
4 Old
Medium No Fair Yes
5 Old
Low Yes Fair Yes
6 Old
Low Yes Excellent No
7
Medium Low Yes Excellent Yes
8
Young Medium No Fair No
9
Young Low Yes Fair Yes
10
Old Medium Yes Fair Yes
11
Young Medium Yes Excellent Yes
12
Medium Medium No Excellent Yes
13
Medium High Yes Fair Yes
14
Old Medium No Excellent No
15
Medium Medium Yes Fair No
16
Medium Medium Yes Excellent Yes
17
Young Low Yes Excellent Yes
18
Old High No Fair No
19
Old Low No Excellent No
20
Young Medium Yes Excellent Yes
• We
want to predict, for each of the following users, if s/he will buy a computer
or not.
-
User #21. A young student with medium income and fair credit rating.
-
User #22. A young non-student with low income and fair credit rating.
-
User #23. A medium student with high income and excellent credit rating.
-
User #24. An old non-student with high income and excellent credit rating.
Use
the WEKA tool
•
Convert the dataset containing 20 examples (i.e., Users #1-20) into the ARFF
format (supported
by
WEKA), and save it in the “buy_comp.arff” file.
• For
each user in the set of Users #21-24, set the values of the Buy_Computer
attribute by the predictions computed manually in Part I. Convert the data of
these four users into the ARFF format, and save it in the “buy_comp_extra.arff”
file.
•
Launch the WEKA tool, and then activate the “Explorer” environment.
•
Open the “buy_comp” dataset (i.e., saved in the “buy_comp.arff” file).
- For
each attribute and for each of its possible values, how many instances in each
class
have
the feature value (i.e., the class distribution of the feature values)?
• Go
to the “Classify” tab. Select the Id3 classifier. Choose “Percentage
split” (66% for training) test mode. Run the classifier and observe the results
shown in the “Classifier output” window.
- How
many instances used for the training? How many for the test?
-
Does the test set currently used include the four instances of Users #21-24?
- How
many instances are incorrectly classified?
- What
is the MAE (mean absolute error) made by the learned DT?
-
What can you infer from the information shown in the Confusion Matrix?
-
Visualize the errors made by the learned DT. In the plot, how can you
differentiate
between
the correctly and incorrectly classified instances? In the plot, how can you
see
the
detailed information of an incorrectly classified instance?
- How
can you save the learned DT to a file?
- How
can you visualize the structure of the learned DT?
•
Now, in the “Test options” panel select the “Supplied test set” option.
Activate the nearby “Set...” button and locate the “buy_comp_extra.arff” file.
Run the classifier and observe the results shown in the “Classifier output”
window.
- How
many instances used for the training? How many for the test?
-
Does the test set currently used include the four examples (i.e., Users
#21-24)?
- In
the “Classifier output” window, where you can find the information that says
for which of the four users (i.e., Users #21-24) the learned DT predicts correctly
and for which others it predicts incorrectly?
- What is the MAE (mean absolute error) made by the learned
DT?