Clustering
1) Clustering using K-Means
Get to the Weka Explorer
environment and load the training file using the Preprocess mode. Try
first with weather.arff. Get to the Cluster mode (by clicking on
the Cluster tab) and select a clustering algorithm, for example
SimpleKMeans. Then click on Start and you get the clustering result in
the output window. The actual clustering for this algorithm is shown as one
instance for each cluster representing the cluster centroid.
Scheme:
weka.clusterers.SimpleKMeans -N 2 -S 10
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Test mode: evaluate on training data
===
Clustering model (full training set) === Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Test mode: evaluate on training data
kMeans
======
Number of iterations: 4
Within cluster sum of squared errors: 16.156838252701938
Cluster centroids:
Cluster 0
Mean/Mode: rainy 75.625 86 FALSE yes
Std Devs: N/A 6.5014 7.5593 N/A N/A
Cluster 1
Mean/Mode: sunny 70.8333 75.8333 TRUE yes
Std Devs: N/A 6.1128 11.143 N/A N/A
=== Evaluation on training set ===
kMeans
======
Number of iterations: 4
Within cluster sum of squared errors: 32.31367650540387
Cluster centroids:
Cluster 0
Mean/Mode: rainy 75.625 86 FALSE yes
Std Devs: N/A 6.5014 7.5593 N/A N/A
Cluster 1
Mean/Mode: sunny 70.8333 75.8333 TRUE yes
Std Devs: N/A 6.1128 11.143 N/A N/A
Clustered Instances
0 8 ( 57%)
1 6 ( 43%)
Evaluation
The way Weka evaluates the clusterings depends on the
cluster mode you select. Four different cluster modes are available (as buttons
in the Cluster mode panel):
- Use training set (default). After generating the clustering Weka classifies the training instances into clusters according to the cluster representation and computes the percentage of instances falling in each cluster. For example, the above clustering produced by k-means shows 43% (6 instances) in cluster 0 and 57% (8 instances) in cluster 1.
- In Supplied test set or Percentage split Weka can evaluate clusterings on separate test data if the cluster representation is probabilistic (e.g. for EM).
- Classes to clusters evaluation. In this mode Weka first ignores the class attribute and generates the clustering. Then during the test phase it assigns classes to the clusters, based on the majority value of the class attribute within each cluster. Then it computes the classification error, based on this assignment and also shows the corresponding confusion matrix. An example of this for k-means is shown below.
Scheme: weka.clusterers.SimpleKMeans -N 2 -S 10
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
Ignored:
play
Test mode: Classes to clusters evaluation on training data
kMeans
======
Number of iterations: 4
Within cluster sum of squared errors: 11.156838252701938
Cluster centroids:
Cluster 0
Mean/Mode: rainy 75.625 86 FALSE
Std Devs: N/A 6.5014 7.5593 N/A
Cluster 1
Mean/Mode: sunny 70.8333 75.8333 TRUE
Std Devs: N/A 6.1128 11.143 N/A
=== Evaluation on training set ===
kMeans
======
Number of iterations: 4
Within cluster sum of squared errors: 22.31367650540387
Cluster centroids:
Cluster 0
Mean/Mode: rainy 75.625 86 FALSE
Std Devs: N/A 6.5014 7.5593 N/A
Cluster 1
Mean/Mode: sunny 70.8333 75.8333 TRUE
Std Devs: N/A 6.1128 11.143 N/A
Clustered Instances
0 8 ( 57%)
1 6 ( 43%)
Class attribute: play
Classes to Clusters:
0 1 <-- assigned to cluster
5 4 | yes
3 2 | no
Cluster 0 <-- yes
Cluster 1 <-- no
Incorrectly clustered instances : 7.0 50 %
EM
The EM clustering scheme generates probabilistic
descriptions of the clusters in terms of mean and standard deviation
for the numeric attributes and value counts (incremented by 1 and
modified with a small value to avoid zero probabilities) - for the nominal
ones. In "Classes to clusters" evaluation mode this algorithm also
outputs the log-likelihood, assigns classes to the clusters and prints the
confusion matrix and the error rate, as shown in the example below.
Clustered Instances
0 4 (
29%) 1 10 ( 71%)
Log likelihood: -8.36599
Class attribute: play
Classes to Clusters:
0 1 <-- assigned to cluster
2 7 | yes
2 3 | no
Cluster 0 <-- no
Cluster 1 <-- yes
Incorrectly clustered instances : 5.0 35.7143 %
Cobweb
Cobweb generates hierarchical clustering, where clusters are
described probabilistically. Below is an example clustering of the weather data
(weather.arff). The class attribute (play) is ignored (using the ignore
attributes panel) in order to allow later classes to clusters evaluation.
Doing this automatically through the "Classes to clusters" option
does not make much sense for hierarchical clustering, because of the large
number of clusters. Sometimes we need to evaluate particular clusters or levels
in the clustering hierarchy. We shall discuss here an approach to this.
Let us first see how Weka represents the Cobweb clusters. Below is a
copy of the output window, showing the run time information and the structure
of the clustering tree. Scheme: weka.clusterers.Cobweb -A 1.0 -C 0.234
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
Ignored:
play
Test mode: evaluate on training data
Number of merges: 2
Number of splits: 1
Number of clusters: 6
node 0 [14]
| node 1 [8]
| | leaf 2 [2]
| node 1 [8]
| | leaf 3 [3]
| node 1 [8]
| | leaf 4 [3]
node 0 [14]
| leaf 5 [6]
=== Evaluation on training set ===
Number of merges: 2
Number of splits: 1
Number of clusters: 6
node 0 [14]
| node 1 [8]
| | leaf 2 [2]
| node 1 [8]
| | leaf 3 [3]
| node 1 [8]
| | leaf 4 [3]
node 0 [14]
| leaf 5 [6]
Clustered Instances
2 2 ( 14%)
3 3 ( 21%)
4 3 ( 21%)
5 6 ( 43%)
Here is some comment on the output above:
- -A 1.0 -C 0.234 in the command line specifies the Cobweb parameters Acuity and Cutoff (see the text, page 215). They can be specified through the pop-up window that appears by clicking on area left to the Choose button.
- node N or leaf N represents a subcluster, whose parent cluster is N.
- The clustering tree structure is shown as a horizontal tree, where subclusters are aligned at the same column. For example, cluster 1 (referred to in node 1) has three subclusters 2 (leaf 2), 3 (leaf 3) and 4 (leaf 4).
- The root cluster is 0. Each line with node 0 defines a subcluster of the root.
- The number in square brackets after node N represents the number of instances in the parent cluster N.
- Clusters with [1] at the end of the line are instances.
- For example, in the above structure cluster 1 has 8 instances and its subclusters 2, 3 and 4 have 2, 3 and 3 instances correspondingly.
- To view the clustering tree right click on the last line in the result list window and then select Visualize tree.
To evaluate the Cobweb clustering using the classes
to clusters approach we need to know the class values of the
instances, belonging to the clusters. We can get this information from Weka in
the following way: After Weka finishes (with the class attribute ignored), right
click on the last line in the result list window. Then choose Visualize
cluster assignments - you get the Weka cluster visualize window.
Here you can view the clusters, for example by putting Instance_number
on X and Cluster on Y. Then click on Save and choose a file name
(*.arff). Weka saves the cluster assignments in an ARFF file. Below is
shown the file corresponding to the above Cobweb clustering.
@relation weather_clustered
@attribute Instance_number numeric
@attribute outlook {sunny,overcast,rainy}
@attribute temperature numeric
@attribute humidity numeric
@attribute windy {TRUE,FALSE}
@attribute play {yes,no}
@attribute Cluster {cluster0,cluster1,cluster2,cluster3,cluster4,cluster5}
0,sunny,85,85,FALSE,no,cluster3
1,sunny,80,90,TRUE,no,cluster5
2,overcast,83,86,FALSE,yes,cluster2
3,rainy,70,96,FALSE,yes,cluster4
4,rainy,68,80,FALSE,yes,cluster4
5,rainy,65,70,TRUE,no,cluster5
6,overcast,64,65,TRUE,yes,cluster5
7,sunny,72,95,FALSE,no,cluster3
8,sunny,69,70,FALSE,yes,cluster3
9,rainy,75,80,FALSE,yes,cluster4
10,sunny,75,70,TRUE,yes,cluster5
11,overcast,72,90,TRUE,yes,cluster5
12,overcast,81,75,FALSE,yes,cluster2
13,rainy,71,91,TRUE,no,cluster5
To represent the cluster assignments Weka adds a new attribute Cluster and includes its corresponding values at the end of each data line. Note that all other attributes are shown, including the ignored ones (play, in this case). Also, only the leaf clusters are shown.
If we want to compute the error not only for
leaf clusters, we need to look at the clustering structure (the Visualize
tree option helps here) and determine how the leaf clusters are combined in
other clusters at higher levels of the hierarchy. For example, at the top level
we have two clusters - 1 and 5. We can get the class distribution of 5 directly
from the data (because 5 is a leaf) - 3 yes's and 3 no's. While
for cluster 1 we need its subclusters - 2, 3 and 4. Summing up the class values
we get 6 yes's and 2 no's. Finally, the majority in cluster 1 is
yes and in cluster 5 is no (could be yes too) and the
error (for the top level partitioning in two clusters) is 5/14.
Weka provides another approach to see the instances
belonging to each cluster. When you visualize the clustering tree, you can
click on a node and then see the visualization of the instances falling into
the corresponding cluster (i.e. into the leafs of the subtree). This is a very
useful feature, however if you ignore an attribute (as we did with
"play" in the experiments above) it does not show in the
visualization.