Data Preprocessing Exercises
Exercise 1)
Attribute Relevance Ranking
For each step, open the indicated file
in the “Preprocess” window. Then, go to the “Attribute Selection” window and
set the “Attribute selection mode to “Use full training set”. For below mentioned
case, perform attribute ranking using the following attribute selection methods
with default parameters:
a)
InfoGainAttributeEval; and
b)
GainRatioAttributeEval;
These attribute selection methods
should consider only non-class dimensions (for each set, the class attribute is
indicated above the “Start” button). Record the output of each run in a text
file called “output.txt”. For that, copy the output of the run from the
“Attribute selection output” window in the Explorer and paste it at the end of
the “output.txt” file.
a). Perform
attribute ranking on the “contact-lenses.arff” data set using the two attribute
ranking methods with default parameters.
Evaluation
Once you have performed the
experiments, you should spend some time evaluating your results. In particular,
try to answer at least the following questions: Why would one need attribute
relevance ranking? Do these attribute-ranking methods often agree or disagree?
On which data set(s), if any, these methods disagree? Does discretization and
its method affect the results of attribute ranking? Do missing values affect
the results of attribute ranking? Record these and any other observations in a
Word file called “Observations.doc”.
Exercise 2
1.
Fire up the Weka (Waikato Environment for Knowledge Analysis) software, launch
the explorer window and select the \Preprocess" tab.
2.
Open the iris data-set (\iris.ar_", this should be in the ./data/ directory
of the Weka install). What information do you have about the data set (e.g.
number of instances, attributes and classes)? What type of attributes does this
data-set contain (nominal or numeric)? What are the classes in this data-set?
Which attribute has the greatest standard deviation? What does this tell you
about that attribute? (You might also find it useful to open \iris.ar_" in
a text editor).
3.
Under \Filter" choose the \Standardize" _lter and apply it to all attributes.
What does it do? How does it afect the attributes' statistics? Click
\Undo" to un-standardize the data and now apply the \Normalize" filter
and apply it to all the attributes. What does it do? How does it affect the
attributes' statistics? How does it differ from \Standardize"? Click
\Undo" again to return the data to its original state.
4. At
the bottom right of the window there should be a graph which visualizes the
data-set, making sure \Class: class (Nom)" is selected in the drop-down
box click \Visualize All". What can you interpret from these graphs? Which
attribute(s) discriminate best between the classes in the data-set? How do the
\Standardize" and \Normalize"
filters
affect these graphs?
5.
Under \Filter" choose the \AttributeSelection" filter. What does it do?
Are the attributes it selects the same as the ones you chose as discriminatory
above? How does its behavior change as you alter its parameters?
6.
Select the \Visualize" tab. This shows you 2D scatter plots of each attribute
against each other attribute (similar to the F1 vs F2 plots from tutorial 1).
Make sure the drop-down box at the bottom says \Color: class (Nom)". Pay
close attention to the plots between attributes you think discriminate best
between classes, and the plots
between
attributes selected by the \AttributeSelection" filter. Can you verify
from these plots whether your thoughts and the \AttributeSelection" filter
are correct? Which attributes are correlated?
Exercise
3
1.
Download the Old Faithful data-set
2.
Upload this data in Excel. There are 2 attributes and 2 classes. Sort the data
by class (be careful to sort the entire row), line or bar plot each of the
features individually and save the graphs in a Word _le. What do you notice on
the plots from a visual inspection.
3.
For each class feature, compute its minimum, maximum, mean and
standard
deviation.
4.
Generate a pairwise scatter plot for the combinations of: F1 vs F2. Can you
visually guess whether these attributes are related or not?
5.
Based on the scatter plot generated in point 4, determine the data points that
are the outliers (extreme high or low values). Do this manually by visually
inspecting the scatter plot, remove at least 5 points.
6.
Compute correlation between features for each class separately and
create
a correlation matrix. What does it show?
7.
Normalise all features to the range [0, 1]. There are several ways this can be
done, we will use the standard min-max normalization Recompute 6 has it made a
difference?
:non) � s a � `K� 'font-size:10.0pt;font-family:"Times-Roman","serif";mso-bidi-font-family:
Times-Roman'>How many attributes are there in the dataset? What are their
names? What is the class attribute? Run the classification algorithm IBk (weka.classifiers.lazy.IBk). Use cross-validation to test its performance,
leaving the number of folds at the default value of 10. Recall
that you can examine the classifier options in the Generic Object Editor window
that pops up when you click the text beside the Choose button. The default
value of the KNN field is 1: This sets the number of neighboring
instances to use when classifying.
6) Glass.arff
What is the accuracy of IBk (given in the Classifier Output box)? Run IBk again, but increase the number of neighboring
instances to k = 5 by entering this value in the KNN field. Here and throughout
this section, continue to use cross-validation as the evaluation method.
What is the accuracy of IBk with five neighboring instances (k = 5)?
7) Ionosphere.arff
For J48, compare
cross-validated accuracy and the size of the trees generated for (1) the raw
data, (2) data discretized by the unsupervised discretization method in default
mode, and (3) data discretized by the same
method with binary attributes.
8) Apply the ranking technique to the labor negotiations data in labor.arff to determine the four most important attributes based on
information gain. On the same data, run CfsSubsetEval
for correlation-based
selection, using the BestFirst
search. Then run the wrapper method with J48 as the base learner, again using the BestFirst search. Examine the
attribute subsets that are output. Which attributes are selected by both
methods? How do they relate to the output generated by ranking
using information gain?
9)
Run Apriori on the weather data with each of the four rule-ranking metrics,
and default settings otherwise. What is the top-ranked rule that is output for
each metric?