Data Mining Lab : Data Preprocessing Exercises

Data Preprocessing Exercises

Exercise 1) Attribute Relevance Ranking

For each step, open the indicated file in the “Preprocess” window. Then, go to the “Attribute Selection” window and set the “Attribute selection mode to “Use full training set”. For below mentioned case, perform attribute ranking using the following attribute selection methods with default parameters:

a) InfoGainAttributeEval; and

b) GainRatioAttributeEval;

These attribute selection methods should consider only non-class dimensions (for each set, the class attribute is indicated above the “Start” button). Record the output of each run in a text file called “output.txt”. For that, copy the output of the run from the “Attribute selection output” window in the Explorer and paste it at the end of the “output.txt” file.

a). Perform attribute ranking on the “contact-lenses.arff” data set using the two attribute ranking methods with default parameters.

Evaluation

Once you have performed the experiments, you should spend some time evaluating your results. In particular, try to answer at least the following questions: Why would one need attribute relevance ranking? Do these attribute-ranking methods often agree or disagree? On which data set(s), if any, these methods disagree? Does discretization and its method affect the results of attribute ranking? Do missing values affect the results of attribute ranking? Record these and any other observations in a Word file called “Observations.doc”.

Exercise 2

1. Fire up the Weka (Waikato Environment for Knowledge Analysis) software, launch the explorer window and select the \Preprocess" tab.

2. Open the iris data-set (\iris.ar_", this should be in the ./data/ directory of the Weka install). What information do you have about the data set (e.g. number of instances, attributes and classes)? What type of attributes does this data-set contain (nominal or numeric)? What are the classes in this data-set? Which attribute has the greatest standard deviation? What does this tell you about that attribute? (You might also find it useful to open \iris.ar_" in a text editor).

3. Under \Filter" choose the \Standardize" _lter and apply it to all attributes. What does it do? How does it afect the attributes' statistics? Click \Undo" to un-standardize the data and now apply the \Normalize" filter and apply it to all the attributes. What does it do? How does it affect the attributes' statistics? How does it differ from \Standardize"? Click \Undo" again to return the data to its original state.

4. At the bottom right of the window there should be a graph which visualizes the data-set, making sure \Class: class (Nom)" is selected in the drop-down box click \Visualize All". What can you interpret from these graphs? Which attribute(s) discriminate best between the classes in the data-set? How do the \Standardize" and \Normalize"

filters affect these graphs?

5. Under \Filter" choose the \AttributeSelection" filter. What does it do? Are the attributes it selects the same as the ones you chose as discriminatory above? How does its behavior change as you alter its parameters?

6. Select the \Visualize" tab. This shows you 2D scatter plots of each attribute against each other attribute (similar to the F1 vs F2 plots from tutorial 1). Make sure the drop-down box at the bottom says \Color: class (Nom)". Pay close attention to the plots between attributes you think discriminate best between classes, and the plots

between attributes selected by the \AttributeSelection" filter. Can you verify from these plots whether your thoughts and the \AttributeSelection" filter are correct? Which attributes are correlated?

Exercise 3

1. Download the Old Faithful data-set

2. Upload this data in Excel. There are 2 attributes and 2 classes. Sort the data by class (be careful to sort the entire row), line or bar plot each of the features individually and save the graphs in a Word _le. What do you notice on the plots from a visual inspection.

3. For each class feature, compute its minimum, maximum, mean and

standard deviation.

4. Generate a pairwise scatter plot for the combinations of: F1 vs F2. Can you visually guess whether these attributes are related or not?

5. Based on the scatter plot generated in point 4, determine the data points that are the outliers (extreme high or low values). Do this manually by visually inspecting the scatter plot, remove at least 5 points.

6. Compute correlation between features for each class separately and

create a correlation matrix. What does it show?

7. Normalise all features to the range [0, 1]. There are several ways this can be done, we will use the standard min-max normalization Recompute 6 has it made a difference?

:non) � s a � `K� 'font-size:10.0pt;font-family:"Times-Roman","serif";mso-bidi-font-family: Times-Roman'>How many attributes are there in the dataset? What are their names? What is the class attribute? Run the classification algorithm IBk (weka.classifiers.lazy.IBk). Use cross-validation to test its performance,

leaving the number of folds at the default value of 10. Recall that you can examine the classifier options in the Generic Object Editor window that pops up when you click the text beside the Choose button. The default

value of the KNN field is 1: This sets the number of neighboring instances to use when classifying.

6) Glass.arff

What is the accuracy of IBk (given in the Classifier Output box)? Run IBk again, but increase the number of neighboring instances to k = 5 by entering this value in the KNN field. Here and throughout this section, continue to use cross-validation as the evaluation method.

What is the accuracy of IBk with five neighboring instances (k = 5)?

7) Ionosphere.arff

For J48, compare cross-validated accuracy and the size of the trees generated for (1) the raw data, (2) data discretized by the unsupervised discretization method in default mode, and (3) data discretized by the same

method with binary attributes.

8) Apply the ranking technique to the labor negotiations data in labor.arff to determine the four most important attributes based on information gain. On the same data, run CfsSubsetEval for correlation-based

selection, using the BestFirst search. Then run the wrapper method with J48 as the base learner, again using the BestFirst search. Examine the attribute subsets that are output. Which attributes are selected by both

methods? How do they relate to the output generated by ranking using information gain?

9) Run Apriori on the weather data with each of the four rule-ranking metrics, and default settings otherwise. What is the top-ranked rule that is output for each metric?