Case
Study #1
Data Preprocessing
with Weka
The goal of this case study is to
investigate how to preprocess data using Weka data mining tool.
This assignment will be using Weka
data mining tool. Weka is an open source Java development environment for data
mining from the University of Waikato in New Zealand. It can be
downloaded freely from http://www.cs.waikato.ac.nz/ml/weka/, Weka is really an asset for
learning data mining because it is freely available, students can study how the
different data mining models are implemented, and develop customized Java data
mining applications. Moreover, data mining results from Weka can
be published in the most respected journals and conferences, which make it a de
facto developing environment of choice for research in data mining, where
researchers often need to develop new data mining methods.
How to use Weka
Weka can be used in four different modes: through
a command line interface (CLI), through a graphical user
interface called the Explorer, through the Knowledge Flow,
and through the Experimenter. The Knowledge Flow
allows to process large datasets in an incremental manner, while the other
modes can only process small to medium size datasets. The Experimenter
provides an environment for testing and comparing several data mining
algorithms.
The
explanations for this assignment focus on Explorer processing of
a dataset, nevertheless the CLI can produce the same
functionality, and thus can be chosen as an alternative. Moreover, this
assignment will use only the data preprocessing capabilities of Weka,
which may only require some Java development, whereas similar functionality in SPSS/CLEMENTINE
would require mastering a more complex suite of functions, and learning a new
programming language, called CLEM.
How to start Weka’s Explorer
- From the program icon or from the Start
menu.
- From the start-up script runWeka.bat in Weka-3-4
folder.
Both of these options start Weka
GUI Chooser. From there, the Explorer is started by clicking on the Explorer
button.
Heart disease datasets
The dataset
studied is the heart disease dataset from UCI repository
(datasets-UCI.jar). Two different datasets are provided: heart-h.arff
(Hungarian data), and heart-c.arff (Cleveland data). These datasets describe
factors of heart disease. They can be downloaded from: http://www.cs.waikato.ac.nz/~ml/weka/index_datasets.html.
The machine mining
project goal is to better understand the risk factors for heart
disease, as represented in the 14th attribute: num
(<50 means no disease, and values <50-1 to <50-4 represent increasing
levels of heart disease).
The question
on which this machine learning study concentrates is whether it is possible to
predict heart disease from the other known data about a patient. The data
mining task of choice to answer this question will be
classification/prediction, and several different algorithms will be used to
find which one provides the best predictive power.
1. Data preparation- integration
We want to merge the two
datasets into one, in a step called data integration. Revise arff
notation from the tutorial, which is Weka data representation
language. Answer the following questions:
a.
Define what data integration means.
b.
Is there an entity identification or
schema integration problem in this dataset? If yes, how to fix it?
c.
Is there a redundancy problem in this
dataset? If yes, how to fix it?
d.
Are there data value conflicts in this
dataset? If yes, how to fix it?
e.
Integrate the two datasets into one single dataset,
which will be used as a starting point for the next questions, and load it in
the Explorer. How many instances do you have? How many
attributes?
f.
Paste a screenshot of the Explorer
window.
2. Descriptive data summarization
Before preprocessing the
data, an important step is to get acquainted with the data – also called data
understanding in CRISP-DM.
a.
Stay in the Preprocess tab for now. Study
for example the age attribute. What is its mean?
Its standard deviation? Its min and max?
b.
Provide the five-number summary of this
attribute. Is this figure provided in Weka?
c.
Specify which attributes are numeric, which are
ordinal, and which are categorical/nominal.
d.
Interpret the graphic showing in the lower right corner
of the Explorer. How can you name this graphic? What do the red
and blue colors mean (pay attention to the pop-up messages that appear when
dragging the mouse over the graphic)? What does this graphic represent?
e.
Visualize all the attributes in graphic format. Paste a
screenshot.
f.
Comment on what you learn from these graphics.
g.
Switch to the Visualize tab. What is the
term used in the textbook to name the series of boxplots represented? By
selecting the maximum jitter, and looking at the num column – the
last one – can you determine which attributes seem to be the most linked to
heart disease? Paste the boxplot representing the attribute you
find the most predictive of heart disease (Y) as a function of num
(X).
h.
Does any pair of different attributes seem correlated?
3. Data preparation – selection
The datasets studied have already been processed by
selecting a subset of attributes relevant for the data mining project.
a.
From the documentation provided in the dataset, how
many attributes were originally in these datasets?
b.
With Weka, attribute selection can be
achieved either from the specific Select attributes tab, or
within Preprocess tab. List the different options in Weka
for selecting attributes, with a short explanation about the corresponding
method.
c.
In comparison with the methods for attribute selection
detailed in the textbook, are any missing? Are any provided in Weka
not provided in the textbook?
4. Data preparation - cleaning
Data cleaning deals with such defaults of real-world
data as incompleteness, noise, and inconsistencies. In Weka, data
cleaning can be accomplished by applying filters to the data in
the Preprocess tab.
a.
Missing values. List the methods seen in class
for dealing with missing values, and which Weka filters
implement them – if available. Remove the missing values with the method of
your choice, explaining which filter you are using and why you make this
choice. If a filter is not available for your method of choice, develop a new
one that you add to the available filters as a Java class.
b.
Noisy data. List the methods seen in class for
dealing with noisy data, and which Weka filters implement them –
if available.
c.
Outlier detection. List the methods seen in
class for detecting outliers. How would you detect outliers with Weka?
Are there any outliers in this dataset, and if yes, list some of them.
d.
Save the cleaned dataset into heart-cleaned.arff,
and paste here a screenshot showing at least the first 10 rows of this dataset
– with all the columns.
5. Data preparation - transformation
Among the different data transformation techniques,
explore those available through the Weka Filters. Stay in the Preprocess tab for
now. Study the following data transformation only:
a.
Attribute construction – for example
adding an attribute representing the sum of two other ones. Which Weka
filter permits to do this?
b.
Normalize an attribute. Which Weka
filter permits to do this? Can this filter perform Min-max
normalization? Z-score normalization? Decimal normalization? Provide detailed
information about how to perform these in Weka.
c.
Normalize all real attributes in the
dataset using the method of your choice – state which one you choose.
d.
Save the normalized dataset into heart-normal.arff,
and paste here a screenshot showing at least the first 10 rows of this dataset
– with all the columns.
6. Data preparation- reduction
Often, data mining datasets are too large to process
directly. Data reduction techniques are used to preprocess the data. Once the
data mining project has been successful on these reduced data, the larger
dataset can be processed too.
a.
Stay in the Preprocess tab for now.
Beside attribute selection, a reduction method is to select rows from a
dataset. This is called sampling. How to perform sampling with Weka
filters? Can it perform the two main methods: Simple Random
Sample Without Replacement, and Simple Random Sample With
Replacement?