The KDD Process in Weka
This
experiment will be using Weka data mining tool. Weka is
an open source Java
development
environment for data mining from the University
of Waikato in New
Zealand.
Heart disease datasets
The dataset studied is the heart
disease dataset from UCI repository. Two different
datasets are provided: heart-h.arff (Hungarian data), and heart-c.arff (Cleveland data).
These datasets describe factors of heart disease. Both these data
sets are available to you
on the assignment page.
The data mining project
goal is to better understand the risk factors for heart disease, as represented
in the 14th attribute: num (<50 means no disease, and values <50-1 to <50-4 represent
increasing levels of heart disease).
The question on which this machine learning study concentrates is whether it is
possible to predict heart disease from the other known data about a patient.
The data mining task of choice to answer this question will be
classification/prediction, and several different algorithms will be used to
find which one provides the best predictive power. However this exercise
focuses on the various aspects of the KDD process.
1. Data preparation- integration
We want to merge the two datasets into one, in a step called data
integration. Revise arff
notation from the tutorial, which is Weka data representation
language. Answer the
following questions:
a. Define what data integration means. (in your own words)
b. Is there an entity
identification or schema integration problem in this dataset? If
yes, how to fix it?
c. Is there a redundancy problem in this dataset? If yes, how to fix it?
d. Are there data value
conflicts in this dataset? If yes, how to fix it?
e. Integrate the two datasets into one single dataset, which will
be used as a starting point for the next questions, and load it in the Explorer. How many instances do you
have? How many attributes? (You could do this using Excel or spreadsheet programs.
First, save your individual files as “csv” files in weka, Open them in a spreadsheet
viewing program. Copy the rows from one file to another. Save the merged file
(csv). Open it in weka and save it as “csv”. Take care of the above questions.
Think about rectifying potential problems.
f. Paste a screenshot of the Explorer window.
2. Descriptive data summarization
Before preprocessing the data, an important step is to get
acquainted with the data – also
called data understanding.
a. Stay in the Preprocess tab
for now. Study for example the age attribute. What is
its mean? Its standard deviation?
Its min and max?
b. Provide the five-number summary of this
attribute. Is this figure provided in
Weka? This is min, max, median, lower 25% quartile and upper 25% quartile.
c. Specify which attributes are
numeric, which are ordinal, and which are categorical/nominal.
d. Interpret the graphic showing in the lower right corner
of the Explorer. How can
you name this graphic? What do the red and blue colors mean
(pay attention to
the pop-up messages that appear when dragging the mouse
over the graphic)?
What does this graphic represent?
e. Visualize all the attributes in graphic format. Paste a
screenshot.
f. Comment on what you learn from these graphics.
g. Switch to the Visualize tab. By selecting
the maximum jitter, and looking at the
num column – the
last one – can you determine which attributes seem to be the
most linked to heart disease? Paste the boxplot representing
the attribute you find
the most predictive of heart disease (Y) as a function of num
(X).
h. Does any pair of different attributes seem correlated?
3. Data preparation – selection
The datasets studied have already been processed by
selecting a subset of attributes
relevant for the data mining project.
a. From the documentation provided in the dataset, how many
attributes were
originally in these datasets?
b. With Weka, attribute selection can be
achieved either from the specific Select
attributes tab, or
within Preprocess tab. List the different options in Weka for
selecting attributes, with a short explanation about the
corresponding method.
4. Data preparation - cleaning
Data cleaning deals with such defaults of real-world data
as incompleteness, noise, and
inconsistencies. In Weka, data cleaning can
be accomplished by applying filters to the
data in the Preprocess tab.
a. Missing values. List the methods seen in class
for dealing with missing values, and which Weka filters implement
them – if available. Remove the missing values with the method of your choice,
explaining which filter you are using and why you make this choice.
b. Noisy data.
List the methods seen in class for dealing with noisy data, and which
Weka filters implement
them – if available.
c. Save the cleaned dataset into heart-cleaned.arff,
and paste here a screenshot
showing at least the first 10 rows of this dataset – with
all the columns.
5. Data preparation - transformation
1. Among the different data transformation techniques,
explore those available
through the Weka Filters. Stay in the Preprocess
tab for now. Study the following data transformation only:
a. Attribute construction – for example
adding an attribute representing the sum of
two other ones. Which Weka filter permits to
do this?
b. Normalize an attribute. Which Weka
filter permits to do this? Can this filter
perform Min-max normalization? Z-score normalization?
Decimal normalization?
Provide detailed information about how to perform these in Weka.
c. Normalize all real attributes in the
dataset using the method of your choice – state
which one you choose.
d. Save the normalized dataset into heart-normal.arff,
and paste here a screenshot
showing at least the first 10 rows of this dataset – with all the columns.
6. Data preparation- reduction
Often, data mining datasets are too large to process
directly. Data reduction techniques
are used to preprocess the data. Once the data mining
project has been successful on these
reduced data, the larger dataset can be processed too.
a. Stay in the Preprocess tab for now. Beside
attribute selection, a reduction method
is to select rows from a dataset. This is called sampling.
How to perform sampling
with Weka filters? Can it perform the two
main methods: Simple Random
Sample Without Replacement, and Simple Random Sample With Replacement?