Prediction: Linear regression
Linear Regression can be very useful in association
analysis of numerical values, in fact regression analysis is a powerful
approach to modeling the relationship between a dependent and independent
variables. Simple regression is when we predict from one independent variable
and multiple regression is when we predict from more than one independent
variables. The model we attempt to _t is a linear one which is, very simply,
drawing a line through the data. Of all the lines that can possibly be drawn
through the data, we are looking for the one that best fits the data. In fact,
we look to find a line that best satisfies
γ =
β0 + β1x + ε
So a most accurate model is that which yields a best fit
line to the data in question, we are looking for minimal sum of squared
deviations between actual and fitted values, this is called method of least
squares. So now that we have briefly reminded ourselves of the very basics of regression
lets directly move onto an example in Weka.
Exercise
1
(a) In Weka go back to the \Preprocess" tab. Open the
iris data-set (\iris.tar_", this
should be in the ./data/ directory of the Weka install).
(b) In the \Attributes" section (bottom left of the
screen) select the \class" feature and click \Remove". We need to do
this, as simple linear regression cannot deal with non numeric values.
(c) Next select the \Classify" tab to get into the
Classification perspective of Weka, and choose \LinearRegression" (under
\functions").
(d) Clicking on the textbox next to the \Choose"
button brings up the parameter editor window. Click on the \More" button
to get information about the parameters. Make sure that
\attributeSelectionMethod" is set to \No attribute selection" and “\eliminate-ColinearAttributes"
is set to \False".
(e) Finally make sure that you select the parameter “\petalwidth"
in the dropdown box just under the “\Test Options". Hit Start to run the
regression.
Inspect the results, in particular pay attention to the
Linear Regression Model formula returned, and the coefficients and intercept of the straight
line equation. As this is a numeric prediction/regression problem, accuracy is
measured with Root Mean Squared Error, Mean Absolute Error and the likes. As
most of you will have clearly noticed, you can repeat this process for regressing
the other features in turn, and compare how well the different features can be
predicted.
Exercise 2
•
Launch the WEKA tool, and then activate the “Explorer” environment.
•
Open the “cpu” dataset (i.e., contained in the “cpu.arff” file).
- For
each attribute and for each of its possible values, how many instances in each
class have the feature value (i.e., the class distribution of the feature
values)?
• Go
to the “Classify” tab. Select the SimpleLinearRegression learner. Choose
“Percentage split” (66% for training) test mode. Run the classifier and observe
the results shown in the “Classifier output” window.
-
Write down the learned regression function.
-
What is the MAE (mean absolute error) made by the learned regression function?
-
Visualize the errors made by the learned regression function. In the plot, how
can you see the detailed information of a predicted instance?
•
Now, in the “Test options” panel select the “Cross-validation” option (10
folds). Run the classifier and observe the results shown in the “Classifier
output” window.
-
Write down the learned regression function.
-
What is the MAE (mean absolute error) made by the learned regression function?
-
Visualize the errors made by the learned regression function. In the plot, how
can you see the detailed information of a predicted instance?