Data Mining Lab : Prediction: Linear regression

Prediction: Linear regression

Linear Regression can be very useful in association analysis of numerical values, in fact regression analysis is a powerful approach to modeling the relationship between a dependent and independent variables. Simple regression is when we predict from one independent variable and multiple regression is when we predict from more than one independent variables. The model we attempt to _t is a linear one which is, very simply, drawing a line through the data. Of all the lines that can possibly be drawn through the data, we are looking for the one that best fits the data. In fact, we look to find a line that best satisfies

γ = β0 + β1x + ε

So a most accurate model is that which yields a best fit line to the data in question, we are looking for minimal sum of squared deviations between actual and fitted values, this is called method of least squares. So now that we have briefly reminded ourselves of the very basics of regression lets directly move onto an example in Weka.

Exercise 1

(a) In Weka go back to the \Preprocess" tab. Open the iris data-set (\iris.tar_", this should be in the ./data/ directory of the Weka install).

(b) In the \Attributes" section (bottom left of the screen) select the \class" feature and click \Remove". We need to do this, as simple linear regression cannot deal with non numeric values.

(c) Next select the \Classify" tab to get into the Classification perspective of Weka, and choose \LinearRegression" (under \functions").

(d) Clicking on the textbox next to the \Choose" button brings up the parameter editor window. Click on the \More" button to get information about the parameters. Make sure that \attributeSelectionMethod" is set to \No attribute selection" and “\eliminate-ColinearAttributes" is set to \False".

(e) Finally make sure that you select the parameter “\petalwidth" in the dropdown box just under the “\Test Options". Hit Start to run the regression.

Inspect the results, in particular pay attention to the Linear Regression Model formula returned, and the coefficients and intercept of the straight line equation. As this is a numeric prediction/regression problem, accuracy is measured with Root Mean Squared Error, Mean Absolute Error and the likes. As most of you will have clearly noticed, you can repeat this process for regressing the other features in turn, and compare how well the different features can be predicted.

Exercise 2

• Launch the WEKA tool, and then activate the “Explorer” environment.

• Open the “cpu” dataset (i.e., contained in the “cpu.arff” file).

- For each attribute and for each of its possible values, how many instances in each class have the feature value (i.e., the class distribution of the feature values)?

• Go to the “Classify” tab. Select the SimpleLinearRegression learner. Choose “Percentage split” (66% for training) test mode. Run the classifier and observe the results shown in the “Classifier output” window.

- Write down the learned regression function.

- What is the MAE (mean absolute error) made by the learned regression function?

- Visualize the errors made by the learned regression function. In the plot, how can you see the detailed information of a predicted instance?

• Now, in the “Test options” panel select the “Cross-validation” option (10 folds). Run the classifier and observe the results shown in the “Classifier output” window.

- Write down the learned regression function.

- What is the MAE (mean absolute error) made by the learned regression function?

- Visualize the errors made by the learned regression function. In the plot, how can you see the detailed information of a predicted instance?