How to access a database using WEKA
Go to the Control Panel
Choose Adminstrative Tools
Choose Data Sources (ODBC)
At the User DSN tab, choose Add...
Choose database
Microsoft Access
Note: Make sure your database is not open in another
application before following the steps below.
Choose the Microsoft Access driver and click Finish
Give the source a name by typing it into the Data Source
Name field
In the Database section, choose Select...
Browse to find your database file, select it and click OK
Click OK to finalize your DSN
You will need to configure a file called
DatabaseUtils.props. This file already exists under the path weka/experiment/
in the weka.jar file (which is just a ZIP file) that is part of the Weka
download. In this directory you will also find a sample file for ODBC
connectivity, called DatabaseUtils.props.odbc, and one specifically for MS
Access, called DatabaseUtils.props.msaccess (>3.4.14, >3.5.8, >3.6.0),
also using ODBC. You should use one of the sample files as basis for your
setup, since they already contain default values specific to ODBC access.
This file needs to be recognized when the Explorer starts.
You can achieve this by making sure it is in the working directory or the home
directory (if you are unsure what the terms working directory and home
directory mean, see the \textit{Notes} section). The easiest is probably the
second alternative, as the setup will apply to all the Weka instances on your
machine.
Just make sure that the file contains the following lines
at least:
jdbcDriver=sun.jdbc.odbc.JdbcOdbcDriver
jdbcURL=jdbc:odbc:dbname
where dbname is the name you gave the user DSN. (This can
also be changed once the Explorer is running.)
Start up the Weka Explorer.
Choose Open DB...
The URL should read "jdbc:odbc:dbname" where
dbname is the name you gave the user DSN.
Click Connect
Enter a Query, e.g., "select * from tablename"
where tablename is the name of the database table you want to read. Or you
could put a more complicated SQL query here instead.
Click Execute
When you're satisfied with the returned data, click OK to
load the data into the Preprocess panel.
Figure: Knowledge flow directed graph for C4.5 and K-Means.
Exercises on Knowledge FlowComponent of WEKA
I.
Use Knowledge flow canvas and develop a directed graph
for C4.5 execution\
Goal: Setting up a flow to load an arff
file (batch mode) and perform a cross validation using J48 (Weka's C4.5
implementation).
Steps to be done:
- The Weka GUI Chooser window is used to launch Weka's graphical environments. Select the button labeled "KnowledgeFlow" to start the KnowledgeFlow. Alternatively, you can launch the KnowledgeFlow from a terminal window by typing "java weka.gui.beans.KnowledgeFlow".
- First start the KnowlegeFlow.
- Next click on the DataSources tab and choose "ArffLoader" from the toolbar (the mouse pointer will change to a "cross hairs").
- Next place the ArffLoader component on the layout area by clicking somewhere on the layout (A copy of the ArffLoader icon will appear on the layout area).
- Next specify an arff file to load by first right clicking the mouse over the ArffLoader icon on the layout. A pop-up menu will appear. Select "Configure" under "Edit" in the list from this menu and browse to the location of your arff file.
- Next click the "Evaluation" tab at the top of the window and choose the "ClassAssigner" (allows you to choose which column to be the class) component from the toolbar. Place this on the layout.
- Now connect the ArffLoader to the ClassAssigner: first right click over the ArffLoader and select the "dataSet" under "Connections" in the menu. A "rubber band" line will appear. Move the mouse over the ClassAssigner component and left click - a red line labeled "dataSet" will connect the two components.
- Next right click over the ClassAssigner and choose "Configure" from the menu. This will pop up a window from which you can specify which column is the class in your data (last is the default).
- Next grab a "CrossValidationFoldMaker" component from the Evaluation toolbar and place it on the layout. Connect the ClassAssigner to the CrossValidationFoldMaker by right clicking over "ClassAssigner" and selecting "dataSet" from under "Connections" in the menu.
- Next click on the "Classifiers" tab at the top of the window and scroll along the toolbar until you reach the "J48" component in the "trees" section. Place a J48 component on the layout.
- Connect the CrossValidationFoldMaker to J48 TWICE by first choosing "trainingSet" and then "testSet" from the pop-up menu for the CrossValidationFoldMaker.
- Next go back to the "Evaluation" tab and place a "ClassifierPerformanceEvaluator" component on the layout. Connect J48 to this component by selecting the "batchClassifier" entry from the pop-up menu for J48.
- Next go to the "Visualization" toolbar and place a "TextViewer" component on the layout. Connect the ClassifierPerformanceEvaluator to the TextViewer by selecting the "text" entry from the pop-up menu for ClassifierPerformanceEvaluator.
- Now start the flow executing by selecting "Start loading" from the pop-up menu for ArffLoader. Depending on how big the data set is and how long cross validation takes you will see some animation from some of the icons in the layout (J48's tree will "grow" in the icon and the ticks will animate on the ClassifierPerformanceEvaluator). You will also see some progress information in the "Status" bar and "Log" at the bottom of the window.
- When finished you can view the results by choosing show results from the pop-up menu for the TextViewer component.
II.
Use Knowledge flow canvas and develop a directed graph
for k-means execution
Exercises on Experimenter component of WEKA
- Use experimenter to compare any two classifiers of your choice on iris dataset.
Questions from WEKA book
1) Weather.nominal.arff
What are the values that the attribute temperature can have?
Load a new dataset. Click the Open
file button and select the file iris.arff. . How many instances
does this dataset have? How many attributes? What is the range of possible
values of the attribute petallength?
2) Weather.nominal.arff
What is the function of the first column in the Viewer window? What is the class value of instance number 8 in the weather data?
Load the iris data and open it in the editor. How many numeric and
how many nominal attributes does this dataset have?
3) Load the weather.nominal dataset. Use the filter weka.unsupervised.instance.RemoveWithValues
to remove all instances in which the humidity attribute has the value
high. To do
this, first make the field next to the Choose button show the text RemoveWithValues. Then click on it to get the Generic Object Editor window, and
figure out how to change the filter settings appropriately. Undo the change to
the dataset that you just performed, and verify that the data has reverted to
its original state.
4) Load the iris data using the Preprocess panel. Evaluate C4.5 on
this data using (a) the training set and (b) cross-validation. What is the
estimated percentage of correct classifications for (a) and (b)? Which estimate
is more realistic? Use the Visualize classifier errors function
to find the wrongly classified test instances for the cross-validation
performed in previous Exercise . What can you say about the location of the
errors?
5) Glass.arff
How many attributes are there in the dataset? What are their
names? What is the class attribute? Run the classification algorithm IBk (weka.classifiers.lazy.IBk). Use cross-validation to test its performance,
leaving the number of folds at the default value of 10. Recall
that you can examine the classifier options in the Generic Object Editor window
that pops up when you click the text beside the Choose button. The default
value of the KNN field is 1: This sets the number of neighboring
instances to use when classifying.
6) Glass.arff
What is the accuracy of IBk (given in the Classifier Output box)? Run IBk again, but increase the number of neighboring
instances to k = 5 by entering this value in the KNN field. Here and throughout
this section, continue to use cross-validation as the evaluation method.
What is the accuracy of IBk with five neighboring instances (k = 5)?
7) Ionosphere.arff
For J48, compare
cross-validated accuracy and the size of the trees generated for (1) the raw
data, (2) data discretized by the unsupervised discretization method in default
mode, and (3) data discretized by the same
method with binary attributes.
8) Apply the ranking technique to the labor negotiations data in labor.arff to determine the four most important attributes based on
information gain. On the same data, run CfsSubsetEval
for correlation-based
selection, using the BestFirst
search. Then run the wrapper method with J48 as the base learner, again using the BestFirst search. Examine the
attribute subsets that are output. Which attributes are selected by both methods?
How do they relate to the output generated by ranking using information gain?
9)
Run Apriori on the weather data with each of the four rule-ranking metrics,
and default settings otherwise. What is the top-ranked rule that is output for
each metric?