| Binary Tree Prediction Model Search |
| Jennifer Pittman, Quanli Wang & Mike West |
Windows version of Binary Tree Search analysis as described in
Some features:
- It has
been fully optimized for speed;
- A
script file interface has been added so it can be run from a command
window;
- A Java
programming and graphical user interface has been added so it can be run
smoothly from within Matlab.
Code and Examples:
Downloaded the bintree files to an appropriate folder.
This package runs on the Windows platform and includes:
- BinTree.exe:
the main program that implements the Binary Tree Search algorithm;
- bintree.jar:
the java GUI program;
- AbsoluteLayout.jar
: a helper Java file;
- example.m:
a Matlab script that shows how to use the GUI within Matlab;
- collencindex.m:
a Matlab helper function for graphical output;
- getnames.m:
a Matlab helper function for graphical output;
- getpredictors.m:
a Matlab helper function for Binary Tree Model statistics;
- plotcvpred.m:
a Matlab helper function for graphical output;
- plotcvpredq.m:
a Matlab helper function for graphical output;
- plottreeliks.m:
a Matlab helper function for graphical output;
- plottreesplits.m:
a Matlab helper function for graphical output;
- runstat.m:
a Matlab helper function for Binary Tree Model statistics;
- data
subdirectory: a complete sample to run the program.
Installation
- create
a home directory for BinTree, say c:\bintree (Note:
currently, this program does NOT work with directory that has space characters in it);
- copy
all the files mentioned above to the home directory;
- add
the home directory full path to Windows system path;
- start
Matlab, then type edit classpath.txt;
- append
the line c:\bintree\bintree.jar to the end of classpath.txt (this
assumes the home directory for BinTree is c:\bintree);
- append
the line c:\bintree\AbsoluteLayout.jar to the end of classpath.txt
(this assumes the home directory for BinTree is c:\bintree);
- save
the change made to classpath.txt;
- add
the home directory full path to Matlab paths;
- close
matlab and restart Windows.
- To
generate the ps files for tree outputs, you will also need to install
Graphviz ( http://www.research.att.com/sw/tools/graphviz/
) package.
Starting the program (basic)
- start
matlab;
- type myBinTree
= bintree.Model; myBinTree.start to start the program;
Preparing data and setting up parameters
- Total
Number of Samples: total number of observations in dataset for
analysis, including both training set and validation set;
- Number
of Testing Samples: number of samples in hold-out or validation set;
- Data
File: a flat text file that includes data for all the variables,
including response and predictors (samples in columns, variables in rows,
tab delimited, training samples first, followed by validation samples if
any);
- Variable
File: a flat text file that describes the data file; this file can be
generated by using the variable editor by clicking on the Edit
Variables button; see section Using variable editor for
details;
- Response:
the name of the response variable; must be one of the variables defined
within the variable file;
- Cutoff
Value and Cutoff Logic: used to create a binary response if the
response variable has values other than 0’s and 1’s;
- Predictors:
a list of variable names that will be used when running the algorithm; some
abbreviations like mgene(1,3,5) or mgene(1:5) are allowed if the duplicate
property for that variable ( mgene here for example) is set to more than
1; mgene(1,3,5) means mgene1, mgene3 and mgene5 will be used as
predictors; mgene(1:5) means mgene1 though mgene5 will be used as
preodictors;
- Number
of Runs: number of times entire analysis should be performed;
- Cross
Validation: whether or not cross validation should be performed;
- Leave-Out
Size: the size of leave-out group if the cross validation is performed;
- Bayes
Factor Criteria: minimum value of Bayes Factor necessary in order to
proceed with split;
- Maximum
Number of Levels: maximum number of levels to which a tree is allowed
to grow (with the root node as level 0);
- Maximum
Splits For Level 0: maximum number of splits for each node at level
0(root node), e.g, if up to 10 splits are allowed at the root node then
the algorithm will search though all the candidate predictor/threshold
combinations and create trees based on the most significant splits until
10 trees have been created or all combinations have been searched;
- Maximum
Splits at Level 1: maximum number of splits for each node at level 1;
- Minimum
Node Size: minimum number of observations for a node to be eligible
for splitting;
- Select
Trees For Prediction: select the criteria to be used in selecting
trees for inclusion in predictions (i.e. for inclusion in model averaging
and prediction); either minimum required posterior probability
(normalized) or the number of tree with highest posterior probability can
be used; e.g. if the minimum posterior probability is set to 0.05, then
only those trees with posterior probabilities greater than 0.05 will be
included for model averaging and prediction;
- Output Directory: the directory
to which all run results will be stored;
- Maximum Number of Tree Figures: number of highest probability trees for
which graphing code should be provided;
- Confidence
Limits: the lower and upper confidence limits for predictions.
Using variable editor
Variable Editor
is used to define the variable file, which describes the order, name, type and
thresholds for all the variables within the data file. For each row (variable),
user must specify at least a name, the type of the variable (binary or
continuous) and the corresponding threshold(s). Multiple thresholds are
separated by white spaces. If many variables share the similar names (for
example mgene1, mgene2… mgene498), same type and same threshold(s) they can be
defined in one line by specifying a duplicate property. User must make
sure that the order of variables must match that of the data file. A tool bar
is provided to add a new variable, delete an existing variable and change the
order of variables if necessary. A Load button is provided to load a
previously saved variable file. User will also have a chance to save their
inputs after clicking on OK button, which will also quit the editor.
Running/Stopping the program
- set
input/output/model parameters;
- hit
the Run button to run the
algorithm;
- hit
the Stop button to stop the algorithm if want to abandon the
current run.
Saving/Loading customized user inputs
User customized inputs can be
saved and reloaded by hitting the Save/Load buttons.
Viewing outputs
Outputs can be viewed by hitting
the Output button and then providing file name of interest.
Quitting the program
Quit the
program by hitting OK/Cancel button.
This software is made freely available to any interested user. The authors can provide
no support nor assistance with implementations beyond the details and examples here, nor
extensions of the code for other purposes. The download has been tested to confirm all
details are operational as described here.
It is understood by the user that neither the authors nor Duke University bear any responsibility
nor assume any liability for any end-use of this software. It is expected that appropriate
credit/acknowledgement be given should the software be included as an element in other software development
or in publications.
|