Binary Tree Prediction Model Search
Jennifer Pittman, Quanli Wang & Mike West

Windows version of Binary Tree Search analysis as described in

Some features:

  1. It has been fully optimized for speed;
  2. A script file interface has been added so it can be run from a command window;
  3. A Java programming and graphical user interface has been added so it can be run smoothly from within Matlab.

Code and Examples:

Downloaded the bintree files to an appropriate folder.

This package runs on the Windows platform and includes:

  1. BinTree.exe: the main program that implements the Binary Tree Search algorithm;
  2. bintree.jar: the java GUI program;
  3. AbsoluteLayout.jar : a helper Java file;
  4. example.m: a Matlab script that shows how to use the GUI within Matlab;
  5. collencindex.m: a Matlab helper function for graphical output;
  6. getnames.m: a Matlab helper function for graphical output;
  7. getpredictors.m: a Matlab helper function for Binary Tree Model statistics;
  8. plotcvpred.m: a Matlab helper function for graphical output;
  9. plotcvpredq.m: a Matlab helper function for graphical output;
  10. plottreeliks.m: a Matlab helper function for graphical output;
  11. plottreesplits.m: a Matlab helper function for graphical output;
  12. runstat.m: a Matlab helper function for Binary Tree Model statistics;
  13. data subdirectory: a complete sample to run the program.

Installation

  1. create a home directory for BinTree, say c:\bintree (Note: currently, this program does NOT work with directory that has space characters in it);
  2. copy all the files mentioned above to the home directory;
  3. add the home directory full path to Windows system path;
  4. start Matlab, then type edit classpath.txt;
  5. append the line c:\bintree\bintree.jar to the end of classpath.txt (this assumes the home directory for BinTree is c:\bintree);
  6. append the line c:\bintree\AbsoluteLayout.jar to the end of classpath.txt (this assumes the home directory for BinTree is c:\bintree);
  7. save the change made to classpath.txt;
  8. add the home directory full path to Matlab paths;
  9. close matlab and restart Windows.
  10. To generate the ps files for tree outputs, you will also need to install Graphviz ( http://www.research.att.com/sw/tools/graphviz/ ) package.
Starting the program (basic)
  1. start matlab;
  2. type myBinTree = bintree.Model; myBinTree.start to start the program;

Preparing data and setting up parameters

  1. Total Number of Samples: total number of observations in dataset for analysis, including both training set and validation set;
  2. Number of Testing Samples: number of samples in hold-out or validation set;
  3. Data File: a flat text file that includes data for all the variables, including response and predictors (samples in columns, variables in rows, tab delimited, training samples first, followed by validation samples if any);
  4. Variable File: a flat text file that describes the data file; this file can be generated by using the variable editor by clicking on the Edit Variables button; see section Using variable editor for details;
  5. Response: the name of the response variable; must be one of the variables defined within the variable file;
  6. Cutoff Value and Cutoff Logic: used to create a binary response if the response variable has values other than 0s and 1s;
  7. Predictors: a list of variable names that will be used when running the algorithm; some abbreviations like mgene(1,3,5) or mgene(1:5) are allowed if the duplicate property for that variable ( mgene here for example) is set to more than 1; mgene(1,3,5) means mgene1, mgene3 and mgene5 will be used as predictors; mgene(1:5) means mgene1 though mgene5 will be used as preodictors;
  8. Number of Runs: number of times entire analysis should be performed;
  9. Cross Validation: whether or not cross validation should be performed;
  10. Leave-Out Size: the size of leave-out group if the cross validation is performed;
  11. Bayes Factor Criteria: minimum value of Bayes Factor necessary in order to proceed with split;
  12. Maximum Number of Levels: maximum number of levels to which a tree is allowed to grow (with the root node as level 0);
  13. Maximum Splits For Level 0: maximum number of splits for each node at level 0(root node), e.g, if up to 10 splits are allowed at the root node then the algorithm will search though all the candidate predictor/threshold combinations and create trees based on the most significant splits until 10 trees have been created or all combinations have been searched;
  14. Maximum Splits at Level 1: maximum number of splits for each node at level 1;
  15. Minimum Node Size: minimum number of observations for a node to be eligible for splitting;
  16. Select Trees For Prediction: select the criteria to be used in selecting trees for inclusion in predictions (i.e. for inclusion in model averaging and prediction); either minimum required posterior probability (normalized) or the number of tree with highest posterior probability can be used; e.g. if the minimum posterior probability is set to 0.05, then only those trees with posterior probabilities greater than 0.05 will be included for model averaging and prediction;
  17. Output Directory: the directory to which all run results will be stored;
  18. Maximum Number of Tree Figures: number of highest probability trees for which graphing code should be provided;
  19. Confidence Limits: the lower and upper confidence limits for predictions.

Using variable editor

Variable Editor is used to define the variable file, which describes the order, name, type and thresholds for all the variables within the data file. For each row (variable), user must specify at least a name, the type of the variable (binary or continuous) and the corresponding threshold(s). Multiple thresholds are separated by white spaces. If many variables share the similar names (for example mgene1, mgene2 mgene498), same type and same threshold(s) they can be defined in one line by specifying a duplicate property. User must make sure that the order of variables must match that of the data file. A tool bar is provided to add a new variable, delete an existing variable and change the order of variables if necessary. A Load button is provided to load a previously saved variable file. User will also have a chance to save their inputs after clicking on OK button, which will also quit the editor.

Running/Stopping the program

  1. set input/output/model parameters;
  2. hit the Run button to run the algorithm;
  3. hit the Stop button to stop the algorithm if want to abandon the current run.

Saving/Loading customized user inputs

User customized inputs can be saved and reloaded by hitting the Save/Load buttons.

Viewing outputs

Outputs can be viewed by hitting the Output button and then providing file name of interest.

Quitting the program Quit the program by hitting OK/Cancel button.



This software is made freely available to any interested user. The authors can provide no support nor assistance with implementations beyond the details and examples here, nor extensions of the code for other purposes. The download has been tested to confirm all details are operational as described here.

It is understood by the user that neither the authors nor Duke University bear any responsibility nor assume any liability for any end-use of this software. It is expected that appropriate credit/acknowledgement be given should the software be included as an element in other software development or in publications.


More software from the West group