What Is BAKER?

BAKER stands for "Bayesian Kernel Models". It is a software for binary classification based on the Non-parametric Bayesian Kernel Model developed in Liang et.al 2009 from the Department of Statistical Science at Duke University. (An empirical study of the model is available here.) BAKER runs on both Matlab and Octave platforms. The inputs are training and test data as well as model parameters and the output is a report of training and test error rates (point estimates) as well as posterior draws for measuring uncertainty in estimation and prediction. Download BAKER here.

A Brief Tutorial

Unzip the downloaded file. To start, first make sure that the folder "BAKER/" and the sub folders and files are in the search path of MATLAB or OCTAVE, then have the input data ready to be load: The training data should be a matrix with rows corresponding to samples, the first column responses and the remaining column(s) predictors. The responses are either 1/-1 or 1/0 and represent class labels. The test data should be similarly formated. The program will return the empirical training error and test error, and (optionally) posterior draws of the parameters.

For example, suppose we have training data X and Y with X a n*p matrix with rows corresponding to samples and columns corresponding to predictors and Y a n*1 vector of 1/0 class labels. Similarly we have test data Xtest, a ntest*p matrix, and Ytest, a ntest*1 vector. Suppose Dtrain=[Y,X] and Dtest=[Ytest Xtest] (in matlab or octave format). To run the algorithm type:

[errs,nm]=baker(Dtrain, Dtest, 'linear', 'T', [500,1000]);

in the command line. The third argument specifies the kernel type: 'linear' for linear kernels, 'Gaussian' for Gaussian kernels, and {'poly',k} for k-order polynomial kernels. The fourth argument is a flag that selects for variable selection, sampling the kernel parameter $ \rho$ (which may be computationally heavy). The fifth argument is a 2-vector of [number of burn-in steps, number of posterior draws]. The output errs=[training error; test error]. "nm" is a struct object for inference, for example, nm.rho is a matrix representing the posterior draws of the kernel parameters; nm.postp_test is a matrix for the posterior predictive probabilities that the test cases belong to class "1".

For a further detailed explanation of the inputs and outputs type "help baker" in the command line.

Examples

Here we provide two examples to illustrate the use of BAKER.

Nonlinear Simulation

In this simulation example we have 100 samples each from two classes.

Samples from class 0 are drawn from


$\displaystyle (x^{1},x^{2})$ $\displaystyle =$ $\displaystyle (r \, \sin(\theta),r \, \cos(\theta)) , \ r \sim$   Unif$\displaystyle [0,1], \theta \sim$   Unif$\displaystyle [0,2\pi],$  
$\displaystyle x^{j}$ $\displaystyle \sim$ Unif$\displaystyle [-2,2], \ \hbox{for } j=3, \ldots, 12.$  

Samples from class 1 are drawn from


$\displaystyle (x^{1},x^{2})$ $\displaystyle =$ $\displaystyle (r \, \sin(\theta),r \, \cos(\theta)) , \ ,r
\sim$   Unif$\displaystyle [1,2], \theta \sim$   Unif$\displaystyle [0,2\pi],$  
$\displaystyle x^{j}$ $\displaystyle \sim$ Unif$\displaystyle [-2,2], \ \hbox{for } j=3, \ldots, 12.$  

An illustration of the data in terms of the first two signal dimensions are shown in Figure 1. Dimension 3-12 are noisy.

Figure1: The first two dimensions of the data. Points from class 0 are red stars contained in the unit circle and points from class 1 are blue circles contained in an annulus.
\includegraphics[totalheight=2.5in]{data3.eps}

Suppose Dtrain and Dtest are the training and test data matrices formatted as aforementioned. In the command line type:

[errs,nm]=baker(Dtrain, Dtest, 'Gaussian', 'T', [500,1000]);

This uses a Gaussian kernel (the third input is 'Gaussian') to train the data with variable selection (the fourth input is 'T') with 1500 (500+1000) iterations with a burn in of 500 iterations. The output errs=[0;0.06] means that for a particular run the training error is 0 and test error 0.06. Inference refers to the struct output "nm", for instance, nm.rho(1,:) is a vector representing the posterior draws of the kernel parameter $ \rho$ for the first dimension, hence if we are interested in the significance of each dimension we can type:

bar(mean(nm.rho,2));

which will take the mean of the posterior draws of $ \rho$ for each dimension and make a bar plot like that shown in Figure 2.

Figure2: The posterior means for all $ \rho$, the signal dimensions 1-2 are large.
\includegraphics[width=\textwidth, totalheight=2.5in]{toyrho3.eps}

High-dimensional gene expression data

In this example we have 280 samples from patients of which 190 are tumor samples and 90 are normal. For each sample expression data from 16063 genes (predictors) was collected. The data was randomly split into training and test sets with 180 and 100 samples respectively. Variable selection becomes infeasible in this problem as the number of predictors is huge, hence we type:

[errs,nm]=baker(Dtrain, Dtest, 'linear', 'F', [1000,1000]);

This uses a linear kernel without variable selection. BAKER provides a predictive distribution on the test set by "nm.postp_test", containing the draws of the posterior predictive probability that each test data point belongs to the class with label 1, based on which posterior predictive intervals can be constructed like shown in Figure 3

Figure3:The posterior predictive distribution for a test set with the first 10 samples are normal and the remaining tumor. The red stars represent the posterior means and the blue lines are 90% credible intervals.
\includegraphics[totalheight=2.5in,
width=\textwidth]{geneexp_postp.eps}