| barkse {barkse} | R Documentation |
BARK-SE is a customized model for BARK, Bayesian sum-of-kernels model.
For numeric response y, we have
y = f(x) + e,
where e ~ N(0,sigma^2).
For a binary response y, P(Y=1 | x) = F(f(x)), where F
denotes the standard normal cdf (probit link).
The difference between BARK-SE package and BARK package is that BARK-SE only works for the selection and equal weights scenario. However, BARK-SE can accept both categorical (ordered or non-ordered) and continuous covariates, while the current implementation of BARK only works for continuous variables.
The usage is also slightly different. BARK-SE works with a particular data class called "snpdata", which can be generated from function make.snpdata(). Check sim.snps for a example use. If all variables are categorical, one can use the functions summarize.odds() and plot.odds() to view the marginal odds for each level of each variable. Continuous variables are coded as "-1" in the data type vector.
barkse(snpdata,
classification = TRUE,
keepevery = 100, nburn = 100,
nkeep = 100, printevery = 1000,
fixed = list(),
tune = list(lstep=0.5, dpow=1, upow=0,
varphistep=.5, phistep=1),
theta = list()
)
snpdata |
The input data is a "snpdata" class, which is a list with. y # n*1, 0/1 response xid # n*1, 1~m row index of x in xunique matrix xunique # m*p, unique covariate matrix nys # m*3, number of y=0/1/* with certain covariate types # p*1, 1/2 variable type ordered/non-ordered ncats # p*1, number of categories in each variable delta # p*1, 0/1 if known the true model in simulation |
classification |
True false logical variable, indicating a classfication or regression problem. |
keepevery |
Every keepevery draw is kept to be returned to the user. |
nburn |
Number of MCMC iterations (nburntimeskeepevery) to be treated as burn in. |
nkeep |
Number of MCMC iterations kept for the posterior inference. nkeeptimeskeepevery iterations after the burn in. |
printevery |
As the MCMC runs, a message is printed every printevery draws. |
fixed |
A list of fixed hyperparameters, using the default values if not
sepecified. alpha = 1: stable index, must be 1. eps = 0.5: approximation parameter. gam = 5: intensity parameter. la = 1: first argument of the gamma prior on kernel scales. lb = 2: second argument of the gamma prior on kernel scales. pbetaa = 1: first argument of the beta prior on plambda. pbetab = 1: second argument of the beta prior on plambda. n: number of training samples, automatically generates. p: number of explainatory variables, automatically generates. meanJ: the expected number of kernels, automatically generates. |
tune |
A list of tuning parameters, not expected to change. lstep: the stepsize of the lognormal random walk on lambda. dpow: the power on the death step. upow: the power on the update step. varphistep: the stepsize of the lognormal random walk on varphi. phistep: the stepsize of the lognormal random walk on phi. |
theta |
A list of the starting values for the parameter theta,
use the defaults if nothing is given. |
BARK-SE is an Bayesian MCMC method. At each MCMC interation, we produce a draw from the joint posterior distribution, i.e. a full configuration of regression coefficents, kernel locations and kernel parameters etc.
Thus, unlike a lot of other modelling methods in R, we do not produce a single model object from which fits and summaries may be extracted. The output consists of values f*(x) (and sigma* in the numeric case) where * denotes a particular draw. The x is either a row from the training data (x.train) or the test data (x.test).
bark returns a list, including:
snpdata |
The input snpdata. |
fixed |
Fixed hyperparameters. |
tune |
Tuning parameters used. |
theta.list |
The list of draws from the posterior sampling. |
theta.nvec |
A matrix with nrow(x.train)+1 rows and (nkeep) columns, recording the number of kernels at each training sample. |
theta.delta |
A matrix with nrow(x.train)+1 rows and (nkeep) columns, recording the the inclusion of each variable at each iteration. |
theta.beta |
A matrix with nrow(x.train)+1 rows and (nkeep) columns, recording the regression coefficients. |
yhat |
A matrix with nrow(x.train) rows and (nkeep) columns.
Each column corresponds to a draw f* from
the posterior of f
and each row corresponds to a row of x.train.
The (i,j) value is f*(x) for
the j^th kept draw of f
and the i^th row of x.train. For classification problems, this is the value of the expectation for the underlying normal random variable. Burn-in is dropped. |
pred |
prediction vector (different from yhat in probit model) |
This version is very preliminary for investigating kernel models in applications with categorical data.
Ouyang, Zhi (2008) Bayesian Additive Regression Kernels.
Duke University. Ph.D. dissertation, pages 32-62.
at:
http://stat.duke.edu/people/theses/OuyangZ.html
## Simple Logistic Model for 6 Categorical Variables
# 1. Simulate the data
# 1.1 Specify the number of categories for all variables
# Integers for categorical variable, scale (sd) for continous variable
ncats <- c(2, 4, rep(3, 4))
# 1.2 Specify the types of the variables in generating the design table
# -1: continuous; 1: ordered/binary; 2: dominant;
# 3: recessive; 4: non-ordered
simtypes <- c(1, 1, 4, 1, 2, 3)
# 1.3 Specify the types goes with the snpdata class (blind)
# -1: continuous; 1: ordered/binary; 2: non-ordered
datatypes <- c(1, 1, 2, 2, 2, 2)
# 1.4 Specify the prior list for all variables
# vector of probabilities for categorical variable
# center (mean) for continous variable (assuming normal population)
priors <- list(c(.5, .5), c(.2, .3, .4, .1),
c(.7, .2, .1), c(.4, .5, .1),
c(.75, .2, .05), c(.6, .35, .05))
names <- c("SEX", "AGE", "RACE", "SNP1", "SNP2", "SNP3")
# 1.5 Specify which variables come into the logistic model
# For simplicity, the logistic regression coefficient are 0 or 0.7
# You many play with this vector to see different behaviour
delta <- c(1, 0, 1, 0, 1, 0)
# 1.6 Simulate the data from a logistic model
# You may need to use make.snpdata() to generate the proper data format.
snpdata <- sim.snps(500, simtypes=simtypes,
datatypes=datatypes, ncats=ncats,
delta=delta, priors=priors, names=names)
summary(snpdata)
# 2. Fit the model
# Note: this step may take a while
fit <- barkse(snpdata, keepevery=20, printevery=2e2)
# 3. View summary statistics
boxplot(as.data.frame(fit$theta.lambda), ylim=c(0, 1))
# Note: this step may take a while
odds <- summarize.odds(fit$snpdata, fit$theta.list)
plot.odds(odds, snpdata$ncats, ylim=c(0, 3))