Statistics 101
Data Analysis and Statistical
Inference
Responses to
selected student questions
This page
has answers to questions posed by students about statistical topics.
Click on the relevant link to go to the appropriate section.
Click here to go to questions on study design
Click here to go to questions on regression and
correlation
Click here to go to questions on Bayesian statistics
More links coming in the
future.
Study Design
1. Is there any way to account for nonresponse
bias in a study, besides assigning more weight to those participants
that were hard to obtain?
Missing data are one of the hard issues in real studies. Many people
simply disregard them and analyze the complete data only. This is bad
practice. Resist doing this, even when your advisors or bosses tell you
it is OK. A great book on this subject is by Little and Rubin (1987)
Statistical Analysis with Missing Data.
The approach I use is to impute the missing values with
scientifically-driven guesses. For example, suppose income is missing
for some units. I would use a regression model to predict their incomes
from other characteristics of these units. Then, I fill in the missing
incomes with predictions from my regression model, being sure to add
some chance deviation from the line in my imputations. This fills in
all the holes in the data set so that I can analyze it as usual. A key
idea is to impute several values of the missing value to make sure that
you can capture the uncertainty due to guessing. There are specific
rules for combining the multiply-imputed data sets that are derived
from statistical theory. This technique, called multiple imputation,
is an area of my research.
Here’s a comment on assigning more weight to participants who were hard
to obtain. This is a practice done by some statistical agencies. I do
not recommend it. The reason is that you have to make two rather
arbitrary decisions: 1) which units were hard to obtain; 2) how much
extra weight to give these units. With the multiple imputation
approach, there is at least a principled, transparent method of filling
in the missing data.
2. Can you explain more about how to
control for confounding variables and how do you know whether a study in
the media has considered confounding variables?
The ideal way to control for confounding is to use a randomized
experiment. With randomization, all the background characteristics
should be similar in the treatment groups. This minimizes the influence
of confounding factors. Additional steps to avoid confounding include
double-blinding and using placebos in control groups.
In observational studies, the best bet is to match treated units with
similar-looking control units. Matching schemes come in many flavors.
For example, one approach is to form some function of the values of the
background variables. Then, for each treated unit, find the control
unit in the database (usually there are many more controls than treateds
in databases) whose value of the function is closest to that treated
unit. My favorite matching scheme of this type is called propensity
score matching (come by office hours for more info). When it is not
possible to get reasonably close matches, then I would throw up my
hands and say that the data cannot inform me about the causal question
of interest. As dissatisfying as it is, sometimes the answer has to
be, “I can’t do it.”
Another approach is to use multiple regression to control for the
effects of other variables. This approach is extremely popular.
However, it can give misleading results. When the treated and control
groups look different, the data are really spread out. When the data
points are really spread out, it is hard to find a multiple regression
model that describes the data accurately. Furthermore, with many
variables, it is hard even just to tell if the model fits accurately.
Hence, I prefer the matching approach. I do like the idea of matching
first, then fitting a regression model using the matched and control
units. Regressions are more likely to fit the data well when the points
are close together.
How do you judge an observational study reported in the media? This is
hard to do, because many media types don’t report the details. They
just go for conclusions and sound bites. I look for evidence that the
researchers controlled for other variables, for example an explicit
statement to that effect. If that evidence is missing, I don’t trust
the results. When that evidence exists, I depend on the reputation of
the researchers conducting the study. If it comes from Harvard or Duke
or an equivalent, I feel more confident in the results than if it comes
from a special interest group.
As an aside, this last comment is why it is critical to be ethical in
scientific research. Reputations are sooooo important; doing unethical
things (e.g., fudging results, not reporting failures, lying, stealing
others’ work) that weaken your and your institution’s reputation are
disastrous for you and for the progress of science and society.
3. What is the best way to eliminate non-response
bias?
It is almost impossible to eliminate it completely. The realistic goal
is to minimize its impact. This is done primarily through the design of
the survey. Research shows that people respond more frequently when:
a) The questionnaire is clear, easy to follow, and short.
b) Confidentiality is promised and kept.
c) The questionnaire begins with a statement of purpose and convinces
people that filling it out will benefit research or society.
d) Incentives are given to respond (e.g., respondents enter in a
lottery to win a prize; coupons, payments, or gifts are promised to
respondents).
e) Mailings don’t look like junk mail.
There is an enormous research literature on avoiding nonresponse. By
no means have we found the answer. This is a very active research area
for statisticians, cognitive psychologists, and sociologists. In fact,
if any one is interested in pursuing research in this area, let me
know. There are some great summer internships available.
4. What are the negatives of historical controls
vs. contemporaneous ones?
The key idea is to compare people in the same time frame. For example,
in a comparison of surgery versus chemotherapy for breast cancer, you
wouldn’t want to use surgery patients from 20 years ago as a control
group to compare against a current chemo group. The effectiveness of
surgery has increased dramatically over 20 years, women’s health and
awareness of breast cancer has increased dramatically over 20 years, and
the practices of the medical profession have changed dramatically over
20 years. The goal is to eliminate as many confounding factors as
possible, and using people in the same time frame is one way to
eliminate time as a confunder.
5. How does one calculate a good sample size to
represent the population?
We’ll learn one method for specifying sample size after fall break that
I call the confidence interval sample size method.
6. If biases always exist, how can one discover
anything conclusive by conducting a study?
One can learn by minimizing biases sufficiently. This is the goal of
random sampling in surveys and random sampling in causal studies. When
dealing with questionnaire wording, one tries to make the questions as
clear as possible and then accept what you get.
One of the open problems in statistics, and perhaps the greatest
problem facing our discipline, is to find ways of measuring the amount
of bias, and then accounting for that in estimates. The person who
figures out how to do that will be very famous and will have made a
great contribution to the world! Who's in?
7. Can one do randomization with an observational
study since the subjects are putting themselves in the separate groups?
No. They’ve already assigned themselves to groups, so we’re stuck with
what they chose.
8. Can you review the basics of confounding
factors?
A confounding factor is one that might explain an apparent causal
relationship and is not a treatment. For example, in a comparison of
the effect of vegetarian diets versus non-vegetarian diets on
cholesterol levels, a potential confounder is exercise (high exercise
lowers cholesterol levels). Suppose the diets are equally effective;
that is, changing your diet does not affect cholesterol level. If the
vegetarian diet group has more avid exercisers than the non-vegetarian
group, the average cholesterol for the vegetarian group will be lower
than the average for the non-vegetarian group. Hence, we might
mistakenly conclude that the vegetarian diet reduced cholesterol levels.
The issues above are the main reason why we try to balance background
characteristics (i.e., potential confounders) in the treatment groups.
9. How is it a representative sample of the
population if the subjects are assigning themselves to the groups in an
observation study?
The issue in causal studies is that the comparison of the groups needs
to be fair. Hence, they need to have similar background
characteristics. Once we have a fair comparison, we can learn about the
effectiveness of the treatment for the people in the study.
A key issue is deciding whether to extend the results from the people
in the study to a broader population. One has to make a scientific
argument for extension. For example, suppose the vegetarian diet
example in (7) uses college students in Stat 101 as subjects. The
researchers might easily argue that the results extend to all of Duke
students, since the relative effectiveness of the diets is unlikely to
depend on being in Stat 101. To argue that the results extend to all
people, they would have to make the argument that the relevant
effectiveness of the diets does not depend on being a Duke student.
This is a harder argument, because Duke students tend to be more active
physically than the general population. Perhaps following a vegetarian
diet has no effect on active people, but it has a great effect on
inactive people. If so, the results would not extend beyond Duke
students.
One advantage of observational studies is that they are realistic.
There are no contrived outcomes or treatments; people are acting without
direction from researchers; and, people are taking the treatment
exactly as they would in real settings. Hence, sometimes observational
studies can be more representative of a population than randomized
experiments.
10. I understand that typically when
conducting tests to check a drug's efficacy, those who run the test give
some subjects a placebo and tell all
the subjects that what they receive may or may not be the actual drug.
This helps to see how subjects respond to the drug. However,
would there be any point to telling everyone that they were
receiving the actual drug, in other words, would this show anything if
they reacted differently? For example, if people who take a drug
respond to it, people who take the placebo (and are told that
what they receive may or may not be the drug) don't show any effect,
but people who take the placebo (and are falsely told that they are
receiving the drug) do respond in the amount equal to those who take
the actual drug, does this undermine the drug's efficacy?
In this situation, it is probably the administration of the drug that
causes the apparent effectiveness rather than the drug itself. For
example, perhaps getting attention from a doctor is all that is needed
to cure the problem. Most drugs have side effects that we seek to
avoid. Hence, given the drug's ineffectiveness, it should not be
used for this ailment.
Researchers keep the treatment secret from the administrators to
isolate the effect of the drug. For example, telling all people
that they're getting the drug (or telling all that they're getting the
placebo) is effectively an additional treatment. It becomes hard
to disentangle the effect of the drug from the effect of being told
you're getting the drug, since everyone got the treatment of "being told
they're getting the drug".
Most clinical trials fall under rules of informed consent, which
require researchers to tell them they are participating in a randomized
study. These rules are very serious.
11. In discussing randomization, we talked
about comparing background variables to make sure they were alike in the
treatment and non-treatment groups. How do we know what
background variables are important to compare? Is it
conceivable that there is some strange factor that influences the
results that we as study designers could not foresee?
It's very hard to know which variables are causally-relevant.
This is the great appeal of randomized experiments: when there are
enough units in each group, ALL background characteristics should be
reasonably well-balanced.
In observational studies, we specify important variables using
subject-specific knowledge related to the question of interest.
Statisticians seldom if ever work alone on studies. Rather,
a team of experts in the field tries to identify the important
variables, then the statistician tries to match them. One of the
drawbacks of observational studies is that it is always feasible that
some unmatched, confounding variable explains the differences between
the groups. There are methods to check the sensitivity of
conclusions to such variables. For example, suppose you matched
well on a variety of relevant background characteristics, and you see a
huge difference in the sample averages in the treated and control
groups. You can make an argument that any unmatched background
characteristic would have to have a huge effect on the sample averages
to explain the difference, and it is unlikely that such a variable
exists because you controlled for all the relevant ones you could think
of.
12. How do we tell whether background
characteristics are similar enough in the two groups?
Examine the means and standard deviations of the variables in the
two groups to see if they are close. Now, what does "close" mean?
This question cannot be answered absolutely. For example,
suppose we examine low income people assigned to a job training program
and low income people not assigned to the program. The outcome
variable is salary one year after completion of the program. If
we see a difference of $10 in the average pre-study date salary in the
two groups, it probably is no big deal: we don't expect a $10
difference to have a great impact on the comparison of future salaries.
On the other hand, if we see a 10% difference between groups in
the percentage of people who report that they are highly motivated to
work hard, this might have a strong impact on the results.
Researchers use prior data to decide if differences are large enough
to have strong impact. For example, in previous studies of all
low-income workers, we might discover that people who are highly
motivated average $5000 more in salary than those who are not as highly
motivated. If $5000 is a substantial number (which it is), we
would worry about making sure the percentages of highly motivated
people are similar in the two groups.
Regression and correlation
1. How does
one distinguish between the regression line for x on y versus y on x?
A regression of "Y on X" means that Y is the dependent
(response) variable, and X is the independent (predictor) variable.
A regression of "X on Y" means that X is the dependent
(response) variable, and Y is the independent (predictor) variable.
The lines are different, because they are based on different
interpretations. Consider a regression of Y on X. For any
given value of X, there are many possible outcomes of Y. A
regression means that the population averages of Y for each value of X
are connected by a straight line:
avg. Y = Intercept + slope*X.
Hence, the regression of Y on X tells you about what happens with Y,
given an X.
Now consider a regression of X on Y. For any given value of Y,
there are many possible outcomes of X. The regression of X
on Y means that the population averages of X for each value of Y are
connected by a straight line:
avg. X = Intercept + slope*Y.
Hence, the regression of X on Y tells you about what happens with X,
given a Y.
In real analyses, you get to choose which one is the dependent
variable and which is the independent variable. So, there won't be
any confusion. When you're looking at a graph in a paper,
whatever is on the vertical axis is the Y and whatever is on the
horizontal axis is the X
2. Are you ever allowed to use a regression line of Y on X to predict a value of X
based on a value of Y?
It is possible to do this. In fact, this problem has a special
name: the "calibration problem". It is especially useful in
dose-response drug studies. For example, say you collect responses
on some health measure (e.g., change in blood pressure) for each of
several doses of a drug. Then, you fit a regression of response
on dose. You might be interested in predicting the dose that
gives a certain response (e.g., no change in blood pressure) to find
out a maximum allowable dosage. This means one wants to predict
a dose for a zero blood pressure change, even thought the regression
predicts blood pressure changes based on dose.
The method for calibration involves calculus, and we won't study it
in Stat 101. Feel free to come by office hours for further
explanation. But, to answer the question, it can be and is done!
3. Is there an easier way to figure out the
correlation, "r"?
Well, I have to respond to this question with, "easier than what?"
Here are all the ways one can figure out r.
a) Use a computer to calculate the formula for correlation
based on the data. This is always what you will do in practice.
b) "Eyeball" it. With enough experience, one can get
reasonably good guesses at correlations by mentally comparing the
scatter of the points to otehr data sets with known correlations.
This is a useful skill for consulting, when you don't computers
handy. But, it's better to use a computer to get the exact value
whenever possible.
c) For a regression with one predictor, use the formula
slope = r (SD of Y / SD of X). Assuming you know
or can estimate the slope and SDs, you can solve backwards for r.
This is useful for exams and for understanding the concept of a
regression line, but you won't do this in practice.
4. Can you clarify the meaning of R-squared?
R-squared is the percentage of variation in the response variable
that is explained by the regression line. Think of it this way.
You've got a variable Y that has an SD of 10. Without any
other knowledge, your best guess (i.e. predicted value) at any one
person's Y is the average value of Y. And, you know that for the
Ys in the data set, the typical deviation from the average equals 10,
the SD. Now, suppose you fit a regression with the same Y
on some X, and the typical deviation around the regression line equals 1
(this is the root mean square error). Given an X, your best guess
for that person's value of Y would be the value on the regression line.
And, you know for the Ys in the data set, the typical deviation
from the guesses on the regression line are about 1, the root MSE.
So, by fitting your regression, you've reduced the typical
deviation around your best guess from 10 to 1, a substantial
improvement. The value of R-squared corresponding to this example
equals
1 - SSE / SST = 1 - (1*1) / (10*10) = .99.
So, your regression line has explained 99% of the variation in the Ys,
where variation is measured in terms of SST. See the course pack
for definition of SSE and SST.
5. Can you say more about multiple regression,
since it is not in the text book?
I could, but it'd be impossible to do in a short space. There
are entire courses devoted to multiple regression. The important
concept for Stat 101 is to recognize that it is just a more
sophisticated method of predicting an outcome. It uses more than
one predictor. Otherwise, it follows the same logic as simple
regression. For every combination of predictors, there is some
population average value of Y. These population average values
fall on a line given by:
avg. Y = Intercept + slope1 * predictor1 +
slope2* predictor2 + slope3*predictor3 + etc.
Values of Y fall around these averages just like in simple
regression. The estimates of the intercept and slopes are those in the
line whose predicted values (i.e. the values from the equation) are as
close as possible to the values of Y in the data.
For a good introduction to multiple regression, see Neter, Kutner,
Nachtsheim and Wasserman, "Applied Linear Statistical Models".
Bayesian statistics
1. How does one specify a prior distribution?
This is a hard question and one of the central criticisms of Bayesian
statistics. There are two main approaches. One is to
convene experts on the subject of interest, and form a function
representing the experts' prior beliefs. For example, in class we
collected opinions about the average IQ of Duke professors and
expressed our prior beliefs as a normal curve. Although one might
argue whether our class are experts, this approach is similar to what
could be done in practice. A second approach is to use historical
data to make the prior beliefs. For example, we might use results
from other studies to help us specify the mean and SD of a normal curve
representing our prior beliefs. The important consideration is
that the prior distributions reflect honestly your beliefs and you are
upfront with those beliefs.
2. Doesn't using a prior distribution taint
your results, since you are not being objective?
First, one is never purely objective in statistical analyses. For
example, the decision to assume the central limit theorem holds is
essentially a subjective one based on your beliefs that the sample size
is large enough; your decision to call certain points outliers is
subjective; and, your decision to use one-tailed or two-tailed
hypothesis tests is subjective. That said, the prior beliefs are
explicit in a Bayesian analysis: they're in the prior distribution.
If you make a prior distribution that is reasonable for the data,
based on scientific grounds, your inferences actually can be better
than if you didn't use a Bayesian analysis. So, the Bayesian
would say that the prior distribution does not "taint" the analysis;
rather, it improves it.
One thing that is usually done in Bayesian analyses is to check how
sensitive the results are to the prior beliefs. If there are
competing prior ditributions that are all plausible, one can perform
the analyses with each of them. When the results are very
sensitive to the prior beliefs, this is reported honestly.
As the sample size gets large, the information from the data dominate
the information from the prior beliefs. The issue is in small
samples, when the prior beliefs really can have a big impact on
inferences. Of course, this is exactly when you want to use prior
information. For example, if you get three flips of a coin all as
heads, you wouldn't want to say there is a 0% chance of getting tails.
You'd want to incorporate the prior beliefs that the coin has a
50-50 chance of landing tails.
3. Why doesn't everyone use Bayesian
statistics, since there's always prior beliefs?
Good question! Many statisticians in the world would argue that
all analyses should be Bayesian. Others worry that it is
impossible to form reasonable prior distributions, so that it is too
risky to use Bayesian statistics (see Question 2 above). When you
use a prior distribution that doesn't reflect reality, you could get
inferences that also do not reflect reality.