Statistics 101
Data Analysis and Statistical Inference
 

Responses to selected student questions


This page has answers to questions posed by students about statistical topics.  Click on the relevant link to go to the appropriate section.

Click here to go to questions on study design
Click here to go to questions on regression and correlation
Click here to go to questions on Bayesian statistics
More links coming in the future.


 Study Design

1. Is there any way to account for nonresponse bias in a study, besides assigning more weight to those participants that were hard to obtain?

Missing data are one of the hard issues in real studies. Many people simply disregard them and analyze the complete data only. This is bad practice. Resist doing this, even when your advisors or bosses tell you it is OK. A great book on this subject is by Little and Rubin (1987) Statistical Analysis with Missing Data.

The approach I use is to impute the missing values with scientifically-driven guesses. For example, suppose income is missing for some units. I would use a regression model to predict their incomes from other characteristics of these units. Then, I fill in the missing incomes with predictions from my regression model, being sure to add some chance deviation from the line in my imputations. This fills in all the holes in the data set so that I can analyze it as usual. A key idea is to impute several values of the missing value to make sure that you can capture the uncertainty due to guessing. There are specific rules for combining the multiply-imputed data sets that are derived from statistical theory. This technique, called multiple imputation, is an area of my research.

Here’s a comment on assigning more weight to participants who were hard to obtain. This is a practice done by some statistical agencies. I do not recommend it. The reason is that you have to make two rather arbitrary decisions: 1) which units were hard to obtain; 2) how much extra weight to give these units. With the multiple imputation approach, there is at least a principled, transparent method of filling in the missing data.

2.  Can you explain more about how to control for confounding variables and how do you know whether a study in the media has considered confounding variables?

The ideal way to control for confounding is to use a randomized experiment. With randomization, all the background characteristics should be similar in the treatment groups. This minimizes the influence of confounding factors. Additional steps to avoid confounding include double-blinding and using placebos in control groups.

In observational studies, the best bet is to match treated units with similar-looking control units. Matching schemes come in many flavors. For example, one approach is to form some function of the values of the background variables. Then, for each treated unit, find the control unit in the database (usually there are many more controls than treateds in databases) whose value of the function is closest to that treated unit. My favorite matching scheme of this type is called propensity score matching (come by office hours for more info). When it is not possible to get reasonably close matches, then I would throw up my hands and say that the data cannot inform me about the causal question of interest. As dissatisfying as it is, sometimes the answer has to be, “I can’t do it.”

Another approach is to use multiple regression to control for the effects of other variables. This approach is extremely popular. However, it can give misleading results. When the treated and control groups look different, the data are really spread out. When the data points are really spread out, it is hard to find a multiple regression model that describes the data accurately. Furthermore, with many variables, it is hard even just to tell if the model fits accurately. Hence, I prefer the matching approach. I do like the idea of matching first, then fitting a regression model using the matched and control units. Regressions are more likely to fit the data well when the points are close together.

How do you judge an observational study reported in the media? This is hard to do, because many media types don’t report the details. They just go for conclusions and sound bites. I look for evidence that the researchers controlled for other variables, for example an explicit statement to that effect. If that evidence is missing, I don’t trust the results. When that evidence exists, I depend on the reputation of the researchers conducting the study. If it comes from Harvard or Duke or an equivalent, I feel more confident in the results than if it comes from a special interest group.

As an aside, this last comment is why it is critical to be ethical in scientific research. Reputations are sooooo important; doing unethical things (e.g., fudging results, not reporting failures, lying, stealing others’ work) that weaken your and your institution’s reputation are disastrous for you and for the progress of science and society.

3. What is the best way to eliminate non-response bias?

It is almost impossible to eliminate it completely. The realistic goal is to minimize its impact. This is done primarily through the design of the survey. Research shows that people respond more frequently when:

a) The questionnaire is clear, easy to follow, and short.
b) Confidentiality is promised and kept.
c) The questionnaire begins with a statement of purpose and convinces people that filling it out will benefit research or society.
d) Incentives are given to respond (e.g., respondents enter in a lottery to win a prize; coupons, payments, or gifts are promised to respondents).
e) Mailings don’t look like junk mail.

There is an enormous research literature on avoiding nonresponse. By no means have we found the answer. This is a very active research area for statisticians, cognitive psychologists, and sociologists. In fact, if any one is interested in pursuing research in this area, let me know. There are some great summer internships available.

4. What are the negatives of historical controls vs. contemporaneous ones?

The key idea is to compare people in the same time frame. For example, in a comparison of surgery versus chemotherapy for breast cancer, you wouldn’t want to use surgery patients from 20 years ago as a control group to compare against a current chemo group. The effectiveness of surgery has increased dramatically over 20 years, women’s health and awareness of breast cancer has increased dramatically over 20 years, and the practices of the medical profession have changed dramatically over 20 years. The goal is to eliminate as many confounding factors as possible, and using people in the same time frame is one way to eliminate time as a confunder.

5. How does one calculate a good sample size to represent the population?

We’ll learn one method for specifying sample size after fall break that I call the confidence interval sample size method.

6. If biases always exist, how can one discover anything conclusive by conducting a study?

One can learn by minimizing biases sufficiently. This is the goal of random sampling in surveys and random sampling in causal studies. When dealing with questionnaire wording, one tries to make the questions as clear as possible and then accept what you get.  

One of the open problems in statistics, and perhaps the greatest problem facing our discipline, is to find ways of measuring the amount of bias, and then accounting for that in estimates. The person who figures out how to do that will be very famous and will have made a great contribution to the world!   Who's in?

7. Can one do randomization with an observational study since the subjects are putting themselves in the separate groups?

No. They’ve already assigned themselves to groups, so we’re stuck with what they chose.

8. Can you review the basics of confounding factors?

A confounding factor is one that might explain an apparent causal relationship and is not a treatment. For example, in a comparison of the effect of vegetarian diets versus non-vegetarian diets on cholesterol levels, a potential confounder is exercise (high exercise lowers cholesterol levels). Suppose the diets are equally effective; that is, changing your diet does not affect cholesterol level. If the vegetarian diet group has more avid exercisers than the non-vegetarian group, the average cholesterol for the vegetarian group will be lower than the average for the non-vegetarian group. Hence, we might mistakenly conclude that the vegetarian diet reduced cholesterol levels.  

The issues above are the main reason why we try to balance background characteristics (i.e., potential confounders) in the treatment groups.

9. How is it a representative sample of the population if the subjects are assigning themselves to the groups in an observation study?

The issue in causal studies is that the comparison of the groups needs to be fair. Hence, they need to have similar background characteristics. Once we have a fair comparison, we can learn about the effectiveness of the treatment for the people in the study.

A key issue is deciding whether to extend the results from the people in the study to a broader population. One has to make a scientific argument for extension. For example, suppose the vegetarian diet example in (7) uses college students in Stat 101 as subjects. The researchers might easily argue that the results extend to all of Duke students, since the relative effectiveness of the diets is unlikely to depend on being in Stat 101. To argue that the results extend to all people, they would have to make the argument that the relevant effectiveness of the diets does not depend on being a Duke student. This is a harder argument, because Duke students tend to be more active physically than the general population. Perhaps following a vegetarian diet has no effect on active people, but it has a great effect on inactive people. If so, the results would not extend beyond Duke students.

One advantage of observational studies is that they are realistic. There are no contrived outcomes or treatments; people are acting without direction from researchers; and, people are taking the treatment exactly as they would in real settings. Hence, sometimes observational studies can be more representative of a population than randomized experiments.

10.  I understand that typically when conducting tests to check a drug's efficacy, those who run the test give some subjects a placebo and tell all
the subjects that what they receive may or may not be the actual drug.  This helps to see how subjects respond to the drug.  However, would there  be any point to telling everyone that they were receiving the actual drug, in other words, would this show anything if they reacted differently?  For example, if people who take a drug respond to it, people who take the  placebo (and are told that what they receive may or may not be the drug) don't show any effect, but people who take the placebo (and are falsely told that they are receiving the drug) do respond in the amount equal to those who take the actual drug, does this undermine the drug's efficacy?


In this situation, it is probably the administration of the drug that causes the apparent effectiveness rather than the drug itself.  For example, perhaps getting attention from a doctor is all that is needed to cure the problem.  Most drugs have side effects that we seek to avoid.  Hence, given the drug's ineffectiveness, it should not be used for this ailment.  

Researchers keep the treatment secret from the administrators to isolate the effect of the drug.  For example, telling all people that they're getting the drug (or telling all that they're getting the placebo) is effectively an additional treatment.  It becomes hard to disentangle the effect of the drug from the effect of being told you're getting the drug, since everyone got the treatment of "being told they're getting the drug".

Most clinical trials fall under rules of informed consent, which require researchers to tell them they are participating in a randomized study.  These rules are very serious.  

11. In discussing randomization, we talked about comparing background variables to make sure they were alike in the treatment and non-treatment groups.  How do we know what background variables are important to compare?  Is it  conceivable that there is some strange factor that influences the results that we as study designers could not foresee?

It's very hard to know which variables are causally-relevant.   This is the great appeal of randomized experiments: when there are enough units in each group, ALL background characteristics should be reasonably well-balanced.

In observational studies, we specify important variables using subject-specific knowledge related to the question of interest.  Statisticians seldom if ever work alone on studies.  Rather, a team of experts in the field tries to identify the important variables, then the statistician tries to match them.  One of the drawbacks of observational studies is that it is always feasible that some unmatched, confounding variable explains the differences between the groups.  There are methods to check the sensitivity of conclusions to such variables.  For example, suppose you matched well on a variety of relevant background characteristics, and you see a huge difference in the sample averages in the treated and control groups.  You can make an argument that any unmatched background characteristic would have to have a huge effect on the sample averages to explain the difference, and it is unlikely that such a variable exists because you controlled for all the relevant ones you could think of.

12.  How do we tell whether background characteristics are similar enough in the two groups?

Examine the means and standard deviations of the variables in the two groups to see if they are close.   Now, what does "close" mean?  This question cannot be answered absolutely.  For example, suppose we examine low income people assigned to a job training program and low income people not assigned to the program.  The outcome variable is salary one year after completion of the program.   If we see a difference of $10 in the average pre-study date salary in the two groups, it probably is no big deal: we don't expect a $10 difference to have a great impact on the comparison of future salaries.  On the other hand, if we see a 10% difference between groups in the percentage of people who report that they are highly motivated to work hard, this might have a strong impact on the results.

Researchers use prior data to decide if differences are large enough to have strong impact.  For example, in previous studies of all low-income workers, we might discover that people who are highly motivated average $5000 more in salary than those who are not as highly motivated.  If $5000 is a substantial number (which it is), we would worry about making sure the percentages of highly motivated people are similar in the two groups.

 Regression and correlation

1.  How does one distinguish between the regression line for x on y versus y on x?

A regression of "Y on X"  means that Y is the dependent (response) variable, and X is the independent (predictor) variable.  A regression of "X on Y"  means that X is the dependent (response) variable, and Y is the independent (predictor) variable.

The lines are different, because they are based on different interpretations.   Consider a regression of Y on X.  For any given value of X, there are many possible outcomes of Y.    A regression means that the population averages of Y for each value of X are connected by a straight line:   

avg. Y = Intercept + slope*X.  

Hence, the regression of Y on X tells you about what happens with Y, given an X.

Now consider a regression of X on Y.  For any given value of Y, there are many possible outcomes of X.    The regression of X on Y means that the population averages of X for each value of Y are connected by a straight line:

avg. X = Intercept + slope*Y.

Hence, the regression of X on Y tells you about what happens with X, given a Y.

In real analyses, you get to choose which one is the dependent variable and which is the independent variable.  So, there won't be any confusion.  When you're looking at a graph in a paper, whatever is on the vertical axis is the Y and whatever is on the horizontal axis is the X

2. Are you ever allowed to use a regression line of Y on X to predict a value of X based on a value of Y?

It is possible to do this.  In fact, this problem has a special name: the "calibration problem".  It is especially useful in dose-response drug studies.  For example, say you collect responses on some health measure (e.g., change in blood pressure) for each of several doses of a drug.  Then, you fit a regression of response on dose.  You might be interested in predicting the dose that gives a certain response (e.g., no change in blood pressure) to find out a maximum allowable dosage.   This means one wants to predict a dose for a zero blood pressure change, even thought the regression predicts blood pressure changes based on dose. 

The method for calibration involves calculus, and we won't study it in Stat 101.  Feel free to come by office hours for further explanation.  But, to answer the question, it can be and is done!

3. Is there an easier way to figure out the correlation, "r"?

Well, I have to respond to this question with, "easier than what?"  Here are all the ways one can figure out r.

a)  Use a computer to calculate the formula for correlation based on the data.  This is always what you will do in practice.
b)  "Eyeball" it.  With enough experience, one can get reasonably good guesses at correlations by mentally comparing the scatter of the points to otehr data sets with known correlations.  This is a useful skill for consulting, when you don't computers handy.  But, it's better to use a computer to get the exact value whenever possible.
c)  For a regression with one predictor, use the formula  slope = r (SD of Y  /  SD of X).  Assuming you know or can estimate the slope and SDs, you can solve backwards for r.  This is useful for exams and for understanding the concept of a regression line, but you won't do this in practice.

4. Can you clarify the meaning of R-squared?

R-squared is the percentage of variation in the response variable that is explained by the regression line.  Think of it this way.  You've got a variable Y that has an SD of 10.  Without any other knowledge, your best guess (i.e. predicted value) at any one person's Y is the average value of Y.  And, you know that for the Ys in the data set, the typical deviation from the average equals 10, the SD.    Now, suppose you fit a regression with the same Y on some X, and the typical deviation around the regression line equals 1 (this is the root mean square error).  Given an X, your best guess for that person's value of Y would be the value on the regression line.  And, you know for the Ys in the data set, the typical deviation from the guesses on the regression line are about 1, the root MSE.   So, by fitting your regression, you've reduced the typical deviation around your best guess from 10 to 1, a substantial improvement.  The value of R-squared corresponding to this example equals  

1 - SSE / SST   =   1 -  (1*1) / (10*10)  = .99.  

So, your regression line has explained 99% of the variation in the Ys, where variation is measured in terms of SST.  See the course pack for definition of SSE and SST.

5. Can you say more about multiple regression, since it is not in the text book?

I could, but it'd be impossible to do in a short space.  There are entire courses devoted to multiple regression.  The important concept for Stat 101 is to recognize that it is just a more sophisticated method of predicting an outcome.  It uses more than one predictor.  Otherwise, it follows the same logic as simple regression.  For every combination of predictors, there is some population average value of Y.  These population average values fall on a line given by:     

avg. Y =  Intercept  +  slope1 * predictor1   +  slope2* predictor2  +  slope3*predictor3 + etc.

Values of Y fall around these averages just like in simple regression. The estimates of the intercept and slopes are those in the line whose predicted values (i.e. the values from the equation) are as close as possible to the values of Y in the data.

For a good introduction to multiple regression, see Neter, Kutner, Nachtsheim and Wasserman, "Applied Linear Statistical Models".

 Bayesian statistics

1. How does one specify a prior distribution?

This is a hard question and one of the central criticisms of Bayesian statistics.  There are two main approaches.  One is to convene experts on the subject of interest, and form a function representing the experts' prior beliefs.  For example, in class we collected opinions about the average IQ of Duke professors and expressed our prior beliefs as a normal curve.  Although one might argue whether our class are experts, this approach is similar to what could be done in practice.  A second approach is to use historical data to make the prior beliefs.  For example, we might use results from other studies to help us specify the mean and SD of a normal curve representing our prior beliefs.  The important consideration is that the prior distributions reflect honestly your beliefs and you are upfront with those beliefs.

2. Doesn't using a prior distribution taint your results, since you are not being objective?

First, one is never purely objective in statistical analyses.  For example, the decision to assume the central limit theorem holds is essentially a subjective one based on your beliefs that the sample size is large enough; your decision to call certain points outliers is subjective; and, your decision to use one-tailed or two-tailed hypothesis tests is subjective.   That said, the prior beliefs are explicit in a Bayesian analysis: they're in the prior distribution.  If you make a prior distribution that is reasonable for the data, based on scientific grounds, your inferences actually can be better than if you didn't use a Bayesian analysis.  So, the Bayesian would say that the prior distribution does not "taint" the analysis; rather, it improves it.

One thing that is usually done in Bayesian analyses is to check how sensitive the results are to the prior beliefs.  If there are competing prior ditributions that are all plausible, one can perform the analyses with each of them.  When the results are very sensitive to the prior beliefs, this is reported honestly.

As the sample size gets large, the information from the data dominate the information from the prior beliefs.  The issue is in small samples, when the prior beliefs really can have a big impact on inferences.  Of course, this is exactly when you want to use prior information.  For example, if you get three flips of a coin all as heads, you wouldn't want to say there is a 0% chance of getting tails.  You'd want to incorporate the prior beliefs that the coin has a 50-50 chance of landing tails.

3. Why doesn't everyone use Bayesian statistics, since there's always prior beliefs?

Good question!  Many statisticians in the world would argue that all analyses should be Bayesian.  Others worry that it is impossible to form reasonable prior distributions, so that it is too risky to use Bayesian statistics (see Question 2 above).  When you use a prior distribution that doesn't reflect reality, you could get inferences that also do not reflect reality.