Statistics 101
Data Analysis and Statistical Inference

Answers to extra problems on confidence intervals and hypothesis tests


1. Conceptual questions on confidence intervals

i)  False.   With a wider interval, one can be more confident that the parameter is contained in the interval.  So, confidence levels are higher.

ii)  False.   Increasing the sample size decreases the width of confidence intervals, because it decreases the estimated standard deviation. It makes sense that they should be narrower: you have more information and so can make a sharper guess as to a likely range for the parameter.

iii)  False.   Confidence intervals are not probability intervals.   95% confidence means that we have produced an interval with a procedure that works 95% of the time.  That is, 95% of all intervals procduce bythe procedure will contain their corresponding parameters.

iv)  True.  The standard deviation of XBAR, which is often called the standard error, has the square root of  the sample size in the denominator.  Hence, increasing the sample size by a factor of 4 (i.e., multiplying it by 4) is equivalent to multiplying the standard error by 1/2.  Hence, the interval will be half as wide.

v)  The central limit theorem is needed for confidence intervals to be valid.   However, it is also necessary that the observations be independent, and that the data be collected from random samples.  Also, confidence intervals will not remedy poorly collected data.
 

2.  Conceptual questions on hypothesis tests

i)  False.  A small p-value means the value of the statistic we actually observed in the sample is unlikely to have occurred when the null hypothesis is true.  Hence, a smaller p-value means it is even more unlikley the observed statistic would have occurred when the null hypothesis is true.  Hence, a smaller p-value is stronger evidence against the null hypothesis.

ii)  False.  By definition, p-values take into consideration the sample size since the test statistic is divided by the standard error.  Hence, when the null hypothesis is true, a small p-value should be equally likely regardless of sample size.  What is true is that when the null hypothesis is false, hypothesis tests done with small sample sizes are less likely to reject the null hypothesis.

iii)  False.   A p-value is not a probability that the null hypothesis is true.   It is the probability of observing a value of the sample statistic that is as or more extreme than what was observed, when the null hypothesis is true.

iv)   True.  Just by chance it is possible to get a sample that produces a value of the test statistic that leads to a small p-value, even though the null hypothesis is true.  This is called a Type I error.   A Type II error is when the null hypothesis is not rejected when it is in fact false.

v)   The central limit theorem is needed for hypothesis tests to be valid.   However, it is also necessary that the observations be independent, and that the data be collected from random samples.  Also, hypothesis tests will not remedy poorly collected data.

vi)    This is incorrect because the researcher is claiming that (1- p-value) is the probability that the null hypothesis is false.  The p-value is not a probability of a null hypothesis being true or false.  See the answer to part c.

vii)  With four units, the null hypothesis is unlikely to be rejected because the variability in the sample mean will be large.  Hence, there is not enough data to support the researcher's claim that the  alternative hypothesis is clearly not right.

viii)  You should not allow the drug to be manufactured based on this evidence.  The study was not a randomized study, so that there may be differences in the background charcteristics of the people who got the new drug and the people who got the old drug.   Hypothesis tests cannot fix poorly designed studies! Decide whether the following statements are true or false.  Explain your reasoning.

3. True or False:

False.  The formula for the confidence interval is fine, but clearly this is not a representative sample.  People who call in to the program to voice their opinions for new military action against Iraq are likely to hold stronger opinions than those who do not call in.  For any statistical procedure to yield valid results, the data must be collected properly!

4.  Frightening information about our citizenry.

The outcomes are dichotomous (identify, do not identify).  There is some population percentage, p, of people who can identify the Bill of Rights.   We want a 95% CI for p.  Because 507 people are randomly sampled from the population, it is reasonable to assume a box model for the data.

First, the point estimate of p is the sample percentage, 142/507  =  .28.  The standard error equals square root (.28 * (1-.28) / 507 )  =  .02.  Hence, the lower limit of the interval is .28 - 1.96 * .02   =   .241.  The upper limit of the interval as .28 + 1.96 * .02   =   .321.

To conclude, we are 95% confident that the percentage of Americans who could not identify the Bill of Rights is between 24.1 and 32.1 percent.  Yikes!!!  That's a lot higher than I ever thought it would be....

2. For this confidence interval to be valid, we need to assume that:

a)   The data collected are representative of the population of interest (i.e., it was a simple random sample).
b)   The Central Limit Theorem holds for the sample proportion.

We've already argued why a box model holds.  Furthermore, since the sample is large, there is no need to worry about the correction factor in the standard error.

The Central Limit Theorem certainly holds, since it appears that  np > 5 and n(1-p) > 5.    We check this by  using .28 as an estimate of p, and 507 for n, so that np = .28*507 > 5  and .72 * 507 > 5.
 

3.  Comments:

a).  We would like to know exactly how the data were collected.  If they are from a random sample and there was no serious problem with nonresponse or other biases, we can trust the confidence interval.

b)  We also would like to see the question wording, so that we can make sure that the question was clear and objective.
 

5.  Improving response rates in surveys

Let p1  =  the percentage of all people in the study who would return their surveys when given a plain cover.
Let p2  =  the percentage of all people in the study who would return their surveys when given a skydiver cover.

The parameter of interest is p1-p2.

1.  If there is no difference in the probabilities of responding, then p1 = p2.   If there is a difference, then p1 not = p2.  Thus,

 
Ho :  p1 -p2 = 0     Ha:  p1 - p2  not = 0
 
2.   The p-value is the chance of seeing a value of the z-test statistic that is as or more extreme than what we got in the sample.  Since the alternative hypothesis is two-sided, "as or more extreme" means far away from zero in either the positive or negative direction.
 
The sample proportion in group 1 is  104 / 207 = 0.5024.     The sample proportion in group 2 is  109 / 213 = 0.5117.    The absolute difference between these sample proportions is  .0093.
 
Even though the sample difference is around only 1%, we still should check whether we are likely to observe such a difference by chance when p1 = p2.  We can use a hypothesis test.
 
The standard error for the hypothesis test is sqrt[ (.5024)(1-.5024)/207 + (.5117)(1-.5117)/213 ] =.0488.

Thus, the z-statistic for the test is:

z = (.0093 - 0) / .0488 = .19.

Because this is a two-sided test, the p-value is the sum of the areas under the normal curve to the left of -.19 and to the right of .19.  Looking on the table, or using JMP-IN, we get  .848 as the p-value.

3.    Since the p-value is so large, we cannot reject the null hypothesis.  It does not appear that the  skydiver cover has a different effect on response rates than the plain cover.   A "significance level" of 0.05 means that consider p-values less than 0.05 to be strong evidence against the null hypothesis.

4.  The lower limit of the 95% confidence interval is -.0093 - 1.96 * .0488   =   -.104954.   The upper limit of the 95% confidence interval is -.0093 + 1.96 * .0488   =   .086311.

Hence, we are 95% confident that the difference in percentage of respondents to the plain cover and the sky-diver cover is between -10.5% and 8.6%.  There isn't evidence favoring one cover over the other.

Note this is a 95% CI for p1 - p2, so that it is a likely range for the difference between plain cover response rates minus skydiver cover response rates.  Flipping the order of p1 and p2 would change the signs of the two limits.

5. For this confidence interval and hypothesis test to be valid, we need to assume that:

a).  The data in each group are representative of the study population (i.e., it was a simple randomized experiment).
b).   The Central Limit Theorem holds for the sample proportion in each group.

It was a randomized experiment, so (a) holds.

The Central Limit Theorem certainly holds, since it appears that n1 * p1 > 5  and  n1 *  (1-p1) > 5     and that     n2 * p2 > 5  and  n2 *  (1-p2) > 5.      We check this by  using .5024 as an estimate of p1 and .5117  as an estimate of p2.

Given that the surveys were sent out at random, the two groups should have similar background characteristics, so that a fair comparison can be made.

6.  Comments:

I would like to know whether the people who received the letter are representative of all skydivers.   If so, then we can generalize these results to all skydivers instead of just the people in the study.
 

6.   Questionnaire wording in action

In the first survey, the outcomes are combined into two opinions of "favor death penalty" and "not favor the death penalty".  In the second survey, we consider the two answers of  "favor life imprisonment" and "have no opinion" as one category: "not favor death penalty". Thus the outcome in both surveys is dichotomous.

1.  In the first survey, the total number of trials is (385 + 119 + 39) = 543.

Let p1 be the population percentage of people who favor the death penalty when asked this question.

We are going to get a 95% CI for p1 .  Since the data are collected by a random sample, a box model applies.

In the survey, the percentage favoring the death penalty is 385 / 543 = .709.  So, the estimated standard error is
square root (.709 * (1-.709) / 543)  = .0195.

The lower limit of the 95% confidence interval is .709 - 1.96 * .0195 = .671.   The upper limit of the 05% confidence interval is .709 + 1.96 * .0195  = .747.

Hence, according to the responses to Question 1, we are 95% confident that the percentage of Americans who favor the death penalty is between 67.1 and 74.7 percent.

2. In the second survey, the total number of trials is (286 + 194 + 31) = 511.

Let  p2 be the population percentage of people who favor the death penalty when asked this question.  Note that this value of  p2  may differ from the value of  p1  in part 1.

We are going to get a 95% CI for p2.  Since the data are collected by a random sample, a box model applies.

In this survey, the percentage favoring the death penalty is 286 / 511 = .0.560.   The standard error is
square root (.560 * (1-.560) / 511)  = .022.

The lower limit of the 95% confidence interval is .560 - 1.96 * .022 = .517.   The upper limit of the 95% confidence interval is .560 + 1.96 * .022  = .603.

Based on the 95% confidence interval, we believe that around 52% to 60% of people respond favorably for the death penalty when this question is asked.

3.  The wording does appear to affect the way people respond about the death penalty.  The two confidence intervals are far apart (they don't overlap at all), indicating that more people respond favorably to Question 1 than to Question 2.

7.  Composition of Ancient Earth's Atmosphere

1.The hypotheses for the two tests are:

       1) (Nitrogen) Ho: mu.N2 = 78.1 and Ha: mu.N2 not = 78.1;
       2) (Oxygen) Ho: mu.O2 = 20.9 and Ha: mu.O2 not = 20.9.
 

Using the t-test procedure on JMP-IN, both t-tests have p-values less than .0001. Thus, we reject both null hypotheses. The composition of ancient air appears to have had different concentrations of nitrogen and oxygen than modern air.

Let's show the test for nitrogen by hand.  The sample mean concentration of the nine samples of nitrogen is  59.59.  Their SD equals 6.25, so that the SE equals 6.25 / sqrt[9] = 2.08.  The absolute value of the test statistic equals:

|t| = | (59.95 - 78.1)/2.08 |  =  8.88.

Because the sample size is small, we use a t-distribution with 8 degrees of freedom (9-1=8) to calculate the p-value.  That is, we add the areas under the t-curve with eight degrees of freedom to the left of -8.88 and to the right of 8.88.  This area is very small, less than .0001.

2.   The criticisms include: 1) (definite criticism) the observations are not independent, since the samples come from the same rock; and, 2) (possible criticism) the N2 values are not symmetric around the mean, which could make the tests inaccurate with a small sample size. Although the data for nitrogen are not quite normally distributed, that doesn't bother me too much since there is such a large disparity between the sample mean of N2 and 78.1.   The bigger question is whether these samples are representative of ancient air. Who really knows if the air trapped in resin was like other air in the atmosphere?

8. The effects of logging on tropical rainforests

1.  It is important that the loggers not know about the study, because otherwise they might log areas in such a way that logging does not appear as damaging.  Plus, to make the study conditions as realistic as possible, we want the loggers to select areas as they normally would.

2.   This is a two samples study.

Let mu.log = population mean number of species in logged plots. Let mu.not = population mean number of species in non-logged plots.

The hypotheses are: H0: mu.not - mu.log < 0; Ha: mu.not - mu.log > 0.

Using a t-test in JMP with the greater than alternative hypothesis, the p-value is around .02.  Thus, there is some evidence that the number of species in logged plots is less than the number in not logged plots.  The evidence is somewhat exaggerated by the use of a one-sided hypothesis. If possible, it would be helpful to collect more data before using these results to inform any policy decisions.

By hand, the difference in the sample means equals 3.83 (17.5 - 13.67).  The sample SE equals root(1.5*1.5 + 1.0188*1.0188) = 1.81.  The test statistics equals

t = (3.83 - 0) / 1.81 = 2.11.

This differs slightly from the test statistic on JMP; don't worry about it.  I can explain if you want the technical details.

3.  We note that this is a 99% CI, so it is wider than the usual 95% CI.   Using df = 19 as determined by JMP, we have a 99% CI of 3.83 +/- 2.86 root(1.5*1.5 + 1.0188*1.0188)  =    (-9.01,  1.34).  The confidence interval is mostly negative, suggesting that logged plots have fewer species.  However, it does contain positive value.  If we were demanding the use of a  99% CI, more evidence would be needed before stating conclusively that the logging appears to reduce species diversity.

4.  The assumptions include:  the data were collected in simple random samples in each group; and, the CLT applies in each group.  As long as the data are normally distributed in each group, the CLT assumption is reasonable.  However, there is an outlier in the "logged" group.  That's why we check further below.

5.  The p-value without the data point is .048.   This is weaker evidence against the null hypothesis than before, since the p-value is larger. But, the p-value remains small enough to suggest that logged plots have fewer species.  Hence, the outlier does not change our conclusions meaningfully.

6.  We conclude that there is evidence that logged plots have lower numbers of species, but that the evidence is not overwhelmingly strong. The difference in number of species is around 3.833, with a 99% CI stretching from 9 less to one more.   Whether this difference is large enough to have practical implications is a question best addressed by forestry and biology experts.

9.  Articles

1.  Explanation of the **:  assuming it is true that there is no difference in average insulin (glucose) levels between the two types of mice, the chance that we would observe a difference in average insulin (glucose) levels in the sample that is as or more extreme than the sample difference in this data set is less than .005. Such events are so rare that we reject the hypothesis that there is no difference between the two types of mice, and we conclude that there is a difference. The same logic describes the *, but now the definition of rare is an event that happens less than 5 percent of the time (but more than .005 percent of the time).

10.  Glaucoma

a)  This is a matched pairs study.   The sample average of the difference variable is -4.  The SE = 3.8.  Hence, the 95% CI  is   -4 + (2.36)(3.8) = (-12.982, 4.902) .  The mostly negative 95% CI suggests that glaucoma eyes may be thinner, but the data do not provide sufficiently strong evidence to say which eyes tend to be thicker.  Clearly, more data are needed before strong conclusions can be reached.

Note that there are 7 df (8 - 1 = 7) in the t-distribution used in the CI.

b)    Let mu1 = avg. thickness of eyes with glaucoma.   Let mu2 = avg. thickness of eyes without glaucoma.

Ho: mu1 - mu2  = 0
Ha: mu1 - mu2  not = 0

From JMP, we get a p-value of .327.  Thus, we cannot rejec tthe null  hypothesis: there is not enough evidence to suggest that glaucoma and non-glaucoma eyes have different average thicknesses.

By hand, the absolute value of the test statistic equals:

|t| = | (-4 - 0) / 3.8 | = 1.05

To get the p-value, we add the area under the t-curve with 7 degrees of freedom to the left of -1.05 and to the right of 1.05.  This equals .327.

c)  We assume that the data are collected from simple random samples (they are), that the eyes are paired (they are), that each person is independent (they are), and that the differences are normally distributed (they seem to be).

 d) The inferences are only directly relevant for people who have one type of each eye.  There may be a difference in the thicknesses of eyes for people who have all glaucoma or all non-glaucoma eyes.  Any conclusions for this population are based on the scientific extrapolation that eye thicknesses for the one-of-each type person  are similar to eye thicknesses for the all-of-one type person.