Statistics 101
Data Analysis and Statistical
Inference
Instructions for lab 12
Lab Objective
The purpose of the lab is to gain some experience with multiple
regression.
Lab Procedures
Recall the SAT data from Lab 8.
In those data, we saw that the relationship between expenditures
and total score was negative. But, there was a third variable that
was strongly associated with expenditures and SAT scores, namely the
percentage taking the SAT. When we looked at scatter plots
involving these three variables, it appeared that the relationship
between the percentage taking the SAT and the scores was stronger than
the relationship between expenditures and scores, so that percent taking
the exam might explain the apparently negative relationship between
expenditures and scores.
Multiple regression is designed to parcel out the effects of these
variables. Let's run a multiple regression using the SAT data to
estimate the association of SAT scores and expenditures, controling for
pertent taking.
Questions:
1) Fit a multiple regression for "Total SAT score" on
"Expenditures" and "Percent taking." Go to Analyze - Fit
Model. Put "Total SAT score" in the Y box, then
highlight "Expenditures" and hit Add. Next,
highlight "Percent taking" and hit Add. Then hit, Run
Model.
a) What are the estimates of the intercept and the
coefficients (slopes)? Write each estimate on your report.
b) What is the value of the typical deviation of points from the
regression line?
c) What percentage of the variation in total SAT scores is
explained by this regression?
2) Let's examine plots of residuals versus each of the
predictors to make sure the model fits the data reasonably well.
Click on the red arrow next to Response - Total SAT score.
Click Save Columns - Residuals. This adds the
residuals from the model to your data file. Now you can look at
scatterplots of the residuals versus the predictors in the usual way we
look at scatterplots of any two variables (Fit Y by X).
Nonrandom patterns in these plots (e.g., curves) indicate the
regression assumptions do not hold for these data. Describe
what patterns (e.g., random or non-random) you see in the residual
plots.
2b) Based on you answer to 5a, do you think the regression
assumptions hold for this model? A simple sentence saying you
think they hold or you think they do not hold will suffice.
Clearly, the plots don't look random for Percent taking. To deal with
the curved pattern, let's try using the natural logarithm of Percent taking instead of Percent taking untransformed.
We do this because the graph of y = log(x) is a curve, as is the
relationship between "Total SAT score" and "Percent taking."
Create a new column for "Log(Percent taking)" by using the Formula, highlighting "Percent
taking" and selecting Transcendental
- Log.
3) Fit the multiple regression of "Total SAT score" on
"Expenditures" and "Log(Percent taking)."
a) What are the estimates of the intercept and the coefficients
(slopes)? Write each estimate on your report.
b) What is the value of the typical deviation of points from the
regression line?
c) What percentage of the variation in total SAT scores is
explained by this regression?
4) Perform model checks like those in Question 2. Based on
the plots of residuals, do you think the regression assumptions hold
for this model? A simple sentence saying that you think they hold
or do not hold will suffice.
5) Based on the regression coefficients from 3a, do expenditures
appear to be positively or negatively associated with total SAT scores,
controling for the (logarithm of) percent taking the test? A
short answer will suffice.
6a) Would you be willing to claim that raising expenditures
causes SAT scores to increase? Explain in at most two sentences.
6b) Would you be willing to use this regression to make
predictions about SAT scores at the school-level? Explain in at
most two sentences.