Statistics 101
Data Analysis and Statistical
Inference
Instructions for lab 11
Lab Objective
The purpose of the lab is to gain some experience with multiple
regression.
Lab Procedures
Recall the SAT data from Lab 8.
In those data, we saw that the relationship between expenditures
and total score was negative. But, there was a third variable
that was strongly associated with expenditures and SAT scores, namely
the percentage taking the SAT. When we looked at scatter plots
involving these three variables, it appeared that the relationship
between the percentage taking the SAT and the scores was stronger than
the relationship between expenditures and SAT scores, so that it might
in fact explain the latter, apparently negative relationship.
Multiple regression is designed to help parcel out the effects of these
variables. Let's run a multiple regression using the SAT data to
estimate the association of SAT scores and expenditures, controlling
for pertent taking.
Questions:
1) Let's fit a multiple regression for "Total SAT score" on
"Expenditures" and "Percent taking." Go to Analyze - Fit
Model. Put "Total SAT score" in the Y box, then
highlight "Expenditures" and hit Add. Next,
highlight "Percent taking" and hit Add. Then hit, Run
Model.
a) What are the estimates of the intercept and the
coefficients (slopes)? Write each quantity down.
b) What is the value of the typical deviation of points from the
regression line?
c) What percentage of the variation in total SAT scores is
explained by this regression?
2) Let's examine plots of residuals versus each of the
predictors to make sure the model fits the data reasonably well.
Click on the red arrow next to Response - Total SAT score.
Click Save Columns - Residuals. This adds the
residuals from the model to your data file. Now you can look at
scatterplots of the residuals versus the predictors in the usual way
we look at scatterplots of any two variables (Fit Y by X).
Nonrandom patterns in these plots (e.g., curves) indicate the
regression assumptions do not hold for these data. Describe
what patterns (e.g., random or non-random) you see in the residual
plots.
2b) Based on you answer to 5a, do you think the regression
assumptions hold for this model? A simple sentence saying you
think they hold or you think they do not hold will suffice.
Clearly, the plots don't look random for Percent taking. To deal with
the curved pattern, let's try using the natural logarithm of Percent taking instead of Percent taking untransformed.
We do this because the graph of y = log(x) is a curve, which is
what we want to describe the relationship between "Total SAT score" and
"Percent taking." Create a new column for "Log(Percent taking)"
by using the Formula, highlighting
"Percent taking" and selecting Transcendental
- Log.
3) Fit the multiple regression of "Total SAT score" on
"Expenditures" and "Log(Percent taking."
a) What are the estimates of the intercept and the coefficients
(slopes)? Write each quantity down.
b) What is the value of the typical deviation of points from the
regression line?
c) What percentage of the variation in total SAT scores is
explained by this regression?
4) Perform the model checks that you did in Question 2.
Based on the plots of residuals, do you think the regression
assumptions hold for this model? A simple sentence saying that you
think they hold or do not hold will suffice.
5) Based on the regression coefficient of Expenditure you used to
answer 3a, do expenditures appear to be positively or negatively
associated with total SAT scores, controlling for the (logarithm of)
percent taking the test? A short answer will suffice.
6a) Would you be willing to claim that raising expenditures
causes SAT scores to increase? Explain in at most two sentences.
6b) Would you be willing to use this regression to make
predictions about SAT scores at the school-level? Explain in at
most two sentences.