Statistics 101
Data Analysis and Statistical Inference
 

Instructions for lab 11


Lab Objective

The purpose of the lab is to gain some experience with multiple regression.

Lab Procedures


Recall the SAT data from Lab 8.   In those data, we saw that the relationship between expenditures and total score was negative.  But, there was a third variable that was strongly associated with expenditures and SAT scores, namely the percentage taking the SAT.  When we looked at scatter plots involving these three variables, it appeared that the relationship between the percentage taking the SAT and the scores was stronger than the relationship between expenditures and SAT scores, so that it might in fact explain the latter, apparently negative relationship.

Multiple regression is designed to help parcel out the effects of these variables.  Let's run a multiple regression using the SAT data to estimate the association of SAT scores and expenditures, controlling for pertent taking.

Questions:


1)  Let's fit a multiple regression for "Total SAT score" on "Expenditures" and "Percent taking."   Go to Analyze - Fit Model.  Put "Total SAT score" in the Y box, then highlight "Expenditures" and hit Add.  Next, highlight "Percent taking" and hit Add.  Then hit, Run Model.

a)  What are the estimates of the intercept and the coefficients (slopes)? Write each quantity down.
b)  What is the value of the typical deviation of points from the regression line?
c)  What percentage of the variation in total SAT scores is explained by this regression?

2)  Let's examine plots of residuals versus each of the predictors to make sure the model fits the data reasonably well.  Click on the red arrow next to Response - Total SAT score.  Click Save Columns - Residuals.  This adds the residuals from the model to your data file.  Now you can look at scatterplots of the residuals versus the predictors in the usual way we look at scatterplots of any two variables (Fit Y by X).  Nonrandom patterns in these plots (e.g., curves) indicate the regression assumptions do not hold for these data.  Describe what patterns (e.g., random or non-random) you see in the residual plots.

2b) Based on you answer to 5a,  do you think the regression assumptions hold for this model?  A simple sentence saying you think they hold or you think they do not hold will suffice.

Clearly, the plots don't look random for Percent taking.   To deal with the curved pattern, let's try using the natural logarithm of Percent taking instead of Percent taking untransformed.  We do this because the graph of y = log(x) is a curve, which is what we want to describe the relationship between "Total SAT score" and "Percent taking."  Create a new column for "Log(Percent taking)" by using the Formula, highlighting "Percent taking" and selecting Transcendental - Log.

3)  Fit the multiple regression of "Total SAT score" on "Expenditures" and "Log(Percent taking."  
a)  What are the estimates of the intercept and the coefficients (slopes)? Write each quantity down.
b)  What is the value of the typical deviation of points from the regression line?
c)  What percentage of the variation in total SAT scores is explained by this regression?

4)  Perform the model checks that you did in Question 2.  Based on the plots of residuals, do you think the regression assumptions hold for this model?  A simple sentence saying that you think they hold or do not hold will suffice.

5)  Based on the regression coefficient of Expenditure you used to answer 3a, do expenditures appear to be positively or negatively associated with total SAT scores, controlling for the (logarithm of) percent taking the test?  A short answer will suffice.

6a)  Would you be willing to claim that raising expenditures causes SAT scores to increase?  Explain in at most two sentences.
6b)  Would you be willing to use this regression to make predictions about SAT scores at the school-level?  Explain in at most two sentences.