Statistics 101
Data Analysis and Statistical Inference
 

Instructions for lab 4


Lab Objective

To gain experience with correlations and simple regressions.

Lab Procedures

This lab has several parts, so I recommend you start this one at home.   You should work with the applets in lab so that you can get the TAs' feedback.

Unit 1:  Predicting eruptions of Old Faithful

The geyser Old Faithful in Yellowstone National Park erupts seemingly at random times.  Or does it?  Perhaps we can predict when the next eruption will occur based on characteristics of the previous eruption.  Load in the data set geyser.jmp by clicking on the link.  It is comprised of two variables measured on 21 eruptions of Old Faithful. The two variables include the duration of the previous eruption in minutes (LAST) and the number of minutes in between the current and previous eruptions (NEXT).

Questions:

1.  Describe the distribution of waiting time until eruption (NEXT).  Specifically, answer the following four questions.  What is the value of a typical waiting time?  What is the value of a typical deviation from the average waiting time?  Are there any severe outliers in waiting time?  Does a normal curve describe the histogram of waiting time reasonably well (justify your answer by referring to graphical displays)?  

2.  Is there a strong linear association between waiting time and length of previous eruption?  Provide one number that summarizes the strength of the association, and give a brief description of some relevant graph to justify that the number is an appropriate summary.

3.  What is the regression equation for predicting waiting time until next eruption (Y) from length of previous eruption (X)?  

To fit a regression line, go to Analyze - Fit Y by X.  Select "NEXT" as the Y variable and "LAST" as the X variable.  Once you see the scatter plot, go to the red arrow next to Bivariate Fit.   Select Fit Line.

4.  What is a typical value of the deviation of waiting time from the predicted regression line?

5.   Does the plot of residuals versus the predictor (LAST) suggest any violations of the regression assumptions?   Justify your answer in at most two sentences.

To obtain the plot of residuals versus the predictor values, click on the red arrow next to Linear Fit, which is just below the scatter plot.  Then, select Plot Residuals.

6.  If the last eruption lasted 3.2 minutes, can you use the regression equation to predict the wait until the next eruption?  If you think so, write down the estimated average wait until the next eruption.  If you think not, explain why not in at most one sentence.

7.   If the last eruption lasted 9.6 minutes, can you use the regression equation to predict the wait until next eruption?  If you think so, write down the estimated average wait until the next eruption.  If you think not, explain why not in at most one sentence.

Unit 2:  Understanding regression better with applets

Applet 1: Drawing regression lines by eye

As discussed in class, the regression line is the line that yields the smallest sum of the squared residuals.  Just what does that mean, exactly?   Let's use some applets to illustrate this concept.  

On the "Statistics on the web" (click on this link) page, open the applet "Draw your best guess at a regression line."  Read the instructions on the page, then hit the "Begin" button to try it out.  Your goal is to try to make the "MSE," which stands for mean square error and is the typical deviation around the regression line, as small as possible. 

In addition to playing with the applet, try the following.  You don't have to write anything on the lab report for this part, but you'll use the ideas underpinning this applet on exams.

a)  Draw a line that clearly does not fit the data at all.  Notice that the MSE is relatively large.
b)  Hit "Show minimum MSE" to get the value of the mean square error for the actual regression line. Using that, keep adding more lines to get as close as possible to the minimum MSE.    Hint: Outliers in the horizontal and vertical direction can pull the line towards them.
c)  Compete against classmates or the TAs to see who gets the closest line.  Try to diagnose where your lines went wrong by comparing the actual line to the ones you fit.

Applet 2: Seeing the effect of individual points on the regression line.

On the "Statistics on the web" (click on this link) page, open the applet "See how individual points can affect regression lines."  Read the instructions on the page, and see how placing various points affects where the best fitting line is located.

Try the following.  You don't have to write anything on the lab report for this part, but again, these concepts are important in statistics and so are likely to appear on exams.

d)  Add a point to make the slope decrease.
e)  Add a point to make the slope increase.
f)  Add a point to make the slope remain roughly the same but increase the intercept.
g)  Add a point to make the slope and intercept remain roughly the same.