Statistics 101
Data Analysis and Statistical Inference
 

Instructions for lab 5


Lab Objective

To gain experience with correlations and simple regressions.

Lab Procedures

Write answers to all questions on your lab sheet, and turn in at the end of the lab period.  This lab asks you to do lots of things, so defintely start this one at home.  

Unit 1:  Predicting eruptions of Old Faithful

The geyser Old Faithful in Yellowstone National Park erupts seemingly at random times.  Or does it?  Perhaps we can predict when the next eruption will occur based on characteristics of the previous eruption.  Load in the data set geyser.jmp by clicking on the link.  It is comprised of two variables measured on 21 eruptions of Old Faithful. The two variables include the duration of the previous eruption in minutes (LAST) and the number of minutes in between the current and previous eruptions (NEXT).

Questions:

1.  Describe the distribution of waiting time until eruption (NEXT).  Specifically, answer the following four questions.  What is the value of a typical waiting time?  What is the value of a typical deviation from the average waiting time?  Are there any severe outliers in waiting time?  Does a normal curve describe the histogram of waiting time reasonably well (justify your answer by referring to graphical displays)?  

2.  Is there a strong linear association between waiting time and length of previous eruption?  Provide one number that summarizes the strength of the association, and give a brief description of some relevant graph to justify that the number is an appropriate summary.

3.  What is the regression equation for predicting waiting time until next eruption (Y) from length of previous eruption (X)?  

To fit a regression model, go to Analyze - Fit Y by X.  Select "NEXT" as the Y variable and "LAST" as the X variable.  Once you see the scatter plot, go to the red arrow next to Bivariate Fit.   Select Fit Line.

4.  What is a typical value of the deviation of waiting time from the predicted regression line?

5.   Does the plot of residuals versus the predictor (LAST) suggest any violations of the regression assumptions?   Justify your answer in at most two sentences.

To obtain the plot of residuals versus the predictor values, click on the red arrow next to Linear Fit, which is just below the scatter plot.  Then, select Plot Residuals.

6.  If the last eruption lasted 3.2 minutes, can you use the regression equation to predict the wait until the next eruption?  If you think so, write down the estimated average wait until the next eruption.  If you think not, explain why not in at most one sentence.

7.   If the last eruption lasted 9.6 minutes, can you use the regression equation to predict the wait until next eruption?  If you think so, write down the estimated average wait until the next eruption.  If you think not, explain why not in at most one sentence.


Unit 2:  Characteristics of mammals

Do mammals with bigger brains need more sleep? Does sleep vary by the level of danger the animal lives in?  To answer these questions, Allison and Cicetti (1971) gathered information on 62 different mammals.  Their data are in the file Sleeping Animals.jmp.  This data set is in the JMP data sets folder.  Click on File - Open, and select the folder  JMP In Data.  Select  Sleeping Animals.jmp, then click on Open.  

The variables in the data set include in column order:

a) species;
b) average body weight of species in kg;
c) average weight of brain of species in grams;
d) average number of daily hours of non-dreaming sleep for species;
e) average number of daily hours of dreaming sleep for species;
f) average number of daily hours of total sleep for species;
g) average life span of species in years;
h) average number of weeks in gestation period;
i) an index of predation (range from 1 - 5, with 1 = unlikely to be preyed upon and 5 = likely to be preyed upon);
j) an index of exposure (range from 1 - 5, with 1 = sleeps in a well-protected den and 5 = worst exposure);
k) an index of overall danger based on a variety of factors (range from 1 - 5, with 1 = least danger from other animals and 5 = most danger from other animals).

Again, there are missing data in the file.  We'll ignore them for simplicity, although that is not the approach I recommend.

Questions:

8. Using all data points, make a scatter plot of the relationship between total sleep (Y-axis) and brain weight (X-axis).  As you'll see, it's hard to detect any patterns in this plot because the horizontal axis gets stretched out so far by the heavy brain animals.  Instead, let's fit a scatter plot using only the animals with brains weighing less than 1000 grams.  (JMP Hint:  You'll have to exclude the appropriate rows.)

Using this scatter plot, describe the relationship between total sleep and brain weight.  Items to include in your description are the general trend of the relationship (e.g., positive and linear, negative and linear, some other pattern, no clear pattern) and whether there are any outliers or points that do not fit the pattern.  Also, don't forget to mention the mammals you excluded: do they generally follow the same trends as those in the graph?

9.  When computed using all the data, does the correlation between total sleep and brain weight meaningfully summarize the relationship between these two variables?  Explain in at most two sentences.

10.  Using all data points, describe the relationship between total sleep and the danger index.  Fit a regression line to help your interpretations.  

11.  For the regression of sleep versus danger index, what is a typical deviation around the regression line?

12.  For a mammal that has a danger index of 2, what is the estimated average total sleep?

13.  Plot the residuals versus the danger index.  Does the plot of residuals versus the predictor suggest any violations of the regression assumptions?   Justify your answer in at most two sentences.