Statistics 101
    Data Analysis and Statistical Inference

    Answers for extra problems on exploratory data analysis


1.  College Degrees

The total # of people in thousands is:  (616+529+194+171+30+44+16+26) = 1626

i)  Pr(woman) = # women / 1626  =  (616 + 194 + 30 + 16) / 1626  =  856/1626  =  .5264

ii)  Pr(professional   |   male )  =  # professional males / # males  =  (44) / (529 + 171 + 44+ 26) = 44/770 = .0571

iii)  Pr(female and doctorate) = 16 / 1626

iv)  Pr(man | bachelors) = # males with bachelors / # bachelors  =  529 / (616 + 529)  =  529/1145 = .462.

2.  Shut up, kid!!

Answers to parts i.1-i.3 and ii.1-ii.c, can be checked using JMP.  For i.4, the answer is 40%, since this is the only value that is consistent with where 120 is on the box plot.  For ii.4, the answer is 25%, since 15 - 20 covers roughly the median to the 75th percentile  on the box plot.

iii)  IQ = 91.3 + 1.49 * crying.

Where the estimate of the intercept is 91.268, the estimate of the slope is 1.493.  The correlation between IQ and Crying is .455.  Remember, the intercept is the value where Age=0.  In the graph, age only goes to 5, so that the intercept is not where the line hits the vertical axis of the graph.

iv) The value of R-squared is .207. R squared is the proportion of the variation in IQ scores that is explained by the regression of IQ on crying time.

v)The predicted IQ for a baby who cries 35 times is 91.3 + 1.49 * 35 = 143.45.

vi) The residual for a baby who cries 35 times and has an IQ of 145 equals 145 - 143.45 = 1.55.

3.  Speak up, kid!!

ii.1) Score = 110 - 1.13 * age.

The predicted increase in IQ for an increase of 1 month in crying time is simply 1 times the estimated slope, which equals -1.13.  Thus, we predict that IQ will drop by 1.13 points for every month later the baby waits before speaking.

ii.2)  Your plot should look like the plot of residuals versus Age that you obtain from JMP (use the red arrow next to "Linear Fit").
ii.3)  The typical deviation of the residuals from zero is the RMSE, which equals 11.023.

iii.2) When we exclude observation 18, we have

Score = 106 - 0.78 * age.

This shows the value of the slope drops from -1.13 to -.78. Also, the value of R squared drops from 0.41 to 0.112.  

You can tell the direction the slope and intercept will move by recognizing that point 18 is below the line, so that it pulled the line towards it.   Removing it will make the line try to get closer to points like those at low ages, which will flatten the line out.

Addendum:  When you drop point 19, which is an outlier, but leave in point 18, the value of the slope and intercept do not change much.  However, dropping child 19 does increase R-squared, since the large residual attached to point 19 is no longer affecting the sum of squares equation (Total SS = Regression SS + Residual SS).

Ideally, we'd collect more data than just 21 kids. That way, we could really get a handle on the relationship.
 
4.  Metabolism and Lean Body Mass

The regression of Metabolic rate on Lean Body mass fits the data reasonably well, as indicated by a mostly patternless residual plot.  The only potential assumption violation I see on the residual plot is that there is a large outlier: person 18. This could indicate a violation of the normality assumption, since such a high residual value is unlikely under normality. This outlier tends to pull the regression line towards it, which has some effect on the intercept and slope.  However, these effects are not dramatic. The estimates of the slope and intercept with the point are  26.9 and 113, with standard errors of 3.9 and 186, respectively.  The estimates without the point are 25.26 and 168.  Thus, the change in each estimate is less than half a standard error, so that neither change is substantial statistically.  Additionally, neither change is likely to be substantial practically, since we still predict about a 25 to 27 calorie burn increase for each extra kg of lean body mass.

I would analyze the data with the outlier, since I have no scientific reason to justify removing it.  If the researcher needs accuracy in estimating the slope that is more precise than saying something like "a change of 2 calories does not really matter," then I'd recommend that the nutritionist collect more data.

I would tell the nutritionist that there is a strong, positive linear relationship between lean body mass and calories burned.  Increasing lean body mass by one unit is to increase metabolic rate by some amount likely between 18.9 and 34.9 (limits from a 95% CI for the slope, which we'll discuss later in the semester).
 
5.  Survey of Youth in Custody

i)  There are kids with age = 99 in the data, which is the code for missing data.

ii)  The median is the line inside the box plot, which is around 13.
iii) 14 is right in the middle of the rectangle between the median and 75th percentile.  So, I'd say 63% would be a  reasonable guess.
iv)  14 or more is about 40% of males, and 12 or less is about 25% of males (12 is roughly the 25th percentile).  So, more males were age 14 or older than age 12 or younger.
v)  True.  The medians, 75th percentiles, and 25 percentiles are similar.  There are more outliers for males, but the bulk of the distributions are the same.  The male box is wider because there are many more males than females.  Even though the numbers of people may differ, the distributions of ages for each sex are similar.
vi) True.  The box plot is pretty symmetric around the median.