Answers for extra problems on exploratory data analysis
The total # of people in thousands is: (616+529+194+171+30+44+16+26) = 1626
i) Pr(woman) = # women / 1626 = (616 + 194 + 30 + 16) / 1626 = 856/1626 = .5264
ii) Pr(professional | male ) = # professional males / # males = (44) / (529 + 171 + 44+ 26) = 44/770 = .0571
iii) Pr(female and doctorate) = 16 / 1626
iv) Pr(man | bachelors) = # males with bachelors / # bachelors = 529 / (616 + 529) = 529/1145 = .462.
2. Shut up, kid!!
Answers to parts i.1-i.3 and ii.1-ii.c, can be checked using JMP.
For i.4, the answer is 40%, since this is the only value that is
consistent with where 120 is on the box plot. For ii.4, the answer
is 25%, since 15 - 20 covers roughly the median to the 75th percentile
on the box plot.
iii) IQ = 91.3 + 1.49 * crying.
Where the estimate of the intercept is 91.268, the estimate of the slope is 1.493. The correlation between IQ and Crying is .455. Remember, the intercept is the value where Age=0. In the graph, age only goes to 5, so that the intercept is not where the line hits the vertical axis of the graph.
iv) The value of R-squared is .207. R squared is the proportion of the variation in IQ scores that is explained by the regression of IQ on crying time.
v)The predicted IQ for a baby who cries 35 times is 91.3 + 1.49 * 35 = 143.45.
vi) The residual for a baby who cries 35 times and has an IQ of 145 equals 145 - 143.45 = 1.55.
3. Speak up, kid!!
ii.1) Score = 110 - 1.13 * age.
The predicted increase in IQ for an increase of 1 month in crying time is simply 1 times the estimated slope, which equals -1.13. Thus, we predict that IQ will drop by 1.13 points for every month later the baby waits before speaking.
ii.2) Your plot should look like the plot of residuals versus
Age that you obtain from JMP (use the red arrow next to "Linear Fit").
ii.3) The typical deviation of the residuals from zero is the
RMSE, which equals 11.023.
iii.2) When we exclude observation 18, we have
Score = 106 - 0.78 * age.
This shows the value of the slope drops from -1.13 to -.78. Also,
the value of R squared drops from 0.41 to 0.112.
You can tell the direction the slope and intercept will move by
recognizing that point 18 is below the line, so that it pulled the line
towards it. Removing it will make the line try to get closer to
points like those at low ages, which will flatten the line out.
Addendum: When you drop point 19, which is an outlier, but leave in point 18, the value of the slope and intercept do not change much. However, dropping child 19 does increase R-squared, since the large residual attached to point 19 is no longer affecting the sum of squares equation (Total SS = Regression SS + Residual SS).
Ideally, we'd collect more data than just 21 kids. That way, we
could really get a handle on the relationship.
4. Metabolism and Lean Body Mass
The regression of Metabolic rate on Lean Body mass fits the data reasonably well, as indicated by a mostly patternless residual plot. The only potential assumption violation I see on the residual plot is that there is a large outlier: person 18. This could indicate a violation of the normality assumption, since such a high residual value is unlikely under normality. This outlier tends to pull the regression line towards it, which has some effect on the intercept and slope. However, these effects are not dramatic. The estimates of the slope and intercept with the point are 26.9 and 113, with standard errors of 3.9 and 186, respectively. The estimates without the point are 25.26 and 168. Thus, the change in each estimate is less than half a standard error, so that neither change is substantial statistically. Additionally, neither change is likely to be substantial practically, since we still predict about a 25 to 27 calorie burn increase for each extra kg of lean body mass.
I would analyze the data with the outlier, since I have no scientific reason to justify removing it. If the researcher needs accuracy in estimating the slope that is more precise than saying something like "a change of 2 calories does not really matter," then I'd recommend that the nutritionist collect more data.
I would tell the nutritionist that there is a strong, positive
linear relationship between lean body mass and calories burned.
Increasing lean body mass by one unit is to increase metabolic rate by
some amount likely between 18.9 and 34.9 (limits from a 95% CI for the
slope, which we'll discuss later in the semester).
5. Survey of Youth in Custody
i) There are kids with age = 99 in the data, which is the code for missing data.
ii) The median is the line inside the box plot, which is
around 13.
iii) 14 is right in the middle of the rectangle between the median and
75th percentile. So, I'd say 63% would be a reasonable
guess.
iv) 14 or more is about 40% of males, and 12 or less is about 25%
of males (12 is roughly the 25th percentile). So, more males were
age 14 or older than age 12 or younger.
v) True. The medians, 75th percentiles, and 25 percentiles
are similar. There are more outliers for males, but the bulk of
the distributions are the same. The male box is wider because
there are many more males than females. Even though the numbers
of people may differ, the distributions of ages for each sex are
similar.
vi) True. The box plot is pretty symmetric around the median.