Extra problems on exploratory data anaylsis (use JMP-IN)
For some of
these problems, you should load the JMP data sets and obtain graphical
displays. Avoid looking at numerical summaries like correlations,
means, medians, SDs, regression lines, etc. until after you have
answered the questions. On the exam, you will be presented with
graphical displays and asked questions in the spirit of those asked
here.
1. College Degrees
Here are the counts (in thousands) of earned degrees in the U.S. for a recent year, classified by degree type and sex of degree recipient.
Bachelor's
Master's
Professional
Doctorate
Female
616
194
30
16
Male
529
171
44
26
Problems:
i) If you choose a degree recipient at random, what the is the probability you pick a woman?
ii) If you choose a male degree recipient at random, what is the probability that you pick someone who earned a professional degree?
iii) If you pick a degree recipient at random, what is the probability you pick a woman with a doctorate?
iv) If you pick a Bachelor's degree recipient at random, what is the probability you pick a man?
2. Shut up, kid!!
Babies who cry a lot may be more easily stimulated than other babies, and this may be an indication of higher IQ. Karelitz, et al. (1964) studied the association between IQ and crying frequency with 38 babies. The researchers caused the babies to cry by snapping a rubber band on the sole of their foot (bastards...). They recorded the frequency of cries as the number of peak cries (example: WAAAHHHHH-WAAAAHHHH is two peaks) in the most active 20 seconds of crying. Three years later, they measured the babies' IQs.
The data are in the file crybabies.
Problems:
i) Obtain the histogram and box plot of
IQs. Hide the displayed quantiles and moments.
i.1) Estimate the median of IQ.
i.2) Estimate the mean of IQ.
i.3) Which is closest to the SD of IQ: 5 10
20 30 40.
i.4) Which is closest to the percentage of kids with IQ of at
least 120: 20 40 60
Check your answers using the JMP output.
ii) Repeat problem 1a - 1c for crying.
ii.4) Which is closest to the
percentage of kids with crying in between 15 and 20: 25
50 75
iii). Obtain the scatter plot using
"crying" as a predictor for "IQ".
iii.1) Which is closest to the correlation between crying and IQ:
.10 .40 .70.
iii.2) Which is closest to the slope of the regression line:
0.5 1.0 1.5 2.0
iii.3) Which is closest to the intercept: 90 100
110.
Check your answers using the JMP output.
iv) What is the value of R squared? Explain what R squared means.
v) Predict the IQ score of a kid who has 35 crying peaks (a real cry-baby).
vi) What is the residual for a kid who has an IQ of 145 and has 35 crying peaks?
For associations involving measures of mental abilities, values of R squared in the .20 range are considered to be substantial. Think about it: from one variable, we are able to explain about 20% of the variation in IQ scores. That's pretty impressive!
Reference
Karelitz, S. et al. (1964) "Relation of crying activity in early
infancy to speech and intellectual development at age three years." Child
Development 35, pp. 769--777.
3. Speak up, kid!!
Can we predict the mental abilities of toddlers from the age at which they first spoke? For 21 children, L. M. Linde of UCLA recorded the age in months at which they first spoke and their Gesell Adaptive Score, which is the result of an aptitude test taken much later. These data are presented in Moore (2000). No specific reference for Linde's work is given.
The data are in the file talkbabies.
Problems:
i) Obtain the histograms of age and
score. Practice estimating the means, SDs, medians, and several
percentages. You can check your estimates against the true
answers in JMP. A great way to practice estimating quantities
from histograms is to load random datasets in the JMP data files (look
in the JMP IN Data folder that pops up when you go to File Open in
JMP), and repeat this exercise.
ii). Fit the linear regression of "Score" on "Age".
ii.1) If a baby takes one month longer before speaking, what is
the predicted change in their Gesell Adaptive Score?
ii.2) Draw a rough plot of residuals versus Age corresponding to
the plot for the regression.
ii.3) What is a typical deviation in the residuals from zero?
iii). Child number 18 learned to speak at 42 months, which is
much later than the rest of the children. If you fit the
regression without child 18, would you expect that:
a) the slope will get further from
zero and the intercept will increase.
b) the
slope will get further from zero and the intercept will decrease.
c) the
slope will get closer to zero and the intercept will increase.
d) the
slope will get closer to zero and the intercept will decrease.
Fit the model without child 18 (include all other children) to check your answer.
Observations like child 18 are called influential points because they have a large influence on the location of the regression line. When influential points and outliers exist in the data, the first step is to check if there are any data entry errors. If so, they should be corrected. When there are no data entry errors, the analyst then decides whether or not to include the points based on scientific arguments. For example, child 18 learned to speak at a relatively old age; perhaps this child is mentally retarded. If the research question does not focus on such children, child 18 might be excluded on scientific grounds. On the other hand, child 19, who has a large residual, learned to talk at an age well within the range of other children's ages. I cannot think of any valid scientific reason to exclude this child from the analyses, even though the child is an outlier.
When there is no valid scientific reason to exclude influential points, but inclusion of them results in different conclusions, the best remedy is to collect more data. Often, individual points' influences are reduced when more data are collected. If that is not possible, I present results with and without the outlier--thereby demonstrating the effect of the influential points and the need to collect more data--but I base my conclusions on the results using all the data. Throwing away data for no reason other than to get a conclusion that you want (or don't want) is not good science.
Reference:
Moore, D. S. (2000) The Basic Practice of Statistics.
W.H. Freeman and Company: New York. pp. 117--122.
4. Weight and burning energy
How does body weight relate to the rate at which the body burns energy? Researchers believe that lean body mass, which is a person's weight leaving out all fat, influences the rate at which the body burns energy (called the metabolic rate). Let's investigate this question using data from a study described by Moore (2000).
In the study, nineteen people had their lean body mass weighed (in kilograms) and their metabolic rate recorded (in calories burned per 24 hours). Twelve of the people are women, and seven are men. You can ignore people's sex for this problem.
Problem:
You are the consulting statistician to a
nutritionist. Tell the nutritionist what the relationship is
between lean body mass and metabolic rate. The nutritionist knows
statistics, which means you have to convince her that your model fits
the data well. Report any worries you have about violations of the
linear model assumptions.
(You wouldn't be able to do this on the exam, but it's a great way to
review regression and to mimic what happens in real life.)
Don't forget to examine residual plots.
The data are in the file metabolism.
Reference
Moore, D. The Basic Practice of Statistics. New York: W.H.
Freeman and Company, 2000.
5. Survey of Youth in Custody
In Lab 2, we used the Survey of Youth in Custody to examine the
effectiveness of random assignment of treatments. Let's analyze
these data a bit further. Download the dataset, syc2.jmp.
Obtain the side-by-side box plot of age at first arrest for male and
female inmates. (Use Fit Y-by-X, and select Quantiles).
i) Why are there several points near 100?
Exclude the points near 100, and redo
the box plot. To do this efficiently, go to Rows-Row Selection-Select Where. Choose
"agefirst", then choose "is greater than," then choose "90". The
dataset now should highlight the rows with 99 as the value of agefirst.
Go to Rows-Exclude to
exclude the rows, and refit the box plot. Minimize the window
showing the Quantiles.
ii) Estimate the median age at first arrest for males.
iii) Estimate the percentage of males with age at first arrest 14
or less:
iv) Which is larger: (a) the percentage of males with age
at first arrest of 14 or more; or (b) the percentage of males
with age at first arrest of 12 or less?
v) True or False: The distribution of ages at first
arrest is pretty similar for males and females.
vi) True or False: A normal curve describes the histogram
of males' ages pretty well.