Statistics 101
Data Analysis and Statistical Inference

Extra problems on exploratory data anaylsis (use JMP-IN)



 Click here for answers to these problems

For some of these problems, you should load the JMP data sets and obtain graphical displays.  Avoid looking at numerical summaries like correlations, means, medians, SDs, regression lines, etc. until after you have answered the questions.  On the exam, you will be presented with graphical displays and asked questions in the spirit of those asked here.

1.   College Degrees

Here are the counts (in thousands) of earned degrees in the U.S. for a recent year, classified by degree type and sex of degree recipient.

                      Bachelor's            Master's           Professional            Doctorate
Female              616                        194                        30                               16
Male                  529                        171                        44                               26

Problems:

i)  If you choose a degree recipient at random, what the is the probability you pick a woman?

ii)  If you choose a male degree recipient at random, what is the probability that you pick someone who earned a professional degree?

iii)  If you pick a degree recipient at random, what is the probability you pick a woman with a doctorate?

iv)  If you pick a Bachelor's degree recipient at random, what is the probability you pick a man?

2.  Shut up, kid!!

Babies who cry a lot may be more easily stimulated than other babies, and this may be an indication of higher IQ.  Karelitz, et al. (1964) studied the association between IQ and crying frequency with 38 babies.  The researchers caused the babies to cry by snapping a rubber band on the sole of their foot (bastards...).  They recorded the frequency of cries as the number of peak cries (example: WAAAHHHHH-WAAAAHHHH  is two peaks)  in the most active 20 seconds of crying.  Three years later, they measured the babies' IQs.

The data are in the file crybabies.

Problems:

i)  Obtain the histogram and box plot of IQs.  Hide the displayed quantiles and moments.
i.1)  Estimate the median of IQ.
i.2)  Estimate the mean of IQ.
i.3)  Which is closest to the SD of IQ:   5   10   20   30   40.
i.4)  Which is closest to the percentage of kids with IQ of at least 120:   20   40    60

Check your answers using the JMP output.  

ii)  Repeat problem 1a - 1c for crying.
ii.4)  
Which is closest to the percentage of kids with crying in between 15 and 20:     25   50    75

iii).  Obtain the scatter plot using "crying" as a predictor for "IQ".  
iii.1)  Which is closest to the correlation between crying and IQ:   .10  .40  .70.
iii.2)  Which is closest to the slope of the regression line:  0.5   1.0   1.5   2.0
iii.3)  Which is closest to the intercept:  90   100   110.

Check your answers using the JMP output.  

iv)  What is the value of R squared?  Explain what R squared means.

v)  Predict the IQ score of a kid who has 35 crying peaks (a real cry-baby).

vi)  What is the residual for a kid who has an IQ of 145 and has 35 crying peaks?

For associations involving measures of mental abilities, values of R squared in the .20 range are considered to be substantial.  Think about it: from one variable, we are able to explain about 20% of the variation in IQ scores.  That's pretty impressive!

Reference
Karelitz, S. et al. (1964) "Relation of crying activity in early infancy to speech and intellectual development at age three years." Child Development 35, pp. 769--777.
 

3.  Speak up, kid!!

Can we predict the mental abilities of toddlers from the age at which they first spoke?   For 21 children, L. M. Linde of UCLA recorded the age in months at which they first spoke and their Gesell Adaptive Score, which is the result of an aptitude test taken much later.  These data are presented in Moore (2000).  No specific reference for Linde's work is given.

The data are in the file  talkbabies.
 

Problems:

i)  Obtain the histograms of age and score.  Practice estimating the means, SDs, medians, and several percentages.  You can check your estimates against the true answers in JMP.  A great way to practice estimating quantities from histograms is to load random datasets in the JMP data files (look in the JMP IN Data folder that pops up when you go to File Open in JMP), and repeat this exercise.

ii).  Fit the linear regression of "Score" on "Age".  
ii.1)  If a baby takes one month longer before speaking, what is the predicted change in their Gesell Adaptive Score?
ii.2)  Draw a rough plot of residuals versus Age corresponding to the plot for the regression.  
ii.3)  What is a typical deviation in the residuals from zero?


iii).  Child number 18 learned to speak at 42 months, which is much later than the rest of the children.  If you fit the regression without child 18, would you expect that:
       a)   the slope will get further from zero and the intercept will increase.
       b)  
the slope will get further from zero and the intercept will decrease.
       c)  
the slope will get closer to zero and the intercept will increase.
       d)  
the slope will get closer to zero and the intercept will decrease.

Fit the model without child 18 (include all other children) to check your answer.

Observations like child 18 are called influential points because they have a large influence on the location of the regression line.   When influential points and outliers exist in the data, the first step is to check if there are any data entry errors.  If so, they should be corrected.  When there are no data entry errors, the analyst then decides whether or not to include the points based on scientific arguments.  For example, child 18 learned to speak at a relatively old age; perhaps this child is mentally retarded. If the research question does not focus on such children, child 18 might be excluded on scientific grounds.  On the other hand, child 19, who has a large residual,  learned to talk at an age well within the range of other children's ages.  I cannot think of any valid scientific reason to exclude this child from the analyses, even though the child is an outlier.

When there is no valid scientific reason to exclude influential points, but inclusion of them results in different conclusions, the best remedy is to collect more data.  Often, individual points' influences are reduced when more data are collected.  If that is not possible, I present results with and without the outlier--thereby demonstrating the effect of the influential points and the need to collect more data--but I base my conclusions on the results using all the data.  Throwing away data for no reason other than to get a conclusion that you want (or don't want) is not good science.

Reference:
Moore, D. S. (2000)  The Basic Practice of Statistics.  W.H. Freeman and Company: New York.  pp. 117--122.
 

4. Weight and burning energy

How does body weight relate to the rate at which the body burns energy?  Researchers believe that lean body mass, which is a person's weight leaving out all fat, influences the rate at which the body burns energy (called the metabolic rate).  Let's investigate this question using data from a study described by Moore (2000).

In the study, nineteen people had their lean body mass weighed (in kilograms) and their metabolic rate recorded (in calories burned per 24 hours).  Twelve of the people are women, and seven are men.  You can ignore people's sex for this problem.

Problem:

You are the consulting statistician to a nutritionist.  Tell the nutritionist what the relationship is between lean body mass and metabolic rate.  The nutritionist knows statistics, which means you have to convince her that your model fits the data well. Report any worries you have about violations of the linear model assumptions.

(You wouldn't be able to do this on the exam, but it's a great way to review regression and to mimic what happens in real life.)

Don't forget to examine residual plots.

The data are in the file metabolism.

Reference
Moore, D. The Basic Practice of Statistics. New York: W.H. Freeman and Company, 2000.


5. Survey of Youth in Custody

In Lab 2, we used the Survey of Youth in Custody to examine the effectiveness of random assignment of treatments.  Let's analyze these data a bit further.  Download the dataset, syc2.jmp.  

Obtain the side-by-side box plot of age at first arrest for male and female inmates.   (Use Fit Y-by-X, and select Quantiles).
i)  Why are there several points near 100?

 Exclude the points near 100, and redo the box plot.   To do this efficiently, go to Rows-Row Selection-Select Where.  Choose "agefirst", then choose "is greater than," then choose "90".   The dataset now should highlight the rows with 99 as the value of agefirst.  Go to Rows-Exclude to exclude the rows, and refit the box plot.  Minimize the window showing the Quantiles.

ii)  Estimate the median age at first arrest for males.
iii)  Estimate the percentage of males with age at first arrest 14 or less:
iv)  Which is larger:  (a) the percentage of males with age at first arrest of 14 or more;  or (b) the percentage of males with age at first arrest of 12 or less?
v)  True or False:  The  distribution of ages at first arrest is pretty similar for males and females.
vi)  True or False:  A normal curve describes the histogram of males' ages pretty well.