Statistics 103
Probability and Statistical
Inference
Instructions for lab 5
Lab Objective
The purpose of the lab is to help you pull together what you have
learned about univariate and bivariate graphical and numerical
summaries
in the context of a case study. The lab will also demonstrate
that in these four weeks you have acquired most of the statistical
tools that form the basis of a published scholarly work.
Lab Procedures
Before coming to lab, read the paper by Landrigan et
al. (1975) "Neuropsychological dysfunction in children with
chronic low-level lead absorption". The Lancet, March
29, pp. 708--712. The Lancet is one of the
leading journals in medical science. I recommend you start this
lab before lab period to complete it all.
In this lab, you will have access to the data presented in this
paper. While this paper was published in 1975, it still has an
impact today. The topic of lead exposure in children remains
under investigation and new research results appear in the news almost
every month. The topic is studied by interdisciplinary teams of medical
personnel, epidemiologists, social scientists, environmentalists, and
policy makers.
Open the data file lead.jmp by
clicking on this link. A description of all the variables names
in the data set (often called a Code Book) can be found at the end of
this lab.
The variable in the data set for the blood group has three categories.
The researchers suggest that two categories (below 40mg and above
40mg) are adequate. So, let's make a new variable that recodes
the
group variable to low ("L") and high ("H"). To do this go
to, Columns-New Column. Give the new column the name
"group.recode". Select Data Type-Character to tell JMP
that
we're inputing names. Now, click on New Property-Formula,
then Edit Formula. When the formula box pops open,
highlight "group" from the Table Columns list. Next,
holding down the Shift Key, select from the Functions box the
option Conditional and then Match . A list of
the current response choices (groups 1, 2, 3) is listed. Replace the [then
clause] as follows:
Match(group){1
"L"
2
"H"
3
"H"
else
}
Make sure to include quotes around the letters. Just enter a
space in the else condition.
This replaces the 1s with Ls and the 2s and 3s with Hs. After you
click OK, you should have a variable for high and low groups.
Questions:
1. This question is based on your reading of the article. You don't
need to use JMP for Question 1.
a) What are the experimental units (the subjects)?
b) What is the treatment variable? What is the name
of one of the response variables?
c) Is this an observational study or a randomized
experiment?
(Note: most of the background characteristics in the Table 1 are pretty
similar in the lead and control groups, except for age. A one
year difference could have a large impact on mental and physical
abilities for children of young ages.)
2a ) NOT HANDED IN: The main analysis compares mean performance
IQ scores (W.I.S.C. +
W.P.P.S.I.) for the high lead and low lead groups. Let's make
sure
we get the same results when we analyze the data. Do the means
and standard deviations for performance IQs (see code book at end
of lab for the variable name) for the high and low groups match the
means and SDs reported in the paper? JMP Tip: Put
"group.recode" in the By box to separate the data based on
"group.recode".
2b) HANDED IN: In any analysis, it is important to check
whether the means
and SDs are strongly influenced by individual data points. The
low
lead group has three outliers on performance IQ, and the high lead
group
has one outlier on performance IQ. Exclude these four
observations and compare the means and SDs to those in part 2a.
Are any of the changes big enough that the authors should have
mentioned the effect of the outliers in their article? Report the
two new means and SDs as part of your answer, as well as a
three-sentence-maximum explanation.
3. Perform
computations for Question 3 and all later questions with all data
points; don't exclude anything. A comparison of
means and standard deviations might be
inadequate. For example, suppose one group has a right-skewed
distribution, and the other group has a left-skewed distribution.
Just reporting means and standard deviations does not inform the
reader about such structure. Compare the distributions of performance
IQ of the high and low lead groups. Describe any differences
between the two groups' distributions of performance IQ, e.g.,
compare locations of most of the data, the spreads of the
distributions,
and whether there are outliers. Write at most three
sentences. Reminder:
Box plots are useful for side-by-side comparisons.
4. The authors chose to categorize blood lead level rather than
use it as a continuous variable. Is there a strong linear
relationship between performance IQ and the blood level in 1972
measured
on a continuous scale?
Data Analysis Tip: Researchers sometimes categorize continuous
variables to simplify analyses. However, when there are strong
linear relationships, categorization sacrifices information and can
lead to inaccurate results. Implicitly, disecting blood levels
at 40mg assumes that the average performance IQ of all kids in
the population with blood levels below 40 equals some constant, i.e.,
their average performance IQ does not depend on the actual blood
levels. When categorizing, be sure to have a valid scientific
rationale for choosing the end points of the categories.
5. Older kids typically have faster reflexes than younger kids.
Hence, when comparing finger-wrist tapping speeds for the high
and
low lead groups, we want to make sure the two groups have similar
distributions of ages. Age is in funky units (e.g., 1011
means 10 years and 11 months), so I created a new variable with age in
months. This is Age mo, which is located in the last
column of the data set. Compare the distributions of age in
months for the high and low lead groups. Based on your
comparisons, could the groups' average finger-wrist tapping speeds (or
some other outcome variable) reflect effects of age differences?
Explain in at most three
sentences.
6. One of the key analyses is a regression of finger-wrist tapping
speed on age in months (bottom right corner of page 710). Let's
replicate their regression and check its
validity. Use the Analyze--Fit Y by X, putting
finger-wrist tapping right in Y (be sure to use the correct
variable) and the Age mo in X. In addition, put group.recodein
the By box. This results in separate regressions for
both groups.
a) NOT HANDED IN. You should get very similar results as
those in the paper. If not, you did something wrong!
b) HANDED IN. Examine the scatter plot.
Are there any patterns (e.g., curvilinear relationships) that cause you
to worry about the validity
of the regression lines as a way to summarize the trends in the
data? Or, do the regression lines do a reasonable job of
fitting the data? Justify your answer in at most three
sentences.
7. For this question, assume any patterns you noticed in Part 6
result by random chance, so that the regression model reasonably fits
the data. On page 711, Landrigan et al. state, "To
adjust these data for age, a regression of dominant-hand finger-wrist
tap data against age was plotted for each group (see figure); the
slopes of the resulting lines are nearly parallel."
a) NOT HANDED IN. Verify that the regression lines are
nearly parallel.
Parallel regression lines have the same slopes with possibly different
intercepts.
b) HANDED IN: Since the lines are parallel, what is the
difference in predicted average tapping speed
between a kid in the low lead group and a kid of the same age in the
high lead group? Assume the kids in question have ages within
the range of ages in the data.
c) HANDED IN: If the lines were not parallel, describe (in
two sentences) how the answer you got in part b might not be correct
for all ages.
Code Book for lead.jmp .
ID
: person ID number
AREA: Residence on Aug. 1972
1= 0-1 miles from smelter
2= 1-2.5 miles
3= 2.5-4.1 miles
AGE
: 1011=10 years,
11 months
SEX: 1=male 2=female
IQ
TEST RESULTS
INFO - information subtest in WISC and WPPSI
COMP - comprehension subtest in WISC and WPPSI
AR -
arithmetic subtest in WISC and WPPSI
DS -
digit span subtest(WISC) and sentence
completion(WPPSI)
V/RAW
- raw score/verbal IQ
PC -
picture completion subtest in WISC and WPPSI
BD -
block design subtest in WISC and WPPSI
OA -
object assembly subtest(WISC), animal house
subtest(WPPSI)
COD - coding subtest(WISC), geometric design subtest(WPPSI)
P/RAW -
raw/score performance subtest
HH/INDEX - Hollingshead index of social status
IQV - verbal IQ
IQP -
performance IQ
IQF -
full scale IQ (not sum or average of IQV and IQP)
TYPE OF IQ TEST 1=WISC (usually given
to children GT 5 years) 2=WPPSI (usually given to children LE 5 years
of age)
GROUP – Blood lead level group
1=
blood lead levels below 40 micrograms/100ML in both 1972/1973
2=
blood lead levels GE to 40 micrograms/100ML in both 1972/1973 or GE 40
micrograms/100ML in 1973 alone (3 cases only)
3= blood lead levels GE to 40
micrograms/100ML in 1972 and LT 40 in 1973
LD72
- blood lead values in 1972 (micrograms/100ML) MISSING=99
LD73
- blood lead values in 1973 (micrograms/100ML)
FST2YRS
- did child live for 1st 2 years within 1 mile of smelter
TOTYRS - total number of years
spent within 4.1 miles of smelter
SYMPTOM DATA (AS REPORTED BY PARENTS) 1=Yes, 2=No
PICA
COLIC
CLUMSINESS
IRRITABILITY
CONVULSIONS
NEUROLOGICAL
TEST DATA
Note: MISSING DATA ( -1 or 99).
TAPS/RIGHT
- # of taps for right hand in the 2-plate tapping test (#taps in
one 10 second trial)
TAPS/LEFT-
# of taps for left hand in the 2-plate tapping test (#taps
in one 10 second trial)
REACTION/RIGHT- visual reaction time right hand
(milliseconds)
REACTION/LEFT- visual reaction time left hand
(milliseconds)
AUDITORY/RIGHT- auditory reaction time right hand
(milliseconds)
AUDITORY/LEFT- auditory reaction time left hand
(milliseconds)
FINGER/RIGHT-
finger-wrist tapping test right hand (taps
in one 10 second trial)
FINGER/LEFT-
finger-wrist tapping test left hand (taps
in one 10 second trial)
WWPS
- Werry-Weiss-Peters Scale for
hyperactivity
0=no activity . . . .
4=severely hyperactive (parent reports)