Statistics 101
Data Analysis and Statistical
Inference
Instructions for lab 6
Lab Objective
In this lab, we analyze an observational study with the methods we've
covered so far.
Lab Procedures
These days, it is widely understood that mothers who smoke during
pregnancy risk exposing their babies to many health problems. This
was not common knowledge forty years ago. One of the first studies
that addressed the issue of pregnancy and smoking was the Child Health
and Development Studies, a comprehensive study of all babies born
between 1960 and 1967 at the Kaiser Foundation Hospital in Oakland, CA.
The original reference for the study is Yerushalmy (1964, American
Journal of Obstetrics and Gynecology, pp. 505-518). The data
and a summary of the study are in Nolan and Speed (2000, Stat Labs, Chapter
10) and can be found at the web site
http://www.stat.berkeley.edu/users/statlabs.
You can download the data by clicking on this
link.
There were about 15,000 families in the study. We will analyze a
subset of the data, in particular 1236 male single births where the baby
lived at least 28 days. The researchers interviewed mothers early
in their pregnancy to collect information on socioeconomic and
demographic characteristics, including an indicator of whether the
mother smoked during pregnancy. This is an observational study,
because mothers decided whether or not to smoke during pregnancy; there
was no random assignment to smoke or not to smoke. The variables
in the dataset are described in the code book at the end of these
instructions.
The Surgeon General's Report (1989) states two assertions about smoking
and pregnancy:
1) Mothers who smoke have increased rates of premature delivery
(before 270 days).
2) The newborns of mothers who smoke have smaller birth weights
at every gestational age (number of days into pregnancy when child is
born).
Let's analyze the data to see if they support the Surgeon General's
assertions.
To simplify analyses, we'll compare babies whose mothers smoke
to babies whose mothers have never smoked. The data file you have
access to has only these people, although there were other types of
smokers in the original dataset.
Questions:
Recall that in an observational study, the treatment groups should be
closely balanced on causally-relevant background variables to mitigate
the effect of confounding. Many background variables affect
gestation length and birth weight. Some of these variables are in
the dataset, others are not. As a first step in analyzing this
observational study, as when analyzing any observational study, we
compare the distributions of the available background variables for the
smokers and nonsmokers.
To do this efficiently, use Fit-Y-by-X and enter all the
background characteristics into Y and "smoke" into X. You
can highlight all the relevant background characteristics
simultaneously, and, with one click on Y, make them Y
variables. Remember, we compare the balance of background
characteristics, not outcome variables.
1. For the continuous variables, examine the box plots, means,
and standard deviations by clicking on the red arrow next to "Oneway
Analysis" and selecting Quantiles and Means and Std. Dev.
Describe succintly any substantial differences you see in the
distributions of the background variables for the smoking and
non-smoking groups. Report the means and SDs as support for your
comparisons. You only need to report means and SDs for variables
that differ substantially in the two groups. For other variables, just
say, "The following variables were similarly distributed in the smokers
and non-smokers:" and list the variables.
Data Analysis Tip: Here is a
guideline for deciding whether the two groups' means of some background
variable are "close enough" so as not to be too imbalanced.
Compute the difference in the means, then divide by the average of
their standard deviations. If this quantity is around 0.10 or
less, the means are pretty close (within 10% of a "combined SD" for the
two groups).
2. For the categorical variables, compare the percentages of
people in each category for smokers and non-smokers. To do this
for each variable, click on the red arrow next to "Contingency Table."
Uncheck everything until the only remaining checked item is "Row
%." This leaves you with the percentages of people in each
category for both smokers and nonsmokers. Mention any big
differences you find (if there are any), providing numerical evidence to
support your comparisons. Again, you only need to report
percentages for variables that differ substantially in the two groups.
For other variables, just list their names and say they were
similarly distributed among smokers and non-smokers. Be wary of
the denominators of each percentage when comparing them (e.g., a
percentage of 1/1 for smokers versus 0/1 for non-smokers is a 100%
difference, but in reality a difference of 1 person isn't going to
affect the comparisons of the two groups' outcomes that much.)
Data Analysis Tip: As a rough
guideline for comparing percentages, you can use the same method
described in the previous Tip. To calculate the standard
deviation for each percentage, use SD = square root{ (%)(1 - %) },
where % equals the percentage.
Steps 1 and 2 are required steps when analyzing any observational
study. It's tempting to skip these basic checks (they can be
somewhat tedious when there are lots of backround variables), but when
you do you are liable to arrive at incorrect conclusions without
suspecting anything is wrong. Don't skip them when you analyze
data in your research!
When the background characteristics in the groups differ substantially,
you can improve the balance by matching treated and control
observations. But, matching won't help much here. The sizes
of the two groups are similar (479 smokers and 539 nonsmokers), so that
any matching scheme would result in almost the entire data set being
selected (only 54 nonsmokers would be unmatched). Since the
distributions of the variables for smokers and nonsmokers aren't
dramatically different, we can feel reasonably secure that comparisons
of the smokers and nonsmokers in the entire data set are valid for
checking the assertions of the Surgeon General.
3. Analysis of premature births. A premature birth
is defined as one that occurs before a gestational age of 270 days.
Compare the percentage of premature births for mothers who smoked
during pregnancy and mothers who did not smoke during pregnancy.
Do the data provide evidence supporting the Surgeon General's
assertions? Write a concise answer, reporting also the total # of
premature births / total # of births in each group as evidence.
These percentages are based only on samples of
people, so that they are subject to chance error. We'll learn in
Chapter 26 how to determine the probability that the difference in the
percentages could be plausibly explained by chance error.
4. Analysis of birth weights at gestational ages.
The Surgeon General claims that newborns of mothers who smoke have
smaller birth weights at every gestational age (number of days into
pregnancy when child is born). Perform a statistical analysis
that allows you to assess this question. Report your conclusions,
answering the question, "Do the data support the Surgeon General's
assertion?" Write a paragraph or two describing your analyses and
conclusions. Include relevant numerical or graphical evidence to
support, or cast doubt, on the Surgeon General's claims. Examine
the sensitivity of your conclusions to the effects of outliers, and
consider that there may be some gestational ages for which the data
don't provide enough evidence to make claims either way.
(This type of problem mirrors real life--no one tells you exactly how
to analyze the data. Think about it for a while before asking the
TAs for advice. Of course, if after a while you haven't thought of
a good approach, you can ask your TA.)
EXTREMELY IMPORTANT!!!!
When you report the results of an observational study like this one to
a journal or to some policy-making body, it is crucial to inform your
audience that there may be causally-relevant background characteristics
not in the dataset that are not balanced in the two groups.
Whenever possible, you also should suggest examples of such
variables (e.g., use of drugs, alchohol, caloric intake, health
condition of mother). This is the ethical thing to do, even if it
results in your analyses taking criticism. Telling the truth
about the limitations of a study does more good for society than does
hiding or not reporting them, which could lead to bad policy that
ultimately hurts people.
Code Book
Variable Description
Id
id number
birth
birth date where 1096 = January1, 1961
gestation length of gestation
in days
bwt
birth weight in ounces (999 = unknown)
parity parity =
total number of previous pregnancies, including fetal deaths and still
births. 99=unknown
mrace
mother's race or ethnicity
0-5=white
6=mexican
7=black
8=asian
9=mix
99=unknown
mage
mother's age in years at termination of pregnancy
med
mother's education
0 = less than 8th
grade
1 = 8th to 12th
grade. did not graduate high school
2 = high school
graduate, no other schooling
3 = high school graduate
+ trade school
4 = high school graduate
+ some college
5 = college graduate
6,7 = trade school but
unclear if graduated from high school
9 = unknown
mht
mother's height in inches
mwt
mother's pre-pregnancy weight in pounds
drace
father's race or ethnicity
0-5 = white
6 = mexican
7 = black
8 = asian
9 = mix
dage
father's age in years at termination of pregnancy
ded
father's education
0 = less than 8th
grade
1 = 8th to 12th
grade. did not graduate high school
2 = high school
graduate, no other schooling
3 = high school graduate
+ trade school
4 = high school graduate
+ some college
5 = college graduate
6,7 = trade school but
unclear if graduated from high school
9 = unknown
dht
father's height
dwt
father's pre-pregnancy weight in pounds
marital marital
status of mother
1 = married
2 = legally separated
3 = divorced
4 = widowed
5 = never married
income family yearly
income in 2500 increments. 0 = under 2500, 1 = 2500-4999, ..., 9 =
15000+. 98=unknown, 99=not asked
smoke does
mother smoke?
0 = never
1 = smokes now
2 = until preg
3 = once did, not now
time
If mother quit, how long ago did she quit?
0 = never smoked,
1 = still smokes,
2 = quit during pregnancy,
3 = up to 1 yr ago,
4 = up to 2 yr ago,
5 = up to 3 yr ago,
6 = up to 4 yr ago,
7 = 5 to 9yr ago,
8 = 10+yr ago,
9 = quit and don't know,
98 = unknown
number number
of cigs smoked a day for past and current smokers
0 = never smoked
1 = 1-4
2 = 5-9
3 = 10-14
4 = 15-19
5 = 20-29
6 = 30-39
7 = 40-60
8 = 60+,
9 = smoke but
don't know
Premature = 1 if baby
born before gestational age of 270, and = 0 otherwise.