Statistics 121
Data Analysis for Undergraduate
Research
Instructions for Project 3
These days, it is widely understood that mothers who smoke during
pregnancy risk exposing their babies to many health problems.
This
was not common knowledge forty years ago. One of the first
studies
that addressed the issue of pregnancy and smoking was the Child Health
and Development Studies, a comprehensive study of all babies born
between 1960 and 1967 at the Kaiser Foundation Hospital in Oakland, CA.
The original reference for the study is Yerushalmy (1964, American
Journal of Obstetrics and Gynecology, pp. 505-518). The data
and a summary of the study are in Nolan and Speed (2000, Stat Labs,
Chapter
10) and can be found at the web site
http://www.stat.berkeley.edu/users/statlabs.
You can download the data on the Blackboard web site for STA 121.
There were about 15,000 families in the study. We will analyze a
subset of the data, in particular 1,236 male single births where the
baby
lived at least 28 days. The researchers interviewed mothers early
in their pregnancy to collect information on socioeconomic and
demographic characteristics, including an indicator of whether
the
mother smoked during pregnancy. This is an observational study,
because mothers decided whether or not to smoke during pregnancy; there
was no random assignment to smoke or not to smoke. The variables
in the dataset are described in the code book at the end of these
instructions.
The Surgeon General's Report (1989) states two assertions about smoking
and pregnancy:
1) Mothers who smoke have increased rates of premature delivery
(before 270 days).
2) The newborns of mothers who smoke have smaller birth weights
at every gestational age (number of days into pregnancy when child is
born).
Let's analyze the data to see if they support the Surgeon General's
second assertion. We'll learn models for
addressing the first assertion in Chapter 21. One
way to address the second assertion is with a multiple regression,
focusing on whether there is a relationship between smoking (smoke) and birth weights (bwt) after controlling for
gestational age (gestation)
and other relevant background characteristics. To simplify analyses, we'll compare babies whose mothers smoke
to babies whose mothers have never smoked. The data file you have
access to has only these mothers, although there were other types of
smokers in the original dataset.
What to turn in
Write up to a five page paper (one-sided and double-spaced) describing
your analyses. You should include the following in your
report:
1) Your final model, including estimated coefficients, their estimated
standard errors, and p-values/CIs associated with each
coefficient. Also include R-squared and the estimate of the RMSE.
2) Interpretations of the results from the final model phrased in ways
that people ignorant of statistics can understand. Make sure you
address the Surgeon General's assertion and describe the relationship
between birth weight and smoking.
3) Summarize the model checks you performed. Include
explanations of any transformations made or examined, summaries of your
checks of the regression assumptions, and justifications that the
results are not overly influenced by individual points (or warnings
that they are).
DO NOT HAND IN LOTS OF GRAPHICAL DISPLAYS. Instead, summarize
them in a succint and informative manner.
Some advice on analyses
In any observational study, the treatment groups should be
closely balanced on causally-relevant background variables to mitigate
the effect of confounding. Many background variables affect
smoking status,
gestation length and birth weight. Some of these variables are in
the dataset, others are not. As a first step in analyzing this
observational study, as when analyzing any observational study, compare
the distributions of the available background variables for the
smokers and nonsmokers. Describe succintly any substantial
differences you see in the
distributions of the background variables for the smoking and
non-smoking groups. Report the means and SDs in a table as
support for your
comparisons.
Data Analysis Tip: Here is a
guideline for deciding whether the two groups' means of some background
variable are "close enough" so as not to be too imbalanced.
Compute the difference in the means, then divide by the average
of
their standard deviations. If this quantity is around 0.10 or
less, the means are pretty close (within 10% of a "combined SD" for the
two groups).
For the categorical variables, compare the percentages of
people in each category for smokers and non-smokers. Report the
percentages for the categories. Be wary of
the denominators of each percentage when comparing them, e.g., a
percentage of 1/1 for smokers versus 0/1 for non-smokers is a 100%
difference, but in reality a difference of 1 person isn't going to
affect the comparisons of the two groups' outcomes that much.
Steps 1 and 2 are required steps when
analyzing any observational
study. It's tempting to skip these basic checks (they can be
tedious when there are lots of backround variables), but when
you do you are liable to arrive at incorrect conclusions without
suspecting anything is wrong. Don't skip them when you analyze
data in your research!
When the background characteristics in the groups differ substantially,
you can improve the balance by matching treated and control
observations. But, matching won't help much here. The sizes
of the two groups are similar (479 smokers and 539 nonsmokers), so that
any matching scheme would result in almost the entire data set being
selected (only 54 nonsmokers would be unmatched). Since the
distributions of the variables for smokers and nonsmokers aren't
dramatically different, we can feel reasonably secure that comparisons
of the smokers and nonsmokers in the entire data set are valid for
checking the assertions of the Surgeon General.
EXTREMELY IMPORTANT!!!!
When you report the results of an observational study like this one to
a journal or to some policy-making body, it is crucial to inform your
audience that there may be causally-relevant background characteristics
not in the dataset that are not balanced in the two groups.
Whenever possible, you also should suggest examples of such
variables (e.g., use of drugs, alchohol, caloric intake, health
condition of mother). This is the ethical thing to do, even if it
results in your analyses taking criticism. Telling the truth
about the limitations of a study does more good for society than does
hiding or not reporting them, which could lead to bad policy that
ultimately hurts people.
Code Book Variable Description
Id
id number
birth
birth date where 1096 = January1, 1961
gestation length of gestation
in days
bwt
birth weight in ounces (999 = unknown)
parity parity =
total number of previous pregnancies, including fetal deaths and still
births. 99=unknown
mage
mother's age in years at termination of pregnancy
med
mother's education
0 = less than 8th
grade
1 = 8th to 12th
grade. did not graduate high school
2 = high school
graduate, no other schooling
3 = high school
graduate
+ trade school
4 = high school
graduate
+ some college
5 = college graduate
6,7 = trade school but
unclear if graduated from high school
9 = unknown
mht
mother's height in inches
mwt
mother's pre-pregnancy weight in pounds
drace
father's race or ethnicity
0-5 = white
6 = mexican
7 = black
8 = asian
9 = mix
dage
father's age in years at termination of pregnancy
ded
father's education
0 = less than
8th
grade
1 = 8th to 12th
grade. did not graduate high school
2 = high school
graduate, no other schooling
3 = high school
graduate
+ trade school
4 = high school
graduate
+ some college
5 = college graduate
6,7 = trade school but
unclear if graduated from high school
9 = unknown
dht
father's height
dwt
father's pre-pregnancy weight in pounds
marital marital
status of mother
1 = married
2 = legally separated
3 = divorced
4 = widowed
5 = never married
income family yearly
income in 2500 increments. 0 = under 2500, 1 = 2500-4999, ..., 9
=
15000+. 98=unknown, 99=not asked
smoke does
mother smoke?
0 = never
1 = smokes now
number number
of cigs smoked a day for past and current smokers
0 = never smoked
1 = 1-4
2 = 5-9
3 = 10-14
4 = 15-19
5 = 20-29
6 = 30-39
7 = 40-60
8 = 60+,
9 = smoke but
don't know
Premature = 1 if baby
born before gestational age of 270, and = 0 otherwise.