Statistics 101
Data Analysis and Statistical
Inference
Instructions for lab 7
Lab Objective
In this lab, we analyze an observational study with the methods we've covered
so far.
Lab Procedures
These days, it is widely understood that mothers who smoke during pregnancy
risk exposing their babies to many health problems. This was not common
knowledge forty years ago. One of the first studies that addressed
the issue of pregnancy and smoking was the Child Health and Development Studies,
a comprehensive study of all babies born between 1960 and 1967 at the Kaiser
Foundation Hospital in Oakland, CA. The original reference for the
study is Yerushalmy (1964, American Journal of Obstetrics and Gynecology,
pp. 505-518). The data and a summary of the study are in Nolan
and Speed (2000, Stat Labs, Chapter 10) and can be found at the web
site http://www.stat.berkeley.edu/users/statlabs.
You can download the data by clicking on this link.
There were about 15,000 families in the study. We will analyze a subset
of the data, in particular 1236 male single births where the baby lived at
least 28 days. The researchers interviewed mothers early in their pregnancy
to collect information on socioeconomic and demographic characteristics,
including an indicator of whether the mother smoked during pregnancy.
This is an observational study, because mothers decided whether or not to
smoke during pregnancy; there was no random assignment to smoke or not to
smoke. The variables in the dataset are described in the code book
at the end of these instructions.
The Surgeon General's Report (1989) states two assertions about smoking and
pregnancy:
1) Mothers who smoke have increased rates of premature delivery (before 270
days).
2) The newborns of mothers who smoke have smaller birth weights at every
gestational age (number of days into pregnancy when child is born).
Let's analyze the data to see if they support the Surgeon General's assertions.
To simplify analyses, we'll compare babies whose mothers smoke to
babies whose mothers have never smoked. The data file you have access
to has only these people, although there were other types of smokers in the
original dataset.
Questions:
Recall that in an observational study, the treatment groups should be closely
balanced on causally-relevant background variables to mitigate the effect
of confounding. Many background variables affect gestation length and
birth weight. Some of these variables are in the dataset, others are
not. Let's compare the distributions of the available background variables
for the smokers and nonsmokers.
To do this efficiently, use Fit-Y-by-X and enter all the background
characteristics into Y and "smoke" into X. You can highlight
all the relevant background characteristics simultaneously, and, with one
click on Y, make them Y variables.
1. For the continuous variables, examine the box plots, means, and
standard deviations by clicking on the red arrow next to "Oneway Analysis"
and selecting Quantiles and Means and Std. Dev. For
each variable, describe in a sentence or two any big differences you see
in the distributions (if there are any) for the smoking and non-smoking groups.
Report the means and SDs as support for your comparisons.
Data Analysis Tip: Here is a guideline
for deciding whether the two groups' means of some background variable are
"close enough" so as not to be too imbalanced. Compute the difference
in the means, then divide by the average of their standard deviations. If
this quantity is around 0.10 or less, the means are pretty close (within
10% of a "combined SD" for the two groups). You don't need to do this
for the lab, but it is a useful guideline for when you analyze observational
studies in the future.
2. For the categorical variables, compare the percentages of people
in each category for smokers and non-smokers. To do this for each variable,
click on the red arrow next to "Contingency Table." Uncheck everything
until the only remaining checked item is "Row %." This leaves you with
the percentages of people in each category for both smokers and nonsmokers.
Mention any big differences you find (if there are any), providing
numerical evidence to support your comparisons.
Typically in observational studies, we match treated and control observations
to get better balance. But, matching won't help much here. The sizes
of the two groups are similar (479 smokers and 539 nonsmokers), so that any
matching scheme would result in almost the entire data set being selected
(only 54 nonsmokers would be unmatched). Since the distributions of
the variables for smokers and nonsmokers aren't dramatically different, we
can compare smokers and nonsmokers in the entire data set to check the assertions
of the Surgeon General.
3. Analysis of premature births. A premature birth is
defined as one that occurs before a gestational age of 270 days. Compare
the percentage of premature births for mothers who smoked during pregnancy
and mothers who did not smoke during pregnancy. Do the data provide
evidence supporting the Surgeon General's assertions? Write a paragraph describing
your answer. For each group, report the total # of premature births / total
# of births as evidence.
These percentages are based only on samples of people,
so that they are subject to chance error. We'll learn in Chapter 26
how to determine the probability that the difference in the percentages could
be plausibly explained by chance error.
4. Analysis of birth weights at gestational ages. The
Surgeon General claims that newborns of mothers who smoke have smaller birth
weights at every gestational age (number of days into pregnancy when child
is born). Perform a statistical analysis that allows you to assess
this question. Report your conclusions, answering the question, "Do
the data support or not support the Surgeon General's assertion?" Write a
paragraph or two describing your analyses and conclusions. Include
relevant numerical evidence to support or cast doubt on the Surgeon General's
claims. Examine the sensitivity of your conclusions to the effects
of outliers, and consider that there may be some gestational ages for which
the data don't provide enough evidence to make claims either way.
(This type of problem mirrors real life--no one tells you exactly how to
analyze the data. Think about it for a while before asking the TAs
for advice. Of course, if after a while you haven't thought of a good
approach, you can ask your TA.)
EXTREMELY IMPORTANT!!!!
When you report the results of an observational study like this one to a
journal or to some policy-making body, it is crucial to inform your audience
that there may be causally-relevant background characteristics not in the
dataset that are not balanced in the two groups. Whenever possible,
you also should suggest examples of such variables (e.g., use of drugs,
alchohol, caloric intake, health condition of mother). This is the
ethical thing to do, even if it results in your analyses taking criticism.
Telling the truth about the limitations of a study does more good
for society than does hiding or not reporting them, which could lead to bad
policy that ultimately hurts people.
Code Book
Variable Description
Id
id number
birth birth
date where 1096 = January1, 1961
gestation length of gestation in
days
bwt
birth weight in ounces (999 = unknown)
parity parity = total
number of previous pregnancies, including fetal deaths and still births.
99=unknown
mrace mother's
race or ethnicity
0-5=white
6=mexican
7=black
8=asian
9=mix
99=unknown
mage mother's
age in years at termination of pregnancy
med mother's
education
0 = less than 8th grade
1 = 8th to 12th grade. did
not graduate high school
2 = high school graduate, no other
schooling
3 = high school graduate + trade
school
4 = high school graduate + some
college
5 = college graduate
6,7 = trade school but unclear if
graduated from high school
9 = unknown
mht
mother's height in inches
mwt
mother's pre-pregnancy weight in pounds
drace father's
race or ethnicity
0-5 = white
6 = mexican
7 = black
8 = asian
9 = mix
dage
father's age in years at termination of pregnancy
ded
father's education
0 = less than 8th grade
1 = 8th to 12th grade. did
not graduate high school
2 = high school graduate, no other
schooling
3 = high school graduate + trade
school
4 = high school graduate + some
college
5 = college graduate
6,7 = trade school but unclear if
graduated from high school
9 = unknown
dht
father's height
dwt
father's pre-pregnancy weight in pounds
marital marital status
of mother
1 = married
2 = legally separated
3 = divorced
4 = widowed
5 = never married
income family yearly income
in 2500 increments. 0 = under 2500, 1 = 2500-4999, ..., 9 = 15000+.
98=unknown, 99=not asked
smoke does mother
smoke?
0 = never
1 = smokes now
2 = until preg
3 = once did, not now
time If
mother quit, how long ago did she quit?
0 = never smoked,
1 = still smokes,
2 = quit during pregnancy,
3 = up to 1 yr ago,
4 = up to 2 yr ago,
5 = up to 3 yr ago,
6 = up to 4 yr ago,
7 = 5 to 9yr ago,
8 = 10+yr ago,
9 = quit and don't know,
98 = unknown
number number of
cigs smoked a day for past and current smokers
0 = never smoked
1 = 1-4
2 = 5-9
3 = 10-14
4 = 15-19
5 = 20-29
6 = 30-39
7 = 40-60
8 = 60+,
9 = smoke but don't know
Premature = 1 if baby born
before gestational age of 270, and = 0 otherwise.