Statistics 121
Data Analysis for Undergraduate Research
 

Instructions for Project 3



These days, it is widely understood that mothers who smoke during pregnancy risk exposing their babies to many health problems.  This was not common knowledge forty years ago.  One of the first studies that addressed the issue of pregnancy and smoking was the Child Health and Development Studies, a comprehensive study of all babies born between 1960 and 1967 at the Kaiser Foundation Hospital in Oakland, CA.  The original reference for the study is Yerushalmy (1964, American Journal of Obstetrics and Gynecology, pp. 505-518).  The data and a summary of the study are in Nolan and Speed (2000, Stat Labs, Chapter 10) and can be found at the web site http://www.stat.berkeley.edu/users/statlabs.

You can download the data on the Blackboard web site for STA 121.

There were about 15,000 families in the study.  We will analyze a subset of the data, in particular 1,236 male single births where the baby lived at least 28 days.  The researchers interviewed mothers early in their pregnancy to collect information on socioeconomic and demographic characteristics, including an  indicator of whether the mother smoked during pregnancy.  This is an observational study, because mothers decided whether or not to smoke during pregnancy; there was no random assignment to smoke or not to smoke.  The variables in the dataset are described in the code book at the end of these instructions.

The Surgeon General's Report (1989) states two assertions about smoking and pregnancy:

1)  Mothers who smoke have increased rates of premature delivery (before 270 days).  
2)  The newborns of mothers who smoke have smaller birth weights at every gestational age (number of days into pregnancy when child is born).

Let's analyze the data to see if they support the Surgeon General's second assertion.  We'll learn models for addressing the first assertion in Chapter 21.   One way to address the second assertion is with a multiple regression, focusing on whether there is a relationship between smoking (smoke) and birth weights (bwt) after controlling for gestational age (gestation) and other relevant background characteristics. 

To simplify analyses, we'll compare babies whose mothers smoke to babies whose mothers have never smoked.  The data file you have access to has only these mothers, although there were other types of smokers in the original dataset.

What to turn in

Write up to a five page paper (one-sided and double-spaced) describing your analyses.  You should include the following in your report: 

1) Your final model, including estimated coefficients, their estimated standard errors, and p-values/CIs associated with each coefficient.  Also include R-squared and the estimate of the RMSE.

2) Interpretations of the results from the final model phrased in ways that people ignorant of statistics can understand.  Make sure you address the Surgeon General's assertion and describe the relationship between birth weight and smoking.

3)  Summarize the model checks you performed.  Include explanations of any transformations made or examined, summaries of your checks of the regression assumptions, and justifications that the results are not overly influenced by individual points (or warnings that they are). 


DO NOT HAND IN LOTS OF GRAPHICAL DISPLAYS.  Instead, summarize them in a succint and informative manner.

Some advice on analyses

In any observational study, the treatment groups should be closely balanced on causally-relevant background variables to mitigate the effect of confounding.  Many background variables affect smoking status, gestation length and birth weight.  Some of these variables are in the dataset, others are not.  As a first step in analyzing this observational study, as when analyzing any observational study, compare the distributions of the available background variables for the smokers and nonsmokers.  Describe succintly any substantial differences you see in the distributions of the background variables for the smoking and non-smoking groups.  Report the means and SDs in a table as support for your comparisons. 

Data Analysis Tip:  Here is a guideline for deciding whether the two groups' means of some background variable are "close enough" so as not to be too imbalanced.  Compute the difference in the means, then divide by the average of their standard deviations.  If this quantity is around 0.10 or less, the means are pretty close (within 10% of a "combined SD" for the two groups).

For the categorical variables, compare the percentages of people in each category for smokers and non-smokers.  Report the percentages for the categories.  Be wary of the denominators of each percentage when comparing them, e.g., a percentage of 1/1 for smokers versus 0/1 for non-smokers is a 100% difference, but in reality a difference of 1 person isn't going to affect the comparisons of the two groups' outcomes that much.

Steps 1 and 2 are required steps when analyzing any observational study.   It's tempting to skip these basic checks (they can be tedious when there are lots of backround variables), but when you do you are liable to arrive at incorrect conclusions without suspecting anything is wrong.  Don't skip them when you analyze data in your research!

When the background characteristics in the groups differ substantially, you can improve the balance by matching treated and control observations.  But, matching won't help much here.  The sizes of the two groups are similar (479 smokers and 539 nonsmokers), so that any matching scheme would result in almost the entire data set being selected (only 54 nonsmokers would be unmatched).  Since the distributions of the variables for smokers and nonsmokers aren't dramatically different, we can feel reasonably secure that comparisons of the smokers and nonsmokers in the entire data set are valid for checking the assertions of the Surgeon General.

EXTREMELY IMPORTANT!!!!
When you report the results of an observational study like this one to a journal or to some policy-making body, it is crucial to inform your audience that there may be causally-relevant background characteristics not in the dataset that are not balanced in the two groups.  Whenever possible, you also should suggest examples of such  variables (e.g., use of drugs, alchohol, caloric intake, health condition of mother).  This is the ethical thing to do, even if it results in your analyses taking criticism.   Telling the truth about the limitations of a study does more good for society than does hiding or not reporting them, which could lead to bad policy that ultimately hurts people.


Code Book
Variable             Description
Id                        id number

birth                   birth date where 1096 = January1, 1961

gestation            length of gestation in days

bwt                     birth weight in ounces (999 = unknown)

parity                 parity = total number of previous pregnancies, including fetal deaths and still births. 99=unknown

mrace                 mother's race or ethnicity
                             0-5=white
                                6=mexican
                                7=black
                                8=asian
                                9=mix
                              99=unknown

mage                  mother's age in years at termination of pregnancy

med                   mother's education
                                 0 =  less than 8th grade
                                 1 =  8th to 12th grade. did not graduate high school
                                 2 = high school graduate, no other schooling
                                 3 = high school graduate + trade school
                                 4 = high school graduate + some college
                                 5 = college graduate
                                 6,7 = trade school but unclear if graduated from high school
                                 9 = unknown

mht                      mother's height in inches

mwt                     mother's pre-pregnancy weight in pounds

drace                    father's race or ethnicity
                               0-5 = white
                                  6 = mexican
                                  7 = black
                                  8 = asian
                                  9 = mix

dage                      father's age in years at termination of pregnancy

ded                       father's education 
                                 0 =  less than 8th grade
                                 1 =  8th to 12th grade. did not graduate high school
                                 2 = high school graduate, no other schooling
                                 3 = high school graduate + trade school
                                 4 = high school graduate + some college
                                 5 = college graduate
                                 6,7 = trade school but unclear if graduated from high school
                                 9 = unknown

dht                      father's height

dwt                     father's pre-pregnancy weight in pounds

marital               marital status of mother
                              1 = married
                              2 = legally separated
                              3 = divorced
                              4 = widowed
                              5 = never married

income               family yearly income in 2500 increments.  0 = under 2500, 1 = 2500-4999, ..., 9 = 15000+.   98=unknown, 99=not asked

smoke                does mother smoke?
                               0 = never
                               1 = smokes now

number                 number of cigs smoked a day for past and current smokers
                                    0 = never smoked
                                    1 = 1-4
                                    2 = 5-9
                                    3 = 10-14
                                    4 = 15-19
                                    5 = 20-29
                                    6 = 30-39
                                    7 = 40-60
                                    8 = 60+,
                                    9 = smoke but don't know

Premature               = 1 if baby born before gestational age of 270, and = 0 otherwise.