Statistics 101
Data Analysis and Statistical Inference
 

Instructions for lab 6


Lab Objective

In this lab, we analyze an observational study with the methods we've covered so far.

Lab Procedures


These days, it is widely understood that mothers who smoke during pregnancy risk exposing their babies to many health problems.  This was not common knowledge forty years ago.  One of the first studies that addressed the issue of pregnancy and smoking was the Child Health and Development Studies, a comprehensive study of all babies born between 1960 and 1967 at the Kaiser Foundation Hospital in Oakland, CA.  The original reference for the study is Yerushalmy (1964, American Journal of Obstetrics and Gynecology, pp. 505-518).  The data and a summary of the study are in Nolan and Speed (2000, Stat Labs, Chapter 10) and can be found at the web site http://www.stat.berkeley.edu/users/statlabs.

You can download the data by clicking on this link.

There were about 15,000 families in the study.  We will analyze a subset of the data, in particular 1236 male single births where the baby lived at least 28 days.  The researchers interviewed mothers early in their pregnancy to collect information on socioeconomic and demographic characteristics, including an  indicator of whether the mother smoked during pregnancy.  This is an observational study, because mothers decided whether or not to smoke during pregnancy; there was no random assignment to smoke or not to smoke.  The variables in the dataset are described in the code book at the end of these instructions.

The Surgeon General's Report (1989) states two assertions about smoking and pregnancy:

1)  Mothers who smoke have increased rates of premature delivery (before 270 days).  
2)  The newborns of mothers who smoke have smaller birth weights at every gestational age (number of days into pregnancy when child is born).

Let's analyze the data to see if they support the Surgeon General's assertions.  

To simplify analyses, we'll compare babies whose mothers smoke to babies whose mothers have never smoked.  The data file you have access to has only these people, although there were other types of smokers in the original dataset.

Questions:


Recall that in an observational study, the treatment groups should be closely balanced on causally-relevant background variables to mitigate the effect of confounding.   Many background variables affect gestation length and birth weight.  Some of these variables are in the dataset, others are not.  As a first step in analyzing this observational study, as when analyzing any observational study, we compare the distributions of the available background variables for the smokers and nonsmokers.

To do this efficiently, use Fit-Y-by-X and enter all the background characteristics into Y and "smoke" into X.  You can highlight all the relevant background characteristics simultaneously, and, with one click on Y, make them Y variables.  Remember, we compare the balance of background characteristics, not outcome variables.

1.  For the continuous variables, examine the box plots, means, and standard deviations by clicking on the red arrow next to "Oneway Analysis" and selecting Quantiles and Means and Std. Dev.  Describe succintly any substantial differences you see in the distributions of the background variables for the smoking and non-smoking groups.  Report the means and SDs as support for your comparisons.  You only need to report means and SDs for variables that differ substantially in the two groups. For other variables, just say, "The following variables were similarly distributed in the smokers and non-smokers:" and list the variables.

Data Analysis Tip:  Here is a guideline for deciding whether the two groups' means of some background variable are "close enough" so as not to be too imbalanced.  Compute the difference in the means, then divide by the average of their standard deviations.  If this quantity is around 0.10 or less, the means are pretty close (within 10% of a "combined SD" for the two groups).

2.  For the categorical variables, compare the percentages of people in each category for smokers and non-smokers.  To do this for each variable, click on the red arrow next to "Contingency Table."  Uncheck everything until the only remaining checked item is "Row %."  This leaves you with the percentages of people in each category for both smokers and nonsmokers.  Mention any big differences you find (if there are any), providing numerical evidence to support your comparisons.  Again, you only need to report percentages for variables that differ substantially in the two groups.  For other variables, just list their names and say they were similarly distributed among smokers and non-smokers.  Be wary of the denominators of each percentage when comparing them (e.g., a percentage of 1/1 for smokers versus 0/1 for non-smokers is a 100% difference, but in reality a difference of 1 person isn't going to affect the comparisons of the two groups' outcomes that much.)

Data Analysis Tip:   As a rough guideline for comparing percentages, you can use the same method described in the previous Tip.  To calculate the standard deviation for each percentage, use SD = square root{ (%)(1 - %) }, where % equals the percentage.

Steps 1 and 2 are required steps when analyzing any observational study.   It's tempting to skip these basic checks (they can be somewhat tedious when there are lots of backround variables), but when you do you are liable to arrive at incorrect conclusions without suspecting anything is wrong.  Don't skip them when you analyze data in your research!

When the background characteristics in the groups differ substantially, you can improve the balance by matching treated and control observations.  But, matching won't help much here.  The sizes of the two groups are similar (479 smokers and 539 nonsmokers), so that any matching scheme would result in almost the entire data set being selected (only 54 nonsmokers would be unmatched).  Since the distributions of the variables for smokers and nonsmokers aren't dramatically different, we can feel reasonably secure that comparisons of the smokers and nonsmokers in the entire data set are valid for checking the assertions of the Surgeon General.

3.  Analysis of premature births.   A premature birth is defined as one that occurs before a gestational age of 270 days.   Compare the percentage of premature births for mothers who smoked during pregnancy and mothers who did not smoke during pregnancy.   Do the data provide evidence supporting the Surgeon General's assertions? Write a concise answer, reporting also the total # of premature births / total # of births in each group as evidence.

These percentages are based only on samples of people, so that they are subject to chance error.  We'll learn in Chapter 26 how to determine the probability that the difference in the percentages could be plausibly explained by chance error.

4.  Analysis of birth weights at gestational ages.    The Surgeon General claims that newborns of mothers who smoke have smaller birth weights at every gestational age (number of days into pregnancy when child is born).   Perform a statistical analysis that allows you to assess this question.  Report your conclusions, answering the question, "Do the data support the Surgeon General's assertion?" Write a paragraph or two describing your analyses and conclusions.  Include relevant numerical or graphical evidence to support, or cast doubt, on the Surgeon General's claims.  Examine the sensitivity of your conclusions to the effects of outliers, and consider that there may be some gestational ages for which the data don't provide enough evidence to make claims either way.

(This type of problem mirrors real life--no one tells you exactly how to analyze the data.   Think about it for a while before asking the TAs for advice.  Of course, if after a while you haven't thought of a good approach, you can ask your TA.)


EXTREMELY IMPORTANT!!!!
When you report the results of an observational study like this one to a journal or to some policy-making body, it is crucial to inform your audience that there may be causally-relevant background characteristics not in the dataset that are not balanced in the two groups.  Whenever possible, you also should suggest examples of such  variables (e.g., use of drugs, alchohol, caloric intake, health condition of mother).  This is the ethical thing to do, even if it results in your analyses taking criticism.   Telling the truth about the limitations of a study does more good for society than does hiding or not reporting them, which could lead to bad policy that ultimately hurts people.


Code Book
Variable             Description
Id                        id number

birth                   birth date where 1096 = January1, 1961

gestation            length of gestation in days

bwt                     birth weight in ounces (999 = unknown)

parity                 parity = total number of previous pregnancies, including fetal deaths and still births. 99=unknown

mrace                 mother's race or ethnicity
                             0-5=white
                                6=mexican
                                7=black
                                8=asian
                                9=mix
                              99=unknown

mage                  mother's age in years at termination of pregnancy

med                   mother's education
                                 0 =  less than 8th grade
                                 1 =  8th to 12th grade. did not graduate high school
                                 2 = high school graduate, no other schooling
                                 3 = high school graduate + trade school
                                 4 = high school graduate + some college
                                 5 = college graduate
                                 6,7 = trade school but unclear if graduated from high school
                                 9 = unknown

mht                      mother's height in inches

mwt                     mother's pre-pregnancy weight in pounds

drace                    father's race or ethnicity
                               0-5 = white
                                  6 = mexican
                                  7 = black
                                  8 = asian
                                  9 = mix

dage                      father's age in years at termination of pregnancy

ded                       father's education 
                                 0 =  less than 8th grade
                                 1 =  8th to 12th grade. did not graduate high school
                                 2 = high school graduate, no other schooling
                                 3 = high school graduate + trade school
                                 4 = high school graduate + some college
                                 5 = college graduate
                                 6,7 = trade school but unclear if graduated from high school
                                 9 = unknown

dht                      father's height

dwt                     father's pre-pregnancy weight in pounds

marital               marital status of mother
                              1 = married
                              2 = legally separated
                              3 = divorced
                              4 = widowed
                              5 = never married

income               family yearly income in 2500 increments.  0 = under 2500, 1 = 2500-4999, ..., 9 = 15000+.   98=unknown, 99=not asked

smoke                does mother smoke?
                               0 = never
                               1 = smokes now
                               2 = until preg
                               3 = once did, not now

time                    If mother quit, how long ago did she quit?
                                0 = never smoked,
                                1 = still smokes,
                                2 = quit during pregnancy,
                                3 = up to 1 yr ago,
                                4 = up to 2 yr ago,
                                5 = up to 3 yr ago,
                                6 = up to 4 yr ago,
                                7 = 5 to 9yr ago,
                                8 = 10+yr ago,
                                9 = quit and don't know,
                                98 = unknown

number                 number of cigs smoked a day for past and current smokers
                                    0 = never smoked
                                    1 = 1-4
                                    2 = 5-9
                                    3 = 10-14
                                    4 = 15-19
                                    5 = 20-29
                                    6 = 30-39
                                    7 = 40-60
                                    8 = 60+,
                                    9 = smoke but don't know

Premature               = 1 if baby born before gestational age of 270, and = 0 otherwise.