Statistics 101
Data Analysis and Statistical
Inference
Instructions for lab 2
Lab Objective
To verify the benefits of random sampling and to learn what to look for
when reading journal articles containing surveys.
Lab Procedures
Unit 1: The benefits of randomization
In a survey, the sampled data should be representative of the target population.
The way to guarantee representative data is amazingly simple: collect
data from randomly selected units in the population. We demonstrated
this in class with several examples; now you'll investigate whether this
is true by using some real data.
Open the file agpop
from the course directory. This file is taken from the 1992 U.S.
Census of Agriculture. It contains data on agricultural characteristics
of all 3,078 counties in the United States. Variables include acres92
(number of acres devoted to farming in 1992), farms92 (number
of farms in 1992), largef92 (the number of farms with more than 1,000
acres), smallf92 (the number of farms with fewer than 9 acres),
and similar variables for the 1987 and 1982 censuses. Also included
are county and state names, and a variable indicating the county's region
of the country (West, Northeast, North Central, South). For more
information on the Census of Agriculture, including data from the 1997
census, you can visit the web site of the National Agricultural Statistics Service.
Data analysis tip: When looking at a data set for the first time, it is always
a good idea to play around with it to get a feel for what it contains.
For example, I looked for my home county (Morris County, NJ) and found
more farmland than I expected. I also saw some -99 values in the data.
Obviously, it is not possible to have negative numbers of farms or negative
acres of land. The conclusion I can reach is that the -99s identify
missing data. Missing data require special care, and you should seek
out a professional statistician when you have lots of missing data.
For this lab, we'll be stupid and treat -99s as genuine values.
The Census of Agriculture is a census (duh), so the
data set can be used to obtain quantities for the entire population.
For example, we can calculate the total amount of acres devoted to farming
in the whole United States, the total number of farms in the whole United
States, etc. Let's use JMP to get some of these quantities.
Select the Analyze menu option in JMP, then
click on Distribution. You'll see a box with the names of
the variables. Highlight farms92, then hit Y-columns.
Do this with all the 1992 variables. Now click Okay to get
summaries of the variables, such as the means, medians, and many other statistics we'll use later in the
semester. You'll also see histograms and possibly other graphical displays.
We'll use those later in the semester as well.
Write down the population means on scrap paper for
use in a later question.
Since we have the actual population means, there's
no need to take random samples. There's no point in estimating a number
when you can know it exactly! However, our objective for lab is to
see if random sampling works in a real data set. So, here's what we'll
do. We'll use JMP to take a random sample of 500 counties. If
random selection truly gives a representative sample, the averages of the
variables in the sample should be close to the averages of the variables in
the whole population of 3,078 counties.
At first glance, it may seem preposterous to claim
that 500 counties can represent 3,078 counties. Look at the ranges
of some of the variables: the acres92 has a smallest value of -99
and a largest value of 7,229,985 acres, and the number of large farms in
a county stretches from 0 to 579. How are we possibly going to get
a sample that reflects the characteristics of all these wide-ranging variables
with only 500 out of 3078 counties?!? Let's see what happens....
Questions:
1) Take a random sample. Based
on comparisons between the sample means and population means, does it seem
that picking counties at random provides a representative sample? Justify
your answer with at most two sentences.
It's easy to take a simple random sample in JMP.
Select Tables from the menu options, then select Subset.
Choose the option for Random Sample. Enter 500
as the sample size, i.e. the number of counties to be sampled.
Hit OK and you get a new data table with 500, randomly sampled counties.
If you want to take another random sample to check if it was just dumb
luck, close this new data table and repeat the previous instructions.
2) Noodle around with the data for a while.
Be creative and investigate whatever questions interest you.
Mention three of your findings from the data on the lab report
you hand in to the TA. It may be helpful to use some of the
JMP commands from the last lab.
Free Food Alert! The four people who find the most interesting relationships
in the data, as judged by the TAs, get free dinner at Satisfaction with
Prof. Reiter.
The sample size 500 was chosen arbitrarily.
Later in the semester, we'll learn a principled method of choosing sample
sizes.
To me, what's amazing about this is that you usually
get pretty close by just throwing darts. In fact, you would be hard
pressed to get closer on all variables by any non-random method of selecting
data. I dare you to try.
Data analysis tip: Here's a generic method for taking a random sample.
First, give each unit on the sampling frame a distinct number in the
range 1 to N, where N is the total number of units on your sampling frame.
Second, open JMP and create a file with numbers from 1 to N. Third,
pick a random sample of numbers from this file using the same method as
in the agpop example. Finally, collect data for those units whose
numbers were picked in the sample.
Unit 2: Reading newspaper stories and journal articles about
surveys
In the next part of the lab, you'll be asked questions about several
newspaper stories and journal articles describing surveys and causal studies.
The objective of this part of the lab is to give you some guidelines for
what to look for when reading about study designs in journals and the media.
You won't understand all the statistical methods used in the study; we haven't
learned them yet. By the end of the semester, you will understand those
methods. For now, we focus on the study designs.
You should read the articles and complete as many questions as possible
before labs. Lab time will be used for you to ask questions and to
participate in discussions of the articles.
Article 1: A survey to estimate the prevalence of child abuse,
and the news story that accompanies it.
How prevalent is child abuse in the U.S.? What types of child
abuse are most common? Data on these important issues are scarce.
In the early 1990s, Finkelhor and Dziuba-Leatherman (1994) ran a survey
to address these questions. Their survey is described in an article
in the journal Pediatrics, which can be found on-line at Ovid.
Download
the file by clicking on this link.
The reference for the article is:
Finkelhor, D. and Dziuba-Leatherman, J. (1994). "Children
as victims of violence: A national survey". Pediatrics 94, pp. 413-420.
Also, click on this link
to a news summary in the San Francisco Examiner describing
the results of the study. This can take a long time to print out,
so you may want to view it on your computer screen only.
Questions:
1. What is the definition of abuse used in the survey and, implicitly,
in the title of the newspaper article?
2. What is the target population of interest? What, if any,
evidence is in the article to suggest the 2000 children who responded are
representative demographically of the target population of interest?
3. What procedure did the researchers use to collect data from the children?
4. What percentage of the children are not included in the survey
because their parent refused permission? What is the total percentage
of children who did not respond to the survey, either because their parent
refused permission or they did not want to answer?
5. The researchers admit that a sizeable chunk of parents refused permission,
but they don't do anything else about it. If we could get the data
for the children of these non-responding parents, do you think that the incidence
rates for the abuses would increase or decrease relative to those in the
report? Justify your answers in at most two sentences.
6. The assault rate reported in the news article is 15.6%. Notice
that the newspaper article does not define precisely what they mean by assault.
This is typical of news coverage: report a number without its definition.
Find this rate in the report, and say what it is a report of.
Article 2: Is St. John's wort effective for treating depression?
St. John's wort is an herb that is reputed to elevate moods. In
the early 1990s, anecdotal evidence suggested that St. John's wort can effectively
treat depression. However, the anecdotal evidence was shaky--like all
anecdotal evidence--because it did not control for aspects of patients' background
characteristics. That is, the evidence was not collected from studies
that compared people who took St. John's wort and similar people who did
not.
In 1993, Congress established the National Center for
Complementary and Alternative Medicine (NCCAM) within the National Institute
of Health (NIH) for the purpose of supporting
clinical trials to evaluate the effectiveness of alternative medicine.
Their first major, multi-centered study investigated the effectiveness
of St. John's wort in treating moderately severe cases of depression.
The study cost $6-million to run. In an October 1 1997 NIH news release
anouncing this study, the director of the National Institute of Mental
Health stated:
"This study will give us definitive answers about whether St. John's
wort works for clinical depression. The study will be the first rigorous
clinical trial of the herb that will be large enough and long enough to
fully assess whether it produces a therapeutic effect."
The study and its conclusions are reported by Davidson, J. R.T., et
al. (2002) in the Journal of the American Medical Association,
one of the most prestigious journals in medical science. An
April 9, 2002, NIH news release
summarized the results of the study as follows:
"An extract of the herb St. John's wort was no more effective
for treating major depression of moderate severity than placebo, according
to research published in the April 10 issue of the Journal of the American
Medical Association."
Below is the reference to the article by Davidson et al. (2002),
as well as a link. Click on the link, then click on "pdf of this article".
If you have trouble opening the pdf file, you can click on "full text."
Read the article and answer the questions below.
Click for a direct link
to the article.
Here is the reference for the article.
Davidson, J. R. T. et al. (2002). "Effect of Hypericum perforatum
(St. John's wort) in major depressive disorder. Journal of
the American Medical Association, vol 287, no 14.
Questions:
The article uses the technical names "setraline" for the drug Zoloft (which
is manufactured by Pfizer and is a cousin of Prozac) and "hypericum perforatum"
for St. John's wort. We will replace these by the popular names
in discussing the results of the study.
1. a) What are the treatments? What are the dosages for the
treatments?
b) How many people are assigned to each treatment?
c) How are people selected to participate in the study?
d) How long did the study last?
e) What are the main outcome measures?
2. Give three examples of people who are excluded from the study.
Why do you think the authors excluded these people from the study?
3. In your own words, write an explanation of how the subjects were
assigned to treatments for someone who hasn't read the article. (Simply
copying or paraphrasing the sentences from the paper will not earn credit.)
4. Based on Table 1, are the three groups reasonably well-balanced
on background characteristics before the study began? If not which
variables are not balanced?
5. The authors write in great length to convince us that the study
is double-blind. Why is double-blinding important for this study?
Article 3: What are the effects on employment of increasing
the minimum wage?
Classical economic theory predicts that increasing wages decreases employment.
This is one of the main arguments against raising the minimum wage.
The theory is informative, but it should not be trusted in isolation.
We need evidence from data to see whether the theory is correct.
Card and Krueger (1994) assessed the effects of raising the minimum wage
by examining wages and employment practices in fast food restaurants.
Here is a direct link
to their article. It is a pdf file, so you'll need Adobe Acrobat to
read it. If you don't have this, follow the instructions below.
I got the article originally from JSTOR, an on-line journal search engine
available to Duke. To open the article, go to the Duke libraries home
page. Click on "E-Journals", then click on "Economics" under "Social Sciences".
Click on "A", and scroll down until you see "The American economic review,
1911-1918". Click on that, then select "Browse this journal". Select
"Vol. 81 - Vol. 88", then scroll down to Vol. 84 and select Issue 4. Scroll
down until you see the article, and select "View Article" to see the contents.
The reference for the article is:
Card, D. and Krueger, A. B. (1994). "Minimum wages and employment:
A case study of the fast-food industry in New Jersey and Pennsylvania". The
American Economic Review, 84. pp. 772-793.
Questions:
1. a) What are the treatments?
b) What are the units of study? Why do the authors
use these units for study?
c) How many units are used in each treatment group for the
difference of differences analysis reported in row 4 of Table 3?
d) What are the main outcome measures?
2. Is this an observational study or a randomized experiment? Justify
your answer in one sentence.
3. Based on Table 2, were the two groups balanced on background
characteristics before the minimum wage took effect? If not, which
variables are not balanced?
4. The authors worked hard to obtain information from restaurants
that did not respond to the first wave of the survey. Why might ignoring
those missing restuarants be an unwise decision when estimating the effect
of the minimum wage on employment?
5. In all observational studies, there could be other factors that
affect the treatment groups' responses and thereby explain apparent causal
effects. Did Card and Krueger examine any alternative hypotheses?
If so, describe their analysis of any one of these alternatives.