Statistics 101
Data Analysis and Statistical Inference
 

Instructions for lab 2


Lab Objective

To verify the benefits of random sampling and to learn what to look for when reading journal articles containing surveys.

Lab Procedures

Unit 1: The benefits of randomization 

In a survey, the sampled data should be representative of the target population.   The way to guarantee representative data is amazingly simple: collect data from randomly selected units in the population.  We demonstrated this in class with several examples; now you'll investigate whether this is true by using some real data.

Open the file  agpop from the course directory.  This file is taken from the 1992 U.S. Census of Agriculture.  It contains data on agricultural characteristics of all 3,078 counties in the United States.  Variables include acres92 (number of acres devoted to farming in 1992), farms92 (number of farms in 1992), largef92 (the number of farms with more than 1,000 acres), smallf92 (the number of farms with fewer than 9 acres), and similar variables for the 1987 and 1982 censuses.  Also included are county and state names, and a variable indicating the county's region of the country (West, Northeast, North Central, South).  For more information on the Census of Agriculture, including data from the 1997 census, you can visit the web site of the National Agricultural Statistics Service.

Data analysis tip:  When looking at a data set for the first time, it is always a good idea to play around with it to get a feel for what it contains. For example, I looked for my home county (Morris County, NJ) and found more farmland than I expected.  I also saw some -99 values in the data.  Obviously, it is not possible to have negative numbers of farms or negative acres of land.  The conclusion I can reach is that the -99s identify missing data.  Missing data require special care, and you should seek out a professional statistician when you have lots of missing data.  For this lab, we'll be stupid and treat -99s as genuine values.

The Census of Agriculture is a census (duh), so the data set can be used to obtain quantities for the entire population.  For example, we can calculate the total amount of acres devoted to farming in the whole United States, the total number of farms in the whole United States, etc.  Let's use JMP to get some of these quantities.

Select the Analyze menu option in JMP, then click on Distribution.  You'll see a box with the names of the variables.  Highlight farms92, then hit Y-columns.  Do this with all the 1992 variables.  Now click Okay to get summaries of the variables, such as the means, medians, and many other statistics we'll use later in the semester.  You'll also see histograms and possibly other graphical displays.  We'll use those later in the semester as well.

Write down the population means on scrap paper for use in a later question.
 
Since we have the actual population means, there's no need to take random samples.  There's no point in estimating a number when you can know it exactly!  However, our objective for lab is to see if random sampling works in a real data set.  So, here's what we'll do.  We'll use JMP to take a random sample of 500 counties.  If random selection truly gives a representative sample, the averages of the variables in the sample should be close to the averages of the variables in the whole population of 3,078 counties.

At first glance, it may seem preposterous to claim that 500 counties can represent 3,078 counties.  Look at the ranges of some of the variables: the acres92 has a smallest value of -99 and a largest value of 7,229,985 acres, and the number of large farms in a county stretches from 0 to 579.  How are we possibly going to get a sample that reflects the characteristics of all these wide-ranging variables with only 500 out of 3078 counties?!?  Let's see what happens....
 

Questions:

1)  Take a random sample.   Based on comparisons between the sample means and population means, does it seem that picking counties at random provides a representative sample?  Justify your answer with at most two sentences.

It's easy to take a simple random sample in JMP.  Select Tables from the menu options, then select Subset.   Choose the option for Random Sample.   Enter 500 as the sample size, i.e. the number of counties to be sampled.   Hit OK and you get a new data table with 500, randomly sampled counties.   If you want to take another random sample to check if it was just dumb luck, close this new data table and repeat the previous instructions.

2)  Noodle around with the data for a while.  Be creative and investigate whatever questions interest you.   Mention three of your findings from the data on the lab report you hand in to the TA.   It may be helpful to use some of the JMP commands from the last lab.

Free Food Alert!  The four people who find the most interesting relationships in the data, as judged by the TAs, get free dinner at Satisfaction with Prof. Reiter.
 

The sample size 500 was chosen arbitrarily.  Later in the semester, we'll learn a principled method of choosing sample sizes.  

To me, what's amazing about this is that you usually get pretty close by just throwing darts.  In fact, you would be hard pressed to get closer on all variables by any non-random method of selecting data. I dare you to try.

Data analysis tip:  Here's a generic method for taking a random sample.  First, give each unit on the sampling frame a distinct number in the range 1 to N, where N is the total number of units on your sampling frame.  Second, open JMP and create a file with numbers from 1 to N.  Third, pick a random sample of numbers from this file using the same method as in the agpop example.  Finally, collect data for those units whose numbers were picked in the sample.


Unit 2:  Reading newspaper stories and journal articles about surveys


In the next part of the lab, you'll be asked questions about several newspaper stories and journal articles describing surveys and causal studies. The objective of this part of the lab is to give you some guidelines for what to look for when reading about study designs in journals and the media.  You won't understand all the statistical methods used in the study; we haven't learned them yet.  By the end of the semester, you will understand those methods.  For now, we focus on the study designs.

You should read the articles and complete as many questions as possible before labs.  Lab time will be used for you to ask questions and to participate in discussions of the articles.

Article 1:  A survey to estimate the prevalence of child abuse, and the news story that accompanies it.

How prevalent is child abuse in the U.S.?   What types of child abuse are most common?  Data on these important issues are scarce.  

In the early 1990s, Finkelhor and Dziuba-Leatherman (1994) ran a survey to address these questions.  Their survey is described in an article in  the journal Pediatrics, which can be found on-line at Ovid.   Download the file by clicking on this link.

The reference for the article is:
Finkelhor, D. and Dziuba-Leatherman, J.  (1994).  "Children as victims of violence: A national survey". Pediatrics 94, pp. 413-420.

Also, click on this link to a news summary in the
San Francisco Examiner describing the results of the study.  This can take a long time to print out, so you may want to view it on your computer screen only.

Questions:

1. What is the definition of abuse used in the survey and, implicitly, in the title of the newspaper article?
   
2.  What is the target population of interest?  What, if any, evidence is in the article to suggest the 2000 children who responded are representative demographically of the target population of interest?

3. What procedure did the researchers use to collect data from the children?

4.  What percentage of the children are not included in the survey because their parent refused permission?  What is the total percentage of children who did not respond to the survey, either because their parent refused permission or they did not want to answer?  

5. The researchers admit that a sizeable chunk of parents refused permission, but they don't do anything else about it.  If we could get the data for the children of these non-responding parents, do you think that the incidence rates for the abuses would increase or decrease relative to those in the report?  Justify your answers in at most two sentences.

6. The assault rate reported in the news article is 15.6%.  Notice that the newspaper article does not define precisely what they mean by assault.  This is typical of news coverage: report a number without its definition.  Find this rate in the report, and say what it is a report of.


Article 2:  Is St. John's wort effective for treating depression?

St. John's wort is an herb that is reputed to elevate moods.  In the early 1990s, anecdotal evidence suggested that St. John's wort can effectively treat depression.  However, the anecdotal evidence was shaky--like all anecdotal evidence--because it did not control for aspects of patients' background characteristics.  That is, the evidence was not collected from studies that compared people who took St. John's wort and similar people who did not.

In 1993, Congress established the National Center for Complementary and Alternative Medicine (NCCAM) within the National Institute of Health (NIH) for the purpose of supporting clinical trials to evaluate the effectiveness of alternative medicine.  Their first major, multi-centered study investigated the effectiveness of St. John's wort in treating moderately severe cases of depression. The study cost $6-million to run. In an October 1 1997 NIH news release anouncing this study, the director of the National Institute of Mental Health stated:


"This study will give us definitive answers about whether St. John's wort works for clinical depression. The study will be the first rigorous clinical trial of the herb that will be large enough and long enough to fully assess whether it produces a therapeutic effect."

The study and its conclusions are reported by Davidson, J. R.T., et al. (2002) in the Journal of the American Medical Association, one of the most prestigious journals in medical science.  An April 9, 2002, NIH news release summarized the results of the study as follows:

"An extract of the herb St. John's wort was no more effective for treating major depression of moderate severity than placebo, according to research published in the April 10 issue of the Journal of the American Medical Association."

Below is the reference to the article by Davidson et al. (2002), as well as a link.  Click on the link, then click on "pdf of this article".  If you have trouble opening the pdf file, you can click on "full text."  Read the article and answer the questions below.

Click for a direct link to the article.

Here is the reference for the article.

Davidson, J. R. T. et al. (2002). "Effect of Hypericum perforatum (St. John's wort) in major depressive disorder.  Journal of the American Medical Association, vol 287, no 14.

Questions:

The article uses the technical names "setraline" for the drug Zoloft (which is manufactured by Pfizer and is a cousin of Prozac) and "hypericum perforatum" for St. John's wort.  We will replace these by the popular names in discussing the results of the study.

1.  a) What are the treatments?  What are the dosages for the treatments?
     b) How many people are assigned to each treatment?
     c) How are people selected to participate in the study?
     d) How long did the study last?
     e) What are the main outcome measures?

2.  Give three examples of people who are excluded from the study.  Why do you think the authors excluded these people from the study?

3.  In your own words, write an explanation of how the subjects were assigned to treatments for someone who hasn't read the article.  (Simply copying or paraphrasing the sentences from the paper will not earn credit.)

4.  Based on Table 1, are the three groups reasonably well-balanced on background characteristics before the study began?  If not which variables are not balanced?

5.  The authors write in great length to convince us that the study is double-blind.  Why is double-blinding important for this study?


Article 3:  What are the effects on employment of increasing the minimum wage?

Classical economic theory predicts that increasing wages decreases employment.  This is one of the main arguments against raising the minimum wage.   The theory is informative, but it should not be trusted in isolation.  We need evidence from data to see whether the theory is correct.

Card and Krueger (1994) assessed the effects of raising the minimum wage by examining wages and employment practices in fast food restaurants.

Here is a direct link to their article.  It is a pdf file, so you'll need Adobe Acrobat to read it.  If you don't have this, follow the instructions below.

I got the article originally from JSTOR, an on-line journal search engine available to Duke.  To open the article, go to the Duke libraries home page. Click on "E-Journals", then click on "Economics" under "Social Sciences".  Click on "A", and scroll down until you see "The American economic review, 1911-1918".  Click on that, then select "Browse this journal".  Select "Vol. 81 - Vol. 88", then scroll down to Vol. 84 and select Issue 4.  Scroll down until you see the article, and select "View Article" to see the contents.

The reference for the article is:

Card, D. and Krueger, A. B. (1994).  "Minimum wages and employment: A case study of the fast-food industry in New Jersey and Pennsylvania".  The American Economic Review,  84.  pp. 772-793.

Questions:

1. a) What are the treatments?
    b) What are the units of study?  Why do the authors use these units for study?
    c) How many units are used in each treatment group for the difference of differences analysis reported in row 4 of Table 3?
    d) What are the main outcome measures?

2.  Is this an observational study or a randomized experiment?  Justify your answer in one sentence.

3.  Based on Table 2, were the two groups balanced on background characteristics before the minimum wage took effect?  If not, which variables are not balanced?

4.  The authors worked hard to obtain information from restaurants that did not respond to the first wave of the survey.  Why might ignoring those missing restuarants be an unwise decision when estimating the effect of the minimum wage on employment?

5.  In all observational studies, there could be other factors that affect the treatment groups' responses and thereby explain apparent causal effects.  Did Card and Krueger examine any alternative hypotheses?   If so, describe their analysis of any one of these alternatives.