Statistics 101
Data Analysis and Statistical Inference
 

Instructions for lab 2


Lab Objective

To verify the benefits of random sampling and to learn what to look for when reading journal articles containing surveys.

Lab Procedures

Unit 1: The benefits of randomization 

In a survey, the sampled data should be representative of the target population.   The way to guarantee representative data is amazingly simple: collect data from randomly selected units in the population.  We demonstrated this in class with several examples; now you'll investigate whether this is true by using some real data.

Open the file  agpop from the course directory.  This file is taken from the 1992 U.S. Census of Agriculture.  It contains data on agricultural characteristics of all 3,078 counties in the United States.  Variables include acres92 (number of acres devoted to farming in 1992), farms92 (number of farms in 1992), largef92(the number of farms with more than 1,000 acres), smallf92 (the number of farms with fewer than 9 acres), and similar variables for the 1987 and 1982 censuses.  Also included are county and state names, and a variable indicating the county's region of the country (West, Northeast, North Central, South).  For more information on the Census of Agriculture, including data from the 1997 census, you can visit the web site of the National Agricultural Statistics Service.

Data analysis tip:  When looking at a data set for the first time, it is always a good idea to play around with it to get a feel for what it contains. For example, I looked for my home county (Morris County, NJ) and found more farmland than I expected.  I also saw some -99 values in the data.  Obviously, it is not possible to have negative numbers of farms or negative acres of land.  The conclusion I can reach is that the -99s identify missing data.  Missing data require special care, and you should seek out a professional statistician when you have lots of missing data.  For this lab, we'll be stupid and treat -99s as genuine values.

The Census of Agriculture is a census (duh), so the data set can be used to obtain quantities for the entire population.  For example, we can calculate the total amount of acres devoted to farming in the whole United States, the total number of farms in the whole United States, etc.  Let's use JMP to get some of these quantities.

Select the Analyze menu option in JMP, then click on Distribution.  You'll see a box with the names of the variables.  Highlight farms92, then hit Y-columns.  Do this with all the 1992 variables.  Now click Okay to get summaries of the variables, such as the means, medians, and many other statistics we'll use later in the semester.  You'll also see histograms and possibly other graphical displays.  We'll use those later in the semester as well.

Write down the population means on scrap paper for use in a later question.
 
Since we have the actual population means, there's no need to take random samples.  There's no point in estimating numbers when you can know them exactly!  However, our objective for lab is to see if random sampling works in a real data set.  So, here's what we'll do.  We'll use JMP to take a random sample of 500 counties.  If random selection truly gives a representative sample, the averages of the variables in the sample should be close to the averages of the variables in the whole population of 3,078 counties.

At first glance, it may seem preposterous to claim that 500 counties can represent 3,078 counties.  Look at the ranges of some of the variables: the acres92 has a smallest value of -99 and a largest value of 7,229,985 acres, and the number of large farms in a county stretches from 0 to 579.  How are we possibly going to get a sample that reflects the characteristics of all these wide-ranging variables with only 500 out of 3078 counties?!?  Let's see what happens....
 

Questions:

1)  Take a random sample.   Based on comparisons between the sample means and population means, does it seem that picking counties at random provides a representative sample?  Justify your answer with at most two sentences.

It's easy to take a simple random sample in JMP.  Select Tables from the menu options, then select Subset.  Choose the option for Random Sample.   Enter 500 as the sample size, i.e. the number of counties to be sampled.   HitOKand you get a new data table with 500, randomly sampled counties.   If you want to take another random sample to check if it was just dumb luck, close this new data table and repeat the previous instructions.

2)  Noodle around with the data for a while.  Be creative and investigate whatever questions interest you.   Mention three of your findings from the data on the lab report you hand in to the TA.   It may be helpful to use some of the JMP commands from the last lab.

Free Food Alert!  The four people who find the most interesting relationships in the data, as judged by the TAs, get free dinner at Satisfaction with Prof. Reiter.
 

The sample size 500 was chosen arbitrarily.  Later in the semester, we'll learn a principled method of choosing sample sizes.  

To me, what's amazing about this is that you usually get pretty close by just throwing darts.  In fact, you would be hard pressed to get closer on all variables by any non-random method of selecting data. I dare you to try.

Data analysis tip:  Here's a generic method for taking a random sample.  First, give each unit on the sampling frame a distinct number in the range 1 to N, where N is the total number of units on your sampling frame.  Second, open JMP and create a file with numbers from 1 to N.  Third, pick a random sample of numbers from this file using the same method as in the agpop example.  Finally, collect data for those units whose numbers were picked in the sample.


Unit 2:  Reading newspaper stories and journal articles about surveys


In the next part of the lab, you'll be asked questions about several newspaper stories and journal articles describing surveys and causal studies. The objective of this part of the lab is to give you some guidelines for what to look for when reading about study designs in journals and the media.  You won't understand all the statistical methods used in the study; we haven't learned them yet.  By the end of the semester, you will understand those methods.  For now, we focus on the study designs.

You should read the articles and complete the questions before labs.

Article 1:  A survey to estimate the prevalence of child abuse, and the news story that accompanies it.

How prevalent is child abuse in the U.S.?   What types of child abuse are most common?  Data on these important issues are scarce.  

In the early 1990s, Finkelhor and Dziuba-Leatherman (1994) ran a survey to address these questions.  Their survey is described in an article in  the journal Pediatrics.   Download the journal article by clicking on this link.

The reference for the article is:
Finkelhor, D. and Dziuba-Leatherman, J.  (1994).  "Children as victims of violence: A national survey". Pediatrics 94, pp. 413-420.

Also, click on this link to a news summary in the
San Francisco Examiner describing the results of the study.  This can take a long time to print out, so you may want to view it on your computer screen only.

FOR ALL QUESTIONS, WRITE NO MORE THAN THREE SENTENCES.  TAs WILL NOT READ MORE THAN THE FIRST THREE SENTENCES WHEN GRADING.

Questions:

1.  a) What is the definition of abuse used in the survey?
     b) Why do the researchers choose to use this definition?  
   
2.  a) What is the target population of interest?  
     b) Was random sampling employed in the survey design?  If so, when it was used?
     c) How do the 2000 sampled children compare demographically to the target population of interest?

3.  a) The researchers conducted phone interviews of children, after asking permission from caretakers to speak to their children. What percentage of the children are not included in the survey because their caretakers refused permission?  
     b) If we could get the data for the children of these non-cooperative caretakers, do you think that the incidence rates for the abuses would increase or decrease relative to those in the report?  Justify your answers.

4.  The results of the survey are compared to those of two other national surveys (the NCS and the NYS) in the journal article, and one other survey  (the NCS) in the newspaper article.
a) Why might the results from the survey differ from those in the NCS?
b) Why might the results from the survey differ from those in the NYS?

5. The assault rate reported in the news article is 15.6%.  Notice that the newspaper article does not define precisely what they mean by assault.  This is typical of news coverage: report a number without its definition.  Find this rate in the journal article, and say what it is a report of.


Article 2:  Sports and binge drinking on college campuses

Do college students who are avid sports fans engage in binge drinking more frequently than college students who are not sports fans?  

Nelson and Wechsler (2002) analyze this question using data from the College Alcohol Study (CAS), a survey run by the Harvard School of Public Health.  Download the Nelson and Wechsler journal article by clicking on this link.  This paper has not yet been published.

You also will need to look at the article by Wechsler, Lee, Kuo, and Lee (2000).  It has a thorough description of the sampling design.  Click on this link to download the Wechsler, Lee, Kuo, and Lee article.  The reference for this paper is:  

Wechsler, H., Lee, J., Kuo, M., and Lee, H. (2000). "College binge drinking in the 1990s: A continuing problem--results of the Harvard School of Public Health 1999 College Alcohol Study."  Journal of American College Health, 48, pp. 199-210.

FOR ALL QUESTIONS, WRITE NO MORE THAN THREE SENTENCES.  TAs WILL NOT READ MORE THAN THE FIRST THREE SENTENCES FOR ANY PROBLEM WHEN GRADING.

Questions:

1.  a) What is the definition of binge drinking used in the survey?
     b) What is the definition of sports fan used by Nelson and Wechsler?  
     c) What is the definition of sports school used by Nelson and Wechsler?
   
2.  a) What is the target population of interest?  
     b) What procedures were used to collect data from this target population?  Write a list of the main aspects of the data collection stage.

3.  a) What did the researchers do to encourage students to respond to the survey?
     b) What was the average response rate in the participating colleges?
     c) These data do not tell us whether the nonresponding students have different characteristics than the responding students.  Based on your own knowledge, do you think (i) nonrespondents are more likely to be binge drinkers than respondents, (ii) nonrespondents and respondents are equally likely to be binge drinkers, or (iii) nonrespondents are less likely to be binge drinkers than respondents?  Justify your answer in at most three sentences.

4. a) Write a three (or less) sentences summary of the results in Table 1.
    b) Write a three (or less) sentences summary of the results in Table 2.
    c) Write a three (or less) sentences summary of the results in Table 3.

5.  On page 9 of their article, Nelson and Wechsler say, "The study has several limitations.  As with other surveys, these data are subject to reporting bias."  Explain why this is a limitation of the study.