Instructions for lab 2
Lab Objective
To get more practice using JMP commands, and to illustrate the benefits of random sampling in surveys and causal studies.
Lab Procedures
Unit 1: The benefits of randomization in surveys
In a survey, the sampled data should be representative of the target population. The simplest way to guarantee representative data is to collect data from randomly selected units in the population. We'll illustrate this using real data.
Open the file agpop from the course directory. This file is taken from the 1992 U.S. Census of Agriculture. It contains data on agricultural characteristics of all 3,078 counties in the United States. Variables include acres92 (number of acres devoted to farming in 1992), farms92 (number of farms in 1992), largef92(the number of farms with more than 1,000 acres), smallf92 (the number of farms with fewer than 9 acres), and similar variables for the 1987 and 1982 censuses. Also included are county and state names, and a variable indicating the county's region of the country (West, Northeast, North Central, South). For more information on the Census of Agriculture, including data from the 1997 census, you can visit the web site of the National Agricultural Statistics Service.
Data analysis tip: When looking at a data set for the first time, it is always a good idea to play around with it to get a feel for what it contains. For example, I saw some -99 values in the data. Obviously, it is not possible to have negative numbers of farms or negative acres of land. The conclusion I can reach is that the -99s identify missing data. Missing data require special care, and you should seek out a professional statistician when you have lots of missing data. For this lab, we'll be stupid and treat -99s as genuine values.
Questions: Exploring the data
Hints for questions 1 - 3: With smart sorts and summaries, you can answer these questions very quickly.
Select the Analyze menu option in
JMP, then click on Distribution. You'll see a box with
the names of the variables. Highlight farms92, then hit Y-columns.
Do this with all the 1992 variables. Now click Okay to
get
summaries of the variables, such as the means,
medians, and many other statistics we'll use later in the
semester. You'll also see histograms and possibly other graphical
displays. We'll use those later in the semester as well.
Write down the population means on scrap paper
for use in a later question.
Since we have the actual population means,
there's no need to take random samples. There's no point in
estimating numbers when you can know them exactly! However, our
objective for lab is to see if random sampling works in a real data
set. So, here's what we'll do. We'll use JMP to take a
random sample of 500 counties. If random selection truly gives a
representative sample, the averages of the variables in the sample
should be close to the averages of the variables in the whole
population
of 3,078 counties.
At first glance, it may seem preposterous to
claim that 500 counties can represent 3,078 counties. Look at the
ranges of some of the variables: the acres92 has a smallest
value of -99 and a largest value of 7,229,985 acres, and the number of
large farms in a county stretches from 0 to 579. How are we
possibly going to get a sample that reflects the characteristics of all
these wide-ranging variables with only 500 out of 3078 counties?!?
Let's see what happens....
Question:
4) Take a random sample.
Based on comparisons between the sample means and population
means, does it seem that picking counties at random provides a
representative sample? Talk to the TA or instructor about
your conclusions, and any questions that you may have. After you
talk to the TA or instructor, they will give you credit for answering
this question.
It's easy to take a simple random sample in JMP from a data file. First, make sure that no columns are highlighted. Then, select Tables from the menu options, then select Subset. Choose the option for Random - sample size. Enter 500 as the sample size, i.e. the number of counties to be sampled. Hit OK and you get a new data table with 500 randomly sampled counties. If you want to take another random sample to check if the results from the first sample were just dumb luck, close this new data table and repeat the previous instructions.
The sample size 500 was chosen arbitrarily. Later in the semester, we'll learn a principled method of choosing sample sizes.
To me, what's amazing about this is that you
usually get pretty close by just throwing darts. In fact, you
would be hard pressed to get closer on all variables by any
non-random method of selecting data. I dare you to try at home.
Data analysis tip: Here's a generic method for taking a random
sample from a population. First, give each unit on the sampling
frame a distinct number in the range 1 to N, where N is the total
number of units on your sampling frame. Second, open a new data
file in JMP and create a single column with numbers from 1 to N.
Third, pick a random sample of these numbers from this file using the Subset - Random - sample size method.
Finally, collect data for those units whose numbers were picked
in the sample.
To generate a column of numbers in JMP that go
from 1 to N, first create a new variable (new column). After
highlighting that column, go to Cols
- Column
Info. Select New
Property - Formula. Then, select Edit Formula. Next, select Row - Count. Enter 1 in
the from box; enter the
number N in the to box and
the steps box. Click Okay until you get back to the data
sheet.
What are the characteristics of youth doing time? The 1987
Survey of Youth in Custody sampled juveniles and young adults in
long-term, state-operated juvenile institutions. Residents of 206
facilities at the end of 1987 were interviewed about family
background, previous criminal history, and drug and alcohol use.
Open the data set syc2.jmp
by clicking on the link. The data set is comprised of 28
variables for 2621 youths. The variables we use are described below:
crimtype : most serious crime in current offense
1 = violent (e.g. murder, rape, robbery, assault)
2 = property (e.g., burglary, ;arceny,
arson, fraud, motor vehicle theft)
3 = drug (drug possession or
trafficking)
4 = public order (weapons violation,
perjury, failure to appear in court)
5 = juvenile-status offense (truancy, running
away, incorrigible behavior)
9 = missing
numarr : number of times arrested
agefirst : age at first arrest
alcuse : Did the youth drink alcohol at all during the
year
before being sent to the institution?
1 =
yes; 2 = no, didn't drink during the year before; 3 = no, don't drink
at
all; 9 = missing.
everdrug : Did the youth ever use illegal drugs?
0 = no;
1 = yes; 9 = missing.
The variables have missing data, filled in with 99s and 9s.
Since the purpose of this lab is to see how well random
assignment
to treatments works, we'll be stupid and treat the 99s and 9s as if
they
are real values. Again, this is not good practice; contact a
statistician for help when you encounter missing data in your research.
Questions:
5) Let's look at the characteristics of these youths before
illustrating random assignment of treatments. For this question,
the JMP command for all three parts is after part c.
a) Before looking at the data, guess what two types of crimes are most common among institutionalized youths (you don't need to write your guesses on the lab report). Okay, now let's look at the data. What two types of crimes did most of these youths commit? Report the percentages of youths who committed these two crime types on your lab report.
b) Before looking at the data, guess the average age at first arrest for institutionalized youths (you don't need to write your guess on the lab report). What is the average age at first arrest in the data? Report the average age at first arrest on your lab report.
c) Before looking at the data, guess the percentage of youths
in institutions who drank alcohol in the year before being sent there
(you don't need to write your guess on the lab report). What is
the percentage of these youths who drank alcohol in that year?
Report the percentage
on your lab report.
You can get summaries of all the variables by using Analyze -
Distribution. Enter all the variable names into the Y-columns
box
before hitting OK.
6) Can you conclude using
these data alone that using alcohol
increases the chance that youths will go to institutions? Explain
your answer in three or less sentences.
Now let's randomly assign half the youths to one group, and half to
another group. Go to Rows- Row Selection - Select
Randomly... Enter 1312 in the blank box.
Hit OK. This highlights a random sample of 1312
rows. Next, make sure that no columns are
highlighted (the rows should be), and go to Tables - Subset.
Select the option for all variables, and name this subset
"Group 1". Select "Selected Rows" as the option, and click
OK.
This creates the first group. To create the second
group, which comprises the remainder, go back to the agpop file with
all counties. Select Rows - Row Selection -
Invert Row
Selection. Now you have highlighted all the youths who were
not selected in Group 1. Make sure that no columns are highlighted,
then go to Tables - Subset. Select the option for all
variables, and name this subset "Group 2".
Select "Selected Rows" as the option, and click OK.
This creates the second group.