Instructions for lab 1
Lab Objective
To become familiar with the software package JMP.
Lab Procedures
JMP gives us an enormous advantage over people who learned about and
performed statistical analyses back in the pre-computer days. It
allows us to avoid the drudgery of long, arithmetical calculations in
favor of understanding concepts and analyzing data. You may find
JMP a little annoying at times (all computer software is), but I
suspect
that you will be thankful of its existence once we start analyzing
data. JMP is also easier to work with and has more capabilities
than Excel.
To begin JMP, click on the Windows Start button in the
lower left corner of the page, then select Program-Math and
Statistics-JMP 7. The software should open with a title
page showing. Close the title page and you're ready to work.
Like most Windows-based programs, there are menu choices, including File,
Edit, Tables, Analyze, Graph, Tools, Window, and Help.
As the semester progresses, we will learn about the options under these
menu choices. For now, we focus on creating and downloading JMP
data sets.
Unit 1: Creating new data sets
Sometimes you need to create your own data sets from scratch (e.g., when analyzing data you collected for your final project). This tutorial familiarizes you with creating JMP data sets.
To create a new dataset, click on the New Data Table button
on the JMP menu bar (first icon in the top row), or navigate through
the menus to File
- New Data Table. Thre result is a blank spreadsheet
that will become the data set. The first step is to tell JMP the
number of rows in the data set. Click on the red arrow next to Rows,
or select Rows from the menu choices (this appeared after you
created a blank data sheet), and select Add
Rows. Enter the number of rows you want. To keep
things simple, we'll use 9 rows for this part of the lab.
Here are the data for nine people:
Sex Number
m 1
m 2
f 1
m 3
f 2
f 2
m 3
f 1
f 2
We're going to input the nine numbers in the first column. "Column 1" is a meaningless name for a variable, so let's rename it. Double click on the box containing "Column 1" and change the variable name to "Number".
Data analysis tip: When you create data sets for your own research, give your variables descriptive labels. It is easier to interpret analyses when the output has descriptive labels than when the output has labels like "Column 1", "Column 2", "Column 3", etc. Descriptive labels also make your data set comprehensible to others who may need or want to use it. Finally, if you use the data set in future analyses, you won't have to spend lots of time trying to decipher uninformative variable names.
Let's add another column to record the sexes. Click on Cols in the menu, and select New Column. Change the name of the column to "Sex" by writing over the "Column 2". You see a button for Data Type, which allows you to specify whether the column contains numbers (numeric), labels or names (character), or row states (we won't use this). Choose character. Next is Modeling Type, which helps JMP decide what graphs to show you. The two modeling types we use are continuous and nominal. We'll learn about these in more detail later, but the basic idea is to select continuous for numbers and nominal for variables that are labels. JMP displays variables that have numbers as data with a blue triangle and variables that have labels (or names) as data with a red bar chart.
After you input all the data, answer the following questions.
You don't have to turn in anything for questions A and B. Their
purpose is to get you familiar with JMP.
Questions:
A) How many people picked each number?
With nine people it's straightforward to look at the data and get an
accurate count. But, with a large data set, counting
the incidences of each number "by hand" would be cumbersome. In
such settings, you can make life easier by sorting the numbers in
increasing order, then count the incidences. Let's do this
in JMP just to get familiar with this handy command.
Select the Tables menu option and click on Sort.
Select the variable "Number" and place it in the By box.
Hit Sort. You get a sorted data set in a new
table. Sorting is useful for many data analyses. In
fact, you may want to use it again later in the lab.
B) If you want to sort the data first by sex and then by
number (i.e. have all the females first with numbers in increasing
order
and all the males second with numbers in increasing order), which
sequence of commands would you use? Try them both to see what
happens.
-- Select the Tables menu option and click on Sort
. Select the variable "Number" and place it in the By box.
Then select the variable "Sex" and place it in the By box. Hit Sort.
-- Select the Tables menu option and click on Sort
. Select the variable "Sex" and place it in the By box.
Then select the variable "Number" and place it in the By
box.
Hit Sort.
Okay, that's enough of the basics of creating your own spreadsheet.
Now for some real data that someone else has collected.
Unit 2: Downloading data sets
Load in the data set forbes94, which contains the1994 compensation information for Chief Executive Officers (CEOs) of several large companies. To open this data set, click here. Take a look at those total compensation figures.... Yikes!! Why did I decide to go into academia?
When you get a data set, the first thing to do is figure out how many variables and how many units of observation you have to play with. This is pretty easy in JMP. Each column represents a variable, and each row represents a unit of observation. Hence, there are 800 CEOs in this data set. There are also a mix of numeric (blue) and character variables (red) in the data set.
Let's get into some analyses. Write your answers on a blank
piece of paper to be turned in at the end
of lab as your lab report. You're permitted and encouraged to talk
about
questions with your classmates, but write up your lab report with your
own words. Feel free to ask for help from the TAs or classmates
if you get stuck. And, please give your TAs some love by turning
in a
neat, easy-to-read lab report.
Questions:
1) JMP displays missing values with dots. True or false:
There are more than five CEOs whose values of total
compensation
are missing in the data file. (Hint: You can do this quickly
using the sort command.)
Data analysis tip: It is common for some data to be missing on a file. Unfortunately, there is no universally accepted way of representing missing values. Some software packages, like JMP, use a dot or period. Other packages use an "NA" for not available. Some data producers, like federal agencies, use extreme values of a variable (e.g., -99) to indicate missing values. Using extreme values is bad practice: how does the user know if the value is an actual value or if it is a dummy for missingness? When you get a data set from someone, learn how they code missing data before doing any further analyses.
2) What is the salary (not total compensation) of the CEO of Duke Power Company?
We need to search through the data base for Duke Power, then read
off the salary of its CEO. One approach is to look at the
company names row by row. For those who find joy only in tedium,
this is the preferred approach. All others should go to the Edit
menu option, and select Search and then Find. Type
in "Duke Power," selecting nothing else. Another approach
is to sort the data alphabetically by Company, and hunt for Duke
Power. By the way, check out Blockbuster and Disney. I was
mildly surprised that Blockbuster is considered a retail--not
entertainment--industry. I also didn't expect Disney to be a
"travel" industry. Who knew....
3) Which CEO has the highest total compensation? Who has
the lowest total compensation?
4) Which industry type has the highest average CEO total
compensation? Be careful not to read the decimals incorrectly
when
you answer the question.
There are way too many CEOs to figure this out by hand. Let JMP do all the work. Select the Tables menu option and click on Summary. Put the variable "Wide Industry" in the Group box, then highlight "Total Compensation". Next, click on Statistics to pull down a menu of summary statistics. Select the Mean (and just for kicks, one other summary that interests you). Hit Okay. You should see a table of the statistics you selected for the industries, ordered alphabetically by industry. You may need to scroll down to see all industries.
Each row in the table reports the value of the statistic aggregated
over the industries. For example, there are 62 CEOs in "Food"
industries, and their average total compensation equals $2,740,661.31.
That's a lot of Twinkies.
In general, the summary command is useful for comparing means and
other statistics for several groups. It's worth remembering.
5) How many of these CEOs got their undergraduate degree from
Duke?
6) Let's assume all the CEOs from UNC schools graduated from
UNC Chapel Hill. Assuming this, there are many more CEOs with
undergraduate degrees from Carolina than there are from Duke.
Your friends at Carolina use this to argue that their graduates
are more likely to be CEOs than Duke graduates. Defend our
school! Use the CEO counts to make a statistical
argument that Duke does not lag behind Carolina in producing
CEOs.
Write two or less sentences to justify your answer. Hint: To come up with a
good argument, you need information about UNC and Duke that is not in
this data set. Once you've identified what you need, ask the TAs
for that information. There is more than one correct answer (it's
an argument, after all), and there are wrong answers.
7) Highest attained educational degree is in the variable
"Grad degree". Which degree has the highest total
compensation: MBA (business),
JD (law),
MD (physician), PhD, or no graduate degree? Use highest average
total compensation as your criterion, and choose only from these categories.
Data analysis tip: We
cannot say definitively from these data that obtaining one degree results in higher
compensation than obtaining other degrees. There are small
numbers of people in some degree
groups. Statisticians typically hesitate to make strong
statements based on only a few observations. Plus, there could be
lots of reasons why certain degrees have higher compenations than
others; it may not be just the effect of the degree that drives
compensation. We'll talk more about these issues throughout the
course.
8) Explore the data to answer at least one question that
interests you. Report your findings to one of the TAs or the
instructor; you don't have to write anything on your lab sheet for this
question. The TAs will give you credit for answering this problem
when you report to them. Ask your TAs for help with JMP if needed.
You may want to begin your list of JMP commands by adding instructions
for the methods you used in Lab 1. We'll use sorting and
summarizing by groups for Lab 2 (and for later labs), so it will be
helpful to have commands for those data analysis tools handy.
(Obviously, don't turn in this list; it's yours!)
This ends the lab. Remember to turn in your lab sheet to the TA. Include your name and lab time on the sheet.
DON'T FORGET TO LOG OFF FROM YOUR MACHINE.