Introduction to Statistics (MAT/SST 115.03 2008S)
Primary: [Front Door] [Syllabus] [Current Outline] [R] - [Academic Honesty] [Instructions]
Groupings: [Applets] [Assignments] [Data] [Examples] [Handouts] [Labs] [Outlines] [Projects] [Readings] [Solutions]
External Links: [R Front Door] [SamR's Front Door]
In R, you use the cor
function to
find the correlation between two samples. You can call
cor
on a data frame with two columns. You can also
call cor
on two vectors. (Since the correlation
coefficient is symmetrical, it doesn't really matter which one you
enter first.)
Unfortunately, cor
does not like NA values, so
you have to remove them from the frame before calling
cor
. (Removing them from the vectors is harder,
since you want to remove values from the same place in both vectors.
In cases in which either vector has an NA value, combine them into a
data frame first.) The na.omit
function does
the hard work for you.
In this activity, you will be working with data from the file
Cars99.csv
. As you should be able to guess
by now, you can load those data with
Cars99 = read.csv("/home/rebelsky/Stats115/Data/Cars99.csv")
Let's learn a little about the data set.
> summary(Cars99)Model Page.Number City.MPG Highway.MPG Fuel.Capacity
Acura Integra: 1 Min. : 61.0 Min. :17.00 Min. :23.00 Min. :10.30
Acura RL : 1 1st Qu.:100.0 1st Qu.:19.00 1st Qu.:26.00 1st Qu.:14.50
Acura TL : 1 Median :160.0 Median :20.50 Median :28.50 Median :16.20
Audi A4 : 1 Mean :153.9 Mean :20.96 Mean :28.74 Mean :16.38
Audi A6 : 1 3rd Qu.:207.0 3rd Qu.:23.00 3rd Qu.:30.75 3rd Qu.:18.43
Audi A8 : 1 Max. :249.0 Max. :30.00 Max. :38.00 Max. :23.70
(Other) :103 NA's : 3.00 NA's : 3.00 NA's : 1.00
Weight Front.Weight Acceleration.0.to.30 Acceleration.0.to.60
Min. :1845 Min. :46.00 Min. : 2.400 Min. : 5.600
1st Qu.:2845 1st Qu.:59.00 1st Qu.: 3.300 1st Qu.: 8.800
Median :3175 Median :62.00 Median : 3.500 Median : 9.500
Mean :3186 Mean :60.41 Mean : 3.548 Mean : 9.733
3rd Qu.:3545 3rd Qu.:63.00 3rd Qu.: 3.900 3rd Qu.:10.900
Max. :4145 Max. :65.00 Max. : 4.500 Max. :12.500
NA's : 1.00 NA's :36.000 NA's :36.000
Time.for.Quarter.Mile Type
Min. :14.10 family :28
1st Qu.:16.80 large :12
Median :17.40 luxury :13
Mean :17.40 small :25
3rd Qu.:18.20 sports :16
Max. :19.10 upscale:15
NA's :36.00
Note that many variables have some NA values.
Because the data set has NA values, we'll need to do a bit of cleanup first. (Yay!) First, we'll extract the two columns of interest.
tmp = data.frame(TfQM = Cars99$Time.for.Quarter.Mile, Weight=Cars99$Weight)
Next, we'll remove the rows with an NA value.
tmp = na.omit(tmp)
Finally, we'll compute the correlation coefficients.
cor(tmp)
We could also express that more concisely as
cor(na.omit(data.frame(Cars99$Time.for.Quarter.Mile, Cars99$Weight)))
That is,
We'll use a similar strategy in future activities.
Here are the first few computations (for B, C, and D). You should be able to figure out the rest. Remember to look at p. 536 to figure out what columns to use.
cor(na.omit(data.frame(Cars99$Acceleration.0.to.60, Cars99$Time.for.Quarter.Mile))) cor(na.omit(data.frame(Cars99$Page.Number, Cars99$Fuel.Capacity))) cor(na.omit(data.frame(Cars99$Weight, Cars99$City.MPG)))
You can load the data with
StateData = read.csv("/home/rebelsky/Stats115/Data/Governors05.csv")
Since that data set turns out to have many more columns than are reported in the table on p. 554, it's useful to see a quick summary of the data.
>
summary(StateData)
State Eastwest Region Median.Housing.Value Median.Household.Income
Alabama : 1 east:26 Midwest :12 Min. : 70700 Min. :32397
Alaska : 1 west:24 Northeast: 9 1st Qu.: 88475 1st Qu.:38414
Arizona : 1 South :16 Median :108200 Median :42621
Arkansas : 1 West :13 Mean :117990 Mean :43172
California: 1 3rd Qu.:139825 3rd Qu.:47980
Colorado : 1 Max. :272700 Max. :56409
(Other) :44
Governor.Salary
Min. : 70000
1st Qu.: 95340
Median :111352
Mean :117984
3rd Qu.:132161
Max. :179000
You can plot those data with
plot(StateData$Median.Housing.Value, StateData$Governor.Salary)
We are fortunate that these data do not include any NA
values. Hence, we can compute the correlation coeficient directly.
cor(StateData$Median.Housing.Value, StateData$Governor.Salary)
It's been a long time since we've updated a frame. We can use
fix
to bring up the data editor on that
frame.
fix(StateData)
After changing the appropriate cell, click somewhere else in the grid before closing the edit window. Your changes are not saved until you close the window. (You also cannot continue working in R until you close the window.)
At least in the Unix version of R, it seems that you change a cell by deleting the old contents and then enter the new contents. The new salary should be 194780.
You may want to check to make sure that the salary was, indeed, updated.
StateData[11,]
See if you can figure this one out on your own. The data are
stored in TVlife06.csv
. You'll ned to
figure out the columns before trying to compute the correlation
coefficient.
If you can't figure it out, a solution is available at the end of the topic page
Parts a-c are all simple calculations that you should be able to do on your own.
Part d is there primarily to confirm your answers above, so let's just look at the code. (You had to figure out code for 27-3, so I'm not sure that there's an advantage to requiring you to do so again.)
HousePrices = read.csv("/home/rebelsky/Stats115/Data/HousePricesAG.csv") cor(HousePrices$Price, HousePrices$Size)
Okay, more data. Let's load it in.
ExamScores = read.csv("/home/rebelsky/Stats115/Data/ExamScores.csv")
They wouldn't give us more NA
values, would they?
>
summary(ExamScores)
Student Exam1 Exam2
S01 : 1 Min. :72.00 Min. :50.00
S02 : 1 1st Qu.:82.25 1st Qu.:68.50
S03 : 1 Median :88.50 Median :80.00
S04 : 1 Mean :87.38 Mean :78.65
S05 : 1 3rd Qu.:95.00 3rd Qu.:91.00
S06 : 1 Max. :98.00 Max. :97.00
(Other):18 NA's : 1.00
Yes, I guess they would. Let's get rid of it.
ExamScores = na.omit(ExamScores)
plot(ExamScores$Exam1, ExamScores$Exam2)
You should be able to compute this one on your own.
You can change the exam 1 grades en masse with
ExamScores$Exam1 = ExamScores$Exam1 - 10
If you look at 27-6 d., you should be able to figure out how to double the grades on exam 2.
exam1 = c(...) exam2 = exam1 + 10 plot(exam1, exam2) cor(exam1,exam2)
We can continue to use the exam 1 grades from part h. We just need to change the way we compute the exam 2 grade.
exam2 = exam1 * 2 plot(exam1, exam2) cor(exam1,exam2)
Have we overwhelmed you with data yet? I hope not. Anyway, let's load the data and check for potential problems. I've separated the data into two files.
Draft1970 = read.csv("/home/rebelsky/Stats115/Data/Draft1970.csv") Draft1971 = read.csv("/home/rebelsky/Stats115/Data/Draft1971.csv")
Let's check the column (variable) names for the data set.
>
names(Draft1970)
[1] "Month" "Day.of.Month" "Day.of.Year" "Draft.Number"
>
names(Draft1971)
[1] "Month" "Day.of.Month" "Day.of.Year" "Draft.Number"
Some notes on the data: The 1970 data have a draft number for February 29, since they were drafting people born in a number of different years. The 1971 data do not have a draft number for February 29 because they were only for men born in 1951, which did not have a February 29. In case you care, the lottery the authors are describing as conducted in 1970 was actually held on December 1, 1969 to determine the order for those to be drafted into service in 1970, and the “Data for 1971” are for a lottery held in 1970 to determine the order for those to be drafted into service in 1971. The authors note that http://www.landscaper.net/draft.htm is the source of their data, so I trust the dates on that page for these corrections.
You will find it easiest to determine your draft number by opening the data set in the editor.
edit(Draft1970)
We begin by reading in the file.
>
TVlife = read.csv("/home/rebelsky/Stats115/Data/TVlife06.csv")
We determine the column names (and, along the way, learn a bit more about each variable).
>
summary(TVlife)
Country Life.Expectancy TVs.per.K
Angola : 1 Min. :38.45 Min. : 5.0
Australia: 1 1st Qu.:59.70 1st Qu.: 87.0
Cambodia : 1 Median :70.72 Median :177.0
Canada : 1 Mean :67.02 Mean :298.4
China : 1 3rd Qu.:77.16 3rd Qu.:570.2
Egypt : 1 Max. :81.25 Max. :844.0
(Other) :16
Remember that the X variable goes first.
>
cor(TVlife$TVs.per.K, TVlife$Life.Expectancy)
[1] 0.7432492
You will need to create a data frame (or two vectors) in order to do the analysis. Here's a start.
RNCtemps = data.frame(Month=1:12, AvgTemp = c(39, 42, ...))
You can load the data from the file Solitaire.csv
.
One nice thing about R's cor
is that, when given
a data frame as a parameter, it computes all the pairs of correlation
coefficients.
cor(Solitaire)
You can load the data from the file Climate.csv
.
We can once again take advantage of R's willingness to compute simultaneously lots of correlation coefficients. However, this time we need to throw away one column (the city names). So, we need to coefficients for columns 2-9.
Climate2 = Climate[,2:9]
You should be able to figure out the rest.
Primary: [Front Door] [Syllabus] [Current Outline] [R] - [Academic Honesty] [Instructions]
Groupings: [Applets] [Assignments] [Data] [Examples] [Handouts] [Labs] [Outlines] [Projects] [Readings] [Solutions]
External Links: [R Front Door] [SamR's Front Door]
Copyright (c) 2007-8 Samuel A. Rebelsky.
This work is licensed under a Creative Commons
Attribution-NonCommercial 2.5 License. To view a copy of this
license, visit http://creativecommons.org/licenses/by-nc/2.5/
or send a letter to Creative Commons, 543 Howard Street, 5th Floor,
San Francisco, California, 94105, USA.