Introduction to Statistics (MAT/SST 115.03 2008S)
Primary: [Front Door] [Syllabus] [Current Outline] [R] - [Academic Honesty] [Instructions]
Groupings: [Applets] [Assignments] [Data] [Examples] [Handouts] [Labs] [Outlines] [Projects] [Readings] [Solutions]
External Links: [R Front Door] [SamR's Front Door]
In R, you use the cor
function to
find the correlation between two samples. You can call
cor
on a data frame with two columns. You can also
call cor
on two vectors. (Since the correlation
coefficient is symmetrical, it doesn't really matter which one you
enter first.)
Unfortunately, cor
does not like NA values, so
you have to remove them from the frame before calling
cor
. (Removing them from the vectors is harder,
since you want to remove values from the same place in both vectors.
In cases in which either vector has an NA value, combine them into a
data frame first.) The na.omit
function does
the hard work for you.
In this activity, you will be working with data from the file
Cars99.csv
. As you should be able to guess
by now, you can load those data with
Cars99 = read.csv("/home/rebelsky/Stats115/Data/Cars99.csv")
Let's learn a little about the data set.
> summary(Cars99)Model Page.Number City.MPG Highway.MPG Fuel.Capacity
Acura Integra: 1 Min. : 61.0 Min. :17.00 Min. :23.00 Min. :10.30
Acura RL : 1 1st Qu.:100.0 1st Qu.:19.00 1st Qu.:26.00 1st Qu.:14.50
Acura TL : 1 Median :160.0 Median :20.50 Median :28.50 Median :16.20
Audi A4 : 1 Mean :153.9 Mean :20.96 Mean :28.74 Mean :16.38
Audi A6 : 1 3rd Qu.:207.0 3rd Qu.:23.00 3rd Qu.:30.75 3rd Qu.:18.43
Audi A8 : 1 Max. :249.0 Max. :30.00 Max. :38.00 Max. :23.70
(Other) :103 NA's : 3.00 NA's : 3.00 NA's : 1.00
Weight Front.Weight Acceleration.0.to.30 Acceleration.0.to.60
Min. :1845 Min. :46.00 Min. : 2.400 Min. : 5.600
1st Qu.:2845 1st Qu.:59.00 1st Qu.: 3.300 1st Qu.: 8.800
Median :3175 Median :62.00 Median : 3.500 Median : 9.500
Mean :3186 Mean :60.41 Mean : 3.548 Mean : 9.733
3rd Qu.:3545 3rd Qu.:63.00 3rd Qu.: 3.900 3rd Qu.:10.900
Max. :4145 Max. :65.00 Max. : 4.500 Max. :12.500
NA's : 1.00 NA's :36.000 NA's :36.000
Time.for.Quarter.Mile Type
Min. :14.10 family :28
1st Qu.:16.80 large :12
Median :17.40 luxury :13
Mean :17.40 small :25
3rd Qu.:18.20 sports :16
Max. :19.10 upscale:15
NA's :36.00
Note that many variables have some NA values.
Because the data set has NA values, we'll need to do a bit of cleanup first. (Yay!) First, we'll extract the two columns of interest.
tmp = data.frame(TfQM = Cars99$Time.for.Quarter.Mile, Weight=Cars99$Weight)
Next, we'll remove the rows with an NA value.
tmp = na.omit(tmp)
Finally, we'll compute the correlation coefficients.
cor(tmp)
We could also express that more concisely as
cor(na.omit(data.frame(Cars99$Time.for.Quarter.Mile, Cars99$Weight)))
That is,
We'll use a similar strategy in future activities.
Here are the first few computations (for B, C, and D). You should be able to figure out the rest. Remember to look at p. 536 to figure out what columns to use.
cor(na.omit(data.frame(Cars99$Acceleration.0.to.60, Cars99$Time.for.Quarter.Mile))) cor(na.omit(data.frame(Cars99$Page.Number, Cars99$Fuel.Capacity))) cor(na.omit(data.frame(Cars99$Weight, Cars99$City.MPG)))
Primary: [Front Door] [Syllabus] [Current Outline] [R] - [Academic Honesty] [Instructions]
Groupings: [Applets] [Assignments] [Data] [Examples] [Handouts] [Labs] [Outlines] [Projects] [Readings] [Solutions]
External Links: [R Front Door] [SamR's Front Door]
Copyright (c) 2007-8 Samuel A. Rebelsky.
This work is licensed under a Creative Commons
Attribution-NonCommercial 2.5 License. To view a copy of this
license, visit http://creativecommons.org/licenses/by-nc/2.5/
or send a letter to Creative Commons, 543 Howard Street, 5th Floor,
San Francisco, California, 94105, USA.