Introduction to Statistics (MAT/SST 115.03 2008S)
In R, you use the
cor function to
find the correlation between two samples. You can call
cor on a data frame with two columns. You can also
cor on two vectors. (Since the correlation
coefficient is symmetrical, it doesn't really matter which one you
cor does not like NA values, so
you have to remove them from the frame before calling
cor. (Removing them from the vectors is harder,
since you want to remove values from the same place in both vectors.
In cases in which either vector has an NA value, combine them into a
data frame first.) The
na.omit function does
the hard work for you.
In this activity, you will be working with data from the file
Cars99.csv. As you should be able to guess
by now, you can load those data with
Cars99 = read.csv("/home/rebelsky/Stats115/Data/Cars99.csv")
Let's learn a little about the data set.
Model Page.Number City.MPG Highway.MPG Fuel.Capacity
Acura Integra: 1 Min. : 61.0 Min. :17.00 Min. :23.00 Min. :10.30
Acura RL : 1 1st Qu.:100.0 1st Qu.:19.00 1st Qu.:26.00 1st Qu.:14.50
Acura TL : 1 Median :160.0 Median :20.50 Median :28.50 Median :16.20
Audi A4 : 1 Mean :153.9 Mean :20.96 Mean :28.74 Mean :16.38
Audi A6 : 1 3rd Qu.:207.0 3rd Qu.:23.00 3rd Qu.:30.75 3rd Qu.:18.43
Audi A8 : 1 Max. :249.0 Max. :30.00 Max. :38.00 Max. :23.70
(Other) :103 NA's : 3.00 NA's : 3.00 NA's : 1.00
Weight Front.Weight Acceleration.0.to.30 Acceleration.0.to.60
Min. :1845 Min. :46.00 Min. : 2.400 Min. : 5.600
1st Qu.:2845 1st Qu.:59.00 1st Qu.: 3.300 1st Qu.: 8.800
Median :3175 Median :62.00 Median : 3.500 Median : 9.500
Mean :3186 Mean :60.41 Mean : 3.548 Mean : 9.733
3rd Qu.:3545 3rd Qu.:63.00 3rd Qu.: 3.900 3rd Qu.:10.900
Max. :4145 Max. :65.00 Max. : 4.500 Max. :12.500
NA's : 1.00 NA's :36.000 NA's :36.000
Min. :14.10 family :28
1st Qu.:16.80 large :12
Median :17.40 luxury :13
Mean :17.40 small :25
3rd Qu.:18.20 sports :16
Max. :19.10 upscale:15
Note that many variables have some NA values.
Because the data set has NA values, we'll need to do a bit of cleanup first. (Yay!) First, we'll extract the two columns of interest.
tmp = data.frame(TfQM = Cars99$Time.for.Quarter.Mile, Weight=Cars99$Weight)
Next, we'll remove the rows with an NA value.
tmp = na.omit(tmp)
Finally, we'll compute the correlation coefficients.
We could also express that more concisely as
We'll use a similar strategy in future activities.
Here are the first few computations (for B, C, and D). You should be able to figure out the rest. Remember to look at p. 536 to figure out what columns to use.
cor(na.omit(data.frame(Cars99$Acceleration.0.to.60, Cars99$Time.for.Quarter.Mile))) cor(na.omit(data.frame(Cars99$Page.Number, Cars99$Fuel.Capacity))) cor(na.omit(data.frame(Cars99$Weight, Cars99$City.MPG)))
Copyright (c) 2007-8 Samuel A. Rebelsky.
This work is licensed under a Creative Commons
Attribution-NonCommercial 2.5 License. To view a copy of this
or send a letter to Creative Commons, 543 Howard Street, 5th Floor,
San Francisco, California, 94105, USA.