Introduction to Statistics (MAT/SST 115.03 2008S)

R notes for Activity 27-1: Car Data

In R, you use the cor function to find the correlation between two samples. You can call cor on a data frame with two columns. You can also call cor on two vectors. (Since the correlation coefficient is symmetrical, it doesn't really matter which one you enter first.)

Unfortunately, cor does not like NA values, so you have to remove them from the frame before calling cor. (Removing them from the vectors is harder, since you want to remove values from the same place in both vectors. In cases in which either vector has an NA value, combine them into a data frame first.) The na.omit function does the hard work for you.

In this activity, you will be working with data from the file Cars99.csv. As you should be able to guess by now, you can load those data with

Cars99 = read.csv("/home/rebelsky/Stats115/Data/Cars99.csv")

Let's learn a little about the data set.

> summary(Cars99)
           Model      Page.Number       City.MPG      Highway.MPG    Fuel.Capacity  
 Acura Integra:  1   Min.   : 61.0   Min.   :17.00   Min.   :23.00   Min.   :10.30  
 Acura RL     :  1   1st Qu.:100.0   1st Qu.:19.00   1st Qu.:26.00   1st Qu.:14.50  
 Acura TL     :  1   Median :160.0   Median :20.50   Median :28.50   Median :16.20  
 Audi A4      :  1   Mean   :153.9   Mean   :20.96   Mean   :28.74   Mean   :16.38  
 Audi A6      :  1   3rd Qu.:207.0   3rd Qu.:23.00   3rd Qu.:30.75   3rd Qu.:18.43  
 Audi A8      :  1   Max.   :249.0   Max.   :30.00   Max.   :38.00   Max.   :23.70  
 (Other)      :103                   NA's   : 3.00   NA's   : 3.00   NA's   : 1.00  
     Weight      Front.Weight
 Min.   :1845   Min.   :46.00   Min.   : 2.400       Min.   : 5.600      
 1st Qu.:2845   1st Qu.:59.00   1st Qu.: 3.300       1st Qu.: 8.800      
 Median :3175   Median :62.00   Median : 3.500       Median : 9.500      
 Mean   :3186   Mean   :60.41   Mean   : 3.548       Mean   : 9.733      
 3rd Qu.:3545   3rd Qu.:63.00   3rd Qu.: 3.900       3rd Qu.:10.900      
 Max.   :4145   Max.   :65.00   Max.   : 4.500       Max.   :12.500      
                NA's   : 1.00   NA's   :36.000       NA's   :36.000      
 Time.for.Quarter.Mile      Type   
 Min.   :14.10         family :28  
 1st Qu.:16.80         large  :12  
 Median :17.40         luxury :13  
 Mean   :17.40         small  :25  
 3rd Qu.:18.20         sports :16  
 Max.   :19.10         upscale:15  
 NA's   :36.00                     

Note that many variables have some NA values.

27-1 a. Your first correlation coefficient

Because the data set has NA values, we'll need to do a bit of cleanup first. (Yay!) First, we'll extract the two columns of interest.

tmp = data.frame(TfQM = Cars99$Time.for.Quarter.Mile, Weight=Cars99$Weight)

Next, we'll remove the rows with an NA value.

tmp = na.omit(tmp)

Finally, we'll compute the correlation coefficients.


We could also express that more concisely as

cor(na.omit(data.frame(Cars99$Time.for.Quarter.Mile, Cars99$Weight)))

That is,

  • Create a new data frame with just the columns of interest.
  • Remove the rows that contain NA from that frame.
  • Compute the correlation coefficient.

We'll use a similar strategy in future activities.

27-1 b. More correlation coefficients

Here are the first few computations (for B, C, and D). You should be able to figure out the rest. Remember to look at p. 536 to figure out what columns to use.

cor(na.omit(data.frame(Cars99$, Cars99$Time.for.Quarter.Mile)))
cor(na.omit(data.frame(Cars99$Page.Number, Cars99$Fuel.Capacity)))
cor(na.omit(data.frame(Cars99$Weight, Cars99$City.MPG)))

Creative Commons License

Samuel A. Rebelsky,

Copyright (c) 2007-8 Samuel A. Rebelsky.

This work is licensed under a Creative Commons Attribution-NonCommercial 2.5 License. To view a copy of this license, visit or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.