Introduction to Statistics (MAT/SST 115.03 2008S)

R Notes for Topic 27: Correlation Coefficient


R notes for Activity 27-1: Car Data

In R, you use the cor function to find the correlation between two samples. You can call cor on a data frame with two columns. You can also call cor on two vectors. (Since the correlation coefficient is symmetrical, it doesn't really matter which one you enter first.)

Unfortunately, cor does not like NA values, so you have to remove them from the frame before calling cor. (Removing them from the vectors is harder, since you want to remove values from the same place in both vectors. In cases in which either vector has an NA value, combine them into a data frame first.) The na.omit function does the hard work for you.

In this activity, you will be working with data from the file Cars99.csv. As you should be able to guess by now, you can load those data with

Cars99 = read.csv("/home/rebelsky/Stats115/Data/Cars99.csv")

Let's learn a little about the data set.

> summary(Cars99)
           Model      Page.Number       City.MPG      Highway.MPG    Fuel.Capacity  
 Acura Integra:  1   Min.   : 61.0   Min.   :17.00   Min.   :23.00   Min.   :10.30  
 Acura RL     :  1   1st Qu.:100.0   1st Qu.:19.00   1st Qu.:26.00   1st Qu.:14.50  
 Acura TL     :  1   Median :160.0   Median :20.50   Median :28.50   Median :16.20  
 Audi A4      :  1   Mean   :153.9   Mean   :20.96   Mean   :28.74   Mean   :16.38  
 Audi A6      :  1   3rd Qu.:207.0   3rd Qu.:23.00   3rd Qu.:30.75   3rd Qu.:18.43  
 Audi A8      :  1   Max.   :249.0   Max.   :30.00   Max.   :38.00   Max.   :23.70  
 (Other)      :103                   NA's   : 3.00   NA's   : 3.00   NA's   : 1.00  
     Weight      Front.Weight   Acceleration.0.to.30 Acceleration.0.to.60
 Min.   :1845   Min.   :46.00   Min.   : 2.400       Min.   : 5.600      
 1st Qu.:2845   1st Qu.:59.00   1st Qu.: 3.300       1st Qu.: 8.800      
 Median :3175   Median :62.00   Median : 3.500       Median : 9.500      
 Mean   :3186   Mean   :60.41   Mean   : 3.548       Mean   : 9.733      
 3rd Qu.:3545   3rd Qu.:63.00   3rd Qu.: 3.900       3rd Qu.:10.900      
 Max.   :4145   Max.   :65.00   Max.   : 4.500       Max.   :12.500      
                NA's   : 1.00   NA's   :36.000       NA's   :36.000      
 Time.for.Quarter.Mile      Type   
 Min.   :14.10         family :28  
 1st Qu.:16.80         large  :12  
 Median :17.40         luxury :13  
 Mean   :17.40         small  :25  
 3rd Qu.:18.20         sports :16  
 Max.   :19.10         upscale:15  
 NA's   :36.00                     

Note that many variables have some NA values.

27-1 a. Your first correlation coefficient

Because the data set has NA values, we'll need to do a bit of cleanup first. (Yay!) First, we'll extract the two columns of interest.

tmp = data.frame(TfQM = Cars99$Time.for.Quarter.Mile, Weight=Cars99$Weight)

Next, we'll remove the rows with an NA value.

tmp = na.omit(tmp)

Finally, we'll compute the correlation coefficients.

cor(tmp)

We could also express that more concisely as

cor(na.omit(data.frame(Cars99$Time.for.Quarter.Mile, Cars99$Weight)))

That is,

  • Create a new data frame with just the columns of interest.
  • Remove the rows that contain NA from that frame.
  • Compute the correlation coefficient.

We'll use a similar strategy in future activities.

27-1 b. More correlation coefficients

Here are the first few computations (for B, C, and D). You should be able to figure out the rest. Remember to look at p. 536 to figure out what columns to use.

cor(na.omit(data.frame(Cars99$Acceleration.0.to.60, Cars99$Time.for.Quarter.Mile)))
cor(na.omit(data.frame(Cars99$Page.Number, Cars99$Fuel.Capacity)))
cor(na.omit(data.frame(Cars99$Weight, Cars99$City.MPG)))

R notes for Activity 27-2: Governors' Salaries

You can load the data with

StateData = read.csv("/home/rebelsky/Stats115/Data/Governors05.csv")

Since that data set turns out to have many more columns than are reported in the table on p. 554, it's useful to see a quick summary of the data.

> summary(StateData)
        State    Eastwest        Region   Median.Housing.Value Median.Household.Income
 Alabama   : 1   east:26   Midwest  :12   Min.   : 70700       Min.   :32397          
 Alaska    : 1   west:24   Northeast: 9   1st Qu.: 88475       1st Qu.:38414          
 Arizona   : 1             South    :16   Median :108200       Median :42621          
 Arkansas  : 1             West     :13   Mean   :117990       Mean   :43172          
 California: 1                            3rd Qu.:139825       3rd Qu.:47980          
 Colorado  : 1                            Max.   :272700       Max.   :56409          
 (Other)   :44                                                                        
 Governor.Salary 
 Min.   : 70000  
 1st Qu.: 95340  
 Median :111352  
 Mean   :117984  
 3rd Qu.:132161  
 Max.   :179000  

27-2 b. Scatterplots

You can plot those data with

plot(StateData$Median.Housing.Value, StateData$Governor.Salary)

27-2 d. Correlation Coefficient

We are fortunate that these data do not include any NA values. Hence, we can compute the correlation coeficient directly.

cor(StateData$Median.Housing.Value, StateData$Governor.Salary)

27-2 f. Changing Salaries

It's been a long time since we've updated a frame. We can use fix to bring up the data editor on that frame.

fix(StateData)

After changing the appropriate cell, click somewhere else in the grid before closing the edit window. Your changes are not saved until you close the window. (You also cannot continue working in R until you close the window.)

At least in the Unix version of R, it seems that you change a cell by deleting the old contents and then enter the new contents. The new salary should be 194780.

You may want to check to make sure that the salary was, indeed, updated.

StateData[11,]

R notes for Activity 27-3: Televisions and Life Expectancy

See if you can figure this one out on your own. The data are stored in TVlife06.csv. You'll ned to figure out the columns before trying to compute the correlation coefficient.

If you can't figure it out, a solution is available at the end of the topic page

R notes for Activity 27-5: House Prices

Parts a-c are all simple calculations that you should be able to do on your own.

Part d is there primarily to confirm your answers above, so let's just look at the code. (You had to figure out code for 27-3, so I'm not sure that there's an advantage to requiring you to do so again.)

HousePrices = read.csv("/home/rebelsky/Stats115/Data/HousePricesAG.csv")
cor(HousePrices$Price, HousePrices$Size)

R notes for Activity 27-6: Exam Score Improvements

Okay, more data. Let's load it in.

ExamScores = read.csv("/home/rebelsky/Stats115/Data/ExamScores.csv")

They wouldn't give us more NA values, would they?

> summary(ExamScores)
    Student       Exam1           Exam2      
 S01    : 1   Min.   :72.00   Min.   :50.00  
 S02    : 1   1st Qu.:82.25   1st Qu.:68.50  
 S03    : 1   Median :88.50   Median :80.00  
 S04    : 1   Mean   :87.38   Mean   :78.65  
 S05    : 1   3rd Qu.:95.00   3rd Qu.:91.00  
 S06    : 1   Max.   :98.00   Max.   :97.00  
 (Other):18                   NA's   : 1.00  

Yes, I guess they would. Let's get rid of it.

ExamScores = na.omit(ExamScores)

27-6 a. Scatterplots

plot(ExamScores$Exam1, ExamScores$Exam2)

27-6 b. Yet another correlation cofficient

You should be able to compute this one on your own.

27-6 d. Changing exam 1 grades.

You can change the exam 1 grades en masse with

ExamScores$Exam1 = ExamScores$Exam1 - 10

27-6 f. Changing exam 2 grades

If you look at 27-6 d., you should be able to figure out how to double the grades on exam 2.

27-6 h. Making up data

exam1 = c(...)
exam2 = exam1 + 10
plot(exam1, exam2)
cor(exam1,exam2)

27-6 j. Making up data

We can continue to use the exam 1 grades from part h. We just need to change the way we compute the exam 2 grade.

exam2 = exam1 * 2
plot(exam1, exam2)
cor(exam1,exam2)

R notes for Activity 27-7: Draft Lottery

Have we overwhelmed you with data yet? I hope not. Anyway, let's load the data and check for potential problems. I've separated the data into two files.

Draft1970 = read.csv("/home/rebelsky/Stats115/Data/Draft1970.csv")
Draft1971 = read.csv("/home/rebelsky/Stats115/Data/Draft1971.csv")

Let's check the column (variable) names for the data set.

> names(Draft1970)
[1] "Month"        "Day.of.Month" "Day.of.Year"  "Draft.Number"
> names(Draft1971)
[1] "Month"        "Day.of.Month" "Day.of.Year"  "Draft.Number"

Some notes on the data: The 1970 data have a draft number for February 29, since they were drafting people born in a number of different years. The 1971 data do not have a draft number for February 29 because they were only for men born in 1951, which did not have a February 29. In case you care, the lottery the authors are describing as conducted in 1970 was actually held on December 1, 1969 to determine the order for those to be drafted into service in 1970, and the “Data for 1971” are for a lottery held in 1970 to determine the order for those to be drafted into service in 1971. The authors note that http://www.landscaper.net/draft.htm is the source of their data, so I trust the dates on that page for these corrections.

27-7 a. Your draft number

You will find it easiest to determine your draft number by opening the data set in the editor.

edit(Draft1970)

Additional R notes for Activity 27-3: Televisions and Life Expectancy

We begin by reading in the file.

> TVlife = read.csv("/home/rebelsky/Stats115/Data/TVlife06.csv")

We determine the column names (and, along the way, learn a bit more about each variable).

> summary(TVlife)
      Country   Life.Expectancy   TVs.per.K    
 Angola   : 1   Min.   :38.45   Min.   :  5.0  
 Australia: 1   1st Qu.:59.70   1st Qu.: 87.0  
 Cambodia : 1   Median :70.72   Median :177.0  
 Canada   : 1   Mean   :67.02   Mean   :298.4  
 China    : 1   3rd Qu.:77.16   3rd Qu.:570.2  
 Egypt    : 1   Max.   :81.25   Max.   :844.0  
 (Other)  :16                                  

Remember that the X variable goes first.

> cor(TVlife$TVs.per.K, TVlife$Life.Expectancy)
[1] 0.7432492

R Notes for Topic-27 Homework Activities

Activity 27-12: Monthly Temperatures

You will need to create a data frame (or two vectors) in order to do the analysis. Here's a start.

RNCtemps = data.frame(Month=1:12, AvgTemp = c(39, 42, ...))

Activity 27-18: Solitaire

You can load the data from the file Solitaire.csv.

One nice thing about R's cor is that, when given a data frame as a parameter, it computes all the pairs of correlation coefficients.

cor(Solitaire)

Activity 27-19: Climatic Conditions

You can load the data from the file Climate.csv.

We can once again take advantage of R's willingness to compute simultaneously lots of correlation coefficients. However, this time we need to throw away one column (the city names). So, we need to coefficients for columns 2-9.

  Climate2 = Climate[,2:9]

You should be able to figure out the rest.

Creative Commons License

Samuel A. Rebelsky, rebelsky@grinnell.edu

Copyright (c) 2007-8 Samuel A. Rebelsky.

This work is licensed under a Creative Commons Attribution-NonCommercial 2.5 License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/2.5/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.