# R Notes for Topic 27: Correlation Coefficient

## R notes for Activity 27-1: Car Data

In R, you use the `cor` function to find the correlation between two samples. You can call `cor` on a data frame with two columns. You can also call `cor` on two vectors. (Since the correlation coefficient is symmetrical, it doesn't really matter which one you enter first.)

Unfortunately, `cor` does not like NA values, so you have to remove them from the frame before calling `cor`. (Removing them from the vectors is harder, since you want to remove values from the same place in both vectors. In cases in which either vector has an NA value, combine them into a data frame first.) The `na.omit` function does the hard work for you.

In this activity, you will be working with data from the file `Cars99.csv`. As you should be able to guess by now, you can load those data with

```Cars99 = read.csv("/home/rebelsky/Stats115/Data/Cars99.csv")
```

Let's learn a little about the data set.

```> summary(Cars99)
`           Model      Page.Number       City.MPG      Highway.MPG    Fuel.Capacity  `
` Acura Integra:  1   Min.   : 61.0   Min.   :17.00   Min.   :23.00   Min.   :10.30  `
` Acura RL     :  1   1st Qu.:100.0   1st Qu.:19.00   1st Qu.:26.00   1st Qu.:14.50  `
` Acura TL     :  1   Median :160.0   Median :20.50   Median :28.50   Median :16.20  `
` Audi A4      :  1   Mean   :153.9   Mean   :20.96   Mean   :28.74   Mean   :16.38  `
` Audi A6      :  1   3rd Qu.:207.0   3rd Qu.:23.00   3rd Qu.:30.75   3rd Qu.:18.43  `
` Audi A8      :  1   Max.   :249.0   Max.   :30.00   Max.   :38.00   Max.   :23.70  `
` (Other)      :103                   NA's   : 3.00   NA's   : 3.00   NA's   : 1.00  `
`     Weight      Front.Weight   Acceleration.0.to.30 Acceleration.0.to.60`
` Min.   :1845   Min.   :46.00   Min.   : 2.400       Min.   : 5.600      `
` 1st Qu.:2845   1st Qu.:59.00   1st Qu.: 3.300       1st Qu.: 8.800      `
` Median :3175   Median :62.00   Median : 3.500       Median : 9.500      `
` Mean   :3186   Mean   :60.41   Mean   : 3.548       Mean   : 9.733      `
` 3rd Qu.:3545   3rd Qu.:63.00   3rd Qu.: 3.900       3rd Qu.:10.900      `
` Max.   :4145   Max.   :65.00   Max.   : 4.500       Max.   :12.500      `
`                NA's   : 1.00   NA's   :36.000       NA's   :36.000      `
` Time.for.Quarter.Mile      Type   `
` Min.   :14.10         family :28  `
` 1st Qu.:16.80         large  :12  `
` Median :17.40         luxury :13  `
` Mean   :17.40         small  :25  `
` 3rd Qu.:18.20         sports :16  `
` Max.   :19.10         upscale:15  `
` NA's   :36.00                     `
```

Note that many variables have some NA values.

### 27-1 a. Your first correlation coefficient

Because the data set has NA values, we'll need to do a bit of cleanup first. (Yay!) First, we'll extract the two columns of interest.

```tmp = data.frame(TfQM = Cars99\$Time.for.Quarter.Mile, Weight=Cars99\$Weight)
```

Next, we'll remove the rows with an NA value.

```tmp = na.omit(tmp)
```

Finally, we'll compute the correlation coefficients.

```cor(tmp)
```

We could also express that more concisely as

```cor(na.omit(data.frame(Cars99\$Time.for.Quarter.Mile, Cars99\$Weight)))
```

That is,

• Create a new data frame with just the columns of interest.
• Remove the rows that contain NA from that frame.
• Compute the correlation coefficient.

We'll use a similar strategy in future activities.

### 27-1 b. More correlation coefficients

Here are the first few computations (for B, C, and D). You should be able to figure out the rest. Remember to look at p. 536 to figure out what columns to use.

```cor(na.omit(data.frame(Cars99\$Acceleration.0.to.60, Cars99\$Time.for.Quarter.Mile)))
cor(na.omit(data.frame(Cars99\$Page.Number, Cars99\$Fuel.Capacity)))
cor(na.omit(data.frame(Cars99\$Weight, Cars99\$City.MPG)))
```

## R notes for Activity 27-2: Governors' Salaries

You can load the data with

```StateData = read.csv("/home/rebelsky/Stats115/Data/Governors05.csv")
```

Since that data set turns out to have many more columns than are reported in the table on p. 554, it's useful to see a quick summary of the data.

````>` `summary(StateData)`
`        State    Eastwest        Region   Median.Housing.Value Median.Household.Income`
` Alabama   : 1   east:26   Midwest  :12   Min.   : 70700       Min.   :32397          `
` Alaska    : 1   west:24   Northeast: 9   1st Qu.: 88475       1st Qu.:38414          `
` Arizona   : 1             South    :16   Median :108200       Median :42621          `
` Arkansas  : 1             West     :13   Mean   :117990       Mean   :43172          `
` California: 1                            3rd Qu.:139825       3rd Qu.:47980          `
` Colorado  : 1                            Max.   :272700       Max.   :56409          `
` (Other)   :44                                                                        `
` Governor.Salary `
` Min.   : 70000  `
` 1st Qu.: 95340  `
` Median :111352  `
` Mean   :117984  `
` 3rd Qu.:132161  `
` Max.   :179000  `
```

### 27-2 b. Scatterplots

You can plot those data with

```plot(StateData\$Median.Housing.Value, StateData\$Governor.Salary)
```

### 27-2 d. Correlation Coefficient

We are fortunate that these data do not include any `NA` values. Hence, we can compute the correlation coeficient directly.

```cor(StateData\$Median.Housing.Value, StateData\$Governor.Salary)
```

### 27-2 f. Changing Salaries

It's been a long time since we've updated a frame. We can use `fix` to bring up the data editor on that frame.

```fix(StateData)
```

After changing the appropriate cell, click somewhere else in the grid before closing the edit window. Your changes are not saved until you close the window. (You also cannot continue working in R until you close the window.)

At least in the Unix version of R, it seems that you change a cell by deleting the old contents and then enter the new contents. The new salary should be 194780.

You may want to check to make sure that the salary was, indeed, updated.

```StateData[11,]
```

## R notes for Activity 27-3: Televisions and Life Expectancy

See if you can figure this one out on your own. The data are stored in `TVlife06.csv`. You'll ned to figure out the columns before trying to compute the correlation coefficient.

If you can't figure it out, a solution is available at the end of the topic page

## R notes for Activity 27-5: House Prices

Parts a-c are all simple calculations that you should be able to do on your own.

Part d is there primarily to confirm your answers above, so let's just look at the code. (You had to figure out code for 27-3, so I'm not sure that there's an advantage to requiring you to do so again.)

```HousePrices = read.csv("/home/rebelsky/Stats115/Data/HousePricesAG.csv")
cor(HousePrices\$Price, HousePrices\$Size)
```

## R notes for Activity 27-6: Exam Score Improvements

Okay, more data. Let's load it in.

```ExamScores = read.csv("/home/rebelsky/Stats115/Data/ExamScores.csv")
```

They wouldn't give us more `NA` values, would they?

````>` `summary(ExamScores)`
`    Student       Exam1           Exam2      `
` S01    : 1   Min.   :72.00   Min.   :50.00  `
` S02    : 1   1st Qu.:82.25   1st Qu.:68.50  `
` S03    : 1   Median :88.50   Median :80.00  `
` S04    : 1   Mean   :87.38   Mean   :78.65  `
` S05    : 1   3rd Qu.:95.00   3rd Qu.:91.00  `
` S06    : 1   Max.   :98.00   Max.   :97.00  `
` (Other):18                   NA's   : 1.00  `
```

Yes, I guess they would. Let's get rid of it.

```ExamScores = na.omit(ExamScores)
```

### 27-6 a. Scatterplots

```plot(ExamScores\$Exam1, ExamScores\$Exam2)
```

### 27-6 b. Yet another correlation cofficient

You should be able to compute this one on your own.

### 27-6 d. Changing exam 1 grades.

You can change the exam 1 grades en masse with

```ExamScores\$Exam1 = ExamScores\$Exam1 - 10
```

### 27-6 f. Changing exam 2 grades

If you look at 27-6 d., you should be able to figure out how to double the grades on exam 2.

### 27-6 h. Making up data

```exam1 = c(...)
exam2 = exam1 + 10
plot(exam1, exam2)
cor(exam1,exam2)
```

### 27-6 j. Making up data

We can continue to use the exam 1 grades from part h. We just need to change the way we compute the exam 2 grade.

```exam2 = exam1 * 2
plot(exam1, exam2)
cor(exam1,exam2)
```

## R notes for Activity 27-7: Draft Lottery

Have we overwhelmed you with data yet? I hope not. Anyway, let's load the data and check for potential problems. I've separated the data into two files.

```Draft1970 = read.csv("/home/rebelsky/Stats115/Data/Draft1970.csv")
```

Let's check the column (variable) names for the data set.

````>` `names(Draft1970)`
`[1] "Month"        "Day.of.Month" "Day.of.Year"  "Draft.Number"`
`>` `names(Draft1971)`
`[1] "Month"        "Day.of.Month" "Day.of.Year"  "Draft.Number"`
```

Some notes on the data: The 1970 data have a draft number for February 29, since they were drafting people born in a number of different years. The 1971 data do not have a draft number for February 29 because they were only for men born in 1951, which did not have a February 29. In case you care, the lottery the authors are describing as conducted in 1970 was actually held on December 1, 1969 to determine the order for those to be drafted into service in 1970, and the “Data for 1971” are for a lottery held in 1970 to determine the order for those to be drafted into service in 1971. The authors note that http://www.landscaper.net/draft.htm is the source of their data, so I trust the dates on that page for these corrections.

### 27-7 a. Your draft number

You will find it easiest to determine your draft number by opening the data set in the editor.

```edit(Draft1970)
```

## Additional R notes for Activity 27-3: Televisions and Life Expectancy

We begin by reading in the file.

````>` `TVlife = read.csv("/home/rebelsky/Stats115/Data/TVlife06.csv")`
```

We determine the column names (and, along the way, learn a bit more about each variable).

````>` `summary(TVlife)`
`      Country   Life.Expectancy   TVs.per.K    `
` Angola   : 1   Min.   :38.45   Min.   :  5.0  `
` Australia: 1   1st Qu.:59.70   1st Qu.: 87.0  `
` Cambodia : 1   Median :70.72   Median :177.0  `
` Canada   : 1   Mean   :67.02   Mean   :298.4  `
` China    : 1   3rd Qu.:77.16   3rd Qu.:570.2  `
` Egypt    : 1   Max.   :81.25   Max.   :844.0  `
` (Other)  :16                                  `
```

Remember that the X variable goes first.

````>` `cor(TVlife\$TVs.per.K, TVlife\$Life.Expectancy)`
`[1] 0.7432492`
```

## R Notes for Topic-27 Homework Activities

### Activity 27-12: Monthly Temperatures

You will need to create a data frame (or two vectors) in order to do the analysis. Here's a start.

```RNCtemps = data.frame(Month=1:12, AvgTemp = c(39, 42, ...))
```

### Activity 27-18: Solitaire

You can load the data from the file `Solitaire.csv`.

One nice thing about R's `cor` is that, when given a data frame as a parameter, it computes all the pairs of correlation coefficients.

```cor(Solitaire)
```

### Activity 27-19: Climatic Conditions

You can load the data from the file `Climate.csv`.

We can once again take advantage of R's willingness to compute simultaneously lots of correlation coefficients. However, this time we need to throw away one column (the city names). So, we need to coefficients for columns 2-9.

```  Climate2 = Climate[,2:9]
```

You should be able to figure out the rest.

Samuel A. Rebelsky, rebelsky@grinnell.edu

Copyright (c) 2007-8 Samuel A. Rebelsky.

This work is licensed under a Creative Commons Attribution-NonCommercial 2.5 License. To view a copy of this license, visit `http://creativecommons.org/licenses/by-nc/2.5/` or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.