Introduction to Statistics (MAT/SST 115.03 2008S)

R Notes for Topic 28: Least Squares Regression

Least Squares Regression

R provides the lm function to compute the least-squares regression line. (The “lm” stands for “linear model”.) You need to provide it with a paired set of vectors, which you create with the ~ operation.

lm(response ~ explanatory)

For example, if we had a data frame called People with one column called FootLength and another called Height, we might compute the coefficients as follows. (We get somewhat different values than given in Activity 28-1 because we're not working with exactly the same data set.)

> lm(People$Height~People$FootLength)
lm(formula = People$Height ~ People$FootLength)
      (Intercept)  People$FootLength  
           38.668              1.022  

That's a lot of text, and not in a particularly usable format. Fortunately, we can use the coef to grab the values from the result.

> coef(lm(People$Height~People$FootLength))
      (Intercept) People$FootLength 
        38.668071          1.022173 

We can even grab and name the two coefficients.

> ab = coef(lm(People$Height~People$FootLength))
> a = ab[1]
> b = ab[2]
> a
> b

We can then use those values in predictions, such as predicting the height (in inches) of someone with a foot size of 28 centimeters. (Your guess is as good as mine as to why they switch units.)

> a + 29*b

Yeah, that “(Intercept)” is annoying. Ignore it for now.

We can even plot the line, using abline(a,b). In this case, we want to put it on a scatterplot of height vs. foot length.

> plot(People$Height ~ People$FootLength, main="Height (in Inches) vs. Foot Length (in cm)")
> abline(a,b)

R notes for Activity 28-2: House Prices

You can load the data with

HousePrices = read.csv("/home/rebelsky/Stats115/Data/HousePricesAG.csv")

The columns are Address, Price, Bedrooms, Bathrooms, and Size.

You can plot house price vs. size (without the regression line) with

plot(HousePrices$Price ~ HousePrices$Size, 
  ylab = "House Price (in $)",
  xlab = "House Size (in sq. ft.)")

28-2 b. Computing coefficients

For this problem, you should simply use R as a calculator, entering the values in the formulae.

28-2 c. Checking with technology

Since this is your first time using lm, we'll go through all of the steps. First, we just ask R for the summary. That summary should be enough to confirm your answer.

lm(HousePrices$Price ~ HousePrices$Size)

That summary should be enough to confirm your answer. However, you may find it helpful to have the intercept (a) and slope (b) in variables, so we'll do that, too.

ab = coef(lm(HousePrices$Price ~ HousePrices$Size))
a = ab[1]
b = ab[2]

28-2 d. Predicting prices

Now we see why it was useful to put a and b in variables.

a + b*1242

28-2 l. Explaining variability with least squares lines

In case you missed it, the description of the proportion of variability explainted by the least squares line is given in the text on the top of p. 579.

R notes for Activity 28-3: Animal Trotting Speeds

Let's start by gathering the data, building the scatterplot, computing the parameters of the least-squares line, and plotting that line. Since we're using the plot to explore data, and not for presentations, we won't worry about labels.

TrotSpeeds = read.csv("/home/rebelsky/Stats115/Data/TrotSpeeds.csv")
plot(TrotSpeeds$Trot.Speed ~ TrotSpeeds$Body.Mass)
ab = coef(lm(TrotSpeeds$Trot.Speed ~ TrotSpeeds$Body.Mass))
a = ab[1]
b = ab[2]

We'll also compute the r2 value.

r = cor(TrotSpeeds$Trot.Speed, TrotSpeeds$Body.Mass)

28-3 d. A residual plot

Okay, the first thing we have to do is compute the residuals. So, we need to predict the values and subtract those predicted values from the observed values.

predicted = a + b*TrotSpeeds$Body.Mass
residuals = TrotSpeeds$Trot.Speed - predicted

Now, we're ready to plot. You should be able to figure out the plot command yourself. Remember, the form is

plot(response ~ explanatory)

You may find it useful to add a horizontal line for the residual of 0.


28-3 e. A logarithmic transformation

We'll start by computing the logs.

log10BodyMass = log10(TrotSpeeds$Body.Mass)

You can create the plot with

plot(TrotSpeeds$Trot.Speed ~ log10BodyMass)

28-3 f. New least-squares line

The R is fairly straightforward.

lm(TrotSpeeds$Trot.Speed ~ log10BodyMass)
ab = coef(lm(TrotSpeeds$Trot.Speed ~ log10BodyMass))
a = ab[1]
b = ab[2]

The value of r2 is computed by

r = cor(TrotSpeeds$Trot.Speed, log10BodyMass)

28-3 g. Another residual plot

This plot is a bit subtle, since the residuals are computed from the log (base 10) of the body mass, but the X axis should still be the original body mass.

predicted = a + b * log10BodyMass
residuals = TrotSpeeds$Trot.Speed - predicted
plot(residuals ~ TrotSpeeds$Body.Mass)

R notes for Activity 28-4: Textbook Prices

We'll load the data using our standard strategy.

TBP = read.csv("/home/rebelsky/Stats115/Data/TextbookPrices.csv")

28-4 b. Scatterplots

R is happy to make you a grid of scatterplots, using each pair of explanatory/response variable.


If you'd rather do the individual scatterpots, we can write

plot(TBP$Price ~ TBP$Pages)
plot(TBP$Price ~ TBP$Year)

28-4 d. Least-squares line

Since this is a self-check exercise, you should figure out how do this and the remaining problems using the prior answers.

Creative Commons License

Samuel A. Rebelsky,

Copyright (c) 2007-8 Samuel A. Rebelsky.

This work is licensed under a Creative Commons Attribution-NonCommercial 2.5 License. To view a copy of this license, visit or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.