Introduction to Statistics (MAT/SST 115.03 2008S)
R provides the
lm function to compute the
least-squares regression line. (The “lm” stands
for “linear model”.) You need to provide it with a
paired set of vectors, which you create with the
lm(response ~ explanatory)
For example, if we had a data frame called
one column called
FootLength and another called
Height, we might compute the coefficients as follows.
(We get somewhat different values than given in Activity 28-1 because
we're not working with exactly the same data set.)
lm(formula = People$Height ~ People$FootLength)
That's a lot of text, and not in a particularly usable format.
Fortunately, we can use the
coef to grab
the values from the result.
We can even grab and name the two coefficients.
ab = coef(lm(People$Height~People$FootLength))
a = ab
b = ab
We can then use those values in predictions, such as predicting the height (in inches) of someone with a foot size of 28 centimeters. (Your guess is as good as mine as to why they switch units.)
a + 29*b
Yeah, that “(Intercept)” is annoying. Ignore it for now.
We can even plot the line, using
In this case, we want to put it on a scatterplot of height vs. foot
plot(People$Height ~ People$FootLength, main="Height (in Inches) vs. Foot Length (in cm)")
You can load the data with
HousePrices = read.csv("/home/rebelsky/Stats115/Data/HousePricesAG.csv")
The columns are
You can plot house price vs. size (without the regression line) with
plot(HousePrices$Price ~ HousePrices$Size, ylab = "House Price (in $)", xlab = "House Size (in sq. ft.)")
For this problem, you should simply use R as a calculator, entering the values in the formulae.
Since this is your first time using
lm, we'll go
through all of the steps. First, we just ask R for the summary.
That summary should be enough to confirm your answer.
lm(HousePrices$Price ~ HousePrices$Size)
That summary should be enough to confirm your answer. However, you may find it helpful to have the intercept (a) and slope (b) in variables, so we'll do that, too.
ab = coef(lm(HousePrices$Price ~ HousePrices$Size)) a = ab b = ab
Now we see why it was useful to put
a + b*1242
In case you missed it, the description of the proportion of variability explainted by the least squares line is given in the text on the top of p. 579.
Let's start by gathering the data, building the scatterplot, computing the parameters of the least-squares line, and plotting that line. Since we're using the plot to explore data, and not for presentations, we won't worry about labels.
TrotSpeeds = read.csv("/home/rebelsky/Stats115/Data/TrotSpeeds.csv") plot(TrotSpeeds$Trot.Speed ~ TrotSpeeds$Body.Mass) ab = coef(lm(TrotSpeeds$Trot.Speed ~ TrotSpeeds$Body.Mass)) a = ab b = ab abline(a,b)
We'll also compute the r2 value.
r = cor(TrotSpeeds$Trot.Speed, TrotSpeeds$Body.Mass) r^2
Okay, the first thing we have to do is compute the residuals. So, we need to predict the values and subtract those predicted values from the observed values.
predicted = a + b*TrotSpeeds$Body.Mass residuals = TrotSpeeds$Trot.Speed - predicted
Now, we're ready to plot. You should be able to figure out the plot command yourself. Remember, the form is
You may find it useful to add a horizontal line for the residual of 0.
We'll start by computing the logs.
log10BodyMass = log10(TrotSpeeds$Body.Mass)
You can create the plot with
plot(TrotSpeeds$Trot.Speed ~ log10BodyMass)
The R is fairly straightforward.
lm(TrotSpeeds$Trot.Speed ~ log10BodyMass) ab = coef(lm(TrotSpeeds$Trot.Speed ~ log10BodyMass)) a = ab b = ab abline(a,b)
The value of r2 is computed by
r = cor(TrotSpeeds$Trot.Speed, log10BodyMass) r^2
This plot is a bit subtle, since the residuals are computed from the log (base 10) of the body mass, but the X axis should still be the original body mass.
predicted = a + b * log10BodyMass residuals = TrotSpeeds$Trot.Speed - predicted plot(residuals ~ TrotSpeeds$Body.Mass) abline(h=0)
We'll load the data using our standard strategy.
TBP = read.csv("/home/rebelsky/Stats115/Data/TextbookPrices.csv")
R is happy to make you a grid of scatterplots, using each pair of explanatory/response variable.
If you'd rather do the individual scatterpots, we can write
X11() plot(TBP$Price ~ TBP$Pages) X11() plot(TBP$Price ~ TBP$Year)
Since this is a self-check exercise, you should figure out how do this and the remaining problems using the prior answers.
Copyright (c) 2007-8 Samuel A. Rebelsky.
This work is licensed under a Creative Commons
Attribution-NonCommercial 2.5 License. To view a copy of this
or send a letter to Creative Commons, 543 Howard Street, 5th Floor,
San Francisco, California, 94105, USA.