NYC Marathon --- simple linear regression

New York City Marathon data analysis

This example comes from the article "The effects of temperature on marathon running' performance," by David E. Martin and John F. Buoncristianni, Chance Magazine, Vol. 12, No. 4, 1999, pages 20-24. The data are in marathon.df and come from New York City Marathon results from 1978 through 1998. Winning times, in minutes, are given for Men and Women, along with Temp=the high temperature on marathon day in Fahrenheit degrees. For this exercise, we will use regression to predict winning time for women by Temp.

First, obtain a scatterplot:
attach(marathon.df)
plot(Temp,Women)

Since the relationship is linear, we add a line to the data:
abline(lm(Women ~ Temp))

For the R code below, the lm(Women ~ Temp) command is discussed in the R manual sections 11.2 and 11.3; pages 55 and 56; it fits a simple linear regression model to the data. Since we want to discuss and use this model further we need to assign it to a variable. In particuler, residuals and fitted values (among other quantities) are automatically saved into the lm1 object.

Here is the R code:

fm1 <- lm(Women ~ Temp) # fm1 is just an arbitrary name.

cbind(Year,Temp,Women,fm1$fit,fm1$resid)

   Year Temp   Women
1  1978   75 152.500 149.1805  3.3195126
2  1979   80 147.550 149.7759 -2.2259105
3  1980   50 145.700 146.2034 -0.5033723
4  1981   54 145.483 146.6797 -1.1967107
5  1982   52 147.233 146.4415  0.7914585
6  1983   59 147.000 147.2751 -0.2751337
7  1984   79 149.500 149.6568 -0.1568259
8  1985   72 148.567 148.8232 -0.2562336
9  1986   65 148.100 147.9896  0.1103586
10 1987   64 150.283 147.8706  2.4124432
11 1988   67 148.117 148.2278 -0.1108106
12 1989   56 145.500 146.9179 -1.4178799
13 1990   73 150.750 148.9423  1.8076818
14 1991   57 147.533 147.0370  0.4960355
15 1992   51 144.667 146.3225 -1.6554569
16 1993   73 146.400 148.9423 -2.5423182
17 1994   70 147.617 148.5851 -0.9680644
18 1995   62 148.100 147.6324  0.4676124
19 1996   49 148.300 146.0843  2.2157123
20 1997   61 148.717 147.5133  1.2036971
21 1998   55 145.283 146.7988 -1.5157953



What should the following R code produce?

sum(fm1$resid*Temp)
sum(fm1$resid*rep(1,21))

The object fm1 is packed with lots of information from the regression calculations. R provides some generic "extracting" functions that will let us get at these pieces of information, which are actually saved by R in fm1 as a list. You can read about these functions in section 11.3.

For example,
fitted(fm1) # prints out the fitted (or predicted) values.

residuals(fm1) # prints out the residuals. We can also refer to these quantities using the fm1$resid and fm1$fit

A simple, approximate prediction interval

The residuals estimate the errors between predicted and actual y-values, since a residual is defined by:

residual = actual value - fitted value

One should check that the residuals are roughly normal:

hist(residuals(fm1))
qqnorm(residuals(fm1))
qqline(residuals(fm1)) # remember that a normal quantile plot will look straight for normal data.

Now, one can use the st.dev(residuals) to obtain an estimate of precision in predictions made with the regression line:
st.dev(residuals)

Plus or minus 2 of these should provide a rough 95% margin of error.

Accurate prediction intervals and confidence intervals for a conditional mean

To predict at a new value:


predict(fm1,newdata=data.frame(Temp=50),interval="prediction")
To get a confidence interval for E(Y|x):
predict(fm1,newdata=data.frame(Temp=50),interval="confidence")