NYC Marathon --- confidence and prediction intervals

New York City Marathon data: confidence and prediction intervals

This example comes from the article "The effects of temperature on marathon running' performance," by David E. Martin and John F. Buoncristianni, Chance Magazine, Vol. 12, No. 4, 1999, pages 20-24. The data are in marathon.df and come from New York City Marathon results from 1978 through 1998. Winning times, in minutes, are given for Men and Women, along with Temp=the high temperature on marathon day in Fahrenheit degrees. For this exercise, we will use regression to predict winning time for women by Temp.

Enter the following data into the data frame marathon.df .

 Year Temp     Men   Women
 1978   75 132.200 152.500
 1979   80 131.700 147.550
 1980   50 129.683 145.700
 1981   54 128.217 145.483
 1982   52 129.483 147.233
 1983   59 128.983 147.000
 1984   79 134.883 149.500
 1985   72 131.567 148.567
 1986   65 131.100 148.100
 1987   64 131.017 150.283
 1988   67 128.333 148.117
 1989   56 128.017 145.500
 1990   73 132.650 150.750
 1991   57 129.467 147.533
 1992   51 129.483 144.667
 1993   73 130.067 146.400
 1994   70 131.350 147.617
 1995   62 131.000 148.100
 1996   49 129.900 148.300
 1997   61 128.200 148.717
 1998   55 128.750 145.283

First, obtain a scatterplot:
attach(marathon.df)
plot(Temp,Women)

Since the relationship is linear, we add a line to the data:
abline(lm(Women ~ Temp))

For the R code below, the lm(Women ~ Temp) command is discussed in the R manual sections 11.2 and 11.3; pages 55 and 56; it fits a simple linear regression model to the data. Since we want to discuss and use this model further we need to assign it to a variable. In particuler, residuals and fitted values (among other quantities) are automatically saved into the lm1 object.

Here is the R code:

fm1 <- lm(Women ~ Temp) # fm1 is just an arbitrary name.

cbind(Year,Temp,Women,fm1$fit,fm1$resid)

   Year Temp   Women
1  1978   75 152.500 149.1805  3.3195126
2  1979   80 147.550 149.7759 -2.2259105
3  1980   50 145.700 146.2034 -0.5033723
4  1981   54 145.483 146.6797 -1.1967107
5  1982   52 147.233 146.4415  0.7914585
6  1983   59 147.000 147.2751 -0.2751337
7  1984   79 149.500 149.6568 -0.1568259
8  1985   72 148.567 148.8232 -0.2562336
9  1986   65 148.100 147.9896  0.1103586
10 1987   64 150.283 147.8706  2.4124432
11 1988   67 148.117 148.2278 -0.1108106
12 1989   56 145.500 146.9179 -1.4178799
13 1990   73 150.750 148.9423  1.8076818
14 1991   57 147.533 147.0370  0.4960355
15 1992   51 144.667 146.3225 -1.6554569
16 1993   73 146.400 148.9423 -2.5423182
17 1994   70 147.617 148.5851 -0.9680644
18 1995   62 148.100 147.6324  0.4676124
19 1996   49 148.300 146.0843  2.2157123
20 1997   61 148.717 147.5133  1.2036971
21 1998   55 145.283 146.7988 -1.5157953



What should the following R code produce?

sum(fm1$resid*Temp)
sum(fm1$resid*rep(1,21))

The object fm1 is packed with lots of information from the regression calculations. R provides some generic "extracting" functions that will let us get at these pieces of information, which are actually saved by R in fm1 as a list. You can read about these functions in section 11.3.

For example,
fitted(fm1) # prints out the fitted (or predicted) values.

residuals(fm1) # prints out the residuals. We can also refer to these quantities using the fm1$resid and fm1$fit

A simple, approximate prediction interval

The residuals estimate the errors between predicted and actual y-values, since a residual is defined by:

residual = actual value - fitted value

One should check that the residuals are roughly normal:

hist(residuals(fm1))
qqnorm(residuals(fm1))
qqline(residuals(fm1)) # remember that a normal quantile plot will look straight for normal data.

Now, one can use the st.dev(residuals) to obtain an estimate of precision in predictions made with the regression line:
st.dev(residuals)

Plus or minus 2 of these should provide a rough 95% margin of error.

Accurate prediction intervals and confidence intervals for a conditional mean

To predict at a new value:


predict(fm1,newdata=data.frame(Temp=50),interval="prediction")
To get a confidence interval for E(Y|x):
predict(fm1,newdata=data.frame(Temp=50),interval="confidence")