New York City Marathon data analysis
This example comes from the article "The effects of temperature on marathon running' performance," by David E. Martin and John F. Buoncristianni, Chance Magazine, Vol. 12, No. 4, 1999, pages 20-24. The data are in marathon.df and come from New York City Marathon results from 1978 through 1998. Winning times, in minutes, are given for Men and Women, along with Temp=the high temperature on marathon day in Fahrenheit degrees. For this exercise, we will use regression to predict winning time for women by Temp.
First, obtain a scatterplot:
attach(marathon.df)
plot(Temp,Women)
Since the relationship is linear, we add a line to the data:
abline(lm(Women ~ Temp))
For the R code below, the lm(Women ~ Temp) command is discussed in the R manual sections 11.2 and 11.3; pages 55 and 56; it fits a simple linear regression model to the data. Since we want to discuss and use this model further we need to assign it to a variable. In particuler, residuals and fitted values (among other quantities) are automatically saved into the lm1 object.
Here is the R code:
fm1 <- lm(Women ~ Temp) # fm1 is just an arbitrary name. cbind(Year,Temp,Women,fm1$fit,fm1$resid) Year Temp Women 1 1978 75 152.500 149.1805 3.3195126 2 1979 80 147.550 149.7759 -2.2259105 3 1980 50 145.700 146.2034 -0.5033723 4 1981 54 145.483 146.6797 -1.1967107 5 1982 52 147.233 146.4415 0.7914585 6 1983 59 147.000 147.2751 -0.2751337 7 1984 79 149.500 149.6568 -0.1568259 8 1985 72 148.567 148.8232 -0.2562336 9 1986 65 148.100 147.9896 0.1103586 10 1987 64 150.283 147.8706 2.4124432 11 1988 67 148.117 148.2278 -0.1108106 12 1989 56 145.500 146.9179 -1.4178799 13 1990 73 150.750 148.9423 1.8076818 14 1991 57 147.533 147.0370 0.4960355 15 1992 51 144.667 146.3225 -1.6554569 16 1993 73 146.400 148.9423 -2.5423182 17 1994 70 147.617 148.5851 -0.9680644 18 1995 62 148.100 147.6324 0.4676124 19 1996 49 148.300 146.0843 2.2157123 20 1997 61 148.717 147.5133 1.2036971 21 1998 55 145.283 146.7988 -1.5157953
What should the following R code produce?
sum(fm1$resid*Temp) sum(fm1$resid*rep(1,21))
The object fm1 is packed with lots of information from the regression calculations. R provides some generic "extracting" functions that will let us get at these pieces of information, which are actually saved by R in fm1 as a list. You can read about these functions in section 11.3.
For example,
fitted(fm1) # prints out the fitted (or predicted) values.
residuals(fm1) # prints out the residuals. We can also refer to these quantities using the fm1$resid and fm1$fit
A simple, approximate prediction interval
The residuals estimate the errors between predicted and actual y-values, since a residual is defined by:
residual = actual value - fitted value
One should check that the residuals are roughly normal:
hist(residuals(fm1))
qqnorm(residuals(fm1))
qqline(residuals(fm1)) # remember that a normal quantile plot will look straight
for normal data.
Now, one can use the st.dev(residuals) to obtain an estimate of precision
in predictions made with the regression line:
st.dev(residuals)
Plus or minus 2 of these should provide a rough 95% margin of error.
Accurate prediction intervals and confidence intervals for a conditional mean
To predict at a new value:
predict(fm1,newdata=data.frame(Temp=50),interval="prediction")To get a confidence interval for E(Y|x):
predict(fm1,newdata=data.frame(Temp=50),interval="confidence")