When an experimenter has made several observations of the values of two quantities x and y and believes that the relation between them is described by some linear equation y = mx + b, she often applies the method of least squares to determine the choice of m and b that best fits her observations. Here's a brief explanation of this method with an example of its application:
We begin with thirty-two observations about the weight and measured fuel consumption of automobiles. Each of the observations involves a different model of automobile.
|Model||Weight (in kg)||Fuel consumption (in ml/km)|
|Buick Park Avenue||1603||92|
|Chrysler Le Baron||1299||91|
|Chrysler Le Baron GTC||1365||87|
|Mercury Grand Marquis||1716||96|
|Oldsmobile Cutlass Calais||1357||95|
|Oldsmobile Cutlass Supreme||1521||84|
|Pontiac Grand Am||1272||78|
Here's a plot of the observed weight and fuel consumption for each car:
That there is a linear relationship between an automobile's weight and its fuel consumption is at least plausible. One could draw a line on the preceding plot that would be on or near most of the data points:
The method of least squares is a way of choosing which line to draw. It calculates the slope and y-intercept (the coefficients m and b in the equation y = mx + b) of the line that minimizes the sum of the squares of the vertical distances between the data points and the line. (The vertical distances are squared so that even one large mismatch between the observed value of y and the value that would have been predicted from the equation of the line and the observed value of x is heavily penalized.)
The derivation of the formulas for calculating m and b involves calculus, so I'm not going to present it here. The formulas themselves are relatively easy to understand. One begins by computing five values directly from the observations:
n, the number of observations that were made (32, in the example above).
xsum, the sum of the observed x values -- the weights of the automobiles, in the example: 1337 + 1603 +... + 1217 = 43841.
ysum, the sum of the observed y values -- the fuel consumption statistics: 80 + 92 + ... + 71 = 2776.
xsqsum, the sum of the squares of the observed x values: 13372 + 16032 + ... + 12172 = 61805287.
xysum, the sum of the products of corresponding x and y values: (1337)(80) + (1603)(92) + ... + (1217)(71) = 3865857.
The slope m of the desired line is
and the y-intercept b of the desired line is
In the example, m = (32 * 3865857 - 43841 * 2776)/(32 * 61805287 - 438412) = 2004808/55735903, or approximately 0.036; then b = (2776 - (2004808/55735903) * 43841) / 32, which works out to be about 37.5. So the equation of the desired line is y = 0.036 x + 37.5. (This is the line shown on the second plot above.)
The exercise is to write a stand-alone Scheme program that prompts the user for any positive number of observations of the values of two related quantities, then calculates and prints out the equation of the line that best fits those observations, according to the method of least squares.
In collecting the data from the user, your program should prompt the user
appropriately at each step. It should recognize the symbol
end as a sentinel indicating that no more observations are
available. It should refuse to accept any input from the user other than
a real number or the symbol
end, printing a warning message
and repeating the prompt if it receives such input. It should signal an
error if the user supplies the symbol
end before providing any
observations or after supplying the first value in a pair.
Here is what a typical run of the program might look like:
bourbaki% scheme least-squares.ss Chez Scheme Version 5.0c Copyright (c) 1994 Cadence Research Systems Type in real numbers giving the observed values of two related quantities. x: 1337 y: 80 x: 1603 y: ninety-two The input must be in the form of a numeral. y: 92 ... x: 1217 y: 71 x: end Calculating the coefficients of the linear equation ... y = 0.03596977696763969 x + 37.47028149880338
This document is available on the World Wide Web as
created October 15, 1997
last revised October 15, 1997
John David Stone (firstname.lastname@example.org)