When an experimenter has made several observations of the values of two quantities x and y and believes that the relation between them is described by some linear equation y = mx + b, she often applies the method of least squares to determine the choice of m and b that best fits her observations. Here's a brief explanation of this method with an example of its application:

We begin with thirty-two observations about the weight and measured fuel consumption of automobiles. Each of the observations involves a different model of automobile.

Model Weight (in kg) Fuel consumption (in ml/km)
Buick Century 1337 80
Buick Park Avenue 1603 92
Buick Regal 1575 92
Buick Skylark 1262 86
Chevrolet Beretta 1202 83
Chevrolet Cavalier 1146 80
Chevrolet Corsica 1209 73
Chevrolet Lumina 1530 85
Chrysler Le Baron 1299 91
Chrysler Le Baron GTC 1365 87
Dodge Daytona 1261 88
Dodge Intrepid 1504 89
Dodge Spirit 1265 87
Eagle Vision 1492 89
Ford Escort 1070 75
Ford Mustang 1259 103
Ford Probe 1188 81
Ford Taurus 1476 83
Geo Metro 748 48
Lincoln Continental 1646 97
Mercury Grand Marquis 1716 96
Mercury Topaz 1180 90
Oldsmobile Achieva 1232 85
Oldsmobile Cutlass Calais 1357 95
Oldsmobile Cutlass Supreme 1521 84
Oldsmobile 98 1677 92
Plymouth Acclaim 1263 87
Pontiac Grand Am 1272 78
Pontiac Sunbird 1280 89
Saturn SL 1217 71

Here's a plot of the observed weight and fuel consumption for each car:

That there is a linear relationship between an automobile's weight and its fuel consumption is at least plausible. One could draw a line on the preceding plot that would be on or near most of the data points:

The method of least squares is a way of choosing which line to draw. It calculates the slope and y-intercept (the coefficients m and b in the equation y = mx + b) of the line that minimizes the sum of the squares of the vertical distances between the data points and the line. (The vertical distances are squared so that even one large mismatch between the observed value of y and the value that would have been predicted from the equation of the line and the observed value of x is heavily penalized.)

The derivation of the formulas for calculating m and b involves calculus, so I'm not going to present it here. The formulas themselves are relatively easy to understand. One begins by computing five values directly from the observations:

• n, the number of observations that were made (32, in the example above).

• xsum, the sum of the observed x values -- the weights of the automobiles, in the example: 1337 + 1603 +... + 1217 = 43841.

• ysum, the sum of the observed y values -- the fuel consumption statistics: 80 + 92 + ... + 71 = 2776.

• xsqsum, the sum of the squares of the observed x values: 13372 + 16032 + ... + 12172 = 61805287.

• xysum, the sum of the products of corresponding x and y values: (1337)(80) + (1603)(92) + ... + (1217)(71) = 3865857.

The slope m of the desired line is

(n * xysum - xsum * ysum) / (n * xsqsum - xsum2)

and the y-intercept b of the desired line is

(ysum - m * xsum) / n.

In the example, m = (32 * 3865857 - 43841 * 2776)/(32 * 61805287 - 438412) = 2004808/55735903, or approximately 0.036; then b = (2776 - (2004808/55735903) * 43841) / 32, which works out to be about 37.5. So the equation of the desired line is y = 0.036 x + 37.5. (This is the line shown on the second plot above.)

The exercise is to write a stand-alone Scheme program that prompts the user for any positive number of observations of the values of two related quantities, then calculates and prints out the equation of the line that best fits those observations, according to the method of least squares.

In collecting the data from the user, your program should prompt the user appropriately at each step. It should recognize the symbol `end` as a sentinel indicating that no more observations are available. It should refuse to accept any input from the user other than a real number or the symbol `end`, printing a warning message and repeating the prompt if it receives such input. It should signal an error if the user supplies the symbol `end` before providing any observations or after supplying the first value in a pair.

Here is what a typical run of the program might look like:

```bourbaki% scheme least-squares.ss
Chez Scheme Version 5.0c

Type in real numbers giving the observed values of two related quantities.

x[1]: 1337
y[1]: 80

x[2]: 1603
y[2]: ninety-two
The input must be in the form of a numeral.
y[2]: 92

...

x[32]: 1217
y[32]: 71

x[33]: end

Calculating the coefficients of the linear equation ...

y = 0.03596977696763969 x + 37.47028149880338
```

This document is available on the World Wide Web as

```http://www.math.grin.edu/~stone/courses/scheme/exercise-6.html
```

created October 15, 1997
last revised October 15, 1997