The text file /u2/stone/datasets/hitters.dat is supposed to
contain statistics about the careers of 130 baseball players, one on each
line, in the following format:
- Columns 1 through 22 contain the player's name, left-justified.
- Column 23 is blank.
- Column 24 through 28 contain the number of times the player
was at bat, right-justified.
- Column 29 is blank.
- Columns 30 through 33 contain the number of hits the player made,
right-justified.
- Column 34 is blank.
- Columns 35 through 37 contain the number of doubles the player made,
right-justified.
- Column 38 is blank.
- Columns 39 through 41 contain the number of triples the player made,
right-justified.
- Column 42 is blank.
- Columns 43 through 45 contain the number of home runs the player made,
right-justified.
- Column 46 is blank.
- Columns 47 through 50 contain the number of runs the player batted in,
right-justified.
- Column 51 is blank.
- Columns 52 through 55 contain the player's career batting average,
written as a decimal point followed by three digits.
- Column 56 is blank.
- Columns 57 through 60 contain the player's career slugging percentage,
written as a decimal point followed by three digits.
- Column 61 is blank.
- Columns 62 through 65 contain the player's career on-base percentage,
written as a decimal point followed by three digits.
Here's a sample line:
Kiner 5205 1451 216 39 369 1015 .279 .548 .397
Let us define the disparity between two players as the sum of the
following quantities:
- the difference between the number of times they were at bat;
- 3 times the difference between the number of hits they made;
- 10 times the difference between the number of doubles they made;
- 15 times the difference between the number of triples they made;
- 15 times the difference between the number of home runs they made;
- 6 times the difference between the number of runs they batted in;
- 12000 times the difference between their batting averages;
- 8000 times the difference between their slugging percentages;
- 9000 times the difference between their on-base percentages.
Each of these differences should always be non-negative. (In other words,
you should take the absolute value of the result of subtracting one number
from the other.)
The assignment is to find and print out, for each of the players, which
five other players are ``most similar'' to him (in the sense that their
disparity from him is smallest). The output should consist of six lines
for each player, with the player's name on the first line and the five most
similar players and their computed disparities, in ascending order by
disparity, on subsequent lines. Thus the output for each player should
look like this:
Most similar to Hackman:
Wilde (disparity 1199)
Flakey (disparity 1416)
Stryke (disparity 1644)
Leimo (disparity 2209)
Whiffer (disparity 2262)
The entries themselves should be printed in alphabetical order by player's
name.
This exercise will be due on Wednesday, October 9.
created October 1, 1996
last revised October 2, 1996