A word index for a text file is an alphabetical list of the words that occur in that file, together with an indication, for each word, of the number of each line in which the word occurs. A word, for this purpose, is defined as a non-null string of adjacent alphabetic characters preceded by a non-alphabetic character (or the beginning of the file) and followed by a non-alphabetic character (or the end of the file).
Since the texts for which word indices are compiled are fairly long, an unabridged word index would include very large and cumbersome lists for common function words such as ``the'' and ``of.'' Usually, therefore, there is a list of stop words that are to be ignored, or at least handled differently, during the indexing process. The file /home/stone/courses/scheme/examples/stop-list is an alphabetical list of seventy-five stop words for English (each one on a separate line).
For instance, suppose that we have a text file containing A. E. Housman's poem ``The oracles'':
'Tis mute, the word they went to hear on high Dodona mountain
When winds were in the oakenshaws and all the cauldrons tolled,
And mute's the midland navel-stone beside the singing fountain,
And echoes list to silence now where gods told lies of old.
I took my question to the shrine that has not ceased from speaking,
The heart within, that tells the truth and tells it twice as plain;
And from the cave of oracles I heard the priestess shrieking
That she and I should surely die and never live again.
Oh priestess, what you cry is clear, and sound good sense I think it;
But let the screaming echoes rest, and froth your mouth no more.
'Tis true there's better boose than brine, but he that drowns must drink it;
And oh, my lass, the news is news that men have heard before.
The King with half the East at heel is marched from lands of morning;
Their fighters drink the rivers up, their shafts benight the air.
And he that stands will die for nought, and home there's no returning.
The Spartans on the sea-wet rock sat down and combed their hair.
Here is a word index for this file, omitting the stop words:
again 9 air 17 before 14 benight 17 beside 3 better 13 boose 13 brine 13 cauldrons 2 cave 8 ceased 6 clear 11 combed 19 cry 11 die 9, 18 dodona 1 drink 13, 17 drowns 13 east 16 echoes 4, 12 fighters 17 fountain 3 froth 12 gods 4 good 11 hair 19 half 16 has 6 have 14 hear 1 heard 8, 14 heart 7 heel 16 high 1 home 18 is 11, 14, 16 king 16 lands 16 lass 14 let 12 lies 4 list 4 live 9 marched 16 men 14 midland 3 morning 16 mountain 1 mouth 12 must 13 mute 1, 3 navel 3 never 9 news 14 no 12, 18 not 6 nought 18 oakenshaws 2 oh 11, 14 old 4 oracles 8 plain 7 priestess 8, 11 question 6 rest 12 returning 18 rivers 17 rock 19 sat 19 screaming 12 sea 19 sense 11 shafts 17 should 9 shrieking 8 shrine 6 silence 4 singing 3 sound 11 spartans 19 speaking 6 stands 18 stone 3 surely 9 tells 7 think 11 tis 1, 13 told 4 tolled 2 took 6 true 13 truth 7 twice 7 went 1 were 2 wet 19 will 18 winds 2 within 7 word 1
Notice that, as a consequence of the way words are identified, punctuation and whitespace act only as separators and do not appear in the index. The difference between upper- and lower-case letters is ignored; for instance, both the occurrence of the word `Oh' at the beginning of line 11 and the occurrence of `oh' in line 14 are included in the entry for `oh'. If a word occurs more than once in a line, as does the word `news' in line 14, the line number is nevertheless listed only once in the index entry for that word.
The assignment is to write a Scheme procedure that, given the name of a
text file, creates a new file containing an abridged word index (that is,
one containing no entries for the stop words) for that file. The procedure
should take as its argument a string that identifies the text file to be
indexed. The name of the new file should be the same as that of the
specified file, except with the extension .index attached at the
end; for instance, if our sample file above is called oracles.txt, the index file should be oracles.txt.index.
For each test run that you submit, include the original text file and the index file along with your log of the interaction with DrScheme.
This exercise will be due at 2:15 p.m. on Friday, April 29.