Preparing a word index

A word index for a text file is an alphabetical list of the words that occur in that file, together with an indication, for each word, of the number of each line in which the word occurs. A word, for this purpose, is defined as a non-null string of adjacent alphabetic characters preceded by a non-alphabetic character (or the beginning of the file) and followed by a non-alphabetic character (or the end of the file). For instance, suppose that we have a text file containing one of Percy Bysshe Shelley's sonnets:

I met a traveller from an antique land
Who said:  Two vast and trunkless legs of stone
Stand in the desert.  Near them, on the sand,
Half sunk, a shattered visage lies, whose frown,
And wrinkled lip, and sneer of cold command,
Tell that its sculptor well those passions read
Which yet survive, stamped on these lifeless things,
The hand that mocked them, and the heart that fed;
And on the pedestal these words appear:
``My name is Ozymandias, king of kings:
Look on my works, ye Mighty, and despair!''
Nothing beside remains.  Round the decay
Of that colossal wreck, boundless and bare
The lone and level sands stretch far away.

Here is the word index for this file:

a           1, 4
an          1
and         2, 5, 8, 9, 11, 13, 14
antique     1
appear      9
away        14
bare        13
beside      12
boundless   13
cold        5
colossal    13
command     5
decay       12
desert      3
despair     11
far         14
fed         8
from        1
frown       4
half        4
hand        8
heart       8
i           1
in          3
is          10
its         6
king        10
kings       10
land        1
legs        2
level       14
lies        4
lifeless    7
lip         5
lone        14
look        11
met         1
mighty      11
mocked      8
my          10, 11
name        10
near        3
nothing     12
of          2, 5, 10, 13
on          3, 7, 9, 11
ozymandias  10
passions    6
pedestal    9
read        6
remains     12
round       12
said        2
sand        3
sands       14
sculptor    6
shattered   4
sneer       5
stamped     7
stand       3
stone       2
stretch     14
sunk        4
survive     7
tell        6
that        6, 8, 13
the         3, 8, 9, 12, 14
them        3, 8
these       7, 9
things      7
those       6
traveller   1
trunkless   2
two         2
vast        2
visage      4
well        6
which       7
who         2
whose       4
words       9
works       11
wreck       13
wrinkled    5
ye          11
yet         7

Notice that, as a consequence of the way words are identified, punctuation and whitespace act only as separators and do not appear in the index. The difference between upper- and lower-case letters is ignored; for instance, the occurrence of the word `And' at the beginning of line 5 is included in the list of occurrences of `and'. If a word occurs more than once in a line, as does the word `that' in line 8, the line number is nevertheless listed only once in the index entry for that word.

The assignment is to write a Scheme procedure that, as a side effect, creates a new file containing a word index for a specified text file. The procedure should take as its argument a string that identifies the text file to be indexed. The name of the new file should be the same as that of the specified file, except with the extension .index attached at the end; for instance, if our sample file above is called Ozymandias.txt, the index file should be Ozymandias.txt.index.

For each test run that you submit, include the original text file and the index file along with your log of the interaction with DrScheme.


This document is available on the World Wide Web as

http://www.cs.grinnell.edu/~stone/courses/scheme/exercises/word-index.xhtml

Validated as XHTML 1.1 by the World Wide Web Consortium Cascading Style Sheet validated by the World Wide Web Consortium

created November 20, 2001
last revised November 20, 2001

John David Stone (stone@cs.grinnell.edu)