Exercise #5: Automatic indexing

The assignment this time is to write a program that does automatic indexing. It should read in text that has been divided into pages, look for names, words, and phrases that should be entered in an alphabetical index to the text, and compile and write out a list of those names, words, and phrases, each followed by a list of the pages on which it occurs. Such an index might begin like this:

abstract data type 52, 61, 74, 76, 99
abstraction 5, 10, 68, 69
Ada programming language 62
algorithm 6, 13, 16, 19, 23, 34
ambiguity 10
analysis 11, 13, 20
The text to be indexed should be read from standard input. You can assume that it will begin with page 1 and that the ASCII form-feed character, Chr (12), immediately precedes each subsequent page. The index should be written to standard output, one entry -- with its list of page numbers -- on each line, even if this makes some lines unusually long. The entries should be alphabetized without regard to case (specifically, as if all were entirely in upper case). Non-letter characters in the index entries should be arranged according to their ASCII values.

Each name, word, or phrase that should be recorded in the index will be marked, in the source text, with the ASCII commercial-at character, @. If the character immediately following the commercial-at is a left-parenthesis, then the text of the index entry is the string of characters between that left-parenthesis and the next following right-parenthesis. If the character after the commercial-at is not a left-parenthesis, then the index entry is the string of characters between the commercial-at and the next following space character or end of a line of text. So, for instance, a page containing line

level of @(abstract data types), where data are grouped in
should be listed in the index under ``abstract data types,'' and a page containing the line

therefore, require careful @analysis and formal procedures
should be listed in the index under ``analysis.''

It is an error for the text of an index entry to include a commercial-at character, a left-parenthesis, or a form-feed character. It is an error for the source text to contain a commercial-at character followed by a left-parenthesis, if there is no subsequent right-parenthesis. It is an error for the source text to contain a commercial-at character not followed by a left-parenthesis if there is no subsequent space or end of line. It is an error for the index entry to be the null string (i.e., for the commercial-at character to be immediately followed by a space or the end of the line, or for the commercial-at to be followed immediately by a left-parenthesis and then an immediate right-parenthesis).

If the program detects any of these errors, the index should not be produced. Instead, the program should print the page and line number at which it detected the error, and should then proceed through the rest of the source text in ``syntax-checking mode'' -- that is, it should look only for other errors, printing out the page and line number of each one that it detects.

For this exercise, please submit the source code for your main program. If you use any of the modules in /u2/stone/courses/fundamentals/modules, you don't have to submit a copy; but if you develop and use any modules of your own, or make changes in mine, please submit the source code for the modules as well.

If you like, you can submit the input and output from test runs, but I'll be running your program on a variety of test cases of my own, so it's not necessary for me to see your test output if you'd rather not submit it.

This exercise will be due on Monday, November 4.


created October 27, 1996
last revised October 30, 1996