Programming assignment #3: Compiling an index

Course links

External links

The problem

One of the difficulties involved in compiling the index of a book is simply consolidating and sorting the data that the indexer collects. Typically, the indexer reads through the book, making a note of every occurrence of a name, term, or topic that seems appropriate for an index entry. Such a note might contain the number of the page (or range of pages) containing that occurrence and the text of the heading and sub-heading (if any) for the relevant index entry. For instance, here is a selection of the notes that an indexer might make in going through a book:

79: Exponential sums
95: Matrix/inverse
100: Subtraction/floating-point
105: Exponential sums
113: Exponential sums
152: Primitive recursive function
178: Subtraction
191-192: Subtraction
197: Subtraction
265: Subtraction
314: Matrix/inverse
366: Exponential sums
425: Matrix/null space
468: Subtraction/complex
482: Matrix/inverse
506: Subtraction/power series
602: Subtraction/continued fractions
625-626: Matrix/null space
214: Subtraction/floating-point
219: Subtraction/floating-point
230: Subtraction/floating-point
232: Subtraction/floating-point
238-239: Subtraction/floating-point
249: Subtraction/floating-point
250: Subtraction
657: Matrix/inverse
197: Subtraction

The same information would show up in the finished index thus:

Exponential sums, 79, 105, 113, 366
Matrix
    inverse, 95, 314, 482, 657
    null space, 425, 625-626
Primitive recursive function, 152
Subtraction, 178, 191-192, 197, 250, 265
    complex, 468
    continued fractions, 602
    floating-point, 100, 214, 219, 230, 232, 238-239, 249
    power series, 506

The objective

Your program should read in an indexer's notes from a text file specified on the command line, compile the index, and write it out to standard output (System.out).

Whoever prepares the input file is supposed to make sure that each line of the input file either is empty (so that BufferedFileReader.readLine() returns a string of length 0) or contains an indexer's note, consisting of a page number or page range, a colon, a space, and a heading or heading and sub-heading (separated by a slash). The text of a heading or sub-heading will never contain either a colon or a slash. A page range will consist of two page numbers separated by a hyphen.

If your program encounters a line in the input file that isn't empty and doesn't have the right structure to be an indexer's note, it should echo that line to System.err, appropriately labelled, but leave it out of the index. It should then go on to the next line of the input file (instead of, say, crashing).

In the index that you write out, the headings should be alphabetized, the sub-headings associated with each heading should be alphabetized and indented four spaces, and the page numbers associated with each heading or sub-heading should be arranged in ascending numerical order. Page ranges should be sorted according to their starting page numbers. Duplicate page numbers associated with the same heading (or the same heading and sub-heading), such as the two "197: Subtraction" notes in the sample data set above, should be consolidated (that is, the duplicated page number should appear only once in the index).

Submit the output from a test run on the data set above, along with the data sets and output for any other test runs you'd like me to consider. I reserve the right to run your program on additional data sets of my own nefarious contrivance.

This assignment will be due on Friday, April 18.