In some simple text applications, such as English word-lists, we might be willing to accept a reduced character set containing only thirty-two characters in exchange for a more concise representation of strings. Let's use the following NIC (``Non-standard Interchange Code'') for this purpose:
00000 space
00001 small-letter-a
00010 small-letter-b
00011 small-letter-c
00100 small-letter-d
00101 small-letter-e
00110 small-letter-f
00111 small-letter-g
01000 small-letter-h
01001 small-letter-i
01010 small-letter-j
01011 small-letter-k
01100 small-letter-l
01101 small-letter-m
01110 small-letter-n
01111 small-letter-o
10000 small-letter-p
10001 small-letter-q
10010 small-letter-r
10011 small-letter-s
10100 small-letter-t
10101 small-letter-u
10110 small-letter-v
10111 small-letter-w
11000 small-letter-x
11001 small-letter-y
11010 small-letter-z
11011 apostrophe
11100 comma
11101 hyphen-minus
11110 full-stop
11111 end-of-line
Now, since an NIC character occupies only five bits, it should be possible to fit six of them into the thirty-two-bit space occupied by an integer variable. If one were to open up a text file, read in the characters groups of six, pack each such group into an integer, and write the integer to a (binary) file, the resulting file would be one-third smaller than the original, since every six bytes of the original would be reduced to four in the compressed version.
How can we pack NIC characters into an integer? Well, there are four arithmetic operations on integers that can be used for setting and clearing groups of five bits: multiplication by 32, addition, remainder on division by 32, and quotient on division by 32. Specifically:
Integer contains a non-negative
integer less than 33554432 (which is 2^25), multipliying that value
by 32 will have the effect of shifting the bit-pattern that
represents it five positions to the left, filling in zero bits on the right
end. For instance, the integer value 42 is represented by the bit
pattern 00000000000000000000000000101010; multiplying it by
32 yields 1344, which has the bit pattern
00000000000000000000010101000000.
Integer contains a non-negative
integer that is evenly divisible by 32, the five rightmost bits in
the bit-pattern that represents it will all be zeroes, so that adding a
value less than 32 to it will store the last five bits of that value
in the rightmost five bits. For instance, 1344 is evenly divisible
by 32 and has the bit pattern
00000000000000000000010101000000. Adding 23 to
1344 yields 1367, which has the bit pattern
00000000000000000000010101010111; note that only the rightmost
five bits have changed and that those five bits now form the binary numeral
for 23.
Integer contains a non-negative
integer, the remainder after division of its value by 32 is equal to
the value of its last five bits, interpreted as a binary numeral. For
instance, dividing 1367, which has the bit pattern
00000000000000000000010101010111, by 32 leaves a
remainder of 23, which is the value expressed by its last five
bits.
Integer contains a non-negative
integer less than 1073741824 (which is 2^30), dividing that value by
32 and keeping the quotient will have the effect of shifting the
bit-pattern of that value five positions to the right, filling in with zero
bits on the left end. For example, dividing 1367, which has the bit
pattern 00000000000000000000010101010111, by 32 yields
a quotient of 42, which has the bit pattern
00000000000000000000000000101010.This means that we can pack six NIC characters into an integer variable by initializing that variable to 0, adding an integer less than 32 that encodes the first character to store that character into the rightmost five bits, multiplying by 32 to shift it leftwards, adding the encoding for the second character, multiplying by 32 again, and repeating the process until six groups of five bits each have been inserted.
To recover the characters from the finished integer, we divide it by 32 and decode the remainder to get the sixth character, then replace it with the quotient after division by 32 to shift the bit pattern rightwards, then divide by 32 and decode the remainder to get the fifth character, then replace with the quotient to shift rightwards again, and so on.
The exercise is to write and test a procedure for converting text
files to files of the Pascal data type file of Integer,
containing NIC characters packed into integers in the manner described,
and the converse procedure for restoring text files from the compressed
file of Integer versions:
type NICFile = file of Integer; procedure TextToNIC (var Source: Text; var Target: NICFile); procedure NICToText (var Source: NICFile; var Target: Text);Each of these procedures should presuppose that
Source has
already been opened for input and Target for output before it
is called; do not use Reset or Rewrite in the
body of either procedure. You may assume, as standard Pascal does, that
every text file ends with a line break.I propose to check your procedures by running them on a variety of test files, one of which is /u2/stone/datasets/wordlist.dat. The others will remain my little secret for the time being.
There are some design issues that you'll have to resolve:
TextToNIC procedure do if the source file
contains a character that is not in the NIC character set?
TextToNIC procedure do if the number of
characters in the source file is not an exact multiple of 6?
NICToText procedure do if the last
character recovered from the source file is not
end-of-line?Be sure to document your decisions on these issues.
This time, you need only submit listings of your procedures. These procedures will be due on Wednesday, September 25.