Exercise #3: Packing

In some simple text applications, such as English word-lists, we might be willing to accept a reduced character set containing only thirty-two characters in exchange for a more concise representation of strings. Let's use the following NIC (``Non-standard Interchange Code'') for this purpose:

00000 space
00001 small-letter-a
00010 small-letter-b
00011 small-letter-c
00100 small-letter-d
00101 small-letter-e
00110 small-letter-f
00111 small-letter-g
01000 small-letter-h
01001 small-letter-i
01010 small-letter-j
01011 small-letter-k
01100 small-letter-l
01101 small-letter-m
01110 small-letter-n
01111 small-letter-o
10000 small-letter-p
10001 small-letter-q
10010 small-letter-r
10011 small-letter-s
10100 small-letter-t
10101 small-letter-u
10110 small-letter-v
10111 small-letter-w
11000 small-letter-x
11001 small-letter-y
11010 small-letter-z
11011 apostrophe
11100 comma
11101 hyphen-minus
11110 full-stop
11111 end-of-line

Now, since an NIC character occupies only five bits, it should be possible to fit six of them into the thirty-two-bit space occupied by an integer variable. If one were to open up a text file, read in the characters groups of six, pack each such group into an integer, and write the integer to a (binary) file, the resulting file would be one-third smaller than the original, since every six bytes of the original would be reduced to four in the compressed version.

How can we pack NIC characters into an integer? Well, there are four arithmetic operations on integers that can be used for setting and clearing groups of five bits: multiplication by 32, addition, remainder on division by 32, and quotient on division by 32. Specifically:

  • If a variable of type Integer contains a non-negative integer less than 33554432 (which is 2^25), multipliying that value by 32 will have the effect of shifting the bit-pattern that represents it five positions to the left, filling in zero bits on the right end. For instance, the integer value 42 is represented by the bit pattern 00000000000000000000000000101010; multiplying it by 32 yields 1344, which has the bit pattern 00000000000000000000010101000000.

  • If a variable of type Integer contains a non-negative integer that is evenly divisible by 32, the five rightmost bits in the bit-pattern that represents it will all be zeroes, so that adding a value less than 32 to it will store the last five bits of that value in the rightmost five bits. For instance, 1344 is evenly divisible by 32 and has the bit pattern 00000000000000000000010101000000. Adding 23 to 1344 yields 1367, which has the bit pattern 00000000000000000000010101010111; note that only the rightmost five bits have changed and that those five bits now form the binary numeral for 23.

  • If a variable of type Integer contains a non-negative integer, the remainder after division of its value by 32 is equal to the value of its last five bits, interpreted as a binary numeral. For instance, dividing 1367, which has the bit pattern 00000000000000000000010101010111, by 32 leaves a remainder of 23, which is the value expressed by its last five bits.

  • If a variable of type Integer contains a non-negative integer less than 1073741824 (which is 2^30), dividing that value by 32 and keeping the quotient will have the effect of shifting the bit-pattern of that value five positions to the right, filling in with zero bits on the left end. For example, dividing 1367, which has the bit pattern 00000000000000000000010101010111, by 32 yields a quotient of 42, which has the bit pattern 00000000000000000000000000101010.

    This means that we can pack six NIC characters into an integer variable by initializing that variable to 0, adding an integer less than 32 that encodes the first character to store that character into the rightmost five bits, multiplying by 32 to shift it leftwards, adding the encoding for the second character, multiplying by 32 again, and repeating the process until six groups of five bits each have been inserted.

    To recover the characters from the finished integer, we divide it by 32 and decode the remainder to get the sixth character, then replace it with the quotient after division by 32 to shift the bit pattern rightwards, then divide by 32 and decode the remainder to get the fifth character, then replace with the quotient to shift rightwards again, and so on.

    The exercise is to write and test a procedure for converting text files to files of the Pascal data type file of Integer, containing NIC characters packed into integers in the manner described, and the converse procedure for restoring text files from the compressed file of Integer versions:

    type
      NICFile = file of Integer;
    
    procedure TextToNIC (var Source: Text; var Target: NICFile);
    
    procedure NICToText (var Source: NICFile; var Target: Text);
    
    Each of these procedures should presuppose that Source has already been opened for input and Target for output before it is called; do not use Reset or Rewrite in the body of either procedure. You may assume, as standard Pascal does, that every text file ends with a line break.

    I propose to check your procedures by running them on a variety of test files, one of which is /u2/stone/datasets/wordlist.dat. The others will remain my little secret for the time being.

    There are some design issues that you'll have to resolve:

  • What should the TextToNIC procedure do if the source file contains a character that is not in the NIC character set?

  • What should the TextToNIC procedure do if the number of characters in the source file is not an exact multiple of 6?

  • What should the NICToText procedure do if the last character recovered from the source file is not end-of-line?

    Be sure to document your decisions on these issues.

    This time, you need only submit listings of your procedures. These procedures will be due on Wednesday, September 25.


    created September 12, 1996
    last revised September 17, 1996