A table is a data structure that stores zero or more values, each of which is associated with and accessed through a distinct key, just as an array element is associated with and accessed through the index indicating the position in the array at which it is stored. A typical small example is a structure that might be used in a program that plays a word game in which each letter of a word is worth some number of points (Scrabble, for instance). The table would be contain twenty-six integer values, one for each possible letter key; although the values may include duplicates, the keys do not. Here are the operations normally required for tables:
create
Inputs: none.
Output: result, a table.
Preconditions: none.
Postcondition: result contains no values.
search
Inputs: store, a table, and opener, a key.
Outputs: found, a Boolean, and sought, a value of
the type stored in store.
Preconditions: none.
Postcondition: Either found is false and
opener is not associated with any value stored in
store, or found is true and
sought is the value associated with opener in
store.
insert
Inputs: store, a table; opener, a key;
associate, a value of the type stored in
store.
Outputs: none.
Preconditions: none.
Postconditions: At output, store contains
associate and associates that value with opener;
with any other key, it associates the same value as it did at input.
For many tables, including the one in the word-game example, the range of possible keys is a small ordinal type, and then the simplest and most natural implementation is an array in which the keys themselves are used as subscripts:
module SimpleTables;
export
type
Key = 'A' .. 'Z';
Value = Integer;
Table = ^TableRecord;
{ The CreateTable function constructs and returns an empty table. }
function CreateTable: Table;
{ The SearchTable function looks in a given table for a value associated
with a specified key, setting Found to True if it finds such a value
False if it does not. In addition, if the search is successful, the
value is returned as the value of the parameter Sought. }
procedure SearchTable (Store: Table; Opener: Key; var Found: Boolean;
var Sought: Value);
{ The InsertInTable procedure associates a given key with a given value
in a given table. }
procedure InsertInTable (var Store: Table; Opener: Key;
Associate: Value);
{ The DeallocateTable procedure disposes of all the storage
associated with the hash table, leaving its argument undefined. }
procedure DeallocateTable (var Store: Table);
implement
const
Absent = -1;
{ a conventional indication that no value is associated with a given
key }
type
TableRecord = array [Key] of Value;
function CreateTable: Table;
var
Result: Table;
Index: Key;
begin
New (Result);
for Index := 'A' to 'Z' do
Result^[Index] := Absent;
CreateTable := Result
end;
procedure SearchTable (Store: Table; Opener: Key; var Found: Boolean;
var Sought: Value);
begin
{ Assert (Store <> nil;) }
if Store^[Opener] = Absent then
Found := False
else begin
Found := True;
Sought := Store^[Opener]
end
end;
procedure InsertInTable (var Store: Table; Opener: Key;
Associate: Value);
begin
{ Assert (Store <> nil;) }
Store^[Opener] := Associate
end;
procedure DeallocateTable (var Store: Table);
begin
{ Assert (Store <> nil;) }
Dispose (Store);
Store := nil
end;
end.
However, there is a problem if the keys do not belong to an ordinal type or
if the range of possible keys is vastly larger than the number of elements
to be stored. For instance, at Grinnell College, student IDs are
nine-digit numbers (that is, in the range from 000000000 to 999999999),
while there are only about thirteen hundred students. If student numbers
are used as keys in a table, it would not make sense to allocate an array
of one billion storage locations just to store thirteen hundred records.
Using students' names as keys would be even worse, since in Pascal there's
just no way to use a string as an array subscript.
There are, however, various alternatives. They all call for storing the
key along with the value, in a record that can be examined during the
search. One possibility is to build a singly-linked list of key-and-value
records, using linear search for the search operation and
linear insertion -- keeping the list sorted by key -- for the
insert operation. A better candidate, if the number of values
is at all large, would be a binary search tree.
However, both of these implementation structures are less efficient than
the array implementation, in which the search and
insert operations can be performed in constant time,
regardless of the number of values already stored in the table; they are
said to be O(1) operations. A different implementation structure,
known as a hash table, makes it possible to achieve the
constant-time performance of arrays even when the keys are not suitable
array subscripts.
The idea is to interpose some computation between the key and the array
subscript -- a computation that is typically encapsulated in a hash
function that takes keys as arguments and returns subscripts into some
appropriately sized array as values. When inserting an item into a hash
table, one applies this function to the key and places the key-and-value
record at the position it specifies; to recover it again, one again applies
the hash function to the key and looks at the position indicated by the
result.
Of course, if the hash function maps a gigantic range of possible keys into a much smaller range of array subscripts, it can't be one-to-one. On the contrary, it is inevitable that there will be cases in which the hash function assigns the same array subscript to different keys. When the distinct keys of two elements of the hash table are mapped to the same array subscript, a collision occurs. The implementer of a hash table must provide some mechanism for resolving collisions, that is, for finding an alternative storage location for an element that cannot be stored in a previously occupied position proposed by the hash function.
Since a Pascal array has a fixed size that is determined at compilation time, one must know in advance the maximum number of values that will be stored in a given hash table and choose a size that is at least that large. In fact, for best performance, it should be twenty percent larger; otherwise the time required to deal with collisions, which rises very sharply when the hash table is almost full, will outweigh any decrease in running time obtained by using a random-access data structure.
There are various mechanisms for resolving collisions. The earliest proposal was to use the array subscript returned by the hash function as the starting point for a linear search through adjacent locations within the table, incrementing the subscript until an empty position is found; as soon as the linear search encounters a position that is not already occupied, the incoming element can be inserted. If the end of the array is encountered before an unused location is reached, the search ``wraps around'' to the beginning and continues from there.
This linear probing strategy, however, does not work well, because the data tend to clump together as the table fills up, leading to long stretches of occupied slots separated by sparsely occupied stretches. A better idea, called secondary hashing, applies another hash function to the key to figure out how many positions in the array to jump over, after finding an occupied position, before trying to insert the new value again.
It turns out to be much easier to write a satisfactory hash function if the
size of the hash table is a prime number, and for secondary hashing it is
also helpful if that size is two greater than another prime number. Here's
an implementation of hash tables, using secondary hashing, that can store
up to about 1350 values without encountering too many collisions. A
FullTable function is provided so that the application
programmer can detect that this limit has been reached; by adjusting the
MaximumLoad constant, one can permit additional entries, with
declining efficiency, up to one less than the actual size of the array.
(But if every position in the array were occupied, the
SearchTable procedure would not work correctly -- on an
unsuccessful search, it would enter an infinite loop. So
MaximumLoad must be strictly less than
ArraySize.)
module Tables;
$search 'keys.o, values.o'$
import
Keys, Values;
type
Table = ^TableRecord;
{ The CreateTable function constructs and returns an empty table. }
function CreateTable: Table;
{ The SearchTable function looks in a given table for a value associated
with a specified key, setting Found to True if it finds such a value
False if it does not. In addition, if the search is successful, the
value is returned as the value of the parameter Sought. }
procedure SearchTable (Store: Table; Opener: Key; var Found: Boolean;
var Sought: Value);
{ The InsertInTable procedure associates a given key with a given value
in a given table. }
procedure InsertInTable (var Store: Table; Opener: Key;
Associate: Value);
{ The FullTable function determines whether new elements can be added to
the table, returning True if the table is already full and False if
there is room for at least one more value. }
function FullTable (Store: Table): Boolean;
{ The DeallocateTable procedure disposes of all the storage
associated with the hash table, leaving its argument undefined. }
procedure DeallocateTable (var Store: Table);
implement
const
ArraySize = 1609;
{ the number of slots in the underlying array; suitable for
secondary hashing, because both 1607 and 1609 are prime }
MaximumLoad = 1350;
{ the largest number of values that can be accommodated without an
excessive number of collisions }
type
PositionNumber = 1 .. ArraySize;
{ range of position numbers in the hash table }
KeyAndValue = record
K: Key;
V: Value
end;
{ information to be stored in one slot of the hash table }
LoadRange = 0 .. MaximumLoad;
{ A hash table's ``load'' is the number of elements stored in it;
this is the range of possible loads in this implementation. }
TableRecord = record
Arr: array [PositionNumber] of KeyAndValue;
Load: LoadRange
end;
{ The Load field keeps track of the number of positions actually in
use. }
{ The InUse function determines whether a given position in a hash table
is actually occupied by a value previously inserted. }
function InUse (Store: Table; Position: PositionNumber): Boolean;
begin
{ Assert (Store <> nil); }
InUse := not AbsentValue (Store^.Arr[Position].V);
end;
{ The FindPosition procedure determines whether a value associated with a
given key is present in a given hash table, setting Found to True if it
is and to False if it is not; the variable parameter Position is set to
the position within the table occupied by the value associated with
that key, if it is present, and otherwise to an empty position
appropriate for inserting a new element with the specified key. }
procedure FindPosition (Store: Table; Opener: Key;
var Found: Boolean; var Position: PositionNumber);
var
Looking: Boolean;
{ indicates whether the search is to continue beyond the present
position }
begin
{ Assert (Store <> nil); }
with Store^ do begin
Looking := True;
Position := HashKey (Opener, ArraySize);
while Looking do
if EqualKeys (Arr[Position].K, Opener) then begin
Found := True;
Looking := False
end
else if InUse (Store, Position) then
Position := ProbeWithKey (Position, Opener, ArraySize)
else begin
Found := False;
Looking := False
end
end
end;
function CreateTable: Table;
var
Result: Table;
{ the hash table under construction }
Index: PositionNumber;
{ counts off the positions in the hash table }
begin
New (Result);
with Result^ do begin
Load := 0;
for Index := 1 to ArraySize do
AssignAbsentValue (Arr[Index].V);
end;
CreateTable := Result
end;
procedure SearchTable (Store: Table; Opener: Key; var Found: Boolean;
var Sought: Value);
var
Position: PositionNumber;
{ the position at which the item is found }
begin
{ Assert (Store <> nil); }
FindPosition (Store, Opener, Found, Position);
if Found then
AssignValue (Sought, Store^.Arr[Position].V)
end;
procedure InsertInTable (var Store: Table; Opener: Key;
Associate: Value);
var
Found: Boolean;
{ indicates whether the position was already occupied }
Position: PositionNumber;
{ the position at which the new entry is to be inserted }
begin
{ Assert (Store <> nil); }
FindPosition (Store, Opener, Found, Position);
{ Assert (Found or not FullTable (Store)); }
with Store^, Arr[Position] do begin
AssignValue (V, Associate);
if not Found then begin
AssignKey (K, Opener);
Load := Load + 1
end
end
end;
function FullTable (Store: Table): Boolean;
begin
{ Assert (Store <> nil); }
FullTable := (Store^.Load = MaximumLoad)
end;
procedure DeallocateTable (var Store: Table);
var
Index: PositionNumber;
{ counts off the positions in the hash table }
begin
{ Assert (Store <> nil); }
with Store^ do
for Index := 1 to ArraySize do
with Arr[Index] do begin
if InUse (Store, Index) then begin
DeallocateKey (K);
DeallocateValue (V)
end;
Dispose (Store);
Store := nil
end;
end.
This implementation makes several demands on the Keys and
Values modules that it imports, but most of them are trivial
to fulfill: It must be possible to test keys for equality, to assign a key
to a key variable (perhaps by copying it) and a value to a value variable,
and to deallocate keys and values. It must be possible, with the
AssignAbsentValue procedure, to store something in a variable
of type Value that indicates that the location is not in use
-- some dummy value that can play the role of Absent in the
earlier implementation -- and it must be possible to detect this dummy
value with the AbsentValue function.
Most importantly, however, one must write a function HashKey
that takes a key as an argument and returns a value that is an index into
the hash table's array, and a function ProbeWithKey that
provides an alternative index (derived from a key, the previously obtained
index, and the size of the hash table's array) when a collision has
occurred. There are two constraints on these functions: (1) Since
HashKey in particular will be invoked very frequently, it
should be simple and fast. (2) Since hash tables work best when the
collision rate is low, HashKey and ProbeWithKey
should ``randomize'' the keys; in other words, the values it produces
should not conform to any pattern that may characterize the keys.
When the keys are integers and their range is many times larger than the
range of hash table indices, the best and most commonly used hash function
divides each key by the size of the hash table and returns the remainder
(adding 1 if the array subscripts start at 1 rather than at 0). In many
applications, this method has too high a collision rate if the hash table
size has any small divisors, which is one reason for choosing a prime
number as the size of the hash table. The other reason is that when the
table size is prime, a ProbeWithKey function based on an
iterated linear transformation, applied iteratively to its own outputs,
will generate all the hash table indices before repeating.
Here is how the HashKey and ProbeWithKey
functions might look in the Keys module:
type
Key = 0 .. MaxInt;
{ The HashKey function maps any non-negative integer key into some position
within the hash table, in a pseudo-random way. }
function HashKey (Opener: Key; ArraySize: Integer): Integer;
begin
HashKey := Opener mod ArraySize + 1
end;
{ The ProbeWithKey function implements a pseudo-random permutation of the
positions in the hash table; given any position, it yields another
position, using a formula that makes the result dependent on the value
of the key (assumed to be a non-negative integer). }
function ProbeWithKey (LastPosition: Integer; Opener: Key;
ArraySize: Integer): Integer;
var
HashedKey: Integer;
{ an independent pseudo-random number derived from the search key }
begin
HashedKey := Opener mod (ArraySize - 2) + 1;
ProbeWithKey := (LastPosition + HashedKey) mod ArraySize + 1
end;
When the keys are real numbers, or when they are integers in a relatively
narrow range, another common hash function involves multiplying by some
irrational number, discarding the integer part of the result, and then
multiplying by the size of the hash table and discarding the remainder.
One frequently chosen irrational is phi, the limit of the ratios between
successive Fibonacci numbers, (1 + sqrt(5))/2. When the key is a character string, the hash function usually works by converting the key into an integer or real value and then applying one of the preceding techniques. Summing the ordinal values of the characters in the string often fails to disperse the keys sufficiently. Here are two better methods: (1) If the number of entries is typically small, use a hash table of size 128, and compute the hash function by treating each character as a sequence of seven bits, performing a bitwise exclusive-or operation to combine it with a ``running total'' and doing a one-bit circular shift on the result (moving each bit one position leftwards and then removing the leftmost bit and placing it at the right end) after each such operation. (2) Add each character's ordinal value to a running total, multiplying by some constant (perhaps phi) after each addition; discard the integer part of the result, multiply by the size of the hash table, discard the remainder.
Recently, one active area of research in computer science has been devising algorithms to find hash functions that are tailored to specific values, in the sense that among those particular values no collisions whatever will take place. For instance, one might design such a function for Pascal's reserved words and predefined identifiers, so that, when a compiler's symbol table is implemented as a hash table, these frequently occurring strings will not cause unnecessary collisions.
Still another method for handling collisions -- perhaps the one most frequently used today -- is to implement the hash table, not as an array of elements, but as an array of singly-linked lists of elements. The hash function is applied to the key to determine which of these lists the new element should be added to; in the event of a collision, one simply puts all of the elements that hash to the same array subscript into the same list (often, in this context, called a bucket). In contrast to linear probing and secondary hashing, this method has the advantage that it can if necessary accommodate more elements than there are positions in the array, though with a progressive degradation of performance as the average list grows longer and the linear search down such a list comes to occupy a larger fraction of the running time.
Also, it is far easier to delete an element from a hash table that uses buckets than from one that uses linear probing or secondary hashing as its collision-resolution mechanism, so bucketing is the method of choice for a table that must support a delete operation. But in most applications of tables either deletions are never needed, or they can be saved up and performed at a time when the table must be completely rebuilt anyway
Here's a version of the Tables module that uses buckets:
module TablesWithBuckets;
$search 'keys.o, values.o'$
import
Keys, Values;
type
Table = ^TableRecord;
{ The CreateTable function constructs and returns an empty table. }
function CreateTable: Table;
{ The SearchTable function looks in a given table for a value associated
with a specified key, setting Found to True if it finds such a value
False if it does not. In addition, if the search is successful, the
value is returned as the value of the parameter Sought. }
procedure SearchTable (Store: Table; Opener: Key; var Found: Boolean;
var Sought: Value);
{ The InsertInTable procedure associates a given key with a given value
in a given table. }
procedure InsertInTable (var Store: Table; Opener: Key;
Associate: Value);
{ The FullTable function determines whether new elements can be added to
the table, returning True if the table is already full and False if
there is room for at least one more value. }
function FullTable (Store: Table): Boolean;
{ The DeallocateTable procedure disposes of all the storage
associated with the hash table, leaving its argument undefined. }
procedure DeallocateTable (var Store: Table);
implement
const
ArraySize = 1609;
{ the number of slots in the underlying array; suitable for
secondary hashing, because both 1607 and 1609 are prime }
type
PositionNumber = 1 .. ArraySize;
{ range of position numbers in the hash table }
KeyAndValue = record
K: Key;
V: Value
end;
{ information to be stored in one slot of the hash table }
Link = ^Component;
Component = record
Datum: KeyAndValue;
Next: Link
end;
{ components of a singly-linked list of key-and-value records }
TableRecord = array [PositionNumber] of Link;
{ A table is an array of these singly-linked lists. }
function CreateTable: Table;
var
Result: Table;
{ the hash table under construction }
Index: PositionNumber;
{ counts off the positions in the hash table }
begin
New (Result);
for Index := 1 to ArraySize do
Result^[Index] := nil;
CreateTable := Result
end;
procedure SearchTable (Store: Table; Opener: Key; var Found: Boolean;
var Sought: Value);
var
Traverser: Link;
{ points to successive components of the bucket in which the
value sought might be found }
begin
{ Assert (Store <> nil); }
Traverser := Store^[HashKey (Opener, ArraySize)];
Found := False;
while not Found and (Traverser <> nil) do
if EqualKeys (Traverser^.Datum.K, Opener) then
Found := True
else
Traverser := Traverser^.Next;
if Found then
AssignValue (Sought, Traverser^.Datum.V)
end;
procedure InsertInTable (var Store: Table; Opener: Key;
Associate: Value);
{ The InsertOrAppend procedure checks to see whether the list it is
given is empty. If so, it replaces that list with a newly allocated,
one-element list in which the only element contains the key Opener
and the value Associate. Otherwise, it checks to see whether the
key of the component at the head of the list matches Opener. If so,
it, simply overwrites the value stored in that component with
Associate. Otherwise, it ignores the current component and calls
itself recursively to advance down the list until either the end is
reached or a component with a key matching Opener is found. }
procedure InsertOrAppend (var HeadOfList: Link);
begin
if HeadOfList = nil then begin
New (HeadOfList);
HeadOfList^.Datum.K := Opener;
HeadOfList^.Datum.V := Associate;
HeadOfList^.Next := nil
end
else if EqualKeys (HeadOfList^.Datum.K, Opener) then
HeadOfList^.Datum.V := Associate
else
InsertOrAppend (HeadOfList^.Next)
end;
begin { procedure InsertInTable }
{ Assert (Store <> nil); }
InsertOrAppend (Store^[HashKey (Opener, ArraySize)])
end;
function FullTable (Store: Table): Boolean;
begin
FullTable := False { buckets never fill }
end;
procedure DeallocateTable (var Store: Table);
var
Index: PositionNumber;
{ counts off the positions in the hash table }
procedure DeallocateList (var HeadOfList: Link);
begin
if HeadOfList <> nil then begin
DeallocateList (HeadOfList^.Next);
DeallocateKey (HeadOfList^.Datum.K);
DeallocateValue (HeadOfList^.Datum.V);
Dispose (HeadOfList)
end
end;
begin { procedure DeallocateTable }
{ Assert (Store <> nil); }
for Index := 1 to ArraySize do
DeallocateList (Store^[Index]);
Dispose (Store);
Store := nil
end;
end.