Hash tables and hash functions

A table is a data structure that stores zero or more values, each of which is associated with and accessed through a distinct key, just as an array element is associated with and accessed through the index indicating the position in the array at which it is stored. A typical small example is a structure that might be used in a program that plays a word game in which each letter of a word is worth some number of points (Scrabble, for instance). The table would be contain twenty-six integer values, one for each possible letter key; although the values may include duplicates, the keys do not. Here are the operations normally required for tables:

create
Inputs: none.
Output: result, a table.
Preconditions: none.
Postcondition: result contains no values.

search
Inputs: store, a table, and opener, a key.
Outputs: found, a Boolean, and sought, a value of the type stored in store.
Preconditions: none.
Postcondition: Either found is false and opener is not associated with any value stored in store, or found is true and sought is the value associated with opener in store.

insert
Inputs: store, a table; opener, a key; associate, a value of the type stored in store.
Outputs: none.
Preconditions: none.
Postconditions: At output, store contains associate and associates that value with opener; with any other key, it associates the same value as it did at input.

For many tables, including the one in the word-game example, the range of possible keys is a small ordinal type, and then the simplest and most natural implementation is an array in which the keys themselves are used as subscripts:

module SimpleTables;

export

  type
    Key = 'A' .. 'Z';
    Value = Integer;
    Table = ^TableRecord;

  { The CreateTable function constructs and returns an empty table. }

  function CreateTable: Table;

  { The SearchTable function looks in a given table for a value associated
    with a specified key, setting Found to True if it finds such a value
    False if it does not.  In addition, if the search is successful, the
    value is returned as the value of the parameter Sought. }

  procedure SearchTable (Store: Table; Opener: Key; var Found: Boolean;
    var Sought: Value);

  { The InsertInTable procedure associates a given key with a given value
    in a given table. }

  procedure InsertInTable (var Store: Table; Opener: Key;
    Associate: Value);

  { The DeallocateTable procedure disposes of all the storage
    associated with the hash table, leaving its argument undefined. }

  procedure DeallocateTable (var Store: Table);

implement

  const
    Absent = -1;
      { a conventional indication that no value is associated with a given
        key }

  type
    TableRecord = array [Key] of Value;

  function CreateTable: Table;
  var
    Result: Table;
    Index: Key;
  begin
    New (Result);
    for Index := 'A' to 'Z' do
      Result^[Index] := Absent;
    CreateTable := Result
  end;

  procedure SearchTable (Store: Table; Opener: Key; var Found: Boolean;
    var Sought: Value);
  begin
    { Assert (Store <> nil;) }
    if Store^[Opener] = Absent then
      Found := False
    else begin
      Found := True;
      Sought := Store^[Opener]
    end
  end;

  procedure InsertInTable (var Store: Table; Opener: Key;
    Associate: Value);
  begin
    { Assert (Store <> nil;) }
    Store^[Opener] := Associate
  end;

  procedure DeallocateTable (var Store: Table);
  begin
    { Assert (Store <> nil;) }
    Dispose (Store);
    Store := nil
  end;

end.
However, there is a problem if the keys do not belong to an ordinal type or if the range of possible keys is vastly larger than the number of elements to be stored. For instance, at Grinnell College, student IDs are nine-digit numbers (that is, in the range from 000000000 to 999999999), while there are only about thirteen hundred students. If student numbers are used as keys in a table, it would not make sense to allocate an array of one billion storage locations just to store thirteen hundred records. Using students' names as keys would be even worse, since in Pascal there's just no way to use a string as an array subscript.

There are, however, various alternatives. They all call for storing the key along with the value, in a record that can be examined during the search. One possibility is to build a singly-linked list of key-and-value records, using linear search for the search operation and linear insertion -- keeping the list sorted by key -- for the insert operation. A better candidate, if the number of values is at all large, would be a binary search tree.

However, both of these implementation structures are less efficient than the array implementation, in which the search and insert operations can be performed in constant time, regardless of the number of values already stored in the table; they are said to be O(1) operations. A different implementation structure, known as a hash table, makes it possible to achieve the constant-time performance of arrays even when the keys are not suitable array subscripts. The idea is to interpose some computation between the key and the array subscript -- a computation that is typically encapsulated in a hash function that takes keys as arguments and returns subscripts into some appropriately sized array as values. When inserting an item into a hash table, one applies this function to the key and places the key-and-value record at the position it specifies; to recover it again, one again applies the hash function to the key and looks at the position indicated by the result.

Of course, if the hash function maps a gigantic range of possible keys into a much smaller range of array subscripts, it can't be one-to-one. On the contrary, it is inevitable that there will be cases in which the hash function assigns the same array subscript to different keys. When the distinct keys of two elements of the hash table are mapped to the same array subscript, a collision occurs. The implementer of a hash table must provide some mechanism for resolving collisions, that is, for finding an alternative storage location for an element that cannot be stored in a previously occupied position proposed by the hash function.

Since a Pascal array has a fixed size that is determined at compilation time, one must know in advance the maximum number of values that will be stored in a given hash table and choose a size that is at least that large. In fact, for best performance, it should be twenty percent larger; otherwise the time required to deal with collisions, which rises very sharply when the hash table is almost full, will outweigh any decrease in running time obtained by using a random-access data structure.

There are various mechanisms for resolving collisions. The earliest proposal was to use the array subscript returned by the hash function as the starting point for a linear search through adjacent locations within the table, incrementing the subscript until an empty position is found; as soon as the linear search encounters a position that is not already occupied, the incoming element can be inserted. If the end of the array is encountered before an unused location is reached, the search ``wraps around'' to the beginning and continues from there.

This linear probing strategy, however, does not work well, because the data tend to clump together as the table fills up, leading to long stretches of occupied slots separated by sparsely occupied stretches. A better idea, called secondary hashing, applies another hash function to the key to figure out how many positions in the array to jump over, after finding an occupied position, before trying to insert the new value again.

It turns out to be much easier to write a satisfactory hash function if the size of the hash table is a prime number, and for secondary hashing it is also helpful if that size is two greater than another prime number. Here's an implementation of hash tables, using secondary hashing, that can store up to about 1350 values without encountering too many collisions. A FullTable function is provided so that the application programmer can detect that this limit has been reached; by adjusting the MaximumLoad constant, one can permit additional entries, with declining efficiency, up to one less than the actual size of the array. (But if every position in the array were occupied, the SearchTable procedure would not work correctly -- on an unsuccessful search, it would enter an infinite loop. So MaximumLoad must be strictly less than ArraySize.)

module Tables;

$search 'keys.o, values.o'$
import
  Keys, Values;

  type
    Table = ^TableRecord;

  { The CreateTable function constructs and returns an empty table. }

  function CreateTable: Table;

  { The SearchTable function looks in a given table for a value associated
    with a specified key, setting Found to True if it finds such a value
    False if it does not.  In addition, if the search is successful, the
    value is returned as the value of the parameter Sought. }

  procedure SearchTable (Store: Table; Opener: Key; var Found: Boolean;
    var Sought: Value);

  { The InsertInTable procedure associates a given key with a given value
    in a given table. }

  procedure InsertInTable (var Store: Table; Opener: Key;
    Associate: Value);

  { The FullTable function determines whether new elements can be added to
    the table, returning True if the table is already full and False if
    there is room for at least one more value. }

  function FullTable (Store: Table): Boolean;

  { The DeallocateTable procedure disposes of all the storage
    associated with the hash table, leaving its argument undefined. }

  procedure DeallocateTable (var Store: Table);

implement

  const
    ArraySize = 1609;
      { the number of slots in the underlying array; suitable for
        secondary hashing, because both 1607 and 1609 are prime }
    MaximumLoad = 1350;
      { the largest number of values that can be accommodated without an
        excessive number of collisions }

  type
    PositionNumber = 1 .. ArraySize;
      { range of position numbers in the hash table }
    KeyAndValue = record
                    K: Key;
                    V: Value
                  end;
      { information to be stored in one slot of the hash table }
    LoadRange = 0 .. MaximumLoad;
      { A hash table's ``load'' is the number of elements stored in it;
        this is the range of possible loads in this implementation. }
    TableRecord = record
            Arr: array [PositionNumber] of KeyAndValue;
            Load: LoadRange
          end;
    { The Load field keeps track of the number of positions actually in
      use. }

  { The InUse function determines whether a given position in a hash table
    is actually occupied by a value previously inserted. }

  function InUse (Store: Table; Position: PositionNumber): Boolean;
  begin
    { Assert (Store <> nil); }
    InUse := not AbsentValue (Store^.Arr[Position].V);
  end;

  { The FindPosition procedure determines whether a value associated with a
    given key is present in a given hash table, setting Found to True if it
    is and to False if it is not; the variable parameter Position is set to
    the position within the table occupied by the value associated with
    that key, if it is present, and otherwise to an empty position
    appropriate for inserting a new element with the specified key. }

  procedure FindPosition (Store: Table; Opener: Key;
    var Found: Boolean; var Position: PositionNumber);
  var
    Looking: Boolean;
      { indicates whether the search is to continue beyond the present
        position } 
  begin
    { Assert (Store <> nil); }
    with Store^ do begin
      Looking := True;
      Position := HashKey (Opener, ArraySize);
      while Looking do
        if EqualKeys (Arr[Position].K, Opener) then begin
          Found := True;
          Looking := False
        end
        else if InUse (Store, Position) then 
          Position := ProbeWithKey (Position, Opener, ArraySize)
        else begin
          Found := False;
          Looking := False
        end
    end
  end;

  function CreateTable: Table;
  var
    Result: Table;
      { the hash table under construction }
    Index: PositionNumber;
      { counts off the positions in the hash table }
  begin
    New (Result);
    with Result^ do begin
      Load := 0;
      for Index := 1 to ArraySize do
        AssignAbsentValue (Arr[Index].V);
    end;
    CreateTable := Result
  end;
        
  procedure SearchTable (Store: Table; Opener: Key; var Found: Boolean;
    var Sought: Value);
  var
    Position: PositionNumber;
      { the position at which the item is found }
  begin
    { Assert (Store <> nil); }
    FindPosition (Store, Opener, Found, Position);
    if Found then
      AssignValue (Sought, Store^.Arr[Position].V)
  end;

  procedure InsertInTable (var Store: Table; Opener: Key;
    Associate: Value);
  var
    Found: Boolean;
      { indicates whether the position was already occupied }
    Position: PositionNumber;
      { the position at which the new entry is to be inserted }
  begin
    { Assert (Store <> nil); }
    FindPosition (Store, Opener, Found, Position);
    { Assert (Found or not FullTable (Store)); }
    with Store^, Arr[Position] do begin
      AssignValue (V, Associate);
      if not Found then begin
        AssignKey (K, Opener);
        Load := Load + 1
      end
    end
  end;

  function FullTable (Store: Table): Boolean;
  begin
    { Assert (Store <> nil); }
    FullTable := (Store^.Load = MaximumLoad)
  end;

  procedure DeallocateTable (var Store: Table);
  var
    Index: PositionNumber;
      { counts off the positions in the hash table }
  begin
    { Assert (Store <> nil); }
    with Store^ do
      for Index := 1 to ArraySize do
        with Arr[Index] do begin
          if InUse (Store, Index) then begin
            DeallocateKey (K);
          DeallocateValue (V)
        end;
    Dispose (Store);
    Store := nil
  end;

end.
This implementation makes several demands on the Keys and Values modules that it imports, but most of them are trivial to fulfill: It must be possible to test keys for equality, to assign a key to a key variable (perhaps by copying it) and a value to a value variable, and to deallocate keys and values. It must be possible, with the AssignAbsentValue procedure, to store something in a variable of type Value that indicates that the location is not in use -- some dummy value that can play the role of Absent in the earlier implementation -- and it must be possible to detect this dummy value with the AbsentValue function.

Most importantly, however, one must write a function HashKey that takes a key as an argument and returns a value that is an index into the hash table's array, and a function ProbeWithKey that provides an alternative index (derived from a key, the previously obtained index, and the size of the hash table's array) when a collision has occurred. There are two constraints on these functions: (1) Since HashKey in particular will be invoked very frequently, it should be simple and fast. (2) Since hash tables work best when the collision rate is low, HashKey and ProbeWithKey should ``randomize'' the keys; in other words, the values it produces should not conform to any pattern that may characterize the keys.

When the keys are integers and their range is many times larger than the range of hash table indices, the best and most commonly used hash function divides each key by the size of the hash table and returns the remainder (adding 1 if the array subscripts start at 1 rather than at 0). In many applications, this method has too high a collision rate if the hash table size has any small divisors, which is one reason for choosing a prime number as the size of the hash table. The other reason is that when the table size is prime, a ProbeWithKey function based on an iterated linear transformation, applied iteratively to its own outputs, will generate all the hash table indices before repeating.

Here is how the HashKey and ProbeWithKey functions might look in the Keys module:

type
  Key = 0 .. MaxInt;

{ The HashKey function maps any non-negative integer key into some position
  within the hash table, in a pseudo-random way. }

function HashKey (Opener: Key; ArraySize: Integer): Integer;
begin
  HashKey := Opener mod ArraySize + 1
end;

{ The ProbeWithKey function implements a pseudo-random permutation of the
  positions in the hash table; given any position, it yields another
  position, using a formula that makes the result dependent on the value
  of the key (assumed to be a non-negative integer). }

function ProbeWithKey (LastPosition: Integer; Opener: Key;
  ArraySize: Integer): Integer; 
var
  HashedKey: Integer;
    { an independent pseudo-random number derived from the search key }
begin
  HashedKey := Opener mod (ArraySize - 2) + 1;
  ProbeWithKey := (LastPosition + HashedKey) mod ArraySize + 1
end;
When the keys are real numbers, or when they are integers in a relatively narrow range, another common hash function involves multiplying by some irrational number, discarding the integer part of the result, and then multiplying by the size of the hash table and discarding the remainder. One frequently chosen irrational is phi, the limit of the ratios between successive Fibonacci numbers, (1 + sqrt(5))/2.

When the key is a character string, the hash function usually works by converting the key into an integer or real value and then applying one of the preceding techniques. Summing the ordinal values of the characters in the string often fails to disperse the keys sufficiently. Here are two better methods: (1) If the number of entries is typically small, use a hash table of size 128, and compute the hash function by treating each character as a sequence of seven bits, performing a bitwise exclusive-or operation to combine it with a ``running total'' and doing a one-bit circular shift on the result (moving each bit one position leftwards and then removing the leftmost bit and placing it at the right end) after each such operation. (2) Add each character's ordinal value to a running total, multiplying by some constant (perhaps phi) after each addition; discard the integer part of the result, multiply by the size of the hash table, discard the remainder.

Recently, one active area of research in computer science has been devising algorithms to find hash functions that are tailored to specific values, in the sense that among those particular values no collisions whatever will take place. For instance, one might design such a function for Pascal's reserved words and predefined identifiers, so that, when a compiler's symbol table is implemented as a hash table, these frequently occurring strings will not cause unnecessary collisions.

Still another method for handling collisions -- perhaps the one most frequently used today -- is to implement the hash table, not as an array of elements, but as an array of singly-linked lists of elements. The hash function is applied to the key to determine which of these lists the new element should be added to; in the event of a collision, one simply puts all of the elements that hash to the same array subscript into the same list (often, in this context, called a bucket). In contrast to linear probing and secondary hashing, this method has the advantage that it can if necessary accommodate more elements than there are positions in the array, though with a progressive degradation of performance as the average list grows longer and the linear search down such a list comes to occupy a larger fraction of the running time.

Also, it is far easier to delete an element from a hash table that uses buckets than from one that uses linear probing or secondary hashing as its collision-resolution mechanism, so bucketing is the method of choice for a table that must support a delete operation. But in most applications of tables either deletions are never needed, or they can be saved up and performed at a time when the table must be completely rebuilt anyway

Here's a version of the Tables module that uses buckets:

module TablesWithBuckets;

$search 'keys.o, values.o'$
import
  Keys, Values;

  type
    Table = ^TableRecord;

  { The CreateTable function constructs and returns an empty table. }

  function CreateTable: Table;

  { The SearchTable function looks in a given table for a value associated
    with a specified key, setting Found to True if it finds such a value
    False if it does not.  In addition, if the search is successful, the
    value is returned as the value of the parameter Sought. }

  procedure SearchTable (Store: Table; Opener: Key; var Found: Boolean;
    var Sought: Value);

  { The InsertInTable procedure associates a given key with a given value
    in a given table. }

  procedure InsertInTable (var Store: Table; Opener: Key;
    Associate: Value);

  { The FullTable function determines whether new elements can be added to
    the table, returning True if the table is already full and False if
    there is room for at least one more value. }

  function FullTable (Store: Table): Boolean;

  { The DeallocateTable procedure disposes of all the storage
    associated with the hash table, leaving its argument undefined. }

  procedure DeallocateTable (var Store: Table);

implement

  const
    ArraySize = 1609;
      { the number of slots in the underlying array; suitable for
        secondary hashing, because both 1607 and 1609 are prime }

  type
    PositionNumber = 1 .. ArraySize;
      { range of position numbers in the hash table }
    KeyAndValue = record
                    K: Key;
                    V: Value
                  end;
      { information to be stored in one slot of the hash table }
    Link = ^Component;
    Component = record
                  Datum: KeyAndValue;
                  Next: Link
                end;
      { components of a singly-linked list of key-and-value records }
    TableRecord = array [PositionNumber] of Link;
      { A table is an array of these singly-linked lists. }
  function CreateTable: Table;
  var
    Result: Table;
      { the hash table under construction }
    Index: PositionNumber;
      { counts off the positions in the hash table }
  begin
    New (Result);
    for Index := 1 to ArraySize do
      Result^[Index] := nil;
    CreateTable := Result
  end;
        
  procedure SearchTable (Store: Table; Opener: Key; var Found: Boolean;
    var Sought: Value);
  var
    Traverser: Link;
      { points to successive components of the bucket in which the
        value sought might be found }
  begin
    { Assert (Store <> nil); }
    Traverser := Store^[HashKey (Opener, ArraySize)];
    Found := False;
    while not Found and (Traverser <> nil) do
      if EqualKeys (Traverser^.Datum.K, Opener) then
        Found := True
      else
        Traverser := Traverser^.Next;
    if Found then
      AssignValue (Sought, Traverser^.Datum.V)
  end;

  procedure InsertInTable (var Store: Table; Opener: Key;
    Associate: Value);

    { The InsertOrAppend procedure checks to see whether the list it is
      given is empty.  If so, it replaces that list with a newly allocated,
      one-element list in which the only element contains the key Opener
      and the value Associate.  Otherwise, it checks to see whether the
      key of the component at the head of the list matches Opener.  If so,
      it, simply overwrites the value stored in that component with
      Associate.  Otherwise, it ignores the current component and calls
      itself recursively to advance down the list until either the end is
      reached or a component with a key matching Opener is found. }

    procedure InsertOrAppend (var HeadOfList: Link);
    begin
      if HeadOfList = nil then begin
        New (HeadOfList);
        HeadOfList^.Datum.K := Opener;
        HeadOfList^.Datum.V := Associate;
        HeadOfList^.Next := nil
      end
      else if EqualKeys (HeadOfList^.Datum.K, Opener) then
        HeadOfList^.Datum.V := Associate
      else
        InsertOrAppend (HeadOfList^.Next)
    end;
    
  begin { procedure InsertInTable }
    { Assert (Store <> nil); }
    InsertOrAppend (Store^[HashKey (Opener, ArraySize)])
  end;

  function FullTable (Store: Table): Boolean;
  begin
    FullTable := False  { buckets never fill }
  end;

  procedure DeallocateTable (var Store: Table);
  var
    Index: PositionNumber;
      { counts off the positions in the hash table }

    procedure DeallocateList (var HeadOfList: Link);
    begin
      if HeadOfList <> nil then begin
        DeallocateList (HeadOfList^.Next);
        DeallocateKey (HeadOfList^.Datum.K);
        DeallocateValue (HeadOfList^.Datum.V);
        Dispose (HeadOfList)
      end
    end;

  begin { procedure DeallocateTable }
    { Assert (Store <> nil); }
    for Index := 1 to ArraySize do
      DeallocateList (Store^[Index]);
    Dispose (Store);
    Store := nil
  end;

end.

created May 5, 1996
last revised December 10, 1996