Searching

Course links

Linear search

To search a data structure is to examine its elements singly until one has either found an element that has a desired property or concluded that the data structure contains no such element. For instance, one might search a list of integers for an even element, or a vector of pairs for a pair having the string "elephant" as its cdr. Scheme's predefined assq, assv, and assoc procedures search association lists.

In a linear data structure, such as a flat list or a vector, there is an obvious algorithm for conducting a search: Start at the beginning of the data structure and traverse it, testing each element. Eventually one will either find an element that has the desired property or reach the end of the structure without finding such an element, thus conclusively proving that there is no such element. Here's a vector version of the linear-search algorithm:

;;; linear-search: find the position of an element in a given vector that
;;; satisfies a given predicate

;;; Givens:
;;;   TEST?, a unary predicate.
;;;   VEC, a vector.

;;; Result:
;;;   OUTCOME, either a natural number or #F.

;;; Precondition:
;;;   Every element of VEC satisfies the preconditions that TEST? imposes on
;;;   its argument.

;;; Postconditions:
;;;   (1) If no element of VEC satisfies TEST?, OUTCOME is #F.
;;;   (2) If at least one element of VEC satisfies TEST?, then OUTCOME is
;;;       a position in VEC, and the element at position OUTCOME in VEC
;;;       satisfies TEST?

(define linear-search
  (lambda (test? vec)
    (let ((len (vector-length vec)))
      (let kernel ((position 0))
        (cond ((= position len) #f)
              ((test? (vector-ref vec position)) position)
              (else (kernel (+ position 1))))))))

Here are two examples of the use of this procedure:

> (define sample (vector 1 3 5 7 8 11 13))
> (linear-search even? sample)
4
> (linear-search (right-section = 12) sample)
#f

This search procedure returns #f if the search is unsuccessful; if it is successful, it returns the position in the specified vector at which the desired element can be found. There are many variants of this idea: One might, for instance, prefer to signal an error or display a diagnostic message if a search is unsuccessful. In the case of a successful search, one might simply return #t (if all that is needed is an indication of whether an element having the desired property is present in or absent from the list), or one might return the element found rather than its position in the vector.

Binary search

The linear search algorithms just described can be quite slow if the data structure to be searched is large. If one has a number of searches to carry out in the same data structure, it is often more efficient to ``pre-process'' the values, sorting them and transferring them to a vector, before starting those searches. One can then use the much faster binary-search algorithm.

Binary search is a more specialized algorithm than linear search. It requires a random-access structure, as opposed to one that offers only sequential access, and it is limited to the kind of test in which one is looking for a particular value that has a unique relative position in some ordering. For instance, one could use a binary search to look for an element equal to 12 in a vector of integers, since 12 is uniquely located between integers less than 12 and integers greater than 12; but one wouldn't use binary search to look for an even integer, since the even integers don't have a unique position in any natural ordering of the integers.

The idea in a binary search is to divide the sorted vector into two approximately equal parts, examining the element at the point of division to determine which of the parts must contain the value sought. Actually, there are usually three possibilities:

There is one other way in which the recursion can bottom out: If, in some recursive call, the subvector to be searched (which will be half of a half of a half of ... of the original vector) contains no elements at all, then the search obviously cannot succeed and the procedure should take the appropriate failure action.

Here, then, is the basic binary-search algorithm. It is curried, so that the ordering predicate is to be supplied first and separately; binary-search returns a customized searching procedure that one can, in turn, apply to a vector and the item one is looking for. The identifiers lower-bound and upper-bound denote the starting and ending positions of the part of the vector within which the value sought must lie, if it is present at all. As in the reading on sorting by merging, let's adopt the convention that the starting position is ``inclusive'' -- it is the first position that is in the subvector -- and the ending position is ``exclusive'' -- it is the position after the last position in the subvector.

;;; binary-search: given an ordering predicate, construct and return a
;;; procedure that finds the position of a given value in a given vector
;;; ordered by that predicate

;;; Given:
;;;   MAY-PRECEDE?, a binary predicate.

;;; Result:
;;;   SEEKER, a binary procedure.

;;; Precondition:
;;;   MAY-PRECEDE? is an ordering relation.

;;; Postcondition:
;;;   Given a vector VEC, every element of which satisfies the
;;;   preconditions that MAY-PRECEDE? imposes on either of its arguments,
;;;   and which is ordered by MAY-PRECEDE?, and a value SOUGHT, SEEKER
;;;   returns either #F (if SOUGHT not an element of VEC) or a zero-based
;;;   position in VEC at which SOUGHT occurs.

(define binary-search
  (lambda (may-precede?)
    (lambda (vec sought)
      (let kernel ((lower-bound 0)
                   (upper-bound (vector-length vec)))
        (if (< lower-bound upper-bound)
            (let* ((midpoint (quotient (+ lower-bound upper-bound) 2))
                   (middle-element (vector-ref vec midpoint)))
              (cond ((not (may-precede? middle-element sought))
                     (kernel lower-bound midpoint))
                    ((not (may-precede? sought middle-element))
                     (kernel (+ midpoint 1) upper-bound))
                    (else midpoint)))
            #f)))))

In each recursive call to kernel, the length of the subvector within which the value sought must lie, if it is present at all, is cut in half. Since even a very large vector cannot be halved very many times, binary search is typically much, much faster than linear search.