Sorting a set of values -- arranging them in a fixed order, usually alphabetical or numerical -- is one of the commonest computing applications. In most cases, it is such a tiresome, error-prone, and time-consuming process for human beings that the programmer should automate it whenever possible. For this reason, it is an application that has been studied intently by computer scientists.
One of the clear results of these investigations is that no one algorithm for sorting is best in all cases. Ideally, one uses different algorithms depending on whether one is sorting a small set or a large one, on whether the individual elements of the set occupy a lot of storage (so that moving them around in memory is time-consuming), on how easy it is to compare two elements to figure out which one should precede the other, and so on. In this course we'll be looking at two of the most generally useful algorithms for sorting: the insertion sort, which is the subject of today's lab, and the merge sort, which we'll talk about in the next lab.
Imagine first that we're given a set of values and a rule for arranging
them. The values might actually be stored either in a list or in a
vector; let's assume first that they are in a list. The rule typically
takes the form of a predicate of arity 2 that can be applied to any two
values in the set to determine whether the first of them should precede the
second when the values have been sorted. (For example, if one wants to
sort a set of real numbers into ascending numerical order, the rule should
be the predicate <; if one wants to sort a set of strings
into alphabetical order, ignoring case, the rule should be
string-ci<?, and so on.)
The insertion sort works by taking the values one by one and inserting each
one into a new list that it constructs, constantly maintaining the
condition that the elements of the new list are in the desired order with
respect to one another. Clearly, this condition will not be maintained if
each element is added to the new list at the beginning, using
cons; instead, the insertion sort adds each element at a
carefully selected position within the new list, placing the new element
after each previously placed element that precedes it according
to the given precedence rule, but before every such element that
it precedes. The following procedure, insert, adds a new
element to a list in exactly this way. For the moment, we'll assume that
the elements of the list are real numbers and than we want to sort them
into ascending order; < is therefore used as the ordering
predicate.
(define insert
(lambda (new-element ls)
(cond ((null? ls) (list new-element))
((< new-element (car ls)) (cons new-element ls))
(else (cons (car ls) (insert new-element (cdr ls)))))))
In English: If the list into which the new element is to be inserted is empty, return a list containing only the new element. If the new element precedes the first element of the existing list, then, since the existing list is assumed to be sorted already, it must also precede every element of the existing list, so cons the new element onto the front of the existing list and return the result. Otherwise, we haven't yet found the place, so issue a recursive call to insert the new element into the cdr of the current list, then reattach its car at the beginning of the result.
Test the insert procedure by inserting a number into an empty
list; into a list of larger numbers, arranged in ascending order; into a
list of smaller numbers, arranged in ascending order; into a mixed list,
arranged in ascending order; into a list of copies of the number to be
inserted.
What happens if ls is not in ascending order when
insert is invoked?
Modify the insert procedure so that it inserts a string into a
list of strings that are in alphabetical order:
> (modified-insert "dog" '("ape" "bear" "cat" "emu" "frog"))
("ape" "bear" "cat" "dog" "emu" "frog")
The preceding version of the insert procedure is not
tail-recursive. When dealing with long lists, you may want to use the
following tail-recursive version, which uses space more economically:
(define insert
(lambda (new-element ls)
(let loop ((rest ls)
(bypassed '()))
(cond ((null? rest) (revappend bypassed (list new-element)))
((< new-element (car rest))
(revappend bypassed (cons new-element rest)))
(else (loop (cdr rest) (cons (car rest) bypassed)))))))
(define revappend
(lambda (ls-1 ls-2)
(if (null? ls-1)
ls-2
(revappend (cdr ls-1) (cons (car ls-1) ls-2)))))
(The revappend procedure takes two lists and returns the
result of concatenating the reverse of the first one to the front
of the second one.)
Now let's return to the overall process of sorting an entire list. The insertion sort algorithm simply takes up the elements of the list to be sorted one by one and inserts each one into a new list, initially empty:
(define insertion-sort
(lambda (old-ls)
(do ((rest old-ls (cdr rest))
(new-ls '() (insert (car rest) new-ls)))
((null? rest) new-ls)))) ;; no body
Redefine insert using trace-define, so that you
can follow the sequence of calls, then use the insertion-sort
procedure to sort the values 7, 6, 12, 4, 10, 8, 5, and 1.
Test the insertion-sort procedure on some potentially
troublesome arguments: an empty list, a list containing only one element, a
list containing all equal values, a list in which the elements are
originally in descending numerical order.
By writing the specific predicate < into the definition of
insert, we restricted the preceding version of
insertion-sort so that it applies only to lists of real
numbers and always returns a list in ascending numerical order. Let's go
back now and lift that restriction.
According to the original specification, insertion-sort should
take two arguments, the list ls and a predicate
precedes? that compares elements of that list. Since in many
applications the nature of the desired ordering is known before the
particular list to be ordered and is constant over many applications to
different lists, it makes sense to curry the sorting algorithm so that it
takes these arguments separately. As a first draft, we might try this:
(define insertion-sort ;; Be careful: This version doesn't quite work.
(lambda (precedes?)
(lambda (old-ls)
(do ((rest old-ls (cdr rest))
(new-ls '() (insert (car rest) new-ls)))
((null? rest) new-ls)))))
The problem is that the actual use of the ordering rule is not in
the body of insertion-sort, but inside the insert
procedure, which is entirely separate. Even if we went back to that
procedure and changed < to precedes?, the sort
still wouldn't work, because inside the insert procedure the
identifier precedes? wouldn't be bound to anything.
What we'd really like to do is pick up the entire definition of
insert and put it inside the definition of
insertion-sort, at a point where the identifier
precedes? has been bound:
(define insertion-sort
(lambda (precedes?)
(define insert
(lambda (new-element ls)
(cond ((null? ls) (list new-element))
((precedes? new-element (car ls)) (cons new-element ls))
(else (cons (car ls) (insert new-element (cdr ls)))))))
(lambda (old-ls)
(do ((rest old-ls (cdr rest))
(new-ls '() (insert (car rest) new-ls)))
((null? rest) new-ls)))))
It is a pleasant surprise to discover that it is legal to do exactly this
in Scheme. The identifiers that are introduced through such embedded
definitions are local and behave as though they were bound by
means of a letrec-expression; indeed, the Scheme standard
specifies that the preceding code must be semantically identical to the
following version using letrec:
(define insertion-sort
(lambda (precedes?)
(letrec ((insert
(lambda (new-element ls)
(cond ((null? ls) (list new-element))
((precedes? new-element (car ls)) (cons new-element ls))
(else (cons (car ls) (insert new-element (cdr ls))))))))
(lambda (old-ls)
(do ((rest old-ls (cdr rest))
(new-ls '() (insert (car rest) new-ls)))
((null? rest) new-ls))))))
However, many people find the version that uses a local definition more
readable. (Also, under Chez Scheme, trace-define works for
embedded procedure definitions, which is sometimes helpful when you're
trying to debug a program.)
Disciplined programmers use embedded definitions only for procedures.
Beginners are sometimes tempted to use them instead of let or
let* to create local names for intermediate results in a
computation, but this is usually a mistake. Unlike top-level definitions,
embedded definitions are not sequential -- the bindings are
mutually recursive and simultaneous, as in letrec-expressions,
not successive, as in let*-expressions.
Figure out how to invoke the curried version of insertion-sort
so that it arranges the strings "bear", "emu",
"frog", "ape", "dog", and
"cat" into alphabetical order.
Finally, let's consider the rather different case in which the values that we want to arrange are presented as a vector and the goal of the sorting algorithm is to overwrite the old arrangement of those values with a new, sorted arrangement of the same values. Instead of constructing a new vector, we partition the original vector into two subvectors: a sorted subvector, in which all of the elements are in the correct order relative to one another, and an unsorted subvector in which the elements are still in their original positions. The two subvectors are not actually separated; instead, we just keep track of a boundary between them inside the original vector. Items to the left of the boundary are in the sorted subvector; items to its right, in the unsorted one. Initially the boundary is at the left end of the vector. The plan is to shift it, one position at a time, to the right end. When it arrives, the entire vector has been sorted.
Here's the plan for the main algorithm, then. Once again, we use currying so that the ordering rule can be provided before the vector.
(define insertion-sort!
(lambda (precedes?)
;; The definition of the INSERT! procedure goes here.
(lambda (vec)
(let ((len (vector-length vec)))
(do ((boundary 0 (+ boundary 1)))
((= boundary len))
(insert! (vector-ref vec boundary) vec boundary))))))
The insert! procedure takes three arguments: an element to be
inserted into the sorted part of the vector, the vector itself, and the
current boundary position. The new element can be inserted at any position
up to and including the current boundary position, but it must be placed in
the correct order relative to elements to the left of that boundary. This
means that any elements that should follow the new one should be shifted
one position to the right in order to make room for the new one. (Elements
that precede the new one can keep their current positions.)
(define insert!
(lambda (new-element vec boundary)
(do ((test-position boundary (- test-position 1)))
((or (zero? test-position)
(precedes? (vector-ref vec (- test-position 1)) new-element))
(vector-set! vec test-position new-element))
(vector-set! vec test-position (vector-ref vec (- test-position 1))))))
In English: Starting at the boundary and working from right to left,
examine each position in turn as a candidate for the position at which
new-element should be inserted. If the position number is 0
(so that we've reached the left end of the vector), or if the element just
to the left of the current position is supposed to precede
new-element, stop and put new-element in the
current candidate position. Otherwise, fill in the current candidate
position by copying the element just to its left into it and proceed to
the next iteration, in which the position of the element just copied will
be overwritten one way or the other.
Assemble the full definition of insertion-sort! by editing the
definition of insert! into the version given above. Create
and name a vector containing the strings "bear",
"emu", "frog", "ape",
"dog", and "cat". Rearrange the elements of the
vector into alphabetical order by means of an appropriate call to
insertion-sort!. (Note that the sorting occurs as a side
effect of this call -- the value of the do-expression in the
body of insertion-sort! is unspecified -- so to confirm that
the sorting procedure worked you'll have to inspect the vector again
afterwards.)
This document is available on the World Wide Web as
http://www.math.grin.edu/~stone/courses/scheme/sorting-methods.html
created November 21, 1997
last revised November 24, 1997