Skip to main content

Filtering lists

Summary: We consider procedures and processes for selecting and removing data from lists.

Introduction: Filtering

As you’ve seen multiple times, when dealing with data sets, particularly data sets that we do not create ourselves, there are many times that we may have to “filter” the data sets. Here are some examples:

  • A data set of places may include the latitude and longitude for some places, but not all. If we were using the latitude and longitude, we would want to remove the elements that do not have latitude and longitude.
  • Given a list of values that are supposedly grades, we might want to remove the negative numbers because they are unlikely to represent grades (or represent instances in which data was incorrectly entered).
  • We might care only about a subset of the data in our data set. For example, we might have a data set that contains information for cities in every state, but we only care about cities in the midwest or cities within a certain distance of our current location.

We’ve generally used the term “cleaning” for the first two examples and “filtering” for the third, but they have the same general goal: Removing elements from a list.

Initially, we had to come up with some fairly “creative” processes for removing data from a list. For example, to remove the negative values from a list of real numbers, we decided to

  • add a zero to the front,
  • sort the list,
  • find the index of the zero, and
  • remove everything up to and including that index.

Of course, that reorders the values. Hence, to keep them in the same order, we ended up writing a much more complex and much longer set of instructions.

The filter procedure.

The problem of filtering is common enough that DrRacket includes a built-in procedure, filter, to help us filter lists. filter takes two parameters, a unary (one-parameter) predicate and a list of values, and selects all the values for which the predicate holds.

> (define stuff (list -5 10 18 23 14.0 87 1/2 0.5 -12.2))
> stuff
'(-5 10 18 23 14.0 87 1/2 0.5 -12.2)
> (filter inexact? stuff)
'(14.0 0.5 -12.2)
> (filter negative? stuff)
'(-5 -12.2)
> (filter integer? stuff)
'(-5 10 18 23 14.0 87)
> (filter (section <= 0 <> 10) stuff)
'(10 1/2 0.5)

That seems pretty powerful, doesn’t it? Believe it or not, but by the end of this course, you’ll be able to write filter yourself.

Selecting from lists compound values

In these first examples, we’ve selected elements from simple lists. But what if we want to do more complex selection, such as selecting values from a table or selecting strings that meet some criterion.

As an example, let’s return to our list of capitals from the reading on tables.

(define capitals
  '(("Alabama" "Montgomery" 32.361538 -86.279118)
    ("Alaska" "Juneau" 58.301935 -134.41974)
    ...
    ("Wyoming" "Cheyenne" 41.145548 -104.802042)))

Suppose we wanted all the cities north of latitude 39.72. If we cared only about the latitudes, we could first extract all of the latitudes and then filter.

> (map caddr capitals)
'(32.361538
  58.301935
  33.448457
  34.736009
  ...
  43.074722
  41.145548)
> (filter (section > <> 39.72) (map caddr capitals))
'(58.301935
  39.7391667
  41.767
  43.613739
  ...
  43.074722
  41.145548)
> (length capitals)
50
> (length (filter (section > <> 39.72) (map caddr capitals)))
27

But if we want the whole entry? In that case, we could write a compound predicate, one that first extracts the latitude and then compares the result to 39.72.

> (filter (o (section > <> 39.72) 
             caddr) 
          capitals)
'(("Alaska" "Juneau" 58.301935 -134.41974)
  ("Colorado" "Denver" 39.7391667 -104.984167)
  ("Connecticut" "Hartford" 41.767 -72.677)
  ...
  ("Wisconsin" "Madison" 43.074722 -89.384444)
  ("Wyoming" "Cheyenne" 41.145548 -104.802042))
> (filter (o (section < <> 39.72) 
             caddr) 
          capitals)
'(("Alabama" "Montgomery" 32.361538 -86.279118)
  ("Arizona" "Phoenix" 33.448457 -112.073844)
  ("Arkansas" "Little Rock" 34.736009 -92.331122)
  ...
  ("Texas" "Austin" 30.266667 -97.75)
  ("Virginia" "Richmond" 37.54 -77.46)
  ("West Virginia" "Charleston" 38.349497 -81.633294))

We might do the same thing when working with strings. Let’s start with a simple char-vowel? predicate.

;;; Procedure:
;;;   char-vowel?
;;; Parameters:
;;;   ch, a character [unverified]
;;; Purpose:
;;;   Determine if ch represents a "traditional" vowel in the English
;;;   language (aeiou).
;;; Produces:
;;;   is-vowel?, a Boolean value
;;; Preconditions:
;;;   [No additional]
;;; Postconditions:
;;;   * If ch is one of #\a #\e #\i #\o #\u #\A #\E #\I #\O #\U
;;;     then is-vowel? is true.
;;;   * Otherwise, is-vowel? is false.
;;; Problems:
;;;   There are words in which y and w are vowels.  This procedure will
;;;   not work in those situations.
(define char-vowel?
  (lambda (ch)
    (<= 0 (index-of (char-downcase ch) (list #\a #\e #\i #\o #\u)))))

We can then use that to find all the states that start with a capital letter.

> (filter (o char-vowel?
             (section string-ref <> 0)
             cadr)
          capitals)
'(("Georgia" "Atlanta" 33.76 -84.39)
  ("Indiana" "Indianapolis" 39.790942 -86.147685)
  ("Maine" "Augusta" 44.323535 -69.765261)
  ("Maryland" "Annapolis" 38.972945 -76.501157)
  ("New York" "Albany" 42.659829 -73.781339)
  ("Oklahoma" "Oklahoma City" 35.482309 -97.534994)
  ("Texas" "Austin" 30.266667 -97.75)
  ("Washington" "Olympia" 47.042418 -122.893077))

Writing more complex predicates

You may be asking yourself “What do they mean by more complex predicates? Those predicates already seem fairly complex.” But you’ll soon find that they follow a fairly straightfoward pattern: You extract a single datum from a complex data, typically with list-ref, string-ref, or one of cadr-like procedures), and then use a basic predicate on them.

But we can also do some other interesting things with predicates that will also feel complex and become more straightforward. Let’s consider three — negation, conjunction, and disjunction — which correspond to the basic Boolean operations of not, and, and or.

Negation

The negation of a predicate, given by (negate pred?), holds exactly when pred? does not hold. For example, since char-vowel? holds for vowels, (negate char-vowel?) holds for consonants.

> (filter char-vowel? (list #\a #\b #\c #\d #\e #\f #\g))
'(#\a #\e)
> (filter (negate char-vowel?) (list #\a #\b #\c #\d #\e #\f #\g))
'(#\b #\c #\d #\f #\g)

Similarly, (negate integer?) holds for all values that are not integers.

> (filter integer? (list 1 1/2 3.4 4 "two" 'three 8.0))
'(1 4 8.0)
> (filter (negate integer?) (list 1 1/2 3.4 4 "two" 'three 8.0))
'(1/2 3.4 "two" three)

The negate procedure comes with DrRacket, but may not be in all implementations of Scheme.

Conjunction

Just as and can be used to combine two Boolean values, conjoin can be used to combine two unary predicates. The conjunction of two predicates is a new predicate that holds only when both of the predicates hold. Note that the predicates are evaluated left-to-right.

> (filter integer? (list 1 1/2 3.4 4 "two" 'three 8.0))
'(1 4 8.0)
> (filter exact? (list 1 1/2 3.4 4 "two" 'three 8.0))
. . exact?: contract violation
  expected: number?
  given: "two"
> (filter (conjoin integer? exact?) (list 1 1/2 3.4 4 "two" 'three 8.0))
'(1 4)

We can use conjunction to find all of the cities north of 39.72 that start with a vowel.

> (filter (conjoin (o (section > <> 39.72)
                      caddr)
                   (o char-vowel?
                      (section string-ref <> 0)
                      cadr))
          capitals)
  
'(("Indiana" "Indianapolis" 39.790942 -86.147685)
  ("Maine" "Augusta" 44.323535 -69.765261)
  ("New York" "Albany" 42.659829 -73.781339)
  ("Washington" "Olympia" 47.042418 -122.893077))

Note that we could also have written two filter operations to achieve this goal. However, this approach is more efficient because we do not produce an intermediate list.

Disjunction

As you might guess, disjunction is the predicate equivalent of or. The disjunction procedure in DrRacket is called disjoin. Let’s use disjoin to select all the symbols and strings from a list.

> (filter symbol? (list 1 'two 3.0 4.5 "five" "six" 'seven 80/10 9+10i 'ten))
'(two seven ten)
> (filter string? (list 1 'two 3.0 4.5 "five" "six" 'seven 80/10 9+10i 'ten))
'("five" "six")
> (filter (disjoin symbol? string?) (list 1 'two 3.0 4.5 "five" "six" 'seven 80/10 9+10i 'ten))
'(two "five" "six" seven ten)

Self checks

You can find the list of state capitals in the self check in the reading on tables.

Check one: Some predicates

a. What does the predicate (section < 45 <> 55) compute?

b. Check your answer by filtering the list of capitals by that predicate.

c. What does the following compute?

> (filter (conjoin (o (section > <> 39.72)
                      caddr)
                   (o (section < <> 89.5)
                      cadddr))
          capitals)

d. Check your answer.

Check two: Strange combinations

What do each of the following predicates compute when used with the list of capitals?

a. (o odd? truncate caddr)

b. (o odd? truncate cadddr)

c. (negate (o odd? truncate caddr))

d. (negate (o odd? truncate cadddr))

e. (conjoin (o odd? truncate caddr) (o odd? truncate cadddr))

f. (disjoin (o odd? truncate caddr) (o odd? truncate cadddr))

g. (conjoin (negate (o odd? truncate caddr)) (negate (o odd? truncate cadddr)))

h. (negate (conjoin (o odd? truncate caddr) (o odd? truncate cadddr)))