# CSC 301.01, Class 39: Improved string-matching algorithms

*Overview*

- Preliminaries
- Notes and news
- Upcoming work
- Extra credit
- Questions

- Quick review
- The hash-code approach
- Keeping track of look-ahead
- Building the table

### News / Etc.

- Warning! Our Chair is visiting class this week.
- Cut/Close/Balance period finishes on Friday at 5:00 p.m.
- You may start Add/Drop on Monday the 4th.

### Upcoming work

- Homework 10 due next Wednesday.

### Extra Credit (Academic/Artistic)

**NEW**CS Table Tuesday: Exotic PLs

### Extra credit (Peer)

- Swim meet this coming weekend. 6pm to ??? today. 9am tomorrow a.m.
- Chamber Ensembles Saturday at 4:00 p.m.
- One Acts this weekend.

### Extra Credit (Misc)

- Mental Health Campus Resource Fair. Friday at 4pm in JRC 209.
**NEW**Newtown film on Tuesday.

### Other good things

- Festival of trees at Drake THIS afternoon
- Jazz Ensemble Concert TONIGHT at 7:30 p.m.
- YGB Saturday at 2:00 p.m.
- Collegium Concert Sunday at 2:00 p.m.

### Questions

## Quick review

Our goal: Given a *source* of length n and a *target* of length m, find the
first (or all) matches of the source in the target.

Approach one: Try every position.

- Correct.
- Potentially inefficient. O(nm)

## The hash-code approach

*a.k.a. the Rabin Karp algorithm*

- Write a hash function that is easy to update
- Constant time to compute hash(t[i+1…i+n]) from hash(t[i..i+n-1]) and t[n].

- Compute the hash code of the source.
- Compute each hash code in the target.
- If any hash codes match, compare the individual strings, O(n)

*What’s the typical hash function?*

Let `alpha`

be the number of letters in the alphabet. (or “a prime”)
(or “a prime at least as large as the number of letters in the alphabet”)

```
s[i]*alpha^(n-1) + s[i+1]*alpha^(n-2) + ... + s[i+n-1]*alpha^0
```

or

```
s[i]*alpha^0 + s[i+1]*alpha^1 + ... + s[i+n-1]*alpha^(n-1)
```

We use the former. That allows us to update by chopping off the leftmost term, multiplying by alpha, and adding the lower order term.

Note: We only need to compute alpha^(n-1) once.

*Why might Sam think this is not an O(n+m) algorithm?*

- To avoid overflow, we need to mod by something (MAX_INT, some large prime)
- That means we have to check after the hash codes match.
- If we are
*really*unlucky, we can match, and then match again, and then match again. - What’s an input that causes problems? Find two characters, a and b, that
hashmod to the same value.
- source is aaaaaaab
- target is aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab

- We may have to do longer strings for a more sensible modulus, but we will still find some overlaps.

*Why might Sam be wrong?*

It is rare that the modulus is smaller than the number of values.

- Randomness may help!
- Perhaps we can then say “expected O(n+m)”

## Keeping track of look-ahead

Let us return to the original (try matching at position 0, then at position 1, then at position 2, …).

Suppose our source is `aaab`

and our input is `aaaa___`

. What should we
do upon seeing that the `b`

and the fourth `a`

don’t match? In particular,
do we really have to look at t[1] and t[2]?

- We know that t[1] is a and t[2] is a, so we can next match t[3] to s[2]

We can do even better if we try to match “aaab” against “aaaaac”. Once we hit the c, we can start again immediately afterwards.

If we develop some knowledge about the source, we can figure out how much of a prefix of the source string we can keep when we have a failed match.

```
a a a b
0 0 0 2
```

Knuth-Morris-Pratt: Match using this “preserve” table.

```
Inputs:
target, a string
source, a string
P, the table described above
Steps:
i = 0; // Index into target
j = 0; // Index into source
while (i < length(target))
if (j == length(source))
return MATCH at i-j.
else if (target[i] == source[j])
++i;
++j;
else if j == 0
++i
else
j = P[j]
```

*Run this algorithm on target of aaaacaaaaab with source of aaab.*

- P[0] = 0; P[1] = 0; P[2] = 0; P[3] = 2

E.g.,

- i = 0, j = 0, source[j] = a, target[i] = a, MATCH
- i = 1, j = 1, source[j] = a, target[i] = a, MATCH
- i = 2, j = 2, source[j] = a, target[i] = a, MATCH
- i = 3, j = 3, source[j] = b, target[i] = a, FAIL TO MATCH P[3] = 2
- i = 3, j = 2, source[j] = a, target[i] = a, MATCH

### Think about this weekend

*Build a table for this pattern*

- pattern: a b a c a b

How would you start?