CSC 301.01, Class 39: Improved string-matching algorithms

Overview

• Preliminaries
• Notes and news
• Upcoming work
• Extra credit
• Questions
• Quick review
• The hash-code approach
• Building the table

News / Etc.

• Warning! Our Chair is visiting class this week.
• Cut/Close/Balance period finishes on Friday at 5:00 p.m.
• You may start Add/Drop on Monday the 4th.

Upcoming work

• NEW CS Table Tuesday: Exotic PLs

Extra credit (Peer)

• Swim meet this coming weekend. 6pm to ??? today. 9am tomorrow a.m.
• Chamber Ensembles Saturday at 4:00 p.m.
• One Acts this weekend.

Extra Credit (Misc)

• Mental Health Campus Resource Fair. Friday at 4pm in JRC 209.
• NEW Newtown film on Tuesday.

Other good things

• Festival of trees at Drake THIS afternoon
• Jazz Ensemble Concert TONIGHT at 7:30 p.m.
• YGB Saturday at 2:00 p.m.
• Collegium Concert Sunday at 2:00 p.m.

Quick review

Our goal: Given a source of length n and a target of length m, find the first (or all) matches of the source in the target.

Approach one: Try every position.

• Correct.
• Potentially inefficient. O(nm)

The hash-code approach

a.k.a. the Rabin Karp algorithm

• Write a hash function that is easy to update
• Constant time to compute hash(t[i+1…i+n]) from hash(t[i..i+n-1]) and t[n].
• Compute the hash code of the source.
• Compute each hash code in the target.
• If any hash codes match, compare the individual strings, O(n)

What’s the typical hash function?

Let alpha be the number of letters in the alphabet. (or “a prime”) (or “a prime at least as large as the number of letters in the alphabet”)

s[i]*alpha^(n-1) + s[i+1]*alpha^(n-2) + ... + s[i+n-1]*alpha^0


or

s[i]*alpha^0 + s[i+1]*alpha^1 + ... + s[i+n-1]*alpha^(n-1)


We use the former. That allows us to update by chopping off the leftmost term, multiplying by alpha, and adding the lower order term.

Note: We only need to compute alpha^(n-1) once.

Why might Sam think this is not an O(n+m) algorithm?

• To avoid overflow, we need to mod by something (MAX_INT, some large prime)
• That means we have to check after the hash codes match.
• If we are really unlucky, we can match, and then match again, and then match again.
• What’s an input that causes problems? Find two characters, a and b, that hashmod to the same value.
• source is aaaaaaab
• target is aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab
• We may have to do longer strings for a more sensible modulus, but we will still find some overlaps.

Why might Sam be wrong?

It is rare that the modulus is smaller than the number of values.

• Randomness may help!
• Perhaps we can then say “expected O(n+m)”

Let us return to the original (try matching at position 0, then at position 1, then at position 2, …).

Suppose our source is aaab and our input is aaaa___. What should we do upon seeing that the b and the fourth a don’t match? In particular, do we really have to look at t[1] and t[2]?

• We know that t[1] is a and t[2] is a, so we can next match t[3] to s[2]

We can do even better if we try to match “aaab” against “aaaaac”. Once we hit the c, we can start again immediately afterwards.

If we develop some knowledge about the source, we can figure out how much of a prefix of the source string we can keep when we have a failed match.

a a a b
0 0 0 2


Knuth-Morris-Pratt: Match using this “preserve” table.

Inputs:
target, a string
source, a string
P, the table described above
Steps:
i = 0; // Index into target
j = 0; // Index into source
while (i < length(target))
if (j == length(source))
return MATCH at i-j.
else if (target[i] == source[j])
++i;
++j;
else if j == 0
++i
else
j = P[j]


Run this algorithm on target of aaaacaaaaab with source of aaab.

• P[0] = 0; P[1] = 0; P[2] = 0; P[3] = 2

E.g.,

• i = 0, j = 0, source[j] = a, target[i] = a, MATCH
• i = 1, j = 1, source[j] = a, target[i] = a, MATCH
• i = 2, j = 2, source[j] = a, target[i] = a, MATCH
• i = 3, j = 3, source[j] = b, target[i] = a, FAIL TO MATCH P[3] = 2
• i = 3, j = 2, source[j] = a, target[i] = a, MATCH