CSC 301.01, Class 39: Improved string-matching algorithms
- Notes and news
- Upcoming work
- Extra credit
- Quick review
- The hash-code approach
- Keeping track of look-ahead
- Building the table
News / Etc.
- Warning! Our Chair is visiting class this week.
- Cut/Close/Balance period finishes on Friday at 5:00 p.m.
- You may start Add/Drop on Monday the 4th.
- Homework 10 due next Wednesday.
Extra Credit (Academic/Artistic)
- NEW CS Table Tuesday: Exotic PLs
Extra credit (Peer)
- Swim meet this coming weekend. 6pm to ??? today. 9am tomorrow a.m.
- Chamber Ensembles Saturday at 4:00 p.m.
- One Acts this weekend.
Extra Credit (Misc)
- Mental Health Campus Resource Fair. Friday at 4pm in JRC 209.
- NEW Newtown film on Tuesday.
Other good things
- Festival of trees at Drake THIS afternoon
- Jazz Ensemble Concert TONIGHT at 7:30 p.m.
- YGB Saturday at 2:00 p.m.
- Collegium Concert Sunday at 2:00 p.m.
Our goal: Given a source of length n and a target of length m, find the first (or all) matches of the source in the target.
Approach one: Try every position.
- Potentially inefficient. O(nm)
The hash-code approach
a.k.a. the Rabin Karp algorithm
- Write a hash function that is easy to update
- Constant time to compute hash(t[i+1…i+n]) from hash(t[i..i+n-1]) and t[n].
- Compute the hash code of the source.
- Compute each hash code in the target.
- If any hash codes match, compare the individual strings, O(n)
What’s the typical hash function?
alpha be the number of letters in the alphabet. (or “a prime”)
(or “a prime at least as large as the number of letters in the alphabet”)
s[i]*alpha^(n-1) + s[i+1]*alpha^(n-2) + ... + s[i+n-1]*alpha^0
s[i]*alpha^0 + s[i+1]*alpha^1 + ... + s[i+n-1]*alpha^(n-1)
We use the former. That allows us to update by chopping off the leftmost term, multiplying by alpha, and adding the lower order term.
Note: We only need to compute alpha^(n-1) once.
Why might Sam think this is not an O(n+m) algorithm?
- To avoid overflow, we need to mod by something (MAX_INT, some large prime)
- That means we have to check after the hash codes match.
- If we are really unlucky, we can match, and then match again, and then match again.
- What’s an input that causes problems? Find two characters, a and b, that
hashmod to the same value.
- source is aaaaaaab
- target is aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab
- We may have to do longer strings for a more sensible modulus, but we will still find some overlaps.
Why might Sam be wrong?
It is rare that the modulus is smaller than the number of values.
- Randomness may help!
- Perhaps we can then say “expected O(n+m)”
Keeping track of look-ahead
Let us return to the original (try matching at position 0, then at position 1, then at position 2, …).
Suppose our source is
aaab and our input is
aaaa___. What should we
do upon seeing that the
b and the fourth
a don’t match? In particular,
do we really have to look at t and t?
- We know that t is a and t is a, so we can next match t to s
We can do even better if we try to match “aaab” against “aaaaac”. Once we hit the c, we can start again immediately afterwards.
If we develop some knowledge about the source, we can figure out how much of a prefix of the source string we can keep when we have a failed match.
a a a b 0 0 0 2
Knuth-Morris-Pratt: Match using this “preserve” table.
Inputs: target, a string source, a string P, the table described above Steps: i = 0; // Index into target j = 0; // Index into source while (i < length(target)) if (j == length(source)) return MATCH at i-j. else if (target[i] == source[j]) ++i; ++j; else if j == 0 ++i else j = P[j]
Run this algorithm on target of aaaacaaaaab with source of aaab.
- P = 0; P = 0; P = 0; P = 2
- i = 0, j = 0, source[j] = a, target[i] = a, MATCH
- i = 1, j = 1, source[j] = a, target[i] = a, MATCH
- i = 2, j = 2, source[j] = a, target[i] = a, MATCH
- i = 3, j = 3, source[j] = b, target[i] = a, FAIL TO MATCH P = 2
- i = 3, j = 2, source[j] = a, target[i] = a, MATCH
Think about this weekend
Build a table for this pattern
- pattern: a b a c a b
How would you start?