Skip to main content

CSC 301.01, Class 40: Knuth-Morris-Pratt, concluded

Overview

  • Preliminaries
    • Notes and news
    • Upcoming work
    • Extra credit
    • Questions
  • Review of Knuth-Morris-Pratt
  • Backtracking multiple times
  • Running time analysis
  • Building the table

News / Etc.

  • Add/Drop period has started. I’m letting CSC 322 over-enroll to 24.

Upcoming work

Extra Credit (Academic/Artistic)

  • CS Table Tuesday: Esoteric PLs

Extra credit (Peer)

  • Pub-free quiz, Wednesday

Extra Credit (Misc)

  • Newtown film on Tuesday.

Other good things

  • Musical this weekend.
  • More music stuff.

Questions

For swap, will be only be swapping neighboring characters?
Yes.

Review of Knuth-Morris-Pratt

Inputs:
  target, a string
  source, a string
  P, a table that gives you the number of characters to preserve
Steps:     
01:  t = 0; // Index into target
02:  s = 0; // Index into source
03:  while (t < length(target))
04:    if (s == length(source))
05:      return MATCH at t-s.
06:    else if (target[t] == source[s])
07:      ++t;
08:      ++s;
09:    else if s == 0
10:      ++t
11:    else
12:      s = P[s]
13:    end if
14:  end while 

Questions?

Backtracking multiple times

I had suggested that line 12 could be executed in sequential iterations of the loop. Can you come up with an example in which that happens in more than two iterations?

Consider:

  • source: ababc
  • target1: abababc
  • target2: ababdababc
  • table: 0: 0, 1: 0, 2: 0, 3: 0, 4: 2 Or
a b a b c
0 0 0 0 2

When matching source against target1, we’re fine until we hit position 4. At that point, we rewind s to 2, and compare a in source to a in target, and move on. Only one backtrack.

a b a b c
a b a b a b c
        * Fail

Try here:
    a b a b c
a b a b a b c
        * Match
    a b a b c
a b a b a b c
          * Match
    a b a b c
a b a b a b c
            * Match
Done

When matching source against target2, we’re fine until we hit position 4. At that point, we rewrind s to 2, and compare a at postition 2 in source to d at position 4 in target. That causes another backtrack to position 0. The a does not match the d, so we’re now on to line 10, and we advance in the target

a b a b a c
a b a b d a b a b a c
* MATCH
a b a b a c
a b a b d a b a b a c
  * MATCH
a b a b a c
a b a b d a b a b a c
    * MATCH
a b a b a c
a b a b d a b a b a c
      * MATCH
a b a b a c
a b a b d a b a b a c
        * FAIL, SHIFT
    a b a b a c
a b a b d a b a b a c
        * FAIL, SHIFT
        a b a b a c
a b a b d a b a b a c
        * FAIL, ADVANCE
          a b a b a c
a b a b d a b a b a c
          * 

Sam suggests

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6
a b a c a b a c a b a c a b a c d
* 0 * 1 * 1 * 3 4 * * 1 * * * * 8

Running time analysis

  • In cases one and two, we advance in the target string.
  • In case three, we retreat in the source string. And we can do that repeatedly.
  • How do we know that the algorithm is still O(m + build-table(n))?
  • We’re going to amortize. We can only move backwards in the source string if we’ve moved forward in the source string.
  • We only move forward in the source string when we move forward in the target string. (If we move forward in the source string, we move forward in the target string.)
  • You can move forward in the target string at most m times.
  • You can move forward in the source string at most m times.
  • You can move backward in the source string at most m times.
  • It’s a linear algorithm, plus the cost of building the table.

Building the table

We’ll build the table and use something like the algorithm above.

We will try to match the string to itself at each position, but do so efficiently.

How would you build the table? (Assume the length of source is n.)

Here’s a not-quite-right solution.

01: P[0] = *
02: p = -1
03: for s = 1 to n-1
04:   // Find the first match of the character at s
05:   while p > 0 and source[p+1] != source[s]
06:     p = P[p]
07:   if source[p+1] == source[s]
08:     p = p+1
09:   P[s] = p

Let’s talk about the final

In class, Wednesday morning of finals week.

  • Exam will be four or five problems.
  • Two pages of notes plus textbooks. Sam will bring copies of the textbooks or printouts of the appropriate sections
    • 8.5x11 inches or A4.
    • No more than 1 mm thick.
    • Closed computer.
  • Mostly applying algorithms and techniques.
    • E.g., Here’s a red-black tree. Remove the root or add an element.
    • E.g., Here’s a loop invariant and some code, finish the code. (Code will normally be in C, the most fill in adjective of languages.)
      • No memory leaks to worry about.
  • Full semester!
    • Yes, you need to know (or be able to read) every damn algorithm we’ve talked about.