# CSC 301.01, Class 38: Basics of string matching

*Overview*

- Preliminaries
- Notes and news
- Upcoming work
- Extra credit
- Questions

- Approximate substring matching
- Exact substring matching
- The brute-force approach
- Review: The hash-code approach

### News / Etc.

- Warning! Our Chair is visiting class this week.
- Cut/Close/Balance period finishes on Friday at 5:00 p.m.
- You may start Add/Drop on Monday the 4th.

### Upcoming work

- Homework 9 due tonight.
- Please use the folder attached to my bulletin board.

- Homework 10 due next Wednesday.
- Final is in-class on the morning of Wednesday, December 13.

### Extra Credit (Academic/Artistic)

- CS Extras Thursday: Anya’s Advisor.

### Extra credit (Peer)

- Swim meet this coming weekend.
- Chamber Ensembles Saturday at 4:00 p.m.
- One Acts this weekend (Saturday at 8, Sunday at 2).

### Extra Credit (Misc)

- Mental Health Campus Resource Fair. Friday at 4pm in JRC 209.

### Other good things

- Jazz Ensemble Concert Friday at 7:30 p.m.
- YGB Saturday at 2:00 p.m.
- Collegium Concert Sunday at 2:00 p.m.
- Festival of trees at Drake Friday afternoon

### Questions

- What should we do on the tourist problem if an edge includes an invalid city?
- The preconditions of the problem have been violated.
- For the tourist problem, can we print as we go or do we have to wait until we’ve read all the input?
- Whatever is easier.
- Should we try it online?
- Sure. Send me the results of your tests. (Optional.)

## Approximate substring matching: review

Reminder: We use dynamic programming to match strings.

```
source
C "" s1 s2 s3 s4 ...
+--+--+--+--+--+--
t ""| | | | | | ...
a +--+--+--+--+--+--
r t1| | | | | | ...
g +--+--+--+--+--+--
e t2| | | | | | ...
t . . . . .
```

Meaning: `C[i,j]`

is the cost of converting `substring(source,i)`

to `substring(target,j)`

.

### The algorithm

```
C[0,0] = 0
C[i,0] = d + C[i-1,0] (for i>0)
C[0,j] = a + C[0,j-1] (for j>0)
C[i,j] = if (s[i] == t[j]) then
min(d+C[i-1,j], C[i-1,j-1], a+C[i,j-1])
else
min(d+C[i-1,j], r+C[i-1,j-1], a+C[i,j-1])
```

```
source
C "" s1 s2 s3 s4 ...
+--+--+--+--+--+--
t ""| 0| d|2d|3d|4d| ...
a +--+--+--+--+--+--
r t1| a| | | | | ...
g +--+--+--+--+--+--
e t2|2a| | | | | ...
t . . . . .
```

Note that we could also make `a`

, `d`

, and `r`

functions of the
position and character.

## Approximate substring matching: Additional issues

### Variant one

*How would you update the algorithm to accommodate *substring* matching
rather than whole string matching? (E.g., to figure out the best place
to align "habit" in "alphabetical"?)*

A brute force strategy

- Build the table for source vs “”
- Build the table for source vs t[1]
- Build the table for source vs t[1..2]
- Build the table for source vs t[1..3]
- …
- Build the table for source vs t[1..m]
- Build the table for source vs t[2]
- Build the table for source vs t[2..3]
- Build the table for source vs t[2..4]
- …
- Build the table for source vs t[2..m]
- Build the table for source vs t[3]
- …

This should work, in that it looks at every possible solution, but it’s
*really* inefficient.

A better solution

- Column 0 is all 0’s. Represents “we don’t care about leading characters”
- Minimize across the last column to find the cost of the best match.

### Variant two

*The whole string matching problem only gives us a cost. How do
we update the algorithm so that each cell includes both cost and
steps to achieve that cost? S[i,j] are the steps to achieve
C[i,j].*

```
C[0,0] = 0
S[0,0] = <>
C[i,0] = d + C[i-1,0] (for i>0)
S[i,0] = S[i-1,0] ++ <D.i>
C[0,j] = a + C[0,j-1] (for j>0)
S[0,j] = S[0, j-1] ++ <A.j>
C[i,j] = if (s[i] == t[j]) then
min(d+C[i-1,j], C[i-1,j-1], a+C[i,j-1])
else
min(d+C[i-1,j], r+C[i-1,j-1], a+C[i,j-1])
S[i,j] = if (C[i,j] == d+C[i-1,j])
S[i-1,j] ++ <D.i>
...
```

## Exact substring matching

Approximate matching is hard. Best algorithm is O(mn). Exact substring matching is easier.

- Find the first instance of source in target.
- Find all instances of source in target.
- Find any instance of source in target.

## The brute-force approach

```
for (s = 0; s < length(target) - length(source), s++)
{
for (i = 0; i < length(source) && target[s+i] == source[i]; i++)
;
if (i == length(source)
return s;
} // for s
return -1; // Not found
```

*What is the asymptotic upper bound on the algorithm and what is an input
that reaches that upper bound?*

Worst case O(mn).

- source:
`habit`

- target:
`habihabihabihabihabihabihabihabihabihabihabihabihabihabihabihabihabihabihabihabihabihabihabihabit`

.

A worse one

- source:
`hhhhhhhhhhi`

- target:
`hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhi`

Our question for next class: Can we do better than O(mn) and, if so, how?

## Review: The hash-code approach

This doesn’t solve our problem in general, although it does in most cases.