# Class 07: From Specification to Optimal DFA, Continued

Held Friday, September 13, 2002

Summary

Today we continue our consideration of how to move from the readable but declarative regular exprssion notation to the executable but obtuse finite automaton notation.

Notes

• If you have not done so already, please start reading Chapter 4 of the Red Dragon.
• I plan to email you about the project later today.

Overview

• From NFA to DFA
• From DFA to optimal DFA
• Reminder: Lexical analysis using automata

## A Different Exampe

• For today, we'll try a slightly weird, but also interesting example for which I'll draw a DFA
• Strings in the language
• Are followed by an odd # of b's and an a
• Can repeat the (odd # of b's and a) with an intervening "aa"
• Can repeat the (odd # of b's and a) with an intervening "ca"
• Must have a c after all of the repeated (odd # of b's and a)
• Have an optional (odd # of b's) at the end

## From DFA to Optimal DFA

• As you may be able to tell, the DFAs constructed by the NFA to DFA construction are is fairly large.
• Can we build an equivalent DFA that's smaller? Yes!
• Strategy: Group states into sets of equivalent states.
• Surprisingly, it's not enough to do this bottom-up (that is, if two states have the same edges, join them together).
• Sometimes you need to join states in pairs (or more)
• Even more surprisingly, it's better to do it top-down (assume that all states can be joined and refine that decision)
• Here's some pseudocode
```Assume all non-final states can be treated as the same state
Assume all final states can be treated as the same state
For each group of states treated as equivalent
as the same state
For each symbol, s
If there are two "equivalent" states q1,q2 such that
edge(q1,s) and edge(q2,s) lead to non-equivalent states,
split q1 and q2 into different equivalencies
figure out where the other states in the equivalency go
End For // each symbol
End for // each pair of states
```

## From Token Definitions to DFA

• How do we convert a series of token definitions (using regular expressions) to a tokenizing DFA?
• We build an NFA for each of the token definitions.
• We augment the final states with token type and priority.
• We put 'em together in one big automaton by adding a new start state with epsilon transitions to the start states of all the other automata.
• We convert the result to a DFA.
• When we make the determination of a final state, we use the associated token type of the highest priority final state in the NFA.

## Tokenizing with Finite Automata

• We've seen how to match with finite automata, but how do we tokenize?
• We only tokenize with deterministic finite automata.
• We begin by attaching a token (or, sometimes, a set of instructions) to each final state.
• When we reach an appropriate final state (possibly not the first final state), we stop and return the corresponding token.
• How do we decide whether a final state is appropriate? We always keep track of the last final state we encountered, but peek ahead until we find another final state or hit a dead end.
```/**
* Find the first token in the candidate string, starting
* at a particular position.
*/
public token findToken(String candidate, starting_pos)
begin
State current_state = q0;
for i = starting_pos to the length of candidate
current_state = edge(current_state,candidate.symbolAt(i))
if current_state is a final state
final_found = current_state
final_pos = i
end if
if current_state is undefined then
exit the for loop
end if
end for
if (final_found is defined) then
return the token given by final_found at position final_pos
else
no token can be found
end if
end
```

## History

Thursday, 29 August 2002

• First version, based somewhat on outlines from CS362 2001S.

Disclaimer: I usually create these pages on the fly, which means that I rarely proofread them and they may contain bad grammar and incorrect details. It also means that I tend to update them regularly (see the history for more details). Feel free to contact me with any suggestions for changes.

This document was generated by Siteweaver on Fri Dec 6 10:37:58 2002.
The source to the document was last modified on Wed Sep 4 10:08:34 2002.
This document may be found at `http://www.cs.grinnell.edu/~rebelsky/Courses/CS362/2002F/Outlines/outline.07.html`.

You may wish to validate this document's HTML ; ; Check with Bobby

Glimmer Labs: The Grinnell Laboratory for Interactive Multimedia Experimentation & Research
glimmer@grinnell.edu