# Class 7: From regular expression to finite automaton

Held Friday, September 11, 1998

Notes

• No quiz today! However, we will still discuss the core question: how might you write a regular expression or finite automaton to match comments (or our simplified version of comments).
• Next Tuesday at 4:15 there is a talk by Jennifer Hepner on Trapezoids with Integer Sides. Be there or be a trapezoid.
• Hopefully, we will finish up on lexical analysis this week. You may want to read the next chapter of the book, on parsing.
• Take home question: find or develop a more formal definition of nondeterministic finite automaton.
• You can find software for the course in `/usr/local/student/courses/csc362`

## Tokenizing with finite automata

• We've seen how to match with finite automata, but how do we tokenize?
• We only tokenize with deterministic finite automata.
• We begin by attaching a token (or, sometimes, a set of instructions) to each final state.
• When we reach an appropriate final state (possibly not the first final state), we stop and return the corresponding token.
• How do we decide whether a final state is appropriate? We always keep track of the last final state we encountered, but peek ahead until we find another final state or hit a dead end.
```/**
* Find the first token in the candidate string, starting
* at a particular position.
*/
public token findToken(String candidate, starting_pos) {
State current_state = q0;
for i = starting_pos to the length of candidate
current_state = delta(current_state,candidate.symbolAt(i))
if current_state is a final state
final_found = current_state
final_pos = i
end if
if current_state is undefined then
exit the for loop
end if
end for
if (final_found is defined) then
return the token given by final_found at position final_pos
else
no token can be found
end if
} // findToken
```

## From regular expression to DFA

• We've now seen two mechanisms for describing tokens (and, in effect, for tokenizing): regular expressions and finite automata.
• How do we go from one to another? We convert regular expressions to nondeterministic finite automata using a surprisingly naive translation. We convert NFAs to DFAs. We then "optimize" the DFAs.
• We'll do this with a trio of expressions: one for an odd variant of identifiers ((a|bb)[a-d]*) and two for a simple keywords (aa and dd).

## From regular expression to DFA

• Our regular expression to NFA converter builds a finite automaton with one final state (and one start state). It uses the recursive definitions of regular expressions to build a finite automaton for a compound expression from the finite automata for the base expressions.
• The finite automaton for the empty string has two states with an epsilon transition between the states. The first state is the start state. The second state is the final state.
• The finite automaton for a single symbol is similar, except that the transition is labeled with the symbol.
• The finite automaton for alternation has a new start state with epsilon edges to the start states of the automata for the subexpressions and a new final state with epsilon edges from the final states of the automata for the subexpressions.
• The finite automaton for concatenation uses the start state of the automaton for the first expression and the final state of the automaton for the second expression. There is an epislon transition from the final state of the first automaton to the start state of the second automaton.
• The finite automaton for repetition has a new start state and a new final state. There are epsilon transitions
• from the start state to the final state,
• from the start state to the start state of the subautomaton,
• from the final state of the subautomaton to the final state, and
• from the final state of the subautomaton to the start state of the subautomaton.
• If it seems to you that some of these extra states and epsilon transitions are unnecessary, you're probably right. However, you can't get rid of all of them. For example, consider what happens if you don't create any extra states for concatenation or repetition.

## From token definitions to NFA

• How do we convert a series of token definitions (using regular expressions) to a tokenizing NFA?
• We build an NFA for each of the token definitions.
• We augment the final states with token type and priority.
• We put 'em together in one big automaton by adding a new start state with epsilon transitions to the start states of all the other automata.

## From NFA to DFA

• How do we turn this lovely NFA into a deterministic computing machine?
• Basically, we build a DFA that simultaneously follows all the paths that we can go through in the NFA.
• Each state in the DFA is a set of states in the corresponding NFA.
• The algorithm is again fairly simple
• Terminology :
• qx is a state in the NFA and q0 is its start state
• QX is a state in the DFA and Q0 is its start state
```Q0 = { q0 }
// but there are some states we can reach from q0 at no cost
Q0 = epsilon-closure(Q0)
while there are states we haven't processed
pick one such state, Qn
for each symbol s
let tmp be a new set
for each q in Qn
end for
tmp = epsilon-closure(tmp)
if tmp is not in the DFA then
let Qi be a new state
Qi = tmp
else
let Qi be the state equivalent to tmp
end if
add an edge from Qn to Qi in the automaton
end for
end while
for each Qi
if there is a q in Qi that is a final state then
Qi is a final state
end if
end for
```
• When we make the determination of a final state, we use the associated token type of the highest priority final state in the NFA.

Back to Introduction to finite automata. Back to Introduction to grammars and parsing.

Disclaimer Often, these pages were created "on the fly" with little, if any, proofreading. Any or all of the information on the pages may be incorrect. Please contact me if you notice errors.