Held Monday, February 5, 2001
Today we consider how to move from the readable but declarative regular
exprssion notation to the executable but obtuse finite automaton
- If you have not done so already, please start reading Chapter 4 of
the Red Dragon.
- Please don't forget lab this week.
- I'm still working on writing up last Wednesday's class.
- Regular expressions vs. DFAs.
- Regular expressions as lexical descriptions.
- From lexical specification to DFA
- From one regular expression to NFA
- From many regular expressions to NFA
- We've seen how to match with finite automata, but how do we tokenize?
- We only tokenize with deterministic finite automata.
- We begin by attaching a token (or, sometimes, a set of instructions)
to each final state.
- When we reach an appropriate final state (possibly not the first
final state), we stop and return the corresponding token.
- How do we decide whether a final state is appropriate? We always
keep track of the last final state we encountered, but peek ahead
until we find another final state or hit a dead end.
* Find the first token in the candidate string, starting
* at a particular position.
public token findToken(String candidate, starting_pos)
State current_state = q0;
for i = starting_pos to the length of candidate
current_state = edge(current_state,candidate.symbolAt(i))
if current_state is a final state
final_found = current_state
final_pos = i
if current_state is undefined then
exit the for loop
if (final_found is defined) then
return the token given by final_found at position final_pos
no token can be found
- We've now seen two mechanisms for describing tokens (and, in
effect, for tokenizing): regular expressions and finite
- How do we go from one to another? We convert regular expressions
to nondeterministic finite automata using a surprisingly naive
translation. We convert NFAs to DFAs. We then "optimize" the
- We'll do this with a trio of expressions: one for an odd variant
of identifiers ((a|bb)[a-d]*) and two for a simple keywords (aa and
- Our regular expression to NFA converter builds a finite automaton with
one final state (and one start state). It uses the recursive definitions
of regular expressions to build a finite automaton for a compound
expression from the finite automata for the base expressions.
- The finite automaton for the empty string has two states with an
epsilon transition between the states. The first state is the
start state. The second state is the final state.
- The finite automaton for a single symbol is similar, except that the
transition is labeled with the symbol.
- The finite automaton for alternation has a new start state with
epsilon edges to the start states of the automata for the
subexpressions and a new
final state with epsilon edges from the final states of the
automata for the subexpressions.
- The finite automaton for concatenation uses the start state of the
automaton for the first expression and the final state of the
automaton for the second expression. There is an epislon transition
from the final state of the first automaton to the start state of
the second automaton.
- The finite automaton for repetition has a new start state and a
new final state. There are also epsilon transitions
- from the start state to the final state,
- from the start state to the start state of the subautomaton,
- from the final state of the subautomaton to the final state, and
- from the final state of the subautomaton to the start state of
- If it seems to you that some of these extra states and epsilon transitions
are unnecessary, you're probably right. However, you can't get rid of all
For example, consider what happens if you don't create any extra states
for concatenation or repetition.
- How do we convert a series of token definitions (using regular
expressions) to a tokenizing NFA?
- We build an NFA for each of the token definitions.
- We augment the final states with token type and priority.
- We put 'em together in one big automaton by adding a new start
state with epsilon transitions to the start states of all
the other automata.
Monday, 22 January 2001
- Created as a blank outline.
Monday, 5 February 2001
- Filled in the details, based primarily on
outline 7 of
- Updated formatting.
- Added introductory material.