[Instructions] [Search] [Current] [News] [Syllabus] [Handouts] [Outlines] [Assignments]

**Held** Friday, September 11, 1998

**Notes**

- No quiz today! However, we will still discuss the core question: how might you write a regular expression or finite automaton to match comments (or our simplified version of comments).
- Next Tuesday at 4:15 there is a talk by Jennifer Hepner on
*Trapezoids with Integer Sides*. Be there or be a trapezoid. - Hopefully, we will finish up on lexical analysis this week. You may want to read the next chapter of the book, on parsing.
- Take home question: find or develop a more formal definition of
*nondeterministic finite automaton*. - You can find software for the course in
`/usr/local/student/courses/csc362`

- We've seen how to match with finite automata, but how do we tokenize?
- We only tokenize with deterministic finite automata.
- We begin by attaching a token (or, sometimes, a set of instructions)
to each final state.
- When we reach an appropriate final state (possibly not the first final state), we stop and return the corresponding token.

- How do we decide whether a final state is appropriate? We always
keep track of the last final state we encountered, but peek ahead
until we find another final state or hit a dead end.
/** * Find the first token in the candidate string, starting * at a particular position. */ public token findToken(String candidate, starting_pos) { State current_state = q0; for i = starting_pos to the length of candidate current_state = delta(current_state,candidate.symbolAt(i)) if current_state is a final state final_found = current_state final_pos = i end if if current_state is undefined then exit the for loop end if end for if (final_found is defined) then return the token given by final_found at position final_pos else no token can be found end if } // findToken

- We've now seen two mechanisms for describing tokens (and, in effect, for tokenizing): regular expressions and finite automata.
- How do we go from one to another? We convert regular expressions to nondeterministic finite automata using a surprisingly naive translation. We convert NFAs to DFAs. We then "optimize" the DFAs.
- We'll do this with a trio of expressions: one for an odd variant of identifiers ((a|bb)[a-d]*) and two for a simple keywords (aa and dd).

- Our regular expression to NFA converter builds a finite automaton with one final state (and one start state). It uses the recursive definitions of regular expressions to build a finite automaton for a compound expression from the finite automata for the base expressions.
- The finite automaton for the empty string has two states with an epsilon transition between the states. The first state is the start state. The second state is the final state.
- The finite automaton for a single symbol is similar, except that the transition is labeled with the symbol.
- The finite automaton for alternation has a new start state with epsilon edges to the start states of the automata for the subexpressions and a new final state with epsilon edges from the final states of the automata for the subexpressions.
- The finite automaton for concatenation uses the start state of the automaton for the first expression and the final state of the automaton for the second expression. There is an epislon transition from the final state of the first automaton to the start state of the second automaton.
- The finite automaton for repetition has a new start state and a
new final state. There are epsilon transitions
- from the start state to the final state,
- from the start state to the start state of the subautomaton,
- from the final state of the subautomaton to the final state, and
- from the final state of the subautomaton to the start state of the subautomaton.

- If it seems to you that some of these extra states and epsilon transitions are unnecessary, you're probably right. However, you can't get rid of all of them. For example, consider what happens if you don't create any extra states for concatenation or repetition.

- How do we convert a series of token definitions (using regular expressions) to a tokenizing NFA?
- We build an NFA for each of the token definitions.
- We augment the final states with token type and priority.
- We put 'em together in one big automaton by adding a new start state with epsilon transitions to the start states of all the other automata.

- How do we turn this lovely NFA into a deterministic computing machine?
- Basically, we build a DFA that simultaneously follows all the paths that
we can go through in the NFA.
- Each state in the DFA is a set of states in the corresponding NFA.

- The algorithm is again fairly simple
- Terminology :
- qx is a state in the NFA and q0 is its start state
- QX is a state in the DFA and Q0 is its start state
Q0 = { q0 }

*// but there are some states we can reach from q0 at no cost*Q0 = epsilon-closure(Q0) while there are states we haven't processed pick one such state, Qn for each symbol s let tmp be a new set for each q in Qn add delta(q,s) to tmp end for tmp = epsilon-closure(tmp) if tmp is not in the DFA then let Qi be a new state Qi = tmp add Qi to the DFA else let Qi be the state equivalent to tmp end if add an edge from Qn to Qi in the automaton end for end while for each Qi if there is a q in Qi that is a final state then Qi is a final state end if end for

- When we make the determination of a final state, we use the associated token type of the highest priority final state in the NFA.

Back to Introduction to finite automata. Back to Introduction to grammars and parsing.

[Instructions] [Search] [Current] [News] [Syllabus] [Handouts] [Outlines] [Assignments]

**Disclaimer** Often, these pages were created "on the fly" with little, if any, proofreading. Any or all of the information on the pages may be incorrect. Please contact me if you notice errors.

Source text last modified Mon Sep 14 10:53:20 1998.

This page generated on Wed Sep 16 11:21:17 1998 by SiteWeaver.

Contact our webmaster at rebelsky@math.grin.edu