Held Friday, February 2, 2001
Today we consider finite automata, which server as another representation
of data for lexical analysis.
- A few of you didn't show up to lab yesterday and didn't warn me in
advance. That's a no no.
- I'd like to know what you came up with for ``strings that neither
start nor end with an a''.
- Don't forget all the cool Thursday events in the coming weeks
- Feb. 8: Summer CS Opportunities
- Feb. 15: Summer Math Research Opportunities
- Feb. 22: Pete Broadwell, Elizabeth Hulse, Chris Kern, and Erin Nichols
talk about Web Raveler.
- March 1: Josh Vickery talks about annotations. Rachel Heck, Kumail
Nanjiani, Ellie Raulerson, and Isabel Staicut talk about invading
- March 8: Lea Wittie, alumni scholar, talks about debugging.
- Don't forget the Grinnell
Entrepreneurs' forthcoming conference on Saturday, Feb. 17th.
- Context: What we're doing and why
- Regular expressions
- Finite automata
- Looking ahead
- How will we learn about lexical analysis in this class?
- We will conclude our discussion of regular expressions.
- We will then consider how we might combine regular expressions to
describe all of the tokens of a language. [Today]
- Because there seems to be a gap between regular expressions and
computation, we will then consider finite automata, a
more computational way to express tokens and, more importantly,
the process of tokenizing. [Today]
- Although finite automata are covered in some depth in CS341, we
will spend some time considering some issues in the language
of finite automata. [Today]
- Since regular expressions are often easier to develop and write
than finite automata, we will consider how to automatically
translate regular expressions to finite automata.
- We will spend lab next week exploring a tool for building
lexical analyzers, such as
- As you might guess, we use regular expressions during lexical analysis
by writing regular expressions to describe the tokens of our language.
- An identifier is (more or less) given by [a-zA-Z][a-zA-Z0-9]*
- The keyword
if is given by if.
- Integers are almost given by [+-]?[0-9]+.
- Are there problems with using these definitions for lexical analysis?
- One problem is that different regular expressions can match the same string.
if is matched by the regular expression for
the if keyword. However, it is also matched by the definition of
- Most languages don't accept identifiers as keywords.
- Some do, but that can make it easier to write unreadable code, as
in the following (which I've formatted for more readability)
if then = if then if = else else if = then
- In addition, regular expressions are normally used for the membership test
(is this string in this language?). For membership
testing, we already know the bounds of a string. During lexical analysis,
we may not know these bounds. For example, is
token or two? How many tokens are there in
if 3 then?
- There are languages in which the answer to the second question is ``one''.
- Two basic principles are used to answer these questions.
- The principle of longest substring says that if regular expression
A indicates that the first n characters of the string form a token,
regular expression B indicates that the first m characters of the string
form a token, and n is less than m, then the first m characters of the
string form a token of type B.
- This principle works when we are deciding between differing length
tokens, but what if we have two tokens of the same length?
The principle of rule priority, says that the highest priority
rule (regular expression plus name) is used. Which one has highest
priority? Usually the first.
- Regular expressions provide a simple, easy to understand, and elegant
mechanism for describing the tokens of a language.
- Unfortunately, it is difficult to compute with regular expressions.
- Hence, we will consider a more computational approach to lexical
analysis: finite automata.
- What are finite automata?
- They are very simple computing machines
(simple enough that there are things that they cannot compute that
more complex computing machines can compute).
- What do they compute?
- Typically, they compute membership in simple languages.
- That is, given a string, they determine whether or not
that string is in the language.
- As with regular expressions, they can be modified to find
tokens in larger sequences of symbols.
- Because they are simple,
finite automata can easily be translated to machine code.
- Formally, a deterministic finite automaton is a five tuple
- Q is a set of states;
- Sigma is the base alphabet, that is a finite set of symbols;
- delta is a transition function, it is a function from
state/symbol pairs to states;
- q in Q is the start state;
- F subset Q is the set of final states.
- How do we use an automaton to determine whether a string is in the language
given by that automaton? Using a matching function.
* Determine whether a string is in the language given by the
* current automaton.
public boolean inLanguage(String candidate)
State current_state = q0;
for each symbol sym in candidate
current_state = delta(current_state ,sym)
if current_state is undefined return false
return (current_state is a final state);
- As this suggests, it is also helpful to have an explicit or implicit
- Typically, we represent finite automata pictorially.
- Each state is represented by a circle.
- The transition function, delta, is represented by directed and
labeled edges between states. For example, if (q1,b,q2) is in
delta, then there is an edge from q1 to q2 labeled b.
- Final states have a double circle.
- The start state has an incoming arrow.
- Sigma is not represented explicitly.
- You may find it easier to represent some tokens with finite
automata and others with regular expressions.
- However, they have the same computational power.
- Neither can represent the language ``strings of
b's with equal numbers of
- It turns out that it's interesting and useful to consider a variant of
finite automata in which
- some transitions are ``free'', in that they do not require
any symbols in order to make the transition. Typically
such edges are labeled with epsilon.
- there may be multiple edges with the same label exiting
the same node
- Such finite automata are often called nondeterministic finite
automata or NFAs.
- When is a string in the language given by such an automaton? When
there is some path through the automaton that uses the
appropriate symbols and free transitions. [We'd like the computer
to guess the correct path.]
- Do we gain any power from this nondeterminism? Surprisingly, no.
As we'll see next class, nondeterministic finite automata can
easily be converted to deterministic finite automata.
- Why is all this important?
- Because regular expressions (which we like to write), are
declarative, not imperative. That is, they say what we want
to match, but now how.
- Finite automata, on the other hand, include a nice procedure
- Fortunately, with a little bit of effort, you can turn any regular
expression (or set of regular expressions) into a DFA.
- First, you convert each regular expression to an NFA.
- Then you join the NFA's together, giving a new NFA.
- Then you convert the NFA to a DFA.
- We'll do all that tomorrow.
Monday, 22 January 2001
- Created as a blank outline.
Thursday, 1 February 2001
- Filled in the details, many of which were taken from
outline 6 of the previous version of this course.
- Updated formatting.