Held Monday, September 16, 2002
Today we move from lexing to parsing.
- Quick quiz: Where in Pascal can you not tell immediately whether a
character read is part of the current token?
- Congratulations to Senator French (or is that the Senator from France?).
- Are there any questions on homework 1?
- Don't forget the Math/CS picnic on Friday!
- Is anyone bringing parents to class on Friday?
- Yes, I know that last year's notes occasionally appear in these pages.
I'll get rid of them eventually.
- I'll be hanging out in
Algorithms class from 2:15-3:05 this week.
- As many of you have probably noted, different people learn and understand
material differently. Some people learn best from formal nootation; others
learn well from casual analysis; still others learn best from examples.
- I try to balance the ways in which I teach things. Let me know
if I can better serve your best learning style.
- Limits of regular expressions
- BNF Grammars
- Context-Free Grammars
- As we've seen before, there are some limits of regular expressions
and finite automata.
- In particular, there are some
simple languages that we cannot
- Consider the language
b's with equal numbers of
- This language cannot be expressed by a finite automaton.
- The proof is a proof by contradiction. Suppose it could be expressed
by a deterministic finite automaton, A.
- A has a finite number of states, n
- Record the sequence of states the automaton goes through given
a(n+1) as input.
- At least one state, q, must be duplicated in that sequence.
Suppose it happens after
aj, with i
and j different.
- If the automaton accepts
(which it should), then it will also accept
(which it should not).
- Sinc every regular expression and every NFA can be represented by
DFA, that language also cannot be expressed by NFAs or regular
- There are a host of other similar languages that cannot be expressed
by finite automata.
- balanced parentheses
- So, what do we do? We move on to a more powerful notation.
- Note that you may need pretty powerful notations.
- For example, our equal numbers of
may not be implementable on any machine with finite memory.
- In order to express more complicated (or more sophisticated) languages,
we need a more powerful tool.
- Preferably, this tool will support some form of recursion or looping.
- For example, a string has equal numbers of
- the string is empty or
- the string has at least one
at least one
b and, if we remove an
a and a
b from the
string, we end up with a string with equal numbers of
- Alternatively, a string has equal numbers of
- the string is empty or
- the string is built by adding an
a and a
to a string with an equal number of
- The most common tool used for these recursive definisions is the
Backus-Naur Form (BNF) grammar.
- A grammar is a formal set of rules that specify the legal utterances
in the language.
- Formally, BNF grammars are four-tuples that contain
- Sigma, the alphabet from which strings are built
(note that this alphabet may have been generated from the original
program via a lexer). These are also called the terminal symbols
of the grammar.
- N, the nonterminals in the language. You
can think of these as names for the structures in the language or as
names for groups of strings.
- S in N, the start symbol. In BNF grammars,
all the valid utterances are of a single type.
- P, the productions that present rules for
generating valid utterances.
- The set of productions is the core part of a grammar. Often, language
definitions do not formally specify the other parts because they can
be derived from the grammar.
- A production is, in effect, a statement that one sequence of symbols
(a string of terminals and nonterminals) can be replaced by another
sequence of terminals and nonterminals. You can also view a production
as an equivalence: you can represent the left-hand-side by the
- If we read productions from left to right, we have a mechanism
for building strings in the language.
- If we read productions from right to left, we have a mechanism
for testing language membership.
- What do productions look like? It depends on the notation one uses
for nonterminals and terminals, which may depend on the available
- We'll use words that begin with capital letters to indicate
nonterminals, words that begin with lowercase letters to indicate
terminals, and stuff in quotation marks to indicate particular
phrases. We'll use
::= to separate the two parts of a rule.
- We might indicate the standard form of a Pascal program with
Program ::= PROGAM
- Similarly, we might indicate a typical compound statement in Pascal
Compound-Statement ::= BEGIN
- Lists of things are often defined recursively, using multiple definitions.
one for non-empty statement lists. It says,
a statement list can
be a statement; it can also be a statement followed by a semicolon
and another statement list.
Statement-List ::= Statement
Statement-List ::= Statement SEMICOLON Statement-List
- The statements may then be defined with
Statement ::= Assignment-Statement
Statement ::= Compound-Statement
Assignment-Statement ::= identifier ASSIGN Expression
- How do we use BNF grammars? We show derivations by starting
with the start symbol and repeatedly applying productions. Eventually,
we end up with a sequence of terminals.
- Any string that can be produced in this way is in the language.
- The attempt to show that a string is in a language given by a
BNF grammar is called parsing that string.
- We'll soon see a more formal way of parsing.
Thursday, 29 August 2002
- First version, based somewhat on outlines from