Class 27: An Expression Grammar
Held Friday, April 9
- Congratulations to Dima: his team placed second (of 23 teams)
in the Iowa
Collegiate Mathematics Contest. His team also placed 27th in
the nationwide Putnam exam (the highest Grinnell's placed for about
- Congratulations to Wyatt: his team placed sixth in the Iowa
Collegiate Mathematics Contest.
- In practice, we build a language from multiple grammars.
is used to build a simpler alphabet which the next grammar can then
process in more interesting ways.
- For example, the process of finding identifiers, numbers, and such
is much easier than that of finding the various types of statements
in a language.
- If it's possible to build a fast program to find identifiers and such,
this can simplify and speed the other program.
- The simple processing is often called lexical analysis.
It typically uses regular expressions.
- The more complicated processing is then called syntax analysis.
- In addition to leading the Fortran team, Backus (remember him)
worked as part of the
Algol team. One of his key roles in that team was formally defining
the syntax of Algol.
- As you know, many computer scientists have a tendency to
``go meta'' (e.g., rather than writing a program for a
particular problem, we write one for a class of problems, and
then go back and decide to write a program generator).
- Along that lines, Backus worked on a formal system for defining
- Many of his ideas were governed by previous work of Chomsky.
- Naur worked on improving Backus's notation.
- (A note on abbreviations: BNF is ``Backus-Naur Form'', for the
inventors of the notation; CNF is ``Chomsky Normal Form'', for
a particular way of writing grammars. Many people get the ``N''s
- BNF grammars are four-tuples that contain
- Sigma, the alphabet from which strings are built
(note that this alphabet may have been generated from the original
program via a lexer). These are also called the terminal symbols
of the grammar.
- N, the nonterminals in the language. You
can think of these as names for the structures in the language or as
names for groups of strings.
- S in N, the start symbol. In BNF grammars,
all the valid utterances are of a single type.
- P, the productions that present rules for
generating valid utterances.
- The set of productions forms the core part of a grammar. Often, language
definitions do not formally specify the other parts because they can
be derived from the grammar.
- A production is, in effect, a statement that one sequence of symbols
(a string of terminals and nonterminals) can be replaced by another
sequence of terminals and nonterminals. You can also view a production
as an equivalence: you can represent the left-hand-side by the
- What do productions look like? It depends on the notation one uses
for nonterminals and terminals, which may depend on the available
- Some use
::= for productions.
- Others use
-> for productions.
- We'll use words that begin with capital letters to indicate
nonterminals, words that begin with lowercase letters to indicate
terminals, and stuff in quotation marks to indicate particular
phrases. We'll use
::= to separate the two parts of a rule.
- For example, we might indicate the standard form of a Pascal program with
Program ::= 'program'
- Similarly, we might indicate a typical compound statement in Pascal
Compound-Statement ::= 'begin'
- Lists are defined recursively, using multiple definitions. Here's
one for non-empty statement lists. It says, ``a statement list can
be a statement; it can also be a statement followed by a semicolon
and another statement list''.
Statement-List ::= Statement
Statement-List ::= Statement ';' Statement-List
- The statements may then be defined with
Statement ::= Assignment-Statement
Statement ::= Compound-Statement
Assignment-Statement ::= identifier '=' Expression
- How do we use BNF grammars? We can use them to derive legal
strings in the language by starting with the start symbol and
repeatedly replacing strings that match left-hand-sides with the
corresponding right-hand-sides. This is called a derivation.
- When do we stop the derivation? When we run out of nonterminals
in the string.
- Note that the derivation can be represented as a tree, with edges
from lhs (parent) to rhs (children). Such a derivation is called
a parse tree
- We can also build a membership predicate.
If we have a particular string in mind, we try to make a parse tree
that matches the string.
- We can build the parse tree bottom-up or top-down.
- If we automate the construction of the parse tree, we have built
a parser. You can study parsers in CS362.
- While typical language grammars have only one nonterminal on the
left, it is possible to have grammars that include multiple things
on the left.
Rock Anything Hardplace ::= Rock Hardplace
- It turns out to be much harder to parse using grammars that have
multiple nonterminals on the left, so we'll work mostly with grammars
that have only one nonterminal on the left. These are called
context-free grammars, since the replacement of a nonterminal
requires no context. That is, we can replace a nonterminal without
considering the surrounding symbols.
- Hopefully, you studied these other types of grammars in CSC341.
- Let us consider a grammar one might use to define simple arithmetic
expressions over variables and numbers.
- Each number is an expression. We get numbers from the lexer.
- Each identifier is an expression (we won't worry about types right
- Each application of a unary operator to an expression is an
- The legal unary operators are the plus symbol and the minus symbol.
UnOp ::= '+'
UnOp ::= '-'
- The application of a binary operator to two expressions is an expression.
- And there are a number of binary operators. For shorthand, we'll write
a vertical bar (representing "or") to show alternates .
BinOp ::= '+' | '-' | '*' | '/'
- A parenthesized expression is an expression
- How do we show that
a+b*2 is an expression? First,
we observe that the
lexer converts this to
identifier + identifier * number
- The derivation follows. A right arrow (=>) is used to indicate
one derivation step.
Exp => // Exp ::= Exp BinOp Exp
Exp BinOp Exp => // Exp ::= identifier
identifier BinOp Exp => // BinOp ::= '+'
identifier + Exp => // Exp ::= Exp BinOp Exp
identifier + Exp BinOp Exp => // Exp ::= number
identifier + Exp BinOp number => // Exp ::= identifier
identifier + identifier BinOp number =>
identifier + identifier * number
- Note that we can have multiple choices at each step. For example,
we might have expanded the second
Exp rather than the
Exp BinOp Exp.
- To describe these simultaneous choices, we often write visual
descriptions of context-free derivations using parse trees.
nodes of a tree are the nonterminals. A derivation is shown by
connecting a node (representing the lhs) to children representing
the rhs of a derivation.
/ | \
/ | \
Exp BinOp Exp------
/ | | \ \
/ | | \ \
identifier + Exp BinOp Exp
/ | |
/ | |
identifier * number
- As in many other areas of computer science, careless grammar design
can lead to ambiguity in understanding (and parsing)
the objects we describe.
- An ambiguous grammar is one that provides multiple parse trees for
the same string.
- Why is this a problem? Often the parse trees provide a natural
mechanism for evaluating, compiling, understanding, or otherwise
using the parsed expression.
- For example, one might evaluate parsed expressions by evaluating the
subtrees and then applying the appropriate operation. We might get
different results if we had different parse trees.
- How do we get around ambiguity? Here's one technique:
- We identify potential areas of ambiguity (often, using tools that
tell us about such ambiguities).
- We decide which parse tree is correct for our intended meaning.
- We rewrite our grammar to eliminate the ambiguity.
- We ensure that the new grammar describes an identical language
to the old grammar.
- The standard expression grammar is ambiguous. Why? Because there
are multiple ways to parse expressions with two operations.
num + num * num.
- It might be parsed (in shorthand) as
/ | \
/ | \
Exp + Exp
| / | \
| / | \
num Exp * Exp
- It might also be parsed as
/ | \
/ | \
Exp * Exp
/ | \ |
/ | \ |
Exp + Exp num
num - num - num might be parsed with
two different trees.
- Is this a problem? Yes, in both cases.
- Which tree is correct?
- In the first case (num+num*num), it's the first tree, which shows
us doing addition
after multiplication. This is because multiplication has
precedence over addition.
- In the second case (num-num-num), it's the second tree, which shows us
doing the second subtraction second. This is because subtraction is
- How do we resolve these problems? It turns out that it's easiest
to resolve the two problems separately.
- However, solving one does give us a clue to the other.
- We're going to ignore the unary operators for this discussion.
- How do we handle precedence?
- By considering what's wrong with the misparsed tree(s).
- By identifying the rules that lead to the incorrect trees.
- By dividing expressions up into categories to eliminate such problems.
- What's wrong with that tree?
- There's a subtree of a ``multiplication tree'' that contains
- Since one executes the subtree first, one is forced to do the
operations out of order.
- What does this suggest about categories? That we need a
category for ``does not include unparenthesized addition''. We'll call
- It may be an odd name, but it's the one that tradition dictiates.
- What is a
Term? Something that may include
multiplication but does not include no unparenthesized addition.
- It need not include multiplication.
- What other categories do we need? We also need a category
for stuff that can include addition.
- It could be ``stuff that includes unparenthesized addition''. We
might call this
- It could be ``stuff that may (but need not) include unparenthesized
- Which do you prefer? We'll use the second, since it's
close to our definition of
- We observe that there are two possibilities for
- Something that includes multiplication as the ``top level'' operation.
- Something that doesn't include multiplication as the ``top level''
operation. We'll call this
- What expressions include multiplication as the top level operation?
We need multiplication. We also need safe ``arguments''. But we've
defined such safe arguments as
Term ::= Term MulOp Term
MulOp ::= '*' | '/'
- What are the other
Terms? Those that don't include
multiplication. We've already called those
- What are the factors? The expressions that include parentheses and
the base expressions.
Factor ::= '(' + Exp + ')'
Factor ::= num
Factor ::= id
- Now, let's return to general expressions. These may or may not
include addition at the root.
- As with
Term, we can separate them into those that do and
those that don't.
Exp ::= Exp AddOp Exp
Exp ::= Term
AddOp ::= '+' | '-'
- Have we eliminated the problem with precedence? If we try
to misparse our original expression, we'll find that we can't even
- Do we have the same language? Informally, yes. It's clear that
we haven't added any strings. Have we removed any? No, but I wouldn't
want to try to argue that.
- Those of you who've taken CSC341 may want to exercise those
- Created Tuesday, January 19, 1999 as a blank outline.
- Filled in the details on Friday, April 9, 1999.
Many of the details on
language syntax were based on
outline 5 and
outline 6 of
CS302 98S, although they were rewritten to accomdate changes to the structure of the
class (we're covering topics in a much different order).
- Removed uncovered details on Monday, April 12, 1999.