# Class 08: Introduction to Grammars and Parsing

Held Monday, September 16, 2002

Summary

Today we move from lexing to parsing.

Notes

• Quick quiz: Where in Pascal can you not tell immediately whether a character read is part of the current token?
• Congratulations to Senator French (or is that the Senator from France?).
• Are there any questions on homework 1?
• Don't forget the Math/CS picnic on Friday!
• Is anyone bringing parents to class on Friday?
• Yes, I know that last year's notes occasionally appear in these pages. I'll get rid of them eventually.
• I'll be hanging out in Ben Gum's Algorithms class from 2:15-3:05 this week.
• As many of you have probably noted, different people learn and understand material differently. Some people learn best from formal nootation; others learn well from casual analysis; still others learn best from examples.
• I try to balance the ways in which I teach things. Let me know if I can better serve your best learning style.

Assignments

Overview

• Limits of regular expressions
• BNF Grammars
• Form
• Some examples
• Context-Free Grammars

## Limits of Regular Expressions and Finite Automata

• As we've seen before, there are some limits of regular expressions and finite automata.
• In particular, there are some simple languages that we cannot express.
• Consider the language strings of `a`'s and `b`'s with equal numbers of `a`'s and `b`'s
• This language cannot be expressed by a finite automaton.
• The proof is a proof by contradiction. Suppose it could be expressed by a deterministic finite automaton, A.
• A has a finite number of states, n
• Record the sequence of states the automaton goes through given `a`(n+1) as input.
• At least one state, q, must be duplicated in that sequence. Suppose it happens after `a`i and `a`j, with i and j different.
• If the automaton accepts `a`i`b`i and `a`j`b`j (which it should), then it will also accept `a`j`b`i and `a`i`b`j and (which it should not).
• Sinc every regular expression and every NFA can be represented by DFA, that language also cannot be expressed by NFAs or regular expressions.
• There are a host of other similar languages that cannot be expressed by finite automata.
• anbn
• palindromes
• balanced parentheses
• ...
• So, what do we do? We move on to a more powerful notation.
• Note that you may need pretty powerful notations.
• For example, our equal numbers of `a`'s and `b`'s may not be implementable on any machine with finite memory.

## An Introduction to Grammars

• In order to express more complicated (or more sophisticated) languages, we need a more powerful tool.
• Preferably, this tool will support some form of recursion or looping.
• For example, a string has equal numbers of `a`'s and `b`'s if
• the string is empty or
• the string has at least one `a` and at least one `b` and, if we remove an `a` and a `b` from the string, we end up with a string with equal numbers of `a`'s and `b`'s.
• Alternatively, a string has equal numbers of `a`'s and `b`'s if
• the string is empty or
• the string is built by adding an `a` and a `b` to a string with an equal number of `a`'s and `b`'s.
• The most common tool used for these recursive definisions is the Backus-Naur Form (BNF) grammar.
• A grammar is a formal set of rules that specify the legal utterances in the language.

## BNF grammars

• Formally, BNF grammars are four-tuples that contain
• Sigma, the alphabet from which strings are built (note that this alphabet may have been generated from the original program via a lexer). These are also called the terminal symbols of the grammar.
• N, the nonterminals in the language. You can think of these as names for the structures in the language or as names for groups of strings.
• S in N, the start symbol. In BNF grammars, all the valid utterances are of a single type.
• P, the productions that present rules for generating valid utterances.
• The set of productions is the core part of a grammar. Often, language definitions do not formally specify the other parts because they can be derived from the grammar.
• A production is, in effect, a statement that one sequence of symbols (a string of terminals and nonterminals) can be replaced by another sequence of terminals and nonterminals. You can also view a production as an equivalence: you can represent the left-hand-side by the right-hand-side.
• If we read productions from left to right, we have a mechanism for building strings in the language.
• If we read productions from right to left, we have a mechanism for testing language membership.
• What do productions look like? It depends on the notation one uses for nonterminals and terminals, which may depend on the available character set.
• We'll use words that begin with capital letters to indicate nonterminals, words that begin with lowercase letters to indicate terminals, and stuff in quotation marks to indicate particular phrases. We'll use `::=` to separate the two parts of a rule.
• We might indicate the standard form of a Pascal program with
```Program ::= PROGAM
identifier
OPEN_PAREN
Identifier-List
CLOSE_PAREN
Declaration-List
Compound-Statement
END_PROGRAM
```
• Similarly, we might indicate a typical compound statement in Pascal with
```Compound-Statement ::= BEGIN
Statement-List
END
```
• Lists of things are often defined recursively, using multiple definitions. Here's one for non-empty statement lists. It says, a statement list can be a statement; it can also be a statement followed by a semicolon and another statement list.
```Statement-List ::= Statement
Statement-List ::= Statement SEMICOLON Statement-List
```
• The statements may then be defined with
```Statement ::= Assignment-Statement
Statement ::= Compound-Statement
...
Assignment-Statement ::= identifier ASSIGN Expression
...
```
• How do we use BNF grammars? We show derivations by starting with the start symbol and repeatedly applying productions. Eventually, we end up with a sequence of terminals.
• Any string that can be produced in this way is in the language.
• The attempt to show that a string is in a language given by a BNF grammar is called parsing that string.
• We'll soon see a more formal way of parsing.

## History

Thursday, 29 August 2002

• First version, based somewhat on outlines from CS362 2001S.

Disclaimer: I usually create these pages on the fly, which means that I rarely proofread them and they may contain bad grammar and incorrect details. It also means that I tend to update them regularly (see the history for more details). Feel free to contact me with any suggestions for changes.

This document was generated by Siteweaver on Fri Dec 6 10:38:00 2002.
The source to the document was last modified on Wed Sep 4 10:08:34 2002.
This document may be found at `http://www.cs.grinnell.edu/~rebelsky/Courses/CS362/2002F/Outlines/outline.08.html`.

You may wish to validate this document's HTML ; ; Check with Bobby

Glimmer Labs: The Grinnell Laboratory for Interactive Multimedia Experimentation & Research
glimmer@grinnell.edu