# Class 8: Introduction to grammars and parsing

Held Monday, September 14, 1998

Handouts

• The NFA and DFA we developed (or worked on developing) in the previous class.

Notes

• We're wrapping up lexical analysis today. Make sure that you understand everything (but feel free to ask questions in subsequent classes if you feel that you don't).
• Yet another day of no quizzes.
• Note that I've modified the section on ``from NFA to DFA'' slightly.
• Take home question: Write a context-free grammar for ``strings of a's and b's with equal numbers of a's and b's''. How would you show that this grammar is correct?
• Not all of you have filled out the survey on the first debates. Please do so today. The current trend is to use something other than a debate for your public presentations, possibly to have you lead a short (half-class) discussion of some topic we would not otherwise cover.
• If you're going to miss class, please let me know (preferably before class; otherwise after class).

## Tokenizing, concluded

### From NFA to DFA

• How do we turn this lovely NFA into a deterministic computing machine?
• Basically, we build a DFA that simultaneously follows all the paths that we can go through in the NFA.
• Each state in the DFA is a set of states in the corresponding NFA.
• The algorithm is again fairly simple
• Terminology :
• qx is a state in the NFA and q0 is its start state
• QX is a state in the DFA and Q0 is its start state
```Q0 = { q0 }
// but there are some states we can reach from q0 at no cost
Q0 = epsilon-closure(Q0)
while there are states we haven't processed
pick one such state, Qn
for each symbol s
let tmp be a new set
for each q in Qn
end for
tmp = epsilon-closure(tmp)
if tmp is not in the DFA then
let Qi be a new state
Qi = tmp
else
let Qi be the state equivalent to tmp
end if
add an edge from Qn to Qi in the automaton
end for
end while
for each Qi
if there is a q in Qi that is a final state then
Qi is a final state
end if
end for
```
• When we make the determination of a final state, we use the associated token type of the highest priority final state in the NFA.

### From DFA to Optimal DFA

• As you can probably tell, the DFAs created by this algorithm are huge.
• Hence it behooves us to simplify them.
• How do you build a DFA with fewer states from a DFA? Once again, the process is relatively easy.
• Once again, the process involves building states from sets of states.
• This time, however, we will need to split sets (making them smaller) rather than add elements to sets.
• We will use Q for the states in the original DFA and Z for states in the new DFA.
• Our goal is to find states that behave differently.
• We begin with two sets: a set of the final states and a set of the nonfinal states (obviously, these are different kinds of states). [If there are different actions or token names attached to the final states, we may want to begin with more sets, one for each ``type'' of final state.]
• Then, we try to split states. How do we split a state, Z? If we can find a symbol, s, such that there exists Qi and Qj with Qi and Qj in Z, delta(Qi,s) in Zk and delta(Qj,s) in Zl. Qi and Qj clearly behave differently, and thus must be treated as different states.

### lex, Jlex, and flex

• Because all of these conversions are "so easy", there is no reason for you to do them by hand (other than to make sure that you could write a program to do it for you if you were required to do so).
• The original lexical analysis tool was `lex`.
• The GNU version is `flex`
• The Java version is `Jlex`
• `lex` is in `/usr/bin/lex` and `flex` is in `/usr/local/bin/flex`. Blade has installed `Jlex` in the course directory.
• Blade has also printed manuals for the various tools.

## Limits of regular expressions and finite automata

• As we've mentioned before, there are some limits of regular expressions and finite automata.
• In particular, there are some ``simple'' languages that we cannot express.
• Consider the language ``strings of a's and b's with equal numbers of a's and b's''
• This language cannot be expressed by a finite automaton.
• The proof is a proof by contradiction. Suppose it could be expressed by a deterministic finite automaton, A.
• A has a finite number of states, n
• Record the sequence of states the automaton goes through given a(n+1) as input.
• At least one state, q, must be duplicated in that sequence. Suppose it happens after ai and aj, with i and j different.
• If the automaton accepts aibi (which it should), then it will also accept aibj (which it should not).
• Since every regular expression and every NFA can be represented by DFA, that language also cannot be expressed by NFAs or regular expressions.
• There are a host of other similar languages that cannot be expressed in that way.

## An introduction to grammars

• In order to express more complicated (or more sophisticated) languages, we need a more powerful tool.
• Preferably, this tool will support some form of recursion or looping. For example, a string has equal numbers of a's and b's if
• the string is empty or
• the string has an a and a b and, if we remove an a and a b from the string, we end up with a string with equal numbers of a's and b's.
• The most common tool used is the Backus-Naur Form (BNF) grammar.
• A grammar is a formal set of rules that specify the legal utterances in the language.

### BNF grammars

• BNF grammars are four-tuples that contain
• Sigma, the alphabet from which strings are built (note that this alphabet may have been generated from the original program via a lexer). These are also called the terminal symbols of the grammar.
• N, the nonterminals in the language. You can think of these as names for the structures in the language or as names for groups of strings.
• S in N, the start symbol. In BNF grammars, all the valid utterances are of a single type.
• P, the productions that present rules for generating valid utterances.
• The set of productions is the core part of a grammar. Often, language definitions do not formally specify the other parts because they can be derived from the grammar.
• A production is, in effect, a statement that one sequence of symbols (a string of terminals and nonterminals) can be replaced by another sequence of terminals and nonterminals. You can also view a production as an equivalence: you can represent the left-hand-side by the right-hand-side.
• What do productions look like? It depends on the notation one uses for nonterminals and terminals, which may depend on the available character set.
• We'll use words that begin with capital letters to indicate nonterminals, words that begin with lowercase letters to indicate terminals, and stuff in quotation marks to indicate particular phrases. We'll use `::=` to separate the two parts of a rule.
• We might indicate the standard form of a Pascal program with
```Program ::= 'program'
identifier
'('
Identifier-List
')'
Declaration-List
Compound-Statement
'.'
```
• Similarly, we might indicate a typical compound statement in Pascal with
```Compound-Statement ::= 'begin'
Statement-List
'end'
```
• Lists are defined recursively, using multiple definitions. Here's one for non-empty statement lists. It says, "a statement list can be a statement; it can also be a statement followed by a semicolon and another statement list".
```Statement-List ::= Statement
Statement-List ::= Statement ';' Statement-List
```
• The statements may then be defined with
```Statement ::= Assignment-Statement
Statement ::= Compound-Statement
...
Assignment-Statement ::= identifier '=' Expression
...
```
• How do we use BNF grammars? We show derivations by starting with the start symbol and repeatedly applying productions. Eventually, we end up with a sequence of terminals.
• Any string that can be produced in this way is in the language.
• The attempt to show that a string is in a language given by a BNF grammar is called parsing that string.
• We'll soon see a more formal way of parsing.

Disclaimer Often, these pages were created "on the fly" with little, if any, proofreading. Any or all of the information on the pages may be incorrect. Please contact me if you notice errors.