# Midterm Solutions

• Preliminary Notes
• Problem 1: Tokenizing HTML files
• Introductory notes
• Solution
• Difficulties
• Problem 2: Comparing automata
• 2.a: A regular expression
• 2.b: A finite automaton
• 2.c: A grammar
• 2.d: A parser
• A predictive parser
• A shift-reduce parser
• Difficulties
• 2.e: Reflection
• My Thoughts
• Problem 3: A grammar for regular expressions
• Solution
• Difficulties
• Problem 4: Attribute grammars for binary numbers
• Problem 4.a: Nonnegative binary numbers
• Solution
• Difficulties
• Alternate Solutions
• Problem 4.b: Two's Complement
• Solution
• Alternate Solutions
• Problem 5: Details from Appel
• Problem 5.a: Details from the Red Book
• Problem 5.b: Details from the Green Book

## Preliminary Notes

In general, the exams were awful. While most of you did at least one question nearly perfectly (which question it was varied widely), many of you also missed some questions completely. As a makeup, I will give two one-hour, closed book, one problem exams while I am away. Each makeup exam can replace one problem on your exam (and you may decide which problem).

On a more casual note, please remember that the word ``grammar'' has no ``e''s in it. If anyone misspells ``grammar'' again, I will penalize you severely.

## Problem 1: Tokenizing HTML files

In the abstract, we can think of an HTML file as containing a sequence of two ``types'' of tokens: tags and chunks of text. We might describe tags (at least a simplified version of tags) as:

A tag is a sequence of characters beginning with a less-than sign, < and ending with a greater-than sign, >. The first greater-than sign that appears after the opening less-than sign closes the tag, unless that greater-than sign appears within a quoted string. Quoted strings begin with either a single quote, ', or a double quote, ", and must end with the same symbol. A single-quoted string cannot contain single quotes and a double-quoted string cannot contain double quotes. Single quotes can appear within double-quoted strings and double quotes can appear within single-quoted strings.

A chunk is any nonempty sequence of characters that's not a tag.

Write a `lex`, `flex`, or `Jlex` program that describes tags and chunks.

### Introductory notes

Some of you assumed that quoted strings had to appear within the tag. However, the definition does say ``The first greater-than sign that appears after the opening less-than sign closes the tag, unless [...]'', suggesting that some characters don't appear in quoted strings.

The ``any nonempty sequence of characters that's not a tag'' was somewhat ambiguous. For example, is ```<a`'' a tag? It's really an error, rather than a tag or a chunk. In general, a tag is anything not containing a less-than sign.

A few of you decided to do some error checking for problems like the one noted above. You received a small amount of extra credit for error checking.

Surprisingly, the request to write a program that ``describes'' tags and chunks was also ambiguous. Traditionally, a `lex` program identifies the nonterminals (in this case, tag and chunk) and returns a note that it's found such a symbol. The program should also ensure that the corresponding input string is stored in yytext. Many of you ignored this component and instead printed out some message. I did not penalize you for reasonable alternatives.

### Solution

This is approximate lex-syntax. I cared more about the ability to describe tokens than about the lex stuff.

```\<((\"[^\"]*\")|(\'[^\']*\')|([^<>\'\"]))*\>	{ return TOKEN_TAG; }
```

Some of you chose to take advantage of the ``contextual expressions'', starting and ending tags. However, this does make it more difficult to maintain the string for the token. In the sample code below, I've used `startTag` to begin the string for a tag, `addToTag` to add more text to it, and `getTag` to get that string.

```%START TAG
%{
Code for TOKEN_TAG and TOKEN_CHUNK
%}
%%
\<		{ startTag(); addToTag("<"); BEGIN TAG; }
yytext = getTag();
BEGIN INITIAL;
%%
```

Note that the final rule for `TAG` takes advantage of lex's ``the first matching rule wins'' strategy. However, it does mean that `<">` is treated as a tag, which is incorrect. The correct line should be

```<TAG>[^\'\"]	{ addToTag(yytext); }
```

It is also possible to use contexts for ``in a string'' as in the following version.

```\<		{ startTag(); addToTag("<"); BEGIN TAG; }
yytext = getTag();
BEGIN INITIAL;
<TAG>\"		{ addToTag("\"); BEGIN DOUBLEQUOTE; }
<TAG>\'		{ addToTag("'"); BEGIN SINGLEQUOTE; }
<SINGLEQUOTE>\'	{ addToTag(yytext); BEGIN TAG; }
<DOUBLEQUOTE>\"	{ addToTag(yytext); BEGIN TAG; }
```

In this case, it is valid to use the dot, since the previous pattern is a single character rather than a full string.

### Difficulties

Some of you treated single characters as chunks. This means that a character, rather than ``sequence of characters'' is returned.

A surprising number of you forgot about the ``negation set'', leading to overly complicated grammars. Others of you assumed that you could negate regular expressions, which is not legal.

## Problem 2: Comparing automata

In this problem, you will build a variety of representations for the language ``strings of `a`'s and `b`'s beginning with an `a` and ending with a `b`.''

### 2.a: A regular expression

Build a regular expression for the language of ``strings of `a`'s and `b`'s beginning with an `a` and ending with a `b`.''

a([ab]*)b

### 2.b: A finite automaton

Build an optimal deterministic finite automaton for the language of ``strings of `a`'s and `b`'s beginning with an `a` and ending with a `b`.'' You can use the formal translation from regular expressions to finite automata or you can build the DFA by hand.

I've done the optimization by hand and then represented the automaton textually, rather than graphically. Basically, the first a puts you in an intermediate state, a b in that state puts you in a final state. An a in the final state brings you back to the intermediate state.

Alphabet: { a b }
Initial state: q0
Final states: { q2 }

 State Symbol New-State q0 a q1 q1 a q1 q1 b q2 q2 a q1 q2 b q2

Here's an ASCII-graphic to represent that automaton.

```
a           b
+----+      +----+
a    v    |  b   v    |
---> (0) ---> (1) --+---> (2) --+
^           |
+-----------+
a
```

### 2.c: A grammar

Write a context-free grammar for the language of ``strings of `a`'s and `b`'s beginning with an `a` and ending with a `b`.''

This should be somewhat straightforward. I've used `E` for ``any sequence of `a`s and `b`s''.

```S ::= a E b
E ::= a E
|  b E
|  epsilon
```

### 2.d: A parser

Build a predictive parsing table for the language of ``strings of `a`'s and `b`'s beginning with an `a` and ending with a `b`.'' It is likely that you will want to use one of the formal mechanisms for converting a context-free grammar to a predictive parser. You may find that you need to modify the context-free grammar from the previous step in order to build the parsing table.

#### A predictive parser

To build a predictive parser, we must first build the first and follow tables. Note that I've used the end-of-string marker, `\$` as the follow symbol of the start state.

```Symbol	First	Follow	Nullable
S	a	\$
E       a,b     b	yes
```

Now, recall that a predictive parsing table has rows indexed by nonterminals and columns indexed by terminals. An entry in the table is a production that should be executed when one is trying to match the nonterminal and sees the particular symbol.

If `N ::= alpha` then for each element, `s` of first(alpha), we put `N ::= alpha` at `(N,s)`.

If `N ::= alpha` and nullable(alpha), then for each element, `s` of follow(N), we put `N ::= alpha` at `(N,s)`.

For convenience, I'll just write the right hand sides of the production in the table.

```	a	b	\$

S	a E B

E	a E	b E
epsilon
```

Whoops! There are two entries in one position in the table. This means that this particular grammar cannot be parsed by an LL(1) parser. What can we do? We can try an LL(2) parser or we can change the grammar.

Here's an LL(2) parser that works

```	aa	ab	ba	bb	a\$	b\$

S	a E b	a E b

E	a E	a E	b E	b E		epsilon
```

A small change, but it makes a big difference. We know know that when the remaining characters are `b` and the end-of-string mark, we reduce `E` to the empty string.

It is somewhat more difficult to produce an appropriate grammar. Here is one I came up with. The `EndsWithB` nonterminal is used to indicate a sequence of `a`s and `b`s ending in `b`. The `Bs` nonterminal is used for a sequence of 0 or more `b`s.

```S ::= a NeedsB
NeedsB ::= a NeedsB	// If you see an a, you still need a b
NeedsB ::= b Bs Extra	// A sequence of b's followed by ...
Extra ::= epsilon	// (1) The empty string or
Extra ::= a NeedsB	// (2) An a and something ending in b
Bs ::= b Bs
Bs ::= epsilon
```

Now, we can build the helper tables.

```Symbol	First	Follow	Nullable
S	a	\$
NeedsB	a,b	\$
Extra	a	\$	yes
Bs	b	a,\$	yes
```

And finally the parse table.

```	a		b		\$

S	a NeedsB

NeedsB	a NeedsB	b Bs Extra

Extra	a NeedsB			epsilon

Bs	epsilon		b Bs		epsilon
```

#### A shift-reduce parser

Some of you suggested that it might be better to build a shift-reduce parser. However, this is also difficult.

We start by building an LR(0) automaton, with hopes of doing LR(0) or SLR(0) matching.

```s0:
S' ::= * S \$
S ::= * a E b
s1:
S' ::= S * \$
s2:
S ::= a * E b
E ::= * a E
E ::= * b E
E ::= *
```

Whoops! We can already tell that this won't work because we can't decide between shifting a b and reducing to epsilon.

Let's try an LR(1) automaton.

```s0:
S' ::= * S \$ , ?
S ::= * a E b , \$
s1:
S' ::= S * \$ , ?
s2:
S ::= a * E b , \$
E ::= * a E , b
E ::= * b E , b
E ::= * , b
```

Whoops! We still have the same problem. We don't know whether to shift or to reduce upon seeing a b in state 2. Let's try rewriting the grammar slightly.

```r0: S ::= a E b
r1: E ::= E a
r2: E ::= E b
r3: E ::= epsilon
```

Now, let's try building an LR(0) automaton.

```s0:
S' ::= * S \$
S ::= * a E b
s1:
S' ::= S * \$
s2:
S ::= a * E b
E ::= * E a
E ::= * E b
E ::= *
```

So far, so good. We can reduce to epsilon when we see an `a` or a `b`. A little bit odd, but let's see what else it gives us.

```s3:
S ::= a E * b
E ::= E * a
E ::= E * b
s4:
E ::= E a *
s5:
S ::= a E b *
E ::= E b *
```

Now, there seems to be a reduce-reduce conflict in state s5. However, you'll note that follow(S) includes only \$ and follow(E) includes only `a` and `b`. Hence, we can successfully build an SLR(1) automaton.

```    a  b  \$   S  E
s0  s2        g1
s1       acc
s2  r3 r3        g3
s3  s4 s5
s4  r1 r1
s5  r2 r2 r0
```

#### Difficulties

Many of you got this problem wrong. Why? Because you either produced an invalid table (one with two entries) or because you decided to drop one of the overlapping entries and produced a table that wouldn't parse. For example, a surprising number of you turned in something similar to the following table.

```	a	b	\$

S	a E b

E	a E	b E	epsilon
```

But consider what happens when we try to match `abb`

```To Match		Input

S			a b b \$
a E b			a b b \$
E b			b b \$
b E b			b b \$
E b			b \$
b E b			b \$
E b			\$
b			\$
Boom!
```

### 2.e: Reflection

What does your work on this problem suggest to you?

#### My Thoughts

For some languages, it is significantly simpler to build a deterministic finite automaton than to build a shift-reduce parser. In fact, this is why we have lexers in the first place: the simpler things can be done quicker and more efficiently with a lexer than with a parser.

I've also noted that left-recursive grammars are easier to parse with a shift-reduce parser than are right-recursive grammars.

This also makes it somewhat clear that the declarative definitions of languages (regular expressions and grammars) are significantly simpler than the operational definitions (automata and parsers).

A number of you noted that there was a clear progression between the four parts, with each step building on the last. Many also noted that it's nice that we can automate the translation between declarative definitions and operational definitions.

## Problem 3: A grammar for regular expressions

Write an unambiguous BNF grammar for the language of regular expressions over symbols (which are given by the `SYMBOL` token). That is, your grammar should accept only valid regular expressions using the base alphabet plus parenthesization, concatenation, alternation, Kleene star, question mark, plus, and set (bracket) notations. Make sure that your grammar is unambiguous and gives appropriate precedence to the various mechanisms for combining regular expressions.

### Solution

Note that we give precedence by using different nonterminals for each of the types of operators, with higher precedence operators appearing lower in the grammar. In that sense, regular expressions are just like arithmetic expressions, except that we use ``nothing'' for multiplication and a vertical bar for alternation.

```regexp ::= regexp OR regterm		// Alternation
regexp ::= regterm
regterm ::= regterm regfactor		// Concatenation
regterm ::= regterm DOT regfactor	// Alternate concat
regterm ::= regfactor
regfactor ::= regfactor STAR
regfactor ::= regfactor QUESTION
regfactor ::= regfactor PLUS
regfactor ::= regstuff
regstuff ::= SYMBOL
regstuff ::= LPAREN regexp RPAREN
regstuff ::= LBRACKET setcontents RBRACKET
regstuff ::= LBRACKET negation setcontents RBRACKET
regstuff ::= epsilon
setcontents ::= setcontents element
setcontents ::= element
element ::= SYMBOL DASH SYMBOL
element ::= SYMBOL
```

The alternate forms of concatenation and the ``negative sets'' were not required.

### Difficulties

A surprising number of you wrote incredibly ambiguous grammars. The problem did call for an unambiguous grammar.

A surprising number of you ignored precedence. The problem did call for you to give appropriate precedence.

A number of you ignored the restrictions on sets. Recall that sets can contain only symbols and ranges (not other regular expressions).

## Problem 4: Attribute grammars for binary numbers

Answer one of the following two questions. You may receive a modicum of extra credit if you successfully answer both.

### Problem 4.a: Nonnegative binary numbers

Here is a simple grammar that describes binary numbers.

```binum1 ::= bit binum2
binum ::= bit
bit ::= ZERO
bit ::= ONE
```

Peter and Paula Positive have decided to write an attribute grammar for nonnegative binary numbers. In that representation, the ith bit from the right is the 2ith place.

They've decided to begin their grammar as follows

```binum1 ::= bit binum2
binum1.value = bit.value + binum2.value
```

They're now a little bit stumped as to how they're going to get an appropriate value for the `bit`s that make up a binary number. Add rules that will do so. You may need to add other attributes in order to successfully compute that value.

Note that you need to preserve their one rule, which means that the value of a bit depends on its position in the string.

#### Solution

In order to give the appropriate value to each bit, we need to determine its position in the string. The grammar has conveniently been written so that the position is given by the length of the substring to the right.

```binum1 ::= bit binum2
bit.pos = binum2.length
binum1.value = bit.value + binum2.value
binum1.length = 1 + binum2.length
binum ::= bit
bit.pos = 0
binum.value = bit.value
binum.length = 1
bit ::= ZERO
bit.value = 0
bit ::= ONE
bit.value = 2bit.pos
```

#### Difficulties

A few of you ignored the restriction that ``you need to preserve their one rule''. That restriction was there for a reason. Without that restriction, the problem should be trivial.

Some of you updated attributes. Recall that attributes only get set once, and cannot be changed.

#### Alternate Solutions

A few of you used what I would call a ``hack'', but one that I accepted anyway. You assigned an attribute to `bit.value` and then used it immediately, as in the following.

```binum1 ::= bit binum2
bit.value = bit.val * 2binum2.length
binum1.value = bit.value + binum2.value
binum1.length = 1 + binum2.length
binum ::= bit
binum.value = bit.value
bit ::= ZERO
bit.val = 0
bit ::= ONE
bit.val = 1
```

### Problem 4.b: Two's Complement

Thelma and Charles Bitwise have decided to implement an attribute grammar for binary numbers in two's complement notation. They've decided that all such numbers must have at least two bits. So far, all they've been able to come up with is a core BNF grammar.

```twoscomplement ::= ZERO bits
twoscomplement ::= ONE bits
bits1 ::= bits2 bit
bits ::= bit
bit ::= ZERO
bit ::= ONE
```

Complete their attribute grammar. If you deem it necessary, you may also modify the underlying BNF grammar.

#### Solution

We know that the first bit in a two's complement number gives the sign of the number. An initial 0 indicates that it's a positive number, an initial 1 indicates that it's a negative number. Positive numbers are easy to translate. For negative numbers, we may recall that two's complement is quite similar to one's complement, and in one's complement, each 0 is worth -2pos and each 1 is worth 0. We need to subtract one from that result.

```twoscomplement ::= ZERO bits
bits.sign = positive
twoscomplement.value = bits.value
twoscomplement ::= ONE bits
bits.sign = negative
twoscomplement.value = bits.value - 1
bits1 ::= bits2 bit
bit.sign = bits1.sign
bits2.sign = bits1.sign
bits1.value = 2 * bits1.value + bit.value
bits ::= bit
bit.sign = bits.sign
bits.value = bit.value
bit ::= ZERO
bit.value = if bit.sign == positive then 0 else -1
bit ::= ONE
bit.value = if bit.sign == positive then 1 else 0
```

#### Alternate Solutions

A few of you took advantage of the rule like ``if a number is in two's complement, is negative, and has b bits, then you can compute 2^b - (value of remaining bits in unsigned notation)''. A source for or explanation of that rule would have been nice.

## Problem 5: Details from Appel

These problems are intended to test your reading of Appel. They focus on various details. Answer either the Java section or the C section.

### Problem 5.a: Details from the Red Book

5.a.i. In your introductory programming courses, you should have learned to indent your code to enhance readability. In particular, you typically select an indenting style that makes it clear what values are parameters of what method.

Unfortunately, the dictates of the printed page can make it difficult to indent appropriately. In addition, in certain cases other requirements besides nesting come into play.

Consider the code for the abstract syntax tree for the let expression from page 104 of the red book. Rewrite the code so that it follows the traditional rules of indentation (in particular, that any argument to a method should be indented further than the start of the method name).

Why do you think Appel chose this alternate indentation style?

I've noted that Appel is inconsistent in his construction of abstract syntax trees. In some places, he writes the commands for creating the trees (e.g., with `new`) but leaves out the positions. In others, he includes the positions but skips the `new`s. I'll use a hybrid, which has neither `new` nor position.

I've also tried to fill in some of the details that Appel elided.

Here is the original Tiger code

```let var a := 5
function f() : int = g(a)
function g(i: int) = f()
in
f()
end
```

Note that this is a semantically invalid program, since `f` returns integers but `g` is a procedure and therefore returns nothing. In addition, the program (if legal), appears to run forever.

Nonetheless, here is the abstract syntax tree

```LetExp(
DecList(
VarDec(symbol("a"), null, IntExp(5)),
DecList(
FunctionDec(
symbol("f"), // Function name
null,        // Parameter list
symbol(int), // Type
CallExp(     // Body
symbol("g"), // Called function
ExpList(     // Actual parameters
VarExp(SimpleVar(symbol("a"))) // First parameter
null)), // No other parameters
FunctionDec( // Related function
symbol("g"),  // Function name
FieldList(symbol("i"), symbol("int"), null), // Parameter list
null, // Type (this is a procedure)
CallExp(symbol("f"), null),
null) // No more related functions!
) // End outermost function declaration
) // End sub declaration list
), // End outermost declaration list
CallExp(symbol("f"), null)
) // end LetExp
```

And you thought Scheme and LISP had a lot of parentheses!

It is likely that Appel chose his indentation scheme so that the elements of a list appeared at the same indentation level. For example, you will note that all three declarations appear at the same indent, ``nicely'' lined up.

5.a.ii. In the code at the beginning of section 5.4 on page 123 of the red book, there are two consecutive calls to `beginScope`. Why?

One call begins a new scope for type variables, the other begins a new scope for function and normal variables.

5.a.iii. In the code for function declarations on page 125 of the red book, Appel does one `put` before beginning a scope and then multiple `put`s after beginning the scope. Why?

The first `put` adds the function name and return type to the current scope. That name and type may be needed by subsequent functions, so they must be added before beginning a new scope. After the new scope begins, the parameters are added. Since the parameters are not needed after the body of the function is type-checked, they must be done in a separate scope.

5.a.iv. In the code for variable declarations on page 124 of the red book, there is an instruction to `return null;`. Why?

Appel has noted that the `Exp` class is a detail left for a subsequent chapter. Eventually, in addition to type checking we'll want to build intermediate representation trees. The `Exp` is used in building such trees. Since we're presently doing only type checking, we need not return any intermediate representation. Does the translation still serve a purpose? Yes, it updates the symbol table.

However, it is important to note that the function does have a return type, rather than be declared as `void`. This is because Appel has planned ahead for the subsequent sections.

### Problem 5.b: Details from the Green Book

5.b.i. In your introductory programming courses, you should have learned to indent your code to enhance readability. In particular, you typically select an indenting style that makes it clear what values are parameters of what method.

Unfortunately, the dictates of the printed page can make it difficult to indent appropriately. In addition, in certain cases other requirements besides nesting come into play.

Consider the code for the abstract syntax tree for the let expression from page 99 of the green book. Rewrite the code so that it follows the traditional rules of indentation (in particular, that any argument to a method should be indented further than the start of the method name).

Why do you think Appel chose this alternate indentation style?

5.b.ii. In the code at the beginning of section 5.4 on page 118 of the green book, there are two consecutive calls to `S_beginScope`. Why?

5.b.iii. In the code for function declarations on page 120 of the red book, Appel does one `put` before beginning a scope and then multiple `put`s after beginning the scope. Why?

5.b.iv. Why does `transExp` return a struct but `transDec` return nothing?

Disclaimer Often, these pages were created "on the fly" with little, if any, proofreading. Any or all of the information on the pages may be incorrect. Please contact me if you notice errors.