Compilers (CS362 2002F)

Project, Phase 1: Lexical Analysis

Assigned: Monday, September 16, 2002
Due:Friday, September 27, 2002

Summary: In this stage of the project, you will design and build the lexical analyzer for your Pascal compilers.

Warning: You may be required to use each others' lexers at the next stage of the project.

Group Work: You should work in groups of size 3. For this stage, you can choose your own groups. I am likely to reassign groups for the next stage of the project. (Ananta does not count toward the 3 in a group.)

Building a Lexical Analyzer

1. Begin by deciding on the natural tokens for Pascal. You should do this in the next day or so.

2. Design Java classes for tokens. You should have a general Token interface or abstract class and then make classes for the indivdiual tokens that subclass or implement Token (or that subclass a subclass or implement a subinterface of Token).

3. Write regular expressions for the tokens.

4. Design a generic Lexer class that provides appropriate methods, such as nextToken and peekToken.

5. Implement the lexical analyzer (yeah, you knew there had to be a hard part, didn't you?).

6. Write a test program that reads in files and prints out their tokens (and, possibly, reports on errors as it encounters them).

Implementation Options

You are free to implement the lexical analyzer in any manner you see fit. There are a number of options.

You can hand code the analyzer. This solution potentially gives you the most freedom (and perhaps even efficiency). For example, you can probably deal with some non-regular issues with this solution.

You can design and build the finite automaton for your tokens and then convert it to Java. This option may require a lot of hand work.

You can rely on built-in Java classes, particularly StringTokenizer and StreamTokenizer. From my scan of those classes, you may have difficulty converting them from tokenizers for C-like languages to tokenizers for Pascal-like languages.

You can use an existing Java lexical analyzer generator. This solution is what many commercial compilers do. However, you will probably need to put a wrapper class around the results to create a more generic result. You will also need to learn the specifics of one of these systems. Students from the last session of CS362 recommend against this option.

You can write your own lexical analyzer generator. You'll certainly learn a lot of lexical analysis, regular expressions, and automata if you choose this solution. However, it is also requires much more effort than the other options.

Some Suggestions

Whether or not you use, I'd recommend that you look at its interface to decide what aspects you find useful or not useful.

Hand-coding your lexer is probably the best way to go at this stage.

Feel free to talk to others in the class so that you end up with a uniform interface.

Think about what information you want to store in a token in addition to its type (which may be represented by the class of the token). For example, is it useful to store the line number and position of the token?

Think about how you're going to handle keywords. You can write separate regular expressions for keywords or you can treat most keywords as variations of identifiers, and simply add a separate check for keyword procedure that you use when you read an identifier.



Wednesday, 7 February 2001

Monday, 16 February 2002


Disclaimer: I usually create these pages on the fly, which means that I rarely proofread them and they may contain bad grammar and incorrect details. It also means that I tend to update them regularly (see the history for more details). Feel free to contact me with any suggestions for changes.

This document was generated by Siteweaver on Wed Nov 20 08:44:14 2002.
The source to the document was last modified on Fri Sep 20 10:32:25 2002.
This document may be found at

You may wish to validate this document's HTML ; Valid CSS! ; Check with Bobby

Glimmer Labs: The Grinnell Laboratory for Interactive Multimedia Experimentation & Research