Held Friday, September 6, 2002
Today we continue our investigation of regular expressions.
- We'll have class Monday in 2428 so that Marc C. can let his Calculus
class play with our computers.
- I've added links to the previous versions of this class to the links
on most of the pages in the course, in case you want to see what we
did the last two times I taught this course.
- I finally remembered to update the
class news. Sorry for the delay.
- Simplifying Regular Expressions
- Common Shorthands
- Some Sample Regular Expressions
- Using Regular Expressions for Describing Tokens
- As you may have noted, we need a lot of parentheses in writing regular
- At one level, these parentheses are necessary to resolve ambiguities.
- For example, as R.S|T really ((R.S)|T) or (R.(S|T))?
- Does it matter?
- Similarly, is R.S.T really ((R.S).T) or (R.(S.T))?
- Does it matter?
- Is R.T* really ((R.T)*) or (R.(T*))?
- Does it matter?
- However, we can use the same techniques as we use with simple algebra
to avoid these parentheses: Assign precedence to operations.
- Kleene star has the highest precedence.
- Concatenation has the next highest precedence.
- Alternation has the lowest precedence.
- We also tend to elide the concatenations symbol, just as we elide the
- Some examples
- ab* is (a.(b*)) and not ((a.b)*)
- a|bc is (a|(b.c)) and not ((a|b).c)
- When it's appropriate, we assume that the operations are left-associative.
- RST is ((R.S).T)) and not (R.(S.T))
- There are a number of shorthands for regular expression.
- None add
any expressive power and all can be
compiled into normal regular
- A postfix plus sign (+) for
at least one instance. a+ is
- Brackets for alternation of larger sets of symbols. [abc] is a|b|c.
[a-c] is also a|b|c.
- Negation of sets of symbols (which should work only if the set of
symbols is finite). [^abc] is "anything except a, b, or c.".
If sigma is "lowercase letters", [^abc] is shorthand for
- A postfix question mark for
optional. a? is shorthand for
- A power represents repeated concatentation. an is
shorthand for aa...a.
- Naming regular expressions and then using names within other regular
expressions (provided the inclusion is not recursive).
- We can translate any regular expression with names to a normal
regular expression by substituting the named expression for the name.
- Some of the shorthands available in Perl and such really extend the
power of regular expressions, and are not permitted in standard
- "The thing matched in an earlier part of the pattern."
- "Not this regular expression"
- All strings of a's and b's: (a+b)*
- Strings of a's and b's with exactly one b: a*ba*
- Strings of one or more a's: aa* (also a*a, also a*aa*)
- Regular expressions, while powerful, also have some limitations.
- In particular, there are relative simple languages you cannot describe
with regular expressions.
- Here are a few:
- Strings of a's and b's with equal numbers of a's and b's.
- How do we use regular expressions for lexical analysis?
- We start by writing a regular expression for each nonterminal in the
- For example:
- You may have noted that we need to be careful to distingish the symbols
used in building regular expressions from the similar characters.
- Most regular expression implementations provide some mechanism for
distinguishing the two.
- Note that we have only discussed how to describe tokens, not how a
typical lexical analyzer uses those descriptions.
- In particular, what happens if multiple regular expressions apply?
- Two general rules apply:
- Use the longest possible match.
- If two different patterns have an equally long match, use the one
that appears first.
- Note that the first rule can make our matching less efficient, since we
may have to "look ahead" a great deal to determine whether or not
something matches a longer pattern and find that it doesn't.
- In practice, this problem rarely happens.
- In the next few classes, we'll learn how to turn regular expression
specifications into programs that match regular expressions or tokenize.
Thursday, 29 August 2002
- First version, based somewhat on outlines from
Friday, 6 September 2002
- Filled in the body. Some is taken from the previous outline. Most