Software Design (CSC-223 97F)
Outline of Class 33: Regular Expressions
- While there is a lot more we could say about profilers, we'll move
on to new topics after a quick summary of an issue that I think is
- Don't forget Friday's talk at 2:15 on chaos theory.
- On Monday, November 24, at 7:00pm in ARH 224 there will be two
short talks on CS-related summer internships: one on engineering
Java applets (by our own Omar Ghaffar), the other on a FreeNet project.
- So, what did you learn in response to the end-of-class question?
- Short assignment for next class: find "unmatched" less than signs
in an HTML file. That is, you want to find lines in which
less-than signs are not followed by a greater-than sign (the
greater-than sign should appear on the same line and before any
other less-than signs).
- Once we determined that adding profiling lines by hand was excessively
painful, we might then strive to develop a "profiler generator" that
- Read the source text of another program.
- Inserted appropriate profiling commands.
- Wrote the resulting program to a new file.
- This is an instance of what one might call a meta-program,
a program that operates on other programs. Other common meta-programs
include editors and compilers.
- Are there others that you can think of? (That exist or that you
- As Bentley repeatedly stresses, Unix is a filter-based environment,
and filtering programs provide a great deal of power.
- By developing general but powerful filters, you can combine them
in interesting ways to solve a wide variety of problems.
- I'm not going to repeat the spell-checking example again; you should
know it by heart by now.
- What are some of the tools that make a good filtering system?
- Selection filters, like
- Translation filters, like
- Pagers, like
- Other utilities, like
- To make these filters both general and useful, we often need to
rely on a language for describing patterns (e.g., "select lines
that look like ...").
- One of the most common pattern languages is that of regular
- Regular expressions are (relatively) easy to express.
- Regular expressions are (relatively) easy to implement
- Like many things in computer science, regular expressions are
- A special value, epsilon, is a regular expression
that matches the empty string.
- Any single character is a regular expression that matches that
- The concatenation of any two regular expressions is a regular
expression that matches the two regular expressions in sequence.
- The alternation ("or") of any two regular expressions is a
- The Kleene star (repetation) of a regular expression, R is
the union of epilon, R, RR, RRR, RRRR, and so on and so forth.
- Of course, we need a notation for each of these.
- A single character is normally written as that character.
- Concatenation is notated by writing the two RE's in sequence,
with parentheses as necessary.
- Alternation is notated by writing the two RE's separated by
a vertical bar, with parentheses as necessary.
- Kleene star is notated by writing the RE with a subsequent star.
- Star has precedence over concatentation has precedence over alternation.
- I'm not sure about associativity.
- For example,
a matches the character "a".
ab matches the sequence "ab"
a|b matches either the character "a" or the
(a|b)(a|b) matches any length-two sequence of
"a"s and "b"s.
(a+b)* matches any sequence of "a"s and "b"s.
- It's also possible to think of regular expressions as denoting
sets. We did so in class.
- Unix provides expressions that are often called "regular expressions"
but that have some differences from "theoretical" regular expressions.
- Many of these differences are motivated by a need for a more
concise notation, and don't affect the theoretical basis of
- Other differences do affect the theoretical basis.
- What are the differences in Unix regular expressions?
- Alternation is relegated to special "extended regular expressions"
which you don't get by default.
- Sets of characters can be written between braces. For example,
[abc] is shorthand for
of course, that we could write the latter.
- In sets, you can write ranges. For example,
is shorthand for
- You can negate the elements of a set by putting a caret
^ as the first element of the set. For example,
[^a-zA-Z] matches any character that isn't a letter.
- You can anchor your regular expressions at the beginning of
the line with
^ and at the end of the line with
(alternately, you can think of
^ as a special character
that matches the beginning of the line and
$ as a special
character that appears at the end of the line).
- You can repeat a regular expression with the Kleene star
* (meaning zero or more repetition), the plus
+ (meaning one or more repetitions).
- You can match any character with a period .
- Obviously, when you want to explicitly use any of the characters with
alternate meanings described above, you'll need to quote it
(with a backslash
\). You don't need to quote things you
put in sets.
- You can parenthesize expressions with
- You can repeat parenthesized expressions with
# is a
digit between 0 and 9, representing the number of the parenthesized
- You can get more information on Unix regular expressions with
% man 5 regexp
or from my
formatted version of the man page.
- Unix provides a number of tools that use regular expressions.
- Two of the most popular are
grep ("general regular-expression pattern matching", or
some such) permits you to extract lines from a file that match a
sed ("stream editor") allows you to modify lines
based on a regular expression.
- These are tools that follow what one might call a "use/understand
cycle": the more you use them, the better you understand them; the
better you understand them, the more you use them.
grep's primary purpose is information extraction --
it's used to extract appropriate lines from a file.
- It can also be used in what one might call "negative mode" (-v),
returning only the lines in a file that don't match.
- To use "extended regular expressions" (which permit alternation),
you can use the
-E flag, or
egrep on some
- You can use the
-i flag for case-insensitive matching.