[Front Door] [Links] [Slides]
Author: Samuel A. Rebelsky
rebelsky@math.grin.edu
Version: 1.0 of November 1998
copy.pl: Copying files
cpy.pl: Copying files specified on the command line
insert-navbar.pl: Inserting one navigation bar
insert-navbars.pl: Inserting two navigation bars
put-in-table.pl: Putting the current page in a table
insert-sitename.pl: Inserting the site name in the title
highlight.pl: Highlighting specified text
setcolor.pl: Setting the background color
Often, it is useful or necessary to create new versions of HTML files without modifying the original file. For example, you might want to insert a navigation bar at the top and bottom of the page, insert a site name in the document title, update some text, or otherwise modify the page. Many of these tasks are core aspects of site-level authoring.
Why wouldn't you want to modify the original page? Perhaps because you don't control the original page. Perhaps because you will want to modify the page differently in different contexts. Perhaps because you don't want the page author to have access to your modifications. There are a number of reasons.
How do you modify the page? For a single page or even a few pages, it may be feasible to do it by hand. For more than a few pages or for more complex modifications, it is more convenient and more reliable to do some form of batch processing. Typically, you will write instructions in a programming language, macro language, or pattern matching language.
What language is appropriate? For many such tasks, Perl (Practical Extraction and Report Language) is the appropriate language, and it is the language used by many web technicians. While Perl is not a particularly elegant language, it does provide useful solutions to many problems of text manipulation. Perl also provides an appropriate mechanism for CGI scripting, so that modification can be done on demand rather than as a batch. Perl is also well-supported free software.
This rather short document is intended to provide a quick ``getting started'' guide to using Perl for these types of tasks. It is not a comprehensive overview of Perl (the standard reference is Programming Perl, 2nd Edition by Larry Wall, Tom Christiansen, and Randal L. Schwartz) nor is it a complete tutorial in the language (Learning Perl, 2nd Edition by Randal L. Schwartz and Tom Christiansen is a good one). Rather, it is intended to provide a set of examples that novices can build upon.
Similarly, the solutions given here are not always the most efficient or the most elegant. They are presented for clarity and for beginners. To best support beginners, many advanced features of Perl (such as modules) are ignored.
Many of the sample scripts presented herein assume that the HTML is correct
(e.g., no missing <body> tags, only one
<body>) and reasonable well formatted
(e.g., a <body> tag falls on a line by itself). It
is strongly recommended that all page authors verify their HTML with a
program like weblint or the W3C validation service.
We'll begin with a simple task: copying a file. Why do we begin with such a simple task? Because all modifications require copying a file, they differ only in that they also change or insert parts of the file.
Here is a Perl script to copy a file. Note that it is written for a Unix version of Perl. Windows and Macintosh users will use a slightly different syntax, one that should be described in the documentation of their version of Perl.
#!/usr/local/bin/perl
# Copy a file, prompting for input and output file names.
# Get the name of the input file
print "Please enter the name of the input file: ";
$infile = <STDIN>;
chop($infile);
# Get the name of the output file
print "Please enter the name of the output file: ";
$outfile = <STDIN>;
chop($outfile);
# Process the two files
open(INFILE, "< $infile");
open(OUTFILE, "> $outfile");
while (<INFILE>) {
print OUTFILE $_;
} # while
close(OUTFILE);
close(INFILE);
# That's it
exit 0;
What does each part of the script mean?
The instruction to run Perl
The file begins with
#!/usr/local/bin/perl
This indicates, in effect, that it is a Perl file. (More precisely, it indicates that the particular version of perl stored in a particular directory is to be used to interpret the rest of the script.)
A comment
# Copy a file, prompting for input and output file names.
Any line (or portion of a line) that begins with a pound sign (#) is a comment. It is intended as a note for the reader of the Perl script, and is ignored by Perl.
Output
print "Please enter the name of the input file: ";
As you might guess, this is an instruction to print some text to the current text window. Note that this instruction, like almost every Perl instruction, ends with a semicolon.
Input
$infile = <STDIN>;
This odd line tells the program to read one line from the keyboard and
to store the result in a container (variable) named $infile.
In Perl, all standard variable names are preceded by a dollar sign. A
container is simply a place in which you can store values and later retrieve
those values. You can also think of it as providing a convenient name for
values.
Update
chop($infile);
Typically, when you read a line of text, you also read the invisible carriage return at the end of the line. This instruction tell Perl to eliminate that carriage return.
So far ...
At this point, we've asked the user to enter a file name (with the
print statement), read in a file name (with the
<STDIN>), stored it in a container named
$infile (with $infile =), and removed
the carriage return (with chop). We are now ready to
read in the name of the output file, using a similar strategy.
Get name of output file
# Get the name of the output file print "Please enter the name of the output file: "; $outfile = <STDIN>; chop($outfile);
This differs from our attempt to get of an input file only in the name of the container used to store the name of an input file.
Open the files
open(INFILE, "< $infile"); open(OUTFILE, "> $outfile");
Perl makes a distinction between file names (which can really be any
string) and the actual files they represent. The OPEN
instructions open the files for reading (using the less-than sign)
or writing (using the greater than sign).
Read the input file
while (<INFILE>) {
...
} # while
This says ``as long as there are still lines left to process in the input file, read a line and do whatever falls between the two braces''. The braces are necessary.
Write a line to the output file
print OUTFILE $_;
This tells Perl to print the line it just read (given by $_
to the file given by OUTFILE). Why does Perl use
$_ ``the line just read''? Just because.
Close the files
close(OUTFILE); close(INFILE);
It is considered good practice to close (indicate that you're
done with) and files that you open.
Conclude
# That's it exit 0;
THe exit command is a way to indicate ``I'm done.'' The
zero indicates ``the program ended normally''. You can use other
numbers to indicate errors. We'll see how to do that in the next program.
You may want to make a copy of this file, run it, and observe the results before you go on to the next example.
You may have noted that it's somewhat inconvenient to have to respond to questions for the input and output file. In environments in which it is possible to execute programs from a command line (as in Unix and the DOS shell), it is also possible to specify arguments to a Perl. Here is a Perl script that copies the first file specified on the command line to the second file.
#!/usr/local/bin/perl
# Copy a file, taking input and output file names from the
# command line.
# Test the number of arguments
if ($#ARGV != 1) {
print STDERR "Usage: cpy.pl inputfile outputfile\n";
exit 1;
}
# Get the name of the input and output files
$infile = $ARGV[0];
$outfile = $ARGV[1];
# Process the two files
open(INFILE, "< $infile");
open(OUTFILE, "> $outfile");
while (<INFILE>) {
print OUTFILE $_;
} # while
close(OUTFILE);
close(INFILE);
# That's it
exit 0;
Let's consider the new parts of this file.
Testing number of arguments
if ($#ARGV != 1) {
print STDERR "Usage: cpy.pl inputfile outputfile\n";
exit 1;
}
The if part tests a condition and, if that condition holds,
executes the part in braces. The condition is ``are there two arguments''
(even though it doesn't look like it). Traditionally, arguments are
numbered starting at 0, so this ensures there are two arguments. The
print STDERR line prints an error message. Note that a
\n is used to add a carriage return to the output. We
exit with a value of 1 to indicate an error.
Getting file names
We are now ready to read the input and output file names from the command line.
# Get the name of the input and output files $infile = $ARGV[0]; $outfile = $ARGV[1];
As mentioned above, the arguments are numbered starting at 0. To access
a argument n, you write $ARGV[n]..
The rest
Once we've stored the input and output file names in two containers, we process the files as in the original copy program.
From now on, we will use this second mechanism for reading file names. It is up to you which you prefer.
One standard mechanism for giving all the pages in a site a somewhat uniform
``look and feel'' is to insert a navigation bar (a set of links,
potentially with icons) at the top and bottom of every page. For our purposes,assume that the navigation bar is stored in a file called navbar and
contains the HTML code for the navigation bar.
To insert a navigation bar at the top of the page, insert the following into
the script, after the line that reads print OUTFILE $_. An
appendix contains the full program.
# Insert the navigation bar at the top
if (m/<body/i) {
open(NAVBAR, "< navbar");
while ($line = <NAVBAR>) {
print OUTFILE $line;
}
close(NAVBAR);
}
What is happening in this code? FIrst of all, we have
a test given by an if statement.
if (m/<body/i) {
...
}
The (m/<body/i) asks whether the line we just read matches
(m) a less-than sign and then the word body (<body),
independent of case (i). The slashes are used to separate the
parts of the test. If we've matched the beginning of the body of an HTML
document, we want to put the navigation bar next.
open(NAVBAR, "< navbar");
while ($line = <NAVBAR>) {
print OUTFILE $line;
}
close(NAVBAR);
As you might guess, this opens a file called navbar (the one
that stores the HTML for the navigation bar), reads the
lines of the file, writes them to our output file, and closes the file.
Note that we've used a somewhat different ``read from file'' loop,
in which we explicitly call the line read $line (rather than
the default of $_ which is used for the lines of the primary
input file).
What if we want to insert a second navigation bar at the bottom of the page?
We'll put a similar test in, but this time we want to look for the ``end of
body'' tag, </body>. Unfortunately, the slash has meaning
to Perl, so we must ``quote'' it by prefacing it with a backslash, as in
if (m/<\/body/) {
open(NAVBAR, "< navbar");
while ($line = <NAVBAR>) {
print OUTFILE $line;
}
close(NAVBAR);
However, since we want the navigation bar to appear before the end of
the body, we insert this code before the command to print OUTFILE $_,
as you can see from the program
insert-navbars.pl, given in the appendix.
Sometimes, rather than putting a navigation bar at the top or bottom, we'd like to put a variety of things ``around'' the page, typically a site-specific heading at the top of the page and a table of contents along the left-hand side. There are a number of mechanisms for doing this: you can put the page in a frame, you can make clever use of cascading style sheets, or you can put the page in a table. There are reasons to use (or not to use) each of the mechanisms. While Perl isn't really needed for frames or style sheets, it does provide a simple way to put the page in the table.
In effect, all we need to do is insert something like
<TABLE>
<TR valign="center">
<TD>icon</TD>
<TD>site name</TD>
</TR>
<TR valign="top">
<TD>table of contents</TD>
<TD>
at the beginning of the document (after the body tag) and then
</TD>
</TR>
</TABLE>
at the end of the document (before the end body tag).
We could store these pieces of HTML code in a separate file and use the strategy given in the previous section to insert those files. However, it may be more convenient to include the text directly as part of the Perl script. To do so, we need a way to write multiple lines of text. Here's a sample of doing that in Perl.
# Insert the parts of the table that come before the original page.
if (m/<body) {
print OUTFILE <<"START_TABLE";
<TABLE>
<TR valign="center">
<TD>icon</TD>
<TD>site name</TD>
</TR>
<TR valign="top">
<TD>table of contents</TD>
<TD>
START_TABLE
In this case, the <<"START_TABLE" tells Perl to print
everything up to (but not including) the line that reads START_TABLE.
You can, of course, choose whatever text you want. The program put-in-table.pl given in the appendix
provides further details.
In all of the previous examples, we assumed that we wanted to insert text before or after the current line. (We were able to do so because body and end body tags are typically on lines by themselves). What if we want to insert text within the current line?
For example, it is good design practice to include the site name as part of the document's title (so that, for example, it provides a clearer name in the history or bookmarks lists, helping the reader to distinguish the thousands of pages called ``Home page'' or ``Links'' or such). But the title tag usually immediately precedes the title on the same line. We'd like to insert the site name directly after the title tag.
We can do so using Perl's substitute command (s), which
is quite similar to the match command (m). In this particular
case, we might write
s/(<title>)/$1$sitename: /i;
This tells Perl that whenever it sees a title tag (<title>)
it should substitute (s) a title tag, the site name, a colon, and
a space.
($1$sitename: ), and that it doesn't matter what case
is used for the title tag (i). As before, the slashes separate
the parts of the command. The $1 indicates ``the text that was
matched'' and is used so that we maintain the capitalization of the title
tag. In order to use the matched pattern, we need to surround it by parentheses.
If we didn't care about reusing the same text, we might instead write
(perhaps more clearly)
s/<title>/<TITLE>$sitename: /i;
The insert-sitename.pl program
given in the appendix contains the complete set of instructions for
inserting a site name.
There are a wide variety of uses for the substitute command. As we see in the subsequent sections, we might use substitute commands to highlight pieces of text in the HTML file or to change the color of the document.
At times, it is useful to provide a copy of a document with all the instances of a particular word highlighted. Most typically, this is done when presenting the result of a search, but one might also want to highlight a company's name or some other relevant text.
Again, the substitute command is relatively simple. Suppose we wanted to highlight all instances of AACE. We simply add the line
s/AACE/<b>AACE<\/b>/i;
More generally, we can use a variable to hold the text we wish to substitute,
and use the $1 to make sure we duplicate the text exactly.
s/($highlight)/<b>$1<\/b>/i;
The highlight.pl program given in the
appendix provides more details. One problem with this program is that
it also highlights inappropriate text (e.g., text that appear within a tag).
Eliminating the problem is left as a problem for more experienced scripters
(feel free to contact Sam Rebelsky for a sample script, but don't expect to
understand all of it).
At times, we may not only want to insert some text, but also delete some. For example, we may want to set the background color (or text color or background image or ...) used for a document to a site standard. Here, we don't want to keep the body tag, but rather insert a different one. At the same time, we don't really know the precise string to match, since there are many things that can go in a body tag. We use the power of patterns.
For example, to throw away all of the body attributes, and just set the background color to white, we might write the following.
s/<body[^>]*>/<body bgcolor="white">/i;
There are two special patterns here. The [^^gt;] reads
``anything but a greater-than sign''. The * reads
``as many copies as necessary. Hence, the whole pattern reads
``a less than sign, the word `body' (case insensitive), as many characters
that are not greater than signs as necessary, and then a greater than sign'',
essentially the standard structure of a greater-than sign.
The program setcolor.pl in the appendix
provides a more general solution to setting the color.
The following are the full programs discussed above. They are presented as complete programs for the convenience of the reader (e.g., so that they can be copied, pasted, and modified). They are listed in order of appearance in the original text.
copy.pl: Copying files
#!/usr/local/bin/perl
# Copy a file, prompting for input and output file names.
# Get the name of the input file
print "Please enter the name of the input file: ";
$infile = <STDIN>;
chop($infile);
# Get the name of the output file
print "Please enter the name of the output file: ";
$outfile = <STDIN>;
chop($outfile);
# Process the two files
open(INFILE, "< $infile");
open(OUTFILE, "> $outfile");
while (<INFILE>) {
print OUTFILE $_;
} # while
close(OUTFILE);
close(INFILE);
# That's it
exit 0;
cpy.pl: Copying files specified on the command line
#!/usr/local/bin/perl
# Copy a file, taking input and output file names from the
# command line.
# Test the number of arguments
if ($#ARGV != 1) {
print STDERR "Usage: cpy.pl inputfile outputfile\n";
exit 1;
}
# Get the name of the input and output files
$infile = $ARGV[0];
$outfile = $ARGV[1];
# Process the two files
open(INFILE, "< $infile");
open(OUTFILE, "> $outfile");
while (<INFILE>) {
print OUTFILE $_;
} # while
close(OUTFILE);
close(INFILE);
# That's it
exit 0;
insert-navbar.pl: Inserting one navigation bar
#!/usr/local/bin/perl
# Insert a navigation bar at the top of the page.
# Get the input and output files
if ($#ARGV != 1) {
print STDERR "Usage: insert-navbar.pl inputfile outputfile\n";
exit 1;
}
$infile = $ARGV[0];
$outfile = $ARGV[1];
# Process the two files
open(INFILE, "< $infile");
open(OUTFILE, "> $outfile");
while (<INFILE>) {
# Add the current line (whether or not it starts or ends the body)
print OUTFILE $_;
# Have we hit the start of the body? If so, append the navigation bar.
if (m/<body/i) {
open(NAVBAR, "< navbar");
while ($line = <NAVBAR>) {
print OUTFILE $line;
}
close(NAVBAR);
}
} # while
close(OUTFILE);
close(INFILE);
# That's it
exit 0;
insert-navbars.pl: Inserting two navigation bars
#!/usr/local/bin/perl
# Insert a navigation bar at the top and bottom of the page.
# Get the input and output files
if ($#ARGV != 1) {
print STDERR "Usage: insert-navbars.pl inputfile outputfile\n";
exit 1;
}
$infile = $ARGV[0];
$outfile = $ARGV[1];
# Process the two files
open(INFILE, "< $infile");
open(OUTFILE, "> $outfile");
while (<INFILE>) {
# Have we hit the end of the body? If so, insert the navigation bar
if (m/<\/body>/i) {
open(NAVBAR, "< navbar");
while ($line = <NAVBAR>) {
print OUTFILE $line;
}
close(NAVBAR);
print OUTFILE $_;
}
# Add the current line (whether or not it starts or ends the body)
print OUTFILE $_;
# Have we hit the start of the body? If so, append the navigation bar.
if (m/<body/i) {
open(NAVBAR, "< navbar");
while ($line = <NAVBAR>) {
print OUTFILE $line;
}
close(NAVBAR);
}
} # while
close(OUTFILE);
close(INFILE);
# That's it
exit 0;
put-in-table.pl: Putting the current page in a table
#!/usr/local/bin/perl
# Place the current page in the middle of a table so as to provide a
# consistent interface for the site. The table is
# +---+-----------+
# I X | Site Name |
# +---+-----------+
# | T | |
# | O | |
# | C | Page |
# | | |
# | | |
# +---+-----------+
# where X is an icon representing the site and TOC is a site table of contents.
#
# This example version inserts some stuff specific to AACE, the Association
# for the Advancement of Computers in Education, the sponsors of Webnet.
# Get the input and output files
if ($#ARGV != 1) {
print STDERR "Usage: put-in-table.pl inputfile outputfile\n";
exit 1;
}
$infile = $ARGV[0];
$outfile = $ARGV[1];
# Process the two files
open(INFILE, "< $infile");
open(OUTFILE, "> $outfile");
while (<INFILE>) {
# Have we hit the end of the body? If so, end the table.
if (m/<\/body>/i) {
print OUTFILE<<"END_TABLE";
</TD>
</TR>
</TABLE>
END_TABLE
}
# Copy the current line to the output file
print OUTFILE $_;
# Have we hit the start of the body? If so, insert the navigation bar.
if (m/<body/i) {
print OUTFILE<<"START_TABLE"
<TABLE>
<TR valign="center">
<TD><IMG SRC="logo.gif" alt="AACE"></TD>
<TD><font size=16>AACE: Advancing Computers in Education</td>
</TR>
<TR valign="top">
<TD>
<a href="http://www.aace.org">AACE</a> <br>
<a href="http://www.aace.org/conf/edmedia/">EdMedia</a> <br>
<a href="http://www.aace.org/conf/webnet/">Webnet</a>
</TD>
<TD>
START_TABLE
}
} # while
close(OUTFILE);
close(INFILE);
# That's it
exit 0;
insert-sitename.pl: Inserting the site name in the title
#!/usr/local/bin/perl
# Insert the site name into an HTML file.
# Test the number of arguments
if ($#ARGV != 2) {
print STDERR "Usage: insert-sitename.pl sitename inputfile outputfile\n";
exit 1;
}
# Get the site name as well as the names of the input and output files
$sitename = $ARGV[0];
$infile = $ARGV[1];
$outfile = $ARGV[2];
# Process the two files
open(INFILE, "< $infile");
open(OUTFILE, "> $outfile");
while (<INFILE>) {
# Insert the site name after the title
s/(<title>)/$1$sitename: /i;
print OUTFILE $_;
} # while
close(OUTFILE);
close(INFILE);
# That's it
exit 0;
highlight.pl: Highlighting specified text
#!/usr/local/bin/perl
# Highlight some text in an HTML file.
# Test the number of arguments
if ($#ARGV != 2) {
print STDERR "Usage: highlight.pl text-to-highlight inputfile outputfile\n";
exit 1;
}
# Get the site name as well as the names of the input and output files
$highlight = $ARGV[0];
$infile = $ARGV[1];
$outfile = $ARGV[2];
# Process the two files
open(INFILE, "< $infile");
open(OUTFILE, "> $outfile");
while (<INFILE>) {
# Insert the site name after the title
s/($highlight)/<b>$1<\/b>/i;
print OUTFILE $_;
} # while
close(OUTFILE);
close(INFILE);
# That's it
exit 0;
setcolor.pl: Setting the background color
#!/usr/local/bin/perl
# Set the background color of an HTML file.
# Test the number of arguments
if ($#ARGV != 2) {
print STDERR "Usage: setcolor.pl color inputfile outputfile\n";
exit 1;
}
# Get the site name as well as the names of the input and output files
$color = $ARGV[0];
$infile = $ARGV[1];
$outfile = $ARGV[2];
# Process the two files
open(INFILE, "< $infile");
open(OUTFILE, "> $outfile");
while (<INFILE>) {
# Insert the site name after the title
s/<body[^>]*>/<body bgcolor="$color">/i;
print OUTFILE $_;
} # while
close(OUTFILE);
close(INFILE);
# That's it
exit 0;
Released: November 1998
This document was created as a handout for a tutorial on site-level authoring for the Webnet 1998 World Conference on the Internet. While we will discuss Perl for only a short time during the tutorial, experience suggests that many participants will want to learn more.
Version 1.0 can be found on the web at
http://www.math.grin.edu/~rebelsky/Tutorials/SiteLevel/Webnet1998/Perl/siteperl.html.
The sample HTML documents can be found on the web at
http://www.math.grin.edu/~rebelsky/Tutorials/SiteLevel/Webnet1998/Perl/index-of-samples.html.
[Front Door] [Links] [Slides]
Source text written by Samuel A. Rebelsky.
Source text last modified Fri Nov 6 09:23:02 1998.
This page generated on Fri Nov 6 09:23:49 1998 by SiteWeaver.
Contact our webmaster at rebelsky@math.grin.edu