Description: In the laboratory, you will be developing a program that analyzes simplified web log files. Your program will identify a page that has the most hits (if multiple pages have the same number of hits, you need only identify one) and a machine that generated the most hits.
Purpose: In this lab, you will gain experience with Dictionaries in Java. You will also gain some knowledge of pages on the web and prepare yourself for assignment 3. More advanced users may gain some experience with Enumerations in Java.
You can find some sample log files in
/home/rebelsky/Examples. The files are
log.short, log.medium, and
log.long. Each line of each file contains a host name and
the "path" to a file on the web server. You may want to look at
these files to better familiarize yourself with their content. You can
look at access_log to see the format our server normally
creates.
I've started to develop a
simple analysis tool for log files. Right
now, all it does is count the number of accesses for each page by
storing a Counter for each page it finds. The code is a little weird,
but you should be able to understand it. Make a copy of that code with
% example PageCounter.java
You may want to read the documentation for
rebelsky.io.Counter and
java.util.Hashtable. In reading the documentation
for Hashtable, pay particular attention to get,
put, and containsKey.
One of the strange parts of the page counter is the line
((Counter) pages.get(page)).increment();
What does it say? It says look up page in the
dictionary pages. Since get returns
an object and we know that that object is a Counter, we
tell Java about its real type. Since it's a Counter, we
can (and do) increment it.
To tell Java more about the type of an object, your preface the object by the type in parentheses. This is calle casting the object.
args
Finally, a program that actually uses the args that we've
included so frequently. As you may be able to tell from the documentation,
Java assumes that some programs will be run from the command line, just
like mkdir and a host of others. Hence, your Java program
will need to be able to access the other values on the command line. Java
passes them to your main routine as an array of strings.
For example, if someone typed
% ji YourProgram alpha beta gamma
args.length would be 3 (there are three arguments)
args[0] would be alpha
args[1] would be beta
args[2] would be gamma
In this program, we use the 0th argument as the name of the file to process.
Extend the page counter so that it prints out the most frequently accessed page. If there are many pages that are accessed the same number of times, you only need print one of those pages.
Extend the page counter so that it counts the number of different sites that accessed pages on this server.
Extend the page counter so that it prints out the server that accessed the most pages.
Extend the page counter so that if there are many pages accessed the
"maximum" number of times, it prints out all of them. How will you do
this? Once you've determined this maximum number of times, you can use
pages.keys(), which gives you an Enumeration
(almost like a list) of the keys. Then you can step through the hash
table, checking the counter for each key and seeing if it equals the
maximum number of accesses.
Here's the code for the sample usage analyzer.
import java.io.EOFException; // So we can determine end-of-file
import java.util.Hashtable; // For storing information on accesses
import rebelsky.io.SimpleOutput; // Yes, we're generating output
import rebelsky.io.SimpleReader; // And reading input
import rebelsky.util.Counter; // For counting accesses
/**
* Count a series of web page accesses and report on the most
* frequently accessed page. Takes the name of the file
* containing this information from the command line.
*
* @author Samuel A. Rebelsky
* @version 1.0 of February 1998
*/
public class PageCounter
{
/**
* Count those pages.
*
* @exception Exception
* when any trouble occurs. Yes, that's right. This crashes and
* burns horribly without any real error checking.
*/
public static void main(String[] args)
throws Exception
{
// Input to the program.
SimpleReader file;
// Output from the program.
SimpleOutput out = new SimpleOutput();
// The hash table that stores the pages we've seen.
Hashtable pages = new Hashtable();
// The host that requested the page
String host;
// The page that was requested
String page;
// The number of pages we've processed
Counter processed = new Counter();
// Sanity check. Was the program called correctly?
if (args.length != 1) {
out.println("Usage: java PageCounter filename");
out.println(" or: ji PageCounter filename");
System.exit(1);
}
// Initialize input
file = new SimpleReader(args[0]);
// Read lines until end of file
try {
while (true) {
// Read the host and page.
host = file.readString();
page = file.readString();
// Skip anything else on the line.
file.readLine();
// And processs ...
// If we've already seen the page, just increment its counter.
if (pages.containsKey(page)) {
((Counter) pages.get(page)).increment();
}
// If we haven't seen the page, build a new counter with base
// value 1.
else {
pages.put(page, new Counter(1));
}
// Note that we've processed another line
processed.increment();
} // while
} //try
catch (EOFException e) {
// Do nothing except exit the loop.
}
// Okay, we're done, report anything interesting.
out.println("We've processed " + processed.value() + " lines.");
out.println("We've seen " + pages.size() + " different pages.");
// ...
// That's it.
System.exit(0);
} // main
} // PageCounter
Disclaimer Often, these pages were created "on the fly" with little, if any, proofreading. Any or all of the information on the pages may be incorrect. Please contact me if you notice errors.
Source text last modified Thu Feb 12 21:15:53 1998.
This page generated on Thu Feb 12 21:19:44 1998 by SiteWeaver.
Contact our webmaster at rebelsky@math.grin.edu