Skip to main content

Assignment 5 - Distant Reading

Summary
For this assignment, you will put your data science skills to work by exploring ways to gather more complex information about text files.
Collaboration
You must work with your assigned partner(s) on this assignment. You may discuss this assignment with anyone, provided you credit such discussions when you submit the assignment.
Submitting
Email your answers to csc151-01-grader@grinnell.edu. The subject of your email should be [CSC151 01] Assignment 5 - Distant Reading and should contain your answers to all parts of the assignment. Scheme code should be in the body of the message, not in an attachment.
Warning
So that this assignment is a learning experience for everyone, we may spend class time publicly critiquing your work.
Preparation
In preparation for this assignment, pick a dozen or so moderate-length texts (at least fifty and no more than a few hundred pages) from Project Gutenberg. At least three but no more than six of the texts should be by the same author.

Background

A number of scholars in the humanities have begun exploring computer-based approaches to uncovering ideas or themes that some term “distant reading” (to contrast it with the “close reading” that is so core to the understanding of works). In this assignment, we will explore a variety of simple techniques for exploring texts with computers. (Traditional distant reading uses much more sophisticated techniques.)

Problem 1: Word lengths

Topics: files, strings, displaying data with histograms

In this problem, you will analyze texts based on the lengths of the words in the text.

Document and write a procedure, (explore-lengths fname), that takes the name of a text file as input and produces a graph of the frequencies of the word lengths in the file. You will likely want to follow a series of steps similar to the following.

a. Read all of the words from the file.

b. Convert the list of words to a list of lengths.

c. Tally the lengths using tally-all.

d. Scale the tallies by dividing each by the total number of words. (That gives us a frequency between 0 and 1.)

d. Put the tallies in order using sort.

e. Display the data using plot with discrete-histogram.

Using this procedure, create histograms for the books you chose. Write a short note about any similarities or differences you see.

In turning in the assignment, do not submit the histograms themselves. Rather, submit the documentation and code for explore-lengths and the instructions for building the histograms.

Problem 2: Word lengths, revisited

Topics: files, strings, displaying data with dot plots

Another characteristic of books we might explore is the relative proportion of long words to short words.

Document and write a procedure, (compare-word-lengths files), that takes a list of file names as input and produces a scatterplot (a points plot) with one point for each file in which the x coordinate of the point is the percentage of words of seven characters or more in the file and the y coordinate is the percentage of words of four characters or fewer.

Problem 3: Subsequent words

Topics: files, strings, lists

One of the more substantive things about books that computers can help us explore is words that the author tends to use together.

Document and write a procedure, (subsequent-words filename word) that, given the name of a file and a word, makes a list of all the words that follow one, two, or three words after word. For example, suppose the file "story.txt" contains the following.

the cat chased the dog around the cat bowls and the dog dish

We should get output like the following.

> (subsequent-words "/home/username/Desktop/story.txt" "dog")
'("around" "the" "cat" "dish")
> (subsequent-words "/home/username/Desktop/story.txt" "cat")
'("chased" "the" "dog" "bowls" "and" "the")
> (subsequent-words "/home/username/Desktop/story.txt" "the")
'("cat" "chased" "the" "dog" "around" "the" "cat" "bowls" "and" "dog" "dish")

Hint: You may find it easiest to start by building a list of four-tuples like the following.

'(("the" "cat" "chased" "the")
  ("cat" "chased" "the" "dog")
  ("chased" "the" "dog" "around")
  ("the" "dog" "around" "the")
  ...)

There are many approaches to building those lists. One fairly straightforward one is to make four lists, each of which is “off by one” from the previous one, and then to join the elements together with map. E.g., we’d start with the lists

  • '("the" "cat" chased" "the" "dog" "around" "the" "cat" "bowls " ...)
  • '("cat" chased" "the" "dog" "around" "the" "cat" "bowls" ... "")
  • '("chased" "the" "dog" "around" "the" "cat" "bowls" ... "" "")
  • '("the" "dog" "around" "the" "cat" "bowls" ... "" "" "")

Problem 4: Common connections

Topics: files, strings, tallying, sorting

Document and write a procedure (common-connections filename word), that takes as input a file name and a word and produces a list of the five words that most commonly follow close after the word (one, two, or three words away) and the number of times they appear nearby.

> (common-connections "/home/username/Desktop/something-weird.txt" "jabberwock")
'(("alice" 191)
  ("borogoves" 83)
  ("vorpal" 23)
  ("sword" 18)
  ("wabe" 11))

Next, pick three words that you expect to appear in six of your books and find the most common connections to those words in each of those six books.

Problem 5: Categorizing words

Topics: files, strings, conditionals, tallying

We’ve seen a number of ways to categorize words. They may be short or long. They may start with or contain certain letters. They may contain repeated letters. They may be near other words. They may be common. They may be uncommon.

First, Pick and describe between six and ten categories. Then, document and write a procedure, (categorize-word word), that gives the category for a word as a string. You should use “uncategorized” for words that do not fit into your categories. For example,

> (categorize-word "aardvark")
"starts with vowel"
> (categorize-word "jabberwocky")
"Carrollian"
> (categorize-word "defenestrate")
"uncommon"
> (categorize-word "madam")
"palindrome"
> (categorize-word "Grinnell")
"proper-name"
> (categorize-word "elephant")
"starts-with-vowel"
> (categorize-word "me")
"short"
> (categorize-word "hello")
"uncategorized"

If a word falls into multiple categories, you will pick only one.

Next, document and write a procedure, categorize-words-in-file, that takes a file name as input and creates a histogram of the categories in alphabetical order. For each category, you should indicate the percentage of words that fall in that category.

Finally, categorize six of the books you chose and see whether the categorization tells you anything about the book.

Note: You will almost certainly use a conditional to write categorize-word.

Evaluation

We will primarily evaluate your work on correctness (does your code compute what it’s supposed to and are your procedure descriptions accurate); clarity (is it easy to tell what your code does and how it achieves its results; is your writing clear and free of jargon); and concision (have you kept your work short and clean, rather than long and rambly).