Skip to main content

Reading data from files

Summary: In this laboratory, you will explore some of the ways you can explore data that have been stored in files.

Preliminaries

a. Open your answers to the lab on tables.

b. Start DrRacket.

c. Update to the latest version of the csc151 package.

d. Add the following lines to your definitions pane.

#lang racket
(require csc151)

e. Save a copy of the file us-zip-codes.csv on your desktop. Since it’s a large file, you will probably want to right-click on the link and use Save Link As….

f. Add to your definitions pane a line to read that file.

(define zips (read-csv-file "/home/username/Desktop/us-zip-codes.csv"))

Note that you will need to replace username with your username.

g. Confirm that the first few lines of the file contains zip code entries of the form you expect.

> (take zips 5)

h. Save a copy of the file us-zip-codes.txt. That file contains information about the source of the zip code data.

i. Browse that file either by opening it or by reading it into DrRacket with

> (file->lines "/home/username/Desktop/us-zip-codes.txt")

Exercises

Exercise 1: Simple explorations of the data

a. Determine how many entries there are in the zip code table.

b. Determine the last few entries in the zip code table.

Hint: Use drop.

c. Find the middle five elements of the zip code table. (You need not be exact, just close.)

Hint: Use drop and take with appropriate values.

Exercise 2: Sorting by city

a. In the tables lab, you wrote a procedure, sort-by-city, to sort zip-code entries by city name. Copy that procedure (and any other associated procedures) into your definitions pane. Add a citation to your prior partner or partners.

b. Create a copy of the zip code data sorted by city.

(define zips-by-city (sort-by-city zips))

c. Verify that the sorting worked by examining the first few lines, the last few lines, and some lines in the middle.

Exercise 3: Finding data

With more than 42,000 lines in the zip code file, it will be hard for us to manually find any particular city. But that’s why we have computers.

The assoc procedure will search a list of lists based on the first value in each element list. For example, we can find the entry for zip code 52111 with the instruction (assoc "50112" zips).

a. Using that command, find the entry for zip code 50112.

b. There is no city with a zip code of 00000. Determine what assoc does when you try to search for that value.

c. assoc only works when we’re using the first value, in this case the zip code. What if we want to search by city? Write a sequence of Scheme instructions to find the entry for Grinnell.

Exercise 4: Reflecting on data

a. In the tables lab, we identified a potential problem for working with the zip code data. Remind yourself of that problem.

b. Determine whether we are likely to have the same problem when working with the larger data set. (In your notes, you should write down the command you used to make this determination.)

Exercise 5: Cleaning data

While cleaning data is the subject of an upcoming reading and lab, we will have to clean these data before we can move on. Here’s a magic incantation that removes all of the entries that have no latitude.

> (define zips-clean (filter (negate (o string? cadr)) zips))

Confirm that this approach works.

Exercise 6: Extreme cities

a. Find the entries for the five northernmost cities in the zip code database.

b. Find the entries for the five southernmost cities in the zip code database.

c. Write instructions to provide those cities in more human-friendly form (just the city and state, not the zip, latitude, or longitude).

Exercise 7: Tallying states

a. Write an instruction to extract just the states from the cleaned list of zip codes.

b. Create a tally of the states using tally-all.

c. Order that tally from largest to smallest.

d. Determine which three states appear most often in this list.

Exercise 8: Text data

Project Gutenberg provides an extensive collection of public domain books in a variety of forms, including “plain text”.

a. Navigate to the Project Gutenberg Web site and download a few books in plain text format. Strive for short- to medium-length books. Jane Eyre is okay. The Complete Works of William Shakespeare is not.

b. Pick one of the books you’ve downloaded and write instructions to read the characters, words, and lines from the book. Call the results book-letters, book-words, and book-lines. For example,

> (define book-letters (file->chars "/home/rebelsky/Desktop/pg1260.txt"))
> (length book-letters)
1070329

c. Grab the first 20 characters, 10 words, and 5 lines from the book.

Note: You may discover that Project Gutenberg adds a header.

Exercise 9: Tallying

Unfortunately, the tally-all procedure is slow. It works fine for moderate sized data sets, but poorly for large data sets.

a. Write instructions to tally the first 10,000 words in your chosen text. (You should, of course, store the tally in a variable.)

b. Find the five most-frequently-used non-trivial words.

For those with extra time

If you find that you have extra time, you may try any of the following. (That is, you need not do them in order from first to last.)

Extra 1: The beginning and the end

In an earlier exercise, you found the five most-frequently-used non-trivial words in the first 10,000 words of the text. Determine how frequently each of those words is used in the last 10,000 words of the text.

Extra 2: Digrams

In analyzing texts, it is often useful to create “digrams”, pairs of neighboring elements in the text.

a. Create a list of the first 1001 words in your text.

b. Using that list, create a list of the first 1000 word digrams in the text.

Hint: Use (map (string string_apppend <> " " <>) ... ...).

c. Can you figure out which word most frequently follows “the”?

d. Is your answer different if your first convert the words to lowercase using string-downcase?

(define sample-words (map1 string-downcase (take book-words 1000)))

Extra 3: Why generate digrams?

We’ve seen one use of generating digrams: Digrams can tell us a bit more about the author, such as what words the author uses immediately after or before another word.

But digrams (or their extensions) can also be used to generate text. Read this ‘blog post and reflect on possibilities.

Extra 4: Who needs punctuation?

While the list of characters we’ve received from file->chars is comprehensive, it contains a lot of characters we might not want.
Let’s consider a way to get a more “useful” set of characters.

a. Write instructions to create a single string that consists of the first five hundred or so words of the text, separated by spaces.

b. Convert that string to a list of characters using string->list.

c. Using those characters, create a list of two-letter digrams.

d. Reflect on how those digrams might be useful