Reading data from files
Summary: In this laboratory, you will explore some of the ways you can explore data that have been stored in files.
a. Open your answers to the lab on tables.
b. Start DrRacket.
c. Update to the latest version of the
d. Add the following lines to your definitions pane.
#lang racket (require csc151)
e. Save a copy of the file
on your desktop. Since it’s a large file, you will probably want to
right-click on the link and use Save Link As….
f. Add to your definitions pane a line to read that file.
(define zips (read-csv-file "/home/username/Desktop/us-zip-codes.csv"))
Note that you will need to replace
username with your username.
g. Confirm that the first few lines of the file contains zip code entries of the form you expect.
> (take zips 5)
h. Save a copy of the file
That file contains information about the source of the zip code data.
i. Browse that file either by opening it or by reading it into DrRacket with
> (file->lines "/home/username/Desktop/us-zip-codes.txt")
Exercise 1: Simple explorations of the data
a. Determine how many entries there are in the zip code table.
b. Determine the last few entries in the zip code table.
c. Find the middle five elements of the zip code table. (You need not be exact, just close.)
take with appropriate values.
Exercise 2: Sorting by city
a. In the tables lab, you wrote a procedure,
sort-by-city, to sort zip-code entries by city name. Copy that
procedure (and any other associated procedures) into your definitions
pane. Add a citation to your prior partner or partners.
b. Create a copy of the zip code data sorted by city.
(define zips-by-city (sort-by-city zips))
c. Verify that the sorting worked by examining the first few lines, the last few lines, and some lines in the middle.
Exercise 3: Finding data
With more than 42,000 lines in the zip code file, it will be hard for us to manually find any particular city. But that’s why we have computers.
assoc procedure will search a list of lists based on the
first value in each element list. For example, we can find the
entry for zip code 52111 with the instruction
(assoc "50112" zips).
a. Using that command, find the entry for zip code 50112.
b. There is no city with a zip code of 00000. Determine what
does when you try to search for that value.
assoc only works when we’re using the first value, in this case
the zip code. What if we want to search by city? Write a sequence of
Scheme instructions to find the entry for Grinnell.
Exercise 4: Reflecting on data
a. In the tables lab, we identified a potential problem for working with the zip code data. Remind yourself of that problem.
b. Determine whether we are likely to have the same problem when working with the larger data set. (In your notes, you should write down the command you used to make this determination.)
Exercise 5: Cleaning data
While cleaning data is the subject of an upcoming reading and lab, we will have to clean these data before we can move on. Here’s a magic incantation that removes all of the entries that have no latitude.
> (define zips-clean (filter (negate (o string? cadr)) zips))
Confirm that this approach works.
Exercise 6: Extreme cities
a. Find the entries for the five northernmost cities in the zip code database.
b. Find the entries for the five southernmost cities in the zip code database.
c. Write instructions to provide those cities in more human-friendly form (just the city and state, not the zip, latitude, or longitude).
Exercise 7: Tallying states
a. Write an instruction to extract just the states from the cleaned list of zip codes.
b. Create a tally of the states using
c. Order that tally from largest to smallest.
d. Determine which three states appear most often in this list.
Exercise 8: Text data
Project Gutenberg provides an extensive collection of public domain books in a variety of forms, including “plain text”.
a. Navigate to the Project Gutenberg Web site and download a few books in plain text format. Strive for short- to medium-length books. Jane Eyre is okay. The Complete Works of William Shakespeare is not.
b. Pick one of the books you’ve downloaded and write instructions
to read the characters, words, and lines from the book. Call the
book-lines. For example,
> (define book-letters (file->chars "/home/rebelsky/Desktop/pg1260.txt")) > (length book-letters) 1070329
c. Grab the first 20 characters, 10 words, and 5 lines from the book.
Note: You may discover that Project Gutenberg adds a header.
Exercise 9: Tallying
tally-all procedure is slow. It works fine for
moderate sized data sets, but poorly for large data sets.
a. Write instructions to tally the first 10,000 words in your chosen text. (You should, of course, store the tally in a variable.)
b. Find the five most-frequently-used non-trivial words.
For those with extra time
If you find that you have extra time, you may try any of the following. (That is, you need not do them in order from first to last.)
Extra 1: The beginning and the end
In an earlier exercise, you found the five most-frequently-used non-trivial words in the first 10,000 words of the text. Determine how frequently each of those words is used in the last 10,000 words of the text.
Extra 2: Digrams
In analyzing texts, it is often useful to create “digrams”, pairs of neighboring elements in the text.
a. Create a list of the first 1001 words in your text.
b. Using that list, create a list of the first 1000 word digrams in the text.
(map (string string_apppend <> " " <>) ... ...).
c. Can you figure out which word most frequently follows “the”?
d. Is your answer different if your first convert the words to
(define sample-words (map1 string-downcase (take book-words 1000)))
Extra 3: Why generate digrams?
We’ve seen one use of generating digrams: Digrams can tell us a bit more about the author, such as what words the author uses immediately after or before another word.
But digrams (or their extensions) can also be used to generate text. Read this ‘blog post and reflect on possibilities.
Extra 4: Who needs punctuation?
While the list of characters we’ve received from
comprehensive, it contains a lot of characters we might not want.
Let’s consider a way to get a more “useful” set of characters.
a. Write instructions to create a single string that consists of the first five hundred or so words of the text, separated by spaces.
b. Convert that string to a list of characters using
c. Using those characters, create a list of two-letter digrams.
d. Reflect on how those digrams might be useful