Summary: We explore techniques for visualizing a simple multivariate data set.
a. We'll be working in both DrScheme and GIMP, so open a new DrScheme window and a new GIMP window. In the GIMP, open a ScriptFu console. In that console, load our course GIMP library.
b. In order to visualize data, you'll need some data sets with which to work. Let's create a simple set to get you familiar with techniques.
c. Open that file in DrScheme to see its basic structure.
Note: This exercise (and the subsequent exercises) will only work
if you've loaded the
Right now, we have the data in a file. If we are to visualize it, we'll need to convert it to a usable form.
a. Open the input file and name the port with
(define us1 (open-input-file "/home/username/Desktop/us1.txt"))
b. Read the three lines from the file with
(define headings (read-line us1)) (define gdp-line (read-line us1)) (define population-line (read-line us1))
c. Close the port with
d. Look at the form of
gdp-line. What type is it? What
do you expect to have to do in order to use it to plot points?
e. The ScriptFu procedure
(strbreakup str sep)
allows us to separate a compound string into a list of strings, using
sep to determine where to break the string. Separate the
(define heading-parts (strbreakup headings "\t")) (define gdp-parts (strbreakup gdp-line "\t")) (define population-parts (strbreakup population-line "\t"))
f. Look at the form of
gdp-parts. Is it what you expected?
How would you convert it to a list of numbers?
g. Convert the two lists of parts to lists of numbers with
(define years (map string->number (cddr heading-parts))) (define gdps (map string->number (cddr gdp-parts))) (define populations (map string->number (cddr population-parts)))
h. You are now stuck with three lists of values. You'd like to group them into data points. Here's one straightforward way to do so
(define make-triplet-points (lambda (lst1 lst2 lst3) (if (null? lst1) null (cons (vector (car lst1) (car lst2) (car lst3)) (make-triplet-points (cdr lst1) (cdr lst2) (cdr lst3))))))
What do you expect the value of
data-points to be after
the following definition?
(define data-points (make-triplet-points years gdps populations))
i. Check your prediction experimentally.
j. You may find that there's an extra data point at the end of your list. If so, remove it with
(define data-points (reverse (cdr (reverse data-points))))
a. Create a new 340x320 image with
b. Set the foreground color to black and the pen to the pixel (1x1) or similar small brush.
(set-fgcolor BLACK) (set-brush "pixel (1x1 square)")
c. Using the
line procedure, draw the two lines on the image. One line should go from (10,10) to (10,310).
The other should go from (10,310) to (310,310). These lines are intended
to represent the x and y axes. (Which is which?)
d. Suppose we wanted to draw a circle of radius 8 at the point (200,100) in the coordinate system represented by the two axes you just drew. Where should that circle be centered in the GIMP coordinate system?
e. What instructions would you write to draw a blue circle of radius 8 centered at the point (200,100) in the new coordinate system?
f. Test your instructions experimentally.
g. Write a procedure,
(draw-circle image x y r), that draws
a circle of radius
r centered at
(x,y) in the
new coordinate system. The circle should use the current
background color as its primary color and should have a black outline.
Suppose we want to plot GDP (dependent) vs. population (independent). Clearly, neither set of values falls in the range 0 .. 300. What should we do? We should scale them. For example, we might divide population by 300,000,000 (slightly more than the largest population), which gives us a number between 0 and 1. We then multiply by 300 to get a number between 0 and 300.
a. Write a procedure,
(plot img data-point), that, given a
data point (a vector of the form
#(year population gdp)),
b. Plot all of the points computed in exercise 0 with
(map (lambda (point) (plot img point)) data-points)
Of course, now that we have values plotted, it makes sense to give the
reader some sense as to what values each spot represents. Fortunately,
gimp.scm library now contains a procedure,
(text img str x y font size), that draws str in the given font and size, with the upper
left-hand-corner at (x,y).
a. Using that procedure, add the string
"0" directly below
(text img "0" 10 311 "Sans" 8)
b. Add the maximum x value directly below 300 with
(text img "300M" 310 311 "Sans" 8)
c. Add similar labels to the y axis.
As you probably noticed, the values we computed in the previous step are close enough to each other that its hard to tell how much the points differ. What should we do? Instead of treating the origin as (0,0), we want to treat it as some other base value, such as (280,000,000, 10,000,000,000,000).
To continue that example, if an x coordinate of 0 represents a population of 280,000,000, then the values we need to convert to the range (0 .. 300) are
So, we divide by 20,000,000 or so and then multiply by 300.
This strategy gives us a nice spread of x values. (We'll leave it to you to decide whether or not its intellectually honest.)
plot so that it
b. Create a new 320x320 image and draw axes on that image.
c. Label the axes. For example, the origin should now have an x value
c. Plot all of the points computed in exercise 0 with
(map (lambda (point) (plot img point)) data-points)
You now have a plot in which you can tell the points apart. But which point represents which year? In this case, it seems straightforward, since we know that the US population and GDP seem to increase each year. However, there are certainly cases in which points are less predictable.
So, how can we represent the year? One possibility is to change the shade of blue we use to plot each point. For every year, we'll set the red component to 0 and the green component to 0. For the year 2000, we'll use a blue component of 32; for 2001, 64; for 2002, 96; for 2003, 128; for 2004, 160; and for 2006, 192.
plot to set the blue component as described in the
b. Plot all of the points computed in exercise 0.
Of course, in addition to paying attention to GDP, we might also care about GDP per capita.
a. Write a procedure,
add-gdppc that, given a data point
of the form
(year population gdp), adds a forth element
which represents the GDP per capita.
plot so that the size of the circle depends on
the GDP per capita. (Allow the radius of the circle to vary between
5 and 10).
In computing an x and y value for each point, we relied on some human analysis and intuition to decide on how to convert each population to an x value and each GDP to a y value. How you might automate the computation of conversion values?
I usually create these pages
on the fly, which means that I rarely
proofread them and they may contain bad grammar and incorrect details.
It also means that I tend to update them regularly (see the history for
more details). Feel free to contact me with any suggestions for changes.
This document was generated by
Siteweaver on Thu Nov 30 21:46:01 2006.
The source to the document was last modified on Thu Nov 30 21:45:01 2006.
This document may be found at
You may wish to validate this document's HTML ; ;Samuel A. Rebelsky, email@example.com
http://creativecommons.org/licenses/by-nc/2.5/or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.