# Multivariate Data Visualization

Summary: We explore techniques for visualizing a simple multivariate data set.

Contents:

## Preparation

a. We'll be working in both DrScheme and GIMP, so open a new DrScheme window and a new GIMP window. In the GIMP, open a ScriptFu console. In that console, load our course GIMP library.

```(load "/home/rebelsky/Web/Courses/CS151/2006F/Examples/gimp.scm")
```

b. In order to visualize data, you'll need some data sets with which to work. Let's create a simple set to get you familiar with techniques.

• Open a new Web browser window or tab to `http://devdata.worldbank.org/data-query/`.
• Select the United States (and only the United States) from the list of available countries.
• Click to go on to the next screen.
• Select GDP (current US\$) and Population, total.
• Click to go on to the next screen.
• Select all the available years.
• Click to go on to the next screen.
• Under Data export options, select Save data as ASCII file.
• Follow the instructions and save the result on your desktop as `us1.txt`.

c. Open that file in DrScheme to see its basic structure.

## Exercises

### Exercise 1: Preprocessing the Data

Note: This exercise (and the subsequent exercises) will only work if you've loaded the `gimp.scm` library.

Right now, we have the data in a file. If we are to visualize it, we'll need to convert it to a usable form.

a. Open the input file and name the port with

```(define us1 (open-input-file "/home/username/Desktop/us1.txt"))
```

b. Read the three lines from the file with

```(define headings (read-line us1))
```

c. Close the port with

```(close-input-port us1)
```

d. Look at the form of `gdp-line`. What type is it? What do you expect to have to do in order to use it to plot points?

e. The ScriptFu procedure `(strbreakup str sep)` allows us to separate a compound string into a list of strings, using sep to determine where to break the string. Separate the lines with

```(define heading-parts (strbreakup headings "\t"))
(define gdp-parts (strbreakup gdp-line "\t"))
(define population-parts (strbreakup population-line "\t"))
```

f. Look at the form of `gdp-parts`. Is it what you expected? How would you convert it to a list of numbers?

g. Convert the two lists of parts to lists of numbers with

```(define years (map string->number (cddr heading-parts)))
(define gdps (map string->number (cddr gdp-parts)))
(define populations (map string->number (cddr population-parts)))
```

h. You are now stuck with three lists of values. You'd like to group them into data points. Here's one straightforward way to do so

```(define make-triplet-points
(lambda (lst1 lst2 lst3)
(if (null? lst1)
null
(cons (vector (car lst1) (car lst2) (car lst3))
(make-triplet-points (cdr lst1) (cdr lst2) (cdr lst3))))))
```

What do you expect the value of `data-points` to be after the following definition?

```(define data-points (make-triplet-points years gdps populations))
```

j. You may find that there's an extra data point at the end of your list. If so, remove it with

```(define data-points (reverse (cdr (reverse data-points))))
```

### Exercise 2: A Coordinated Drawing System

a. Create a new 340x320 image with `create-image`.

b. Set the foreground color to black and the pen to the pixel (1x1) or similar small brush.

```(set-fgcolor BLACK)
(set-brush "pixel (1x1 square)")
```

c. Using the `line` procedure, draw the two lines on the image. One line should go from (10,10) to (10,310). The other should go from (10,310) to (310,310). These lines are intended to represent the x and y axes. (Which is which?)

d. Suppose we wanted to draw a circle of radius 8 at the point (200,100) in the coordinate system represented by the two axes you just drew. Where should that circle be centered in the GIMP coordinate system?

e. What instructions would you write to draw a blue circle of radius 8 centered at the point (200,100) in the new coordinate system?

g. Write a procedure, `(draw-circle image x y r)`, that draws a circle of radius `r` centered at `(x,y)` in the new coordinate system. The circle should use the current background color as its primary color and should have a black outline.

### Exercise 3: Plotting Data Points

Suppose we want to plot GDP (dependent) vs. population (independent). Clearly, neither set of values falls in the range 0 .. 300. What should we do? We should scale them. For example, we might divide population by 300,000,000 (slightly more than the largest population), which gives us a number between 0 and 1. We then multiply by 300 to get a number between 0 and 300.

a. Write a procedure, `(plot img data-point)`, that, given a data point (a vector of the form `#(year population gdp)`),

• computes an x value by dividing population by 300,000,000 and then multiplying by 300;
• computes a y value by dividing GDP by 15,000,000,000,000 and then multiplying by 300;
• draws a blue circle centered at (x,y); and
• returns the list `(x y)`.

b. Plot all of the points computed in exercise 0 with

```(map (lambda (point) (plot img point)) data-points)
```

### Exercise 4: Labels

Of course, now that we have values plotted, it makes sense to give the reader some sense as to what values each spot represents. Fortunately, the `gimp.scm` library now contains a procedure,
`(text img str x y font size)`, that draws str in the given font and size, with the upper left-hand-corner at (x,y).

a. Using that procedure, add the string `"0"` directly below the origin with

```(text img "0" 10 311 "Sans" 8)
```

b. Add the maximum x value directly below 300 with

```(text img "300M" 310 311 "Sans" 8)
```

c. Add similar labels to the y axis.

### Exercise 5: Shifting

As you probably noticed, the values we computed in the previous step are close enough to each other that its hard to tell how much the points differ. What should we do? Instead of treating the origin as (0,0), we want to treat it as some other base value, such as (280,000,000, 10,000,000,000,000).

To continue that example, if an x coordinate of 0 represents a population of 280,000,000, then the values we need to convert to the range (0 .. 300) are

• 2,224,000 (2.82224E+08 - 2.8E+08)
• 5,318,000 (2.85318E+08 - 2.8E+08)
• 8,369,000 (2.88369E+08 - 2.8E+08)
• 10,810,000 (2.9081E+08 - 2.8E+08)
• 13,655,400 (2.936554E+08 - 2.8E+08)
• 16,496,600 (2.964966E+08 - 2.8E+08)

So, we divide by 20,000,000 or so and then multiply by 300.

This strategy gives us a nice spread of x values. (We'll leave it to you to decide whether or not its intellectually honest.)

Rewrite `plot` so that it

• computes an x value by subtracting 280,000,000 from the population, dividing by an appropriate value, and then multiplying by 300;
• computes a y value by subtracting 10,000,000,000,000 from the GDP, dividing by an appropriate value, and then multiplying by 300;
• draws a blue circle centered at (x,y); and
• returns the list `(x y)`.

b. Create a new 320x320 image and draw axes on that image.

c. Label the axes. For example, the origin should now have an x value of `280M`.

c. Plot all of the points computed in exercise 0 with

```(map (lambda (point) (plot img point)) data-points)
```

### Exercise 6: Changing Colors

You now have a plot in which you can tell the points apart. But which point represents which year? In this case, it seems straightforward, since we know that the US population and GDP seem to increase each year. However, there are certainly cases in which points are less predictable.

So, how can we represent the year? One possibility is to change the shade of blue we use to plot each point. For every year, we'll set the red component to 0 and the green component to 0. For the year 2000, we'll use a blue component of 32; for 2001, 64; for 2002, 96; for 2003, 128; for 2004, 160; and for 2006, 192.

a. Rewrite `plot` to set the blue component as described in the previous paragraph.

b. Plot all of the points computed in exercise 0.

### Exercise 7: Computing New Values

Of course, in addition to paying attention to GDP, we might also care about GDP per capita.

a. Write a procedure, `add-gdppc` that, given a data point of the form `(year population gdp)`, adds a forth element which represents the GDP per capita.

b. Rewrite `plot` so that the size of the circle depends on the GDP per capita. (Allow the radius of the circle to vary between 5 and 10).

## For Those With Extra Time

In computing an x and y value for each point, we relied on some human analysis and intuition to decide on how to convert each population to an x value and each GDP to a y value. How you might automate the computation of conversion values?

## History

Disclaimer: I usually create these pages on the fly, which means that I rarely proofread them and they may contain bad grammar and incorrect details. It also means that I tend to update them regularly (see the history for more details). Feel free to contact me with any suggestions for changes.

This document was generated by Siteweaver on Thu Nov 30 21:46:01 2006.
The source to the document was last modified on Thu Nov 30 21:45:01 2006.
This document may be found at `http://www.cs.grinnell.edu/~rebelsky/Courses/CS151/2006F/Labs/multivariate-visualization.html`.

You may wish to validate this document's HTML ; ;

Samuel A. Rebelsky, rebelsky@grinnell.edu

Copyright © 2006 Samuel A. Rebelsky. This work is licensed under a Creative Commons Attribution-NonCommercial 2.5 License. To view a copy of this license, visit `http://creativecommons.org/licenses/by-nc/2.5/` or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.