# Exercise #3: Finding correlations in polling data

In the project on summarizing poll responses, you learned to apply some basic techniques involving higher-order procedures to a practical problem. This exercise calls upon you to use similar techniques to achieve a slightly subtler and more sophisticated analysis of the data.

## Coefficients of correlation

Two variable quantities are said to be positively correlated when high values of one are regularly found to be associated with high values of the other and low values of one with low values of the other. One can even quantify the degree of correlation between the two variables: If they are invariably and proportionately associated, the correlation is perfect and the coefficient of correlation is 1. On the other hand, if the association is not invariable (if, for instance, it can be overridden or obscured by other factors), the coefficient of correlation is less than 1. When the coefficient of correlation is 0, the two quantities are unrelated. It is also possible for there to be a negative correlation, in which high values of one quantity are associated with low values of the other; if this backwards association is invariable and proportionate, the coefficient of correlation is -1.

Here is one way to compute a coefficient of correlation: Let's call the two quantities X and Y. Take a lot of simultaneous measurements of the two quantities; let's suppose that there are n such simultaneous measurements and call them (x1, y1), ..., (xn, yn). Compute the arithmetic mean (the average) of the observed values of each variable -- (x1 + ... + xn)/n for X and (y1 + ... + yn)/n for Y. Call these means mX and mY respectively.

Go through the observations again and subtract the appropriate mean from each observed value to get its divergence from the mean. (Some of the divergences will be positive and others negative -- be sure not to discard the signs.) The result will be a collection of n observed divergences for each variable: (x1 - mX, y1 - mY), ..., (xn - mX, yn - mY).

The coefficient of correlation is a fraction. To compute its numerator, go through the list of paired divergences, find the product of each X-divergence and the corresponding Y-divergence, and add up all of those products. To compute the denominator, find the sum of the squares of the X-divergences, and the sum of the squares of the Y-divergences, multiply those sums together, and take the square root of the result.

The first part of the programming assignment is to design, write, and test a Scheme procedure that takes as its argument a list of pairs of real numbers, similar to the collection of simultaneous observations of two variables that I mentioned above, and computes and returns the coefficient of correlation. You can take it as a precondition of this procedure that not all of the observed values for either variable will be equal (and hence that there will be at least two measurements of each variable, i.e., the given list will contain at least two pairs).

## Converting polling data into quantities

The preceding method of computing correlations doesn't seem directly applicable to the polling data, since there are no real numbers there. But one can convert some of the kinds of data there into quantities assign numerical values to the various responses to a poll question.

For instance, on one of the issue questions, one could assign the numerical value +1 to a ``yes'' answer, -1 to a ``no'' answer, and 0 to ``no opinion.'' Then it would be possible to find out whether responses to two issue questions were correlated by computing the correlation coefficients for the numerical equivalents of the repondents' answers.

Similarly, in looking for correlations involving voters as opposed to non-voters, one might want to assign the numerical value 1 to the `X--`, `-X-`, and `--X` answers to the question about the respondent's 2004 Presidential vote, and 0 to the `---` answer.

Design, write, and test procedures that assign numerical values to poll responses in the ways described here. Use these procedures and the one you developed above to determine, for each of the sixteen issue questions, the degree of correlation between voting and favoring that issue.

## Filtering and correlating

Sometimes a correlation will be more noticeable if one derives it from only part of a set of data -- for instance, only from the responses of Bush voters, or only the responses of Poweshiek County residents. Filtering the observations first usually makes the results a little less certain, since the effective size of the sample is smaller, but sometimes exposes patterns that are much harder to detect in the data set as a whole.

Defining any Scheme procedures that you think might be helpful, find out which of the issue questions is most strongly correlated, either positively or negatively, with having voted for Kerry among independent voters. (In other words, filter the data first, keeping only the independent voters, and then compute correlation coefficients for Kerry voting and each of the issue questions, and determine which of the coefficients has the greatest absolute value.)

This exercise will be due at 2:15 p.m. on Friday, March 18.