Introduction to Statistics (MAT/SST 115.03 2008S)

Techniques of Statistical Inference: Proportions


As you saw (or will see) in Topics 16-18, there are two principal techniques of statistical inference: confidence intervals and tests of significance. To compute these values in R, you simply need to follow the formulae. We can make it easier to compute with the formulae by using names.

Confidence Intervals

We'll start with confidence intervals. The lower bound of a confidence interval is

c_lower = p_hat - z_star*sqrt(p_hat*(1-p_hat)/n)

where c_lower is the lower bound of the interval, p_hat is the sample proportion, z_star is the desired confidence level (aka the “critical value”), and n is the sample size.

Similarly, the upper bound of a confidence interval is

c_upper = p_hat + z_star*sqrt(p_hat*(1-p_hat)/n)

One way to use these formulae is to plug in the values directly. For example, to find the bounds of the 99% confidence interval (z_star is 2.576) when p_hat is 68% and the population is 2032, we might write

c_lower = .68 - 2.576*sqrt(.68*(1-.68)/2032)
c_upper = .68 + 2.576*sqrt(.68*(1-.68)/2032)
c_lower
c_upper

We echo the c_lower and c_upper so that R not only computes the values, but also prints them out.

However, that's a lot of work to change the formula each time we want to compute a new confidence interval. Instead, we can assign values to the names p_hat, z_star, and n and then use the formulae with the mnemonics.

p_hat = .68
z_star = 2.576
n = 2032
c_lower = p_hat - z_star*sqrt(p_hat*(1-p_hat)/n)
c_upper = p_hat + z_star*sqrt(p_hat*(1-p_hat)/n)
c_lower
c_upper

In fact, we can even use R to compute z_star, so that we don't have to look it up or memorize it each time. For a confidence level of 99%, we need the opposite of the z-score for a proportion of .05. (Why .05? Because we want only 1% outside of the confidence level, and half of the values will be below the interval and half above.) We can use qnorm (the inverse to pnorm) to compute that z-score.

z_star = -qnorm(.05)
z_star

If we don't want to compute the .05 by hand, we can even use a formula for it.

confidence_level = .99
z_star = -qnorm((1-confidence_level)/2)
z_star

You'll note that both of these computations give a slightly different result than the 2.576 that we've been using. As you've figured out from your repeated use of the normal table, it is common practice to use approximations of most values, and R computes a more precise approximation.

Returning to our original example (99% confidence interval for a sample proportion of .68 and size of 2032), we might ask R to compute the confidence interval with

confidence_level = .99
p_hat = .68
n = 2032
z_star = -qnorm((1-confidence_level)/2)
c_lower = p_hat - z_star*sqrt(p_hat*(1-p_hat)/n)
c_upper = p_hat + z_star*sqrt(p_hat*(1-p_hat)/n)
c_lower
c_upper

Significance Tests

We can use the same strategy here that we used above: Write the formulae for the test statistic and the p-value and then make sure the variables used in those formulae are defined as we'd like.

z = (p_hat-pi_0)/sqrt(pi_0*(1-pi_0)/n)
p_left = pnorm(z)
p_right = 1-pnorm(z)
p_both = 2*pnorm(-abs(z))

where p_hat is the sample proportion, pi_0 is the hypothesized population proportion, and n is the sample size.

So, for example, if we want to compute the p-value of getting a test proportion at least as large as 24/74 (.324) with a hypothesized population proportion of .25 when the sample size is 74, we would write,

p_hat = 24/74
pi_0 = .25
n = 72
z = (p_hat-pi_0)/sqrt(pi_0*(1-pi_0)/n)
p = 1 - pnorm(z)
p

Similarly, to compute the double-sided p-value when the test proportion is .68, the hypothesized population proprotion is .65, and the sample size is 2032, we would write

p_hat = .68
pi_0 = .65
n = 2032
z = (p_hat-pi_0)/sqrt(pi_0*(1-pi_0)/n)
p = 2*pnorm(-abs(z))
p

Making Your Own Functions

Warning! This section contains R concepts beyond the scope of this course.

Of course, rather than copying and pasting the same code every time, we might find it useful to create functions that encapsulate the computation. (We may also ask ourselves why R does not include these functions, but that's a question for another day.) It is not our goal to teach you how to write functions in R, so for this section, we'll just give you some code that you might use.

A Function for Confidence Intervals

We can also turn this code into a function of three parameters.

proportion_ci = function(confidence_level, p_hat, n) {
  z_star = -qnorm((1-confidence_level)/2)
  c_lower = p_hat - z_star*sqrt(p_hat*(1-p_hat)/n)
  c_upper = p_hat + z_star*sqrt(p_hat*(1-p_hat)/n)
  c(c_lower,c_upper)
}

Now, rather than naming each value separately, we can call this function like we'd call any other R function.

> proportion_ci(.95,.68,2032)
[1] 0.6597178 0.7002822
> proportion_ci(.99,.68,2032)
[1] 0.6533446 0.7066554
> proportion_ci(.95,.72,996)
[1] 0.6921154 0.7478846
> proportion_ci(.95,.64,1036)
[1] 0.6107713 0.6692287

Functions for Computing p-values

We'll start with a function to compute the test statistic for proportions, since it's used in any computation of p-values.

proportion_z = function(pi_0, p_hat, n) {
  (p_hat-pi_0)/sqrt(pi_0*(1-pi_0)/n)
}

Now, there are three different ways to compute p-values: At least as extreme to the left, at least as extreme, or at least as extreme in both directions. We'll represent these as separate funtions.

proportion_p_left = function(pi_0, p_hat, n) {
  pnorm(proportion_z(pi_0, p_hat, n))
}
proportion_p_right = function(pi_0, p_hat, n) {
  1 - pnorm(proportion_z(pi_0, p_hat, n))
}
proportion_p_both = function(pi_0, p_hat, n) {
  2*pnorm(-abs(proportion_z(pi_0, p_hat, n)))
}

Here's how we'd start to fill in the table for Activity 17-4.

> proportion_z(.25, .30, 50)
[1] 0.8164966
> proportion_p_right(.25, .30, 50)
[1] 0.2071081
> proportion_z(.25, .30, 100)
[1] 1.154701
> proportion_p_right(.25, .30, 100)
[1] 0.1241065
> proportion_z(.25, .30, 150)
[1] 1.414214
> proportion_p_right(.25, .30, 150)
[1] 0.0786496

Similarly, here's how we'd start to fill out the table for Activity 18-1.

> proportion_ci(.99, .68, 2032)
[1] 0.6533446 0.7066554
> proportion_p_both(.65, .68, 2032)
[1] 0.004578886
> proportion_p_both(.6667, .68, 2032)
[1] 0.2034319
> proportion_p_both(.7, .68, 2032)
[1] 0.04914258
> proportion_p_both(.707, .68, 2032)
[1] 0.007492398

Creative Commons License

Samuel A. Rebelsky, rebelsky@grinnell.edu

Copyright (c) 2007-8 Samuel A. Rebelsky.

This work is licensed under a Creative Commons Attribution-NonCommercial 2.5 License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/2.5/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.