MATH 335: Basic R document with examples and summaries

Introduction to R and your first reading assignment

These on-line notes are for Tom Moore's Math 335/336 class for the 2008/2009 school year. They are meant to provide an exposition that complements the document "An Introduction to R" but is simpler, less exhaustive, and more tailored to the set of R functions and properties that we will need in the class. It will not be a polished document, but with luck it may be a somewhat useful one. I urge my students to pay attention to the examples included below. Work through the code yourself, try it out, play with it by changing things around, and through this process become engaged with R. R is a wonderful language, one worth learning.

You should also get used to the fuller document. Let the text and examples below suggest sections of the longer document to you, then read and study these sections on your own.

At some point soon, and certainly before working through Example 1 below, you should read chapters 1 and 2 of "An Introduction to R." In particular you'll need to pay at to these points:

Simple example to illustrate basics of vector manipulation and command-line work

We will work through a very simple example. I rolled a regular die 10 times and obtained these results: 6, 4, 3, 6, 2, 3, 4, 1, 1, 3. I will now ask R to do some things with my data. Keep in mind, that this is a very simple example to illustrate some very basic ideas in R. It would help to work this out in R for yourself just to open the learning paths that flow through your fingertips into your brain.

First, get into R: on Mathlan, this means typing a capital R at the user prompt (%). Note: Linux and R are both case-senstive environments; upper-class words letters are different from lower-class letters.

Now you will be working with the R prompt: >

Note: I will give example R commands without showing the > prompt. In part this is that my examples just show the part you actually type in. Another advantage is that by suppressing the > you can literally do a cut-and-paste operation from the Web browser to your R session to implement the commands. Still, I encourage you at this early stage to actually type things in by hand.

I now define an object called myrolls, a numerical vector containing the 10 numbers listed above. You do that this way:


myrolls <- c(6, 4, 3, 6, 2, 3, 4, 1, 1, 3)

Type that line into R. The <- symbol is an assignment operator that puts the object on the right into the named object myrolls . If you read chapter 2, you find that the c function is one way to define a vector; especially useful for short vectors. Note: The <- symbol is two key strokes: < followed by -. It can be replaced by the single-stroke = symbol (or even the underscore key), but I prefer the <- symbol, because it keeps in front of us the direction of the assignment: the thing on the right is assigned to the object name on the left.

Suppose I now want to add up the values of my rolls. I do that using the R sum function.


sum(myrolls)

. If you type this into R, you will see that R outputs on the screen the value 33. But keep in mind that I could also have saved that result into another object, e.g.:

mysum <- sum(myrolls)

This may not be useful in this example, but the idea is useful. If one types that line, R gives no output; to see the result, just type the single name of the object:

mysum

and you will see the 33.

Here are a few more examples, illustrating some basic function calls or commands in R. Try them out.


sum(myrolls)
max(myrolls)
min(myrolls)
prod(myrolls)
mean(myrolls)
median(myrolls)
myrolls^2
1/myrolls
table(myrolls)
tabulate(myrolls)
unique(myrolls)

Try out these commands in R and see if you can figure out what each of the functions does. As you try and figure out what each does, you might make up simple other vectors to put into myrolls to see if what you conjecture the function is doing, it really is doing. Another way to find out what an R function does is to use R's help facility. R's help facility is explained on page 4 of "An Introduction to R" document. I encourage you to use R help, even though it will not always be pefectly easy to use; R is similar to most software in that using on-line help is a learning process.

Note on commenting lines

R uses the # symbol to identify a comment. A comment can be an entire line or the last part of a line. Any words or characters that follow the # symbol are ignored by R; they are non-exectuted or non-interpreted characters. Comments are simply there to help you, the user, either remember what the line does or to communicate what the line does to some other reader. Here are a couple of examples.


sum(myrolls)  # finds the sum of my 10 rolls of the die.
# The effect of the preceding line typed or copied into R is identical to:
sum(myrolls)

Finally, it is important to be able to manipulate various R objects in logical, rather than numerical ways. Here are some examples that illustrate logical operations on a vector. I use in-line comments to tell you what is going on.


myrolls==6  #creates a logical vector, i.e., a vector of TRUE-FALSE
              #values.  Type it in and observe the results.

sum(myrolls==6)  #the  sum  function expects a numerical argument
                 #but when it encounters the logical vector argument, R
                 #automatically coerces the TRUEs to 1s and the FALSEs
                 #to 0s.  The upshot is to count the number of 6s among
                 #my 10 die rolls; there were two.

In Math 335, we will mostly be using R to simulate chance experiments, things like rolling dice, tossing coins, and pulling cards from a deck. R contains functions that make this easy. The main function that we will use early in the course is the sample function. Later we will learn about functions that simulate special probability distributions such as the binomial, Poisson, and normal distributions.

The sample function takes a random sample of items from a vector. I have copied the output to the help call: help(sample). The function expects a first argument, denoted by x there, a vector from which to select items at random. The second argument, denoted size there, is how many items to select. The third argument, named replace tells the function whether to sample with or without replacement. The words "replace = FALSE" indicates that R has built in a default value of FALSE for this argument, meaning that the default condition is to sample without replacement. The final argument, named prob , allows the user to optionally tell R to weight the chances of each item in the vector being selected. Both replace and prob are optional arguments. You can leave them out and the function will assume the default values. But the first and second argument are not optional: the first object you list in the function should be a vector and R will select items from this vector; the second argument your list should be a number, to tell R how many items to select. We will incur both instances of with and without replacement sampling, so will often use the words "replace = TRUE", but we'll seldom be using the prob argument.


sample                 package:base                 R Documentation

Random Samples and Permutations

Description:

     'sample' takes a sample of the specified size from the elements of
     'x' using either with or without replacement.

Usage:

     sample(x, size, replace = FALSE, prob = NULL)

Arguments:

       x: Either a (numeric, complex, character or logical) vector of
          more than one element from which to choose, or a positive
          integer.

    size: non-negative integer giving the number of items to choose.

 replace: Should sampling be with replacement?

 prob: A vector of probability weights for obtaining the elements of
          the vector being sampled.

The first thing to note is that we can define a die as a vector containing the numbers 1, 2, 3, 4, 5, 6. We can define this a couple of ways:


die <- c(1,2,3,4,5,6)  # or ...
die <- 1:6  # read more about defining patterned vectors in
            # section 2.3 of An Introduction to R

Now take some random samples, and see what is going on by trying these several times.

sample(die)
sample(die,2)
sample(die,2,replace=T)
sample(1:6,2,replace=T) # this is equivalent to the previous command
sum(sample(1:6,2,replace=T)) # roll a pair of die; add the two values

The replicate function.

The replicate function will take any R function call and replicate it any number of times. We will apply it to the very last R function call above, the one that summed the roll of a pair of dice. We will use replicate to roll and sum the pair of dice 20 times.


replicate(20,sum(sample(1:6,2,replace=T)))

Try this out, maybe more than once. Of course one can put the results into a vector for further process, which suppresses the output. Try this:

twenty.sums <- replicate(20,sum(sample(1:6,2,replace=T)))

To see the results, one can just type twenty.sums at the R prompt. But, instead, let's introduce a few other functions and R conventions here. Try each of these to confirm they do "as advertised" by the corresponding comments.

table(twenty.sums) # summarizes how many times of the twenty each of the possible sums occurs.

unique(twenty.sums) # produces a vector of just the unique values in
                    # the vector
sort(unique(twenty.sums)) # sorts the previous vector

twenty.sums == 7 # produces a True-False vector, with True in the i-th
                 # spot only if the i-th roll was a 7.

We now look at a more extended example.

Example 1: Fermat-Pascal dice problems

Fermat and Pascal discussed this problem. The Chevalier de Mere had been winning money by betting even money he could roll an ace (i.e., a "1") within four rolls of a fair die. It turns out that his chance of winning this bet is slightly in his favor (probability = .518). After a while people caught on to being at a disadvantage, so de Mere looked for a different game and came up with this: "I will roll at least one `double-ace' within 24 rolls of a pair of dice." (He used proportional reasoning to come up with the second game, based upon the first game. Do you see how?)

First we will simulate the single-die game using R. We will look at two solutions.

  1. Solution:

  2. We will use R to "sneak up on" a full simulation of the one-die game, ending up with a simulation that uses the powerful replicate function.

    Copy the lines below into a session of R. The comment for each line tells what that line does. Only the final line is the aimed-for full-scale simulation; that last line simulates the rolling of a single-die 100,000 times and outputing the proportion of those times an "ace" is obtained. But, I advise you to type in the other lines---maybe even more than once a piece---to confirm that they do what is claimed by the comments for those lines.

    sample(1:6,1)  # outputs a single die roll
    rolls <- 4 # the problem asks for 4 rolls of the die
    replicate(rolls, sample(1:6,1)) # produces 4 die rolls; how many are ace?
    replicate(rolls, sample(1:6,1))==1 #produces a true-false vector, where
                                       #to the question:"was the roll an ace?"
    sum(replicate(rolls, sample(1:6,1))==1) # gives the number of rolls that
                                            # were aces; repeat this line some
    reps <- 10 # assigns a value of 10 to the variable reps
    replicate(reps, sum(replicate(rolls, sample(1:6,1))==1)>0) # creates a 
                                       #true-false vector 1=true and 0=false
                                       #to the question: "Was at least one of
                                       #the 4 rolls an ace?"
    sum(replicate(reps, sum(replicate(rolls, sample(1:6,1))==1)>0)) 
                                       #counts the number of times at least
                                       #once 'ace' was rolled in 4 rolls.
    reps <- 100000 # now change reps to 100,000;
    sum(replicate(reps, sum(replicate(rolls, sample(1:6,1))==1)>0))/reps
                                       #compute the proportion of times at
                                       #least one ace occurred.
    #Compare your proportion to the theoretical probability of 0.5177469
    
    

    Exercise: The two-dice problem

    Show that with 24 rolls of a pair of dice, the probability of at least one instance of a "double ace" is slightly less than .50, but with 25 rolls, the probability is slightly greater than .50.

    Solution

    
    reps <- 100000 # Warning: When writing R simulations, always start with
                   # smaller reps until you know the code works.
    
    # Note:  Here is the solution:
    reps <- 100000
    sum(replicate(reps,sum(replicate(24,(sum(sample(1:6,2,replace=T)==c(1,1))==2)==TRUE)) > 0))/reps
    
    [1] 0.494
    
    
    sum(replicate(reps,sum(replicate(25,sum(sample(1:6,2,replace=T)==c(1,1))==2) > 0))>0)/reps
    
    [1] 0.50351
    
    > 1-(35/36)^24
    [1] 0.4914039
    > 1-(35/36)^25
    [1] 0.5055315
    
    

    But you should have approached this solutions in steps, as follows. Go line by line, predict what the lines is asking of R, and then type it in to see if you are understanding the line. By working your way through all the lines of code, you should understand the solution above better.

    reps <- 10  # start with a small number for trying out code
    sample(1:6,2,replace=T)
    sample(1:6,2,replace=T)==c(1,1)
    sum(sample(1:6,2,replace=T)==c(1,1))==2
    (sum(sample(1:6,2,replace=T)==c(1,1))==2)==TRUE
    replicate(24,(sum(sample(1:6,2,replace=T)==c(1,1))==2)==TRUE)
    sum(replicate(24,(sum(sample(1:6,2,replace=T)==c(1,1))==2)==TRUE))
    sum(replicate(24,(sum(sample(1:6,2,replace=T)==c(1,1))==2)==TRUE)) > 0
    sum(sum(replicate(24,(sum(sample(1:6,2,replace=T)==c(1,1))==2)==TRUE)) > 0
    replicate(reps,sum(replicate(24,(sum(sample(1:6,2,replace=T)==c(1,1))==2)==TRUE)) > 0)
    
    sum(replicate(reps,sum(replicate(24,(sum(sample(1:6,2,replace=T)==c(1,1))==2)==TRUE)) > 0))
    
    sum(replicate(reps,sum(replicate(24,(sum(sample(1:6,2,replace=T)==c(1,1))==2)==TRUE)) > 0))/reps
    
    
    
    Summary of commands to this point

    Here is a summary of basic R information you have picked up from reading about and doing R:

    Vector basics

    A vector is an ordered list of numbers or character strings. The vector [1,2,2,3,2,60] and the vector [2,1,2,3,60,2] are different even though they each contain the same number of elements and the same number of each element. We can assign a vector using one of three basic methods, which we sometimes combine. Here, through examples, are basic vector inputs:

    
    v1 <- c(1,2,3,4,5,6)  # c is called the concatenate function
    v2 <- 1:6  # v2 and v1 are the same
    v3 <- rep(17,6)  # v3 is the number 17, repeated 6 times
    v4 <- c(1:6, rep(6,17)) # try this one and see!
    v5 <- seq(-1,4,.5) # gives a vector on values, beginning with -1, going
                       # in steps of .5, until the final value of 4.
    v6 <- numeric(23) # produces a vector of 23 zeroes, equivalent to ...
    v7 <- rep(0,23) # same as v6
    
    
    Subsetting vectors is an important concept. We have encountered some of the components of this process in previous examples.

    Consider the vector v4 above. Suppose we want to identify which entries are 6s. We can form a True-False vector (as we have seen before) through a logical operator denoted == (two equal signs, to distinguish it from the single equal sign, which R reserves for other purposes.) Other comparison operators are less than (<), less than or equal to (<=), greater than (>), greater than or equal to (>=), and not equal to !=.

    
    v4==6  # produces a T-F vector of the same length as v4; length 23
    v6 <- v4==6  # assigns this T-F vector to another variable name, v6
    v4[v6] # produces a vector consisting of just the 6s.
    v4[v4==6] # equivalent to the preceding command
    v4[v4 < 6]
    

    We can also use logical operators, often in combination with comparisons. These are discussed on page 9 of the Introduction to R document and are and (&), or (|), and negation (!).

    Loops and other iterative or conditional control functions

    See page 40 of the Introduction to R document for a discussion of these ideas. The one we will use first in the course is the for function. Here is a simple example.

    v <- numeric(5)  # this step defines a vector consisting of 5 0s.
    for (i in 1:5) {
    v[i] <- v[i] + i
    }
    
    
    The result of this code into R is to sum the numbers 1+2+3+4+5. Without the initialization step, R returns an error message. The impact of the for loop is akin to sigma notation you encounter in mathematics classes. The syntax is standard, and is given on page 40. The curly brackets ({ .... }) enclose the R code that gets repeated for values of 1 beginning at 1, and going through 2, 3, 4, and finally 5. The curly brackets are unnecessary if the code for the loop can be contained on a single line, but the curly brackets give you the ability to put a long string of R code inside the loop.

    Example: Drawing a full house

    Here is code that uses a for loop to simulate drawing a full house in poker. A full house means getting 3 cards of one rank and 2 cards of another in a sample of 5 cards from a regular deck. For example 3 Kings and 2 7's constitue a full house.

    
    reps <- 100000  
    success <- 0  # success counts how many full houses have been found
    deck <- rep(1:13,4)  # for purposes of this problem, suits don't
                         # matter
    for (i in 1:reps) {   # begin a loop with index i
             success <- success +
                 ( sort(tabulate(sample(deck,5)),decreasing=T)[1]==3 &
                    sort(tabulate(sample(deck,5)),decreasing=T)[2]==2 )
    }
    

    We note now several details of importance. The vector deck is as simple as we can get away with. Since we don't have to carefully distinguish suits, nor face cards for that matter, we can just repeat the numbers 1, 2, ..., 13 to represent Ace, 2, 3, ..., King. We need 4 copies because of sampling without replacement. The use of tabulate is interesting. Let's play with tabulate a bit. You should try the following. Once you see what tabulate does to a vector, you understand why we need to invoke the decreasing=T argument value.

    
    hand1 <- c(3,13,12,12,1) 
    hand2 <- c(3,3,12,12,3)
    
    tabulate(hand1)
    tabulate(hand2)
    
    sort(tabulate(hand1), decreasing=T)
    sort(tabulate(hand2), decreasing=T)
    
    sort(tabulate(hand1), decreasing=T)[1]
    sort(tabulate(hand1), decreasing=T)[2]
    
    sort(tabulate(hand2), decreasing=T)[1]
    sort(tabulate(hand2), decreasing=T)[2]
    
    
    Example: Waiting for a success to occur

    Suppose you flip a coin repeatedly. Suppose the coin may not be "fair" so let p be the probability of "heads" on a given flip. Flip the coin until the first success occurs. Let X equal the flip on which you get the first head. Possible values for X are 1, 2, 3, ... . Find the expected or average value of X , when p = 1/6.

    We use this example to illustrate use of a while loop, which is described on page 41 (cryptically) of our Introduction to R document.

    How the code works: while loops.

    After initializing p, reps, and results (a vector of all 0s of dimension reps ), we begin a for loop to replicate the experiment 10,000 times. Within the for loop we implement a while loop. The while loop executes a succession of steps until the condition success==0 is false. The variable success is initialized to be 0 at the start of each new run of the experiment, since the game here is to repeat the coin flip until the first success, i.e., "head", is encountered. The counter X is initialized to 0 and at each step through the while loop X is incremented by 1, indicating that one more coin flip has occurred trying to get that first success.

    The second line of the while loop is key: The command rbinom(1,1,p) is a single flip of our coin, with 1=success and 0=failure. So as long as failures are happening, the variable success stays at the value 0; but it becomes 1 at the first success (heads), and this triggers the termination of the while loop, at which point the value of X is the trial number on which the first success occurred.

    
    p <- 1/6
    
    reps <- 10000
    results <- numeric(reps)
    for (i in 1:reps) {
       success <- 0
       X <- 0
       while (success==0) {
         success <- success + rbinom(1,1,p) # note: rbinom(1,1,p) is
                                            # a single coin flip. (see
                                            # below on random numbers)
       X <- X + 1  # increment X each time through the loop.
       }
       X <- 
       results[i] <- X
    }
    
    mean(results)
    [1] 6.0344  # The theoretical answer here is 1/p = 6.
    
    
    

    Probability Distributions: computing pdf, CDF, and generating random numbers

    For each of several "named" distributions, including the binomial and hypergeometric, R provides a collection of functions that:

    Examples for the Binomial distribution

    The format of the dbinom function is dbinom(x,size,prob), where x=the number of successes you want the probability of, size=the number of trials (more conventionally called n), and prob=the probability of success on a single trial (more conventionally called p).

    Supper we have an 80% free throw shooter who makes 10 trials. Let X=the number of successful free throws in the 10 trials. Then:

    dbinom(7,10,.8) # gives the probability of exactly 7 successes;
    pbinom(7,10,.8) # gives the probability of 7 or fewer successes.
    
    
    Try these two examples; the answers should be .2013266 and .3222005.

    Examples for the Hypergeometric distribution

    The format of the dhyper function is dhyper(x,m,n,k), where x=the number of white balls you want the probability of, m=the number of white balls in the urn, n=the number of black balls in the urn, and k=the number of balls you draw from the urn. Similarly, for the phyper function, the format is phyper(x,m,n,k)

    Here is the translation from R's notation and language to that of Larsen and Marx (the course textbook):

    R                                         Larsen&Marx
    -----------------------------------------------------------------
    urn with white and black balls           red and white chips
    focus on white balls                     focus on red chips
    m= # of white balls in urn               r= # of red chips in urn
    n= # of black balls in urn               w= # of white chips
    m+n=total # balls in urn                 N = r+w = total # chips
    k=sample size                            n=sample size
    
    -----------------------------------------------------------------
    

    Suppose you have an urn with 30 balls, 10 red and 20 white (using the Larsen and Marx language. You select 15 at random. What is the probability that the sample contains 8 red? contains 8 or more red?

    Obtain the answers via:

    dhyper(8,10,20,15)  # answer is 0.02248876
    1-phyper(7,10,20,15) # answer is 0.02508746
    
    
    On that last one, do you see why we entered 7 instead of 8 in the argument?

    Generating random numbers

    We have learned about using the sample function to generate random samples from a vector, either with or without replacement. We can use this function to simulate random outcomes from either the hypergeometric or binomail distribution.

    Example: Simulate the free-throw problem.

    Let's simulate many repetitions of the free-throw problem above. Here is a solution using what we already know.

    
    n <- 10
    x <- c(rep(0,2),rep(1,8))  # we want 80% chance of making a FT
    sample(x,n,replace=T) # generates 10 FTs, 0 = miss; 1 = make
    sum(sample(x,n,replace=T)) # counts the number of makes
    
    reps <- 100
    replicate(reps, sum(sample(x,n,replace=T))) # produces reps instances
                                 # of the number of makes in 10 FTs
    
    sum(replicate(reps, sum(sample(x,n,replace=T))) <= 7)/reps 
                          # estimates the probability of getting 7 or fewer.
    

    But R has built-in functions to simulate from well-known distributions. The all begin with the letter r , for example rbinom to generate random numbers from the binomial. So that we can simulate, say, 20 repetitions of 10 free throws when p=.8 by:

    rbinom(20, 10, .8)
     [1]  9  9  7  9  7  8  9  9  8  7  9  7  6  9  8 10  8  8  9  7
    
    results <- rbinom(20, 10, .8)
    sum(results <= 7)
    
    [1] 7
    
    

    Notice that 7 of the 20 resulted in 7 or fewer free-throws.

    On page 33 of the Introduction to R manual, you can find a full listing of the "named" distribution available in R. In our year of study, we will be working with the following:

    
    Distribution        R name        additional arguments
    -------------------------------------------------------
    
    binomial            binom          size, prob
    chi-squared         chisq          df, ncp
    exponential         exp            rate
    F                   f              df1, df2, ncp
    gamma               gamma          shape, scale
    hypergeometric      hyper          m, n, k
    normal              norm           mean, sd
    Poisson             pois           lambda
    Student's t         t              df, ncp
    uniform             unif           min, max
    
    

    Keep in mind that with each "name" there are really three R functions implied. For example, with binom the 3 functions are dbinom , pbinom , and rbinom for, respectively, the pdf, the CDF, and for generating random numbers from the distributions.

    Example : Generate 10000 random numbers from the interval of real numbers [0, 432]. Then generate the vector of first digits from this, using the function first.digit . Then make a histogram of the first digits.

    Solution:

    
    x <- runif(10000, 0, 432)
    y <- first.digit(x)
    hist(y,breaks=seq(.5,9.5)) # breaking bins at .5 marks makes
                               # a nicer histogram
    
    # Here is the first.digit function:
    first.digit <- function(y) {
    # y should be a positive real number.
    k <- floor(log10(y))
    floor(y/(10^k))
    }