Introduction to Statistics (MAT/SST 115.03 2008S)

R notes for Activity 7-5


Exercise 7-5.f

Preliminaries

This exercise asks you to create a variety of histograms. Interestingly, different software packages make very different choices as to where to put the breaks in histograms. For example, the book shows a histogram in which the first subinterval has a midpoint of 0. Many packages start the first subinterval at 0, putting the break at the subinterval size.

R's hist lets you control where the breaks fall. However, this means that you have to specify a vector of breaks. Fortunately, the seq function makes it easy to build that vector.

As you found with R's dot plots, you may need to generate the axes with a separate command. In general, your command to create a histogram will look something like the following:

hist(vector,
  breaks=seq(from=min,to=max,by=step)
  axes=FALSE,
  main="Title"
  xlab="Label of X Axis"
  ylab="Label of Y Axis"
)
axis(1, seq(from=min,to=max,by=step))
axis(2, seq(from=min,to=max,by=step))

The Activity

To complete this part of the activity, you'll need to start by reading in the data.

Diabetes = read.csv("/home/rebelsky/Stats115/Data/Diabetes.csv")

To get the histogram shown on p. 127, you would use

hist(Diabetes$AgeAtDiagnosis,
  breaks=seq(from=-2.5,to=92.5,by=5),
  axes=FALSE,
  main="Diabetes Diagnoses",
  xlab="Age of Diabetes Diagnosis",
  ylab="Number of People"
)
axis(1, seq(from=0,to=90,by=5))
axis(2, seq(from=0,to=70,by=10))

If, however, you think that the first bar should include the values 0, 1, 2, 3, and 4 (and perhaps even 5), rather than just 0, 1, and 2 (after all, who has a negative age), you might use

hist(Diabetes$AgeAtDiagnosis,
  breaks=seq(from=0,to=90,by=5),
  axes=FALSE,
  main="Diabetes Diagnoses",
  xlab="Age of Diabetes Diagnosis",
  ylab="Number of People"
) 
axis(1, seq(from=0,to=90,by=5))
axis(2, seq(from=0,to=70,by=10))

Notice, however, that this slight shift in subintervals can have a significant effect on how we look at the data. Compare, for example, the lower end of the graph in each case.

The book then asks us to decrease the number of subintervals to 10. To acheive that result, we want to make each of size 10. We would therefore use

hist(Diabetes$AgeAtDiagnosis,
  breaks=seq(from=-5,to=95,by=10),
  axes=FALSE,
  main="Diabetes Diagnoses",
  xlab="Age of Diabetes Diagnosis",
  ylab="Number of People"
)
axis(1, seq(from=0,to=90,by=10))
axis(2, seq(from=0,to=130,by=10))

The book next asks us to decrease the number of subintervals to 5. In this case, we want to make each of size about 20 or so.

hist(Diabetes$AgeAtDiagnosis,
  breaks=seq(from=-10,to=90,by=20),
  axes=FALSE,
  main="Diabetes Diagnoses",
  xlab="Age of Diabetes Diagnosis",
  ylab="Number of People"
)
axis(1, seq(from=0,to=90,by=20))
axis(2, seq(from=0,to=200,by=10))

Finally, the book asks us increase the number of subintervals to 30. The size of each should be 3. Let's start at -1.5 (so that the first subinterval is centerd at 1).

hist(Diabetes$AgeAtDiagnosis,
  breaks=seq(from=-1.5,to=90,by=3),
  axes=FALSE,
  main="Diabetes Diagnoses",
  xlab="Age of Diabetes Diagnosis",
  ylab="Number of People"
)
axis(1, seq(from=0,to=90,by=3))
axis(2, seq(from=0,to=50,by=10))

After you have completed this exercise, you should review the commands you copied and pasted to see how they differed, and use those differences to help you understand R's hist function.

Creative Commons License

Samuel A. Rebelsky, rebelsky@grinnell.edu

Copyright (c) 2007-8 Samuel A. Rebelsky.

This work is licensed under a Creative Commons Attribution-NonCommercial 2.5 License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/2.5/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.