For reasons I no longer remember I decided a while back to reproduce Figure 1.5.2 from Casella and Berger (page 32):
It's a plot of the cumulative distribution function of a geometric distribution with p = 0.3. The geometric distribution is best explained with coin-flipping. Let's say we keep flipping a coin until we observe a head. Now let X be a random variable that equals the number of tosses required to see a head. So if we get a head on the first flip of the coin, that means we needed one toss. If it takes two tosses, we got a tail on the first toss and then a head on the second toss. The graph above visualizes the possibilities for 15 flips with probability of getting heads set to 0.3. We see that the probability of taking only one toss is 0.3, which makes sense since the probability of getting heads is 0.3. The probability of requiring two tosses is about 0.5. And so on. The straight lines indicate the probabilities only change at the whole numbers. There is no such thing as, say, 2.23 flips. This is often referred to as a step function. We see that there's a high probability of getting a head by the 5th or 6th flip. This is a simple waiting-time distribution that can be used to model the number of successes before a failure and vice-versa.
To recreate this plot in R I used the pgeom function, which returns the cumulative probability of the geometric distribution for a given number of failures before a success occurs and a given probability of success. For example, the probability of 0 failures before observing a success with p = 0.3:
pgeom(0, prob = 0.3)
## [1] 0.3
The probability of 1 failure before observing a success with p = 0.3:
pgeom(1, prob = 0.3)
## [1] 0.51
To generate all the probabilities in the plot, we can do the following:
y <- pgeom(q = 0:14, prob = 0.3)
Notice that instead of using 1:15
(number of total flips for success), we use 0:14
. That's because pgeom works with the number of failures before success. Now that we have our y coordinates, we can create our first version of the plot as follows:
plot(x = 1:15, y, pch = 19)
Now let's add the horizontal lines using the segments
function:
plot(x = 1:15, y, pch = 19)
segments(x0 = 0:15, y0 = c(0, y),
x1 = 1:16, y1 = c(0, y), lwd = 2)
The x0
and y0
coordinates are where the line starts. The x1
and y1
coordinates are where the line ends. Since the lines are horizontal, the y coordinates are the same for the start and end postions. The y coordinates include 0, so we add that value to y
with the c
function. The lwd = 2
argument makes the line a little thicker and darker.
Our plot is basically done, but just for fun I wanted to see how close I could make it look like the version in the book. That means relabeling the axes, moving the axis labels to the ends, and removing the lines at the top and right side of the plot. It also means moving the axis tick marks inside the plotting area. After some trial and error and deep dives into R documentation, here's what I was able to come up with:
plot(x = 1:15, y, pch = 19,
yaxt = "n", xaxt = "n", ylim = c(0,1.05), xlim = c(0,15.5),
bty="l", xaxs="i", yaxs = "i", xlab = "", ylab = "")
segments(x0 = 0:15, y0 = c(0, y),
x1 = 1:16, y1 = c(0, y), lwd = 2)
axis(side = 1, at = 0:15,
labels = 0:15, tcl = 0.5, family = "serif")
axis(side = 2, at = seq(0,1,0.1),
labels = c(0,paste(".",1:9),1), las=1, tcl = 0.5, family = "serif")
mtext(text = expression(italic(x)), side = 4,
at = 0, las = 1, line = 0.5, family = "serif")
mtext(text = expression(italic(F[x](x))), side = 3,
at = 0, line = 0.5, family = "serif")
In the plot
function:
- the
yaxt = "n"
andxaxt = "n"
arguments say “don't label the axes”. I instead use theaxis
function to create the axes. - the
ylim = c(0,1.05)
andxlim = c(0,15.5)
arguments tell the axes to end at 1.05 and 15.5, respectively. I wanted them to extend beyond the last value just as they do in the book. - the
bty="l"
argument says “draw a box around the plot like a capital letter L” - the
xaxs="i"
andyaxs = "i"
arguments ensures the axes fit within the original range of the data. The default is to extend the range by 4 percent at each end. Again, I was trying to recreate the graph in the book. Notice how the origin has the x-axis and y-axis 0 values right next to one another. - The
xlab = ""
andylab = ""
set blank axis labels. I instead use themtext
function to add axis labels.
The segments
function remained unchanged.
The axis
function allows us to explicitly define how the axis is created.
- The
side
argument specifies which side of the plot we're placing the axis.1
is the bottom,2
is the left. at
is where we draw the tick marks.labels
are how we label the tick marks.- The
tcl
argument specifies how long to make the tick marks and in what direction. A positive value extends the tick marks into the plotting region. - The
las
argument in the secondaxis
function makes the labels on the y-axis horizontal.
Finally I used the mtext
function to create the axis labels. mtext
writes text into the margins of a plot and can take some trial and error to get the placement of text just so.
- The
text
argument is what text we want to place in the graph. In this case I make use of theexpression
function which allows us to create mathematical notation. For example, the syntaxexpression(italic(F[x](x)))
returns \(F_x (x)\) - The
side
argument again refers to where in the plot to place the text.3
is top and4
is right. This means the y-axis label is actually in the top of the plot and the x-axis label is on the right. A little bit of a hack. at
says where to place the text along the axis parallel to the margin. In both cases we use 0. We want the y-axis label at the 0 point corresponding to the x-axis, and the x-axis label at the 0 point corresponding to the y-axis. A little confusing, I think.- The
las
argument rotates the x label to be horizontal - The
line
argument specifies on which margin line to place the text, starting at 0 counting outwards. This is one that takes some trial and error to get just right. - The
family
argument specifies the type of font to use. “serif” is like Times New Roman.
Not perfect, but close enough. Of course I much prefer the R defaults when it comes to plotting layout. Even though R allows us to recreate this graph, I don't think it's necessarily a “good” graph.
I also decided to tackle this using ggplot. Here's how far I got.
library(ggplot2)
dat <- data.frame(x = 1:15, y = pgeom(q = 0:14, prob = 0.3))
dat2 <- data.frame(x = 0:15, y = c(0, dat$y[-16]), xend = 1:16, yend = c(0,dat$y[-16]))
ggplot(dat, aes(x = x, y = y)) + geom_point() +
geom_segment(aes(x = x, y = y, xend = xend, yend = yend), data = dat2) +
scale_x_continuous(breaks = 0:15, labels = 0:15) +
scale_y_continuous(breaks = seq(0,1,0.1), labels = c(0,paste(".",1:9),1)) +
labs(x = "x", y = expression(italic(F[x](x)))) +
theme(panel.background = element_blank(),
axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"),
axis.title.x = element_text(face = "italic", hjust = 1),
axis.title.y = element_text(face = "italic", hjust = 1, angle = 0))
You can see I couldn't figure out how to move the axis ticks into the plotting region, or how to place axis labels at the ends of the axis, or how to get the origin to start at precisely (0, 0). I'm not saying it can't be done, just that I lost interest in trying to go further. And a very strong argument can be made that these are things you shouldn't do anyway! But as I said at the outset, this was all for fun.