Category Archives: Continuous distributions

A Note on Boxplots in R

As a statistical consultant I frequently use boxplots. They're a great way to quickly visualize the distribution of a continuous measure by some grouping variable. More often than not, however, the person I'm helping doesn't regularly use boxplots (if at all) and is not sure what to make of them.

Fortunately, boxplots are pretty easy to explain. The line in the middle of the box is the median. The box itself represents the middle 50% of the data. The box edges are the 25th and 75th percentiles. But what about the whiskers? That seems to be the part that trips people up.

Before we go further, let's make some boxplots in R:

set.seed(9)
y <- rnorm(n = 100, mean = rep(c(22:25),each=25),sd = rep(c(5,8),each=50))
x <- gl(n=4,k=25)
boxplot(y ~ x)

plot of chunk unnamed-chunk-1

Above I generate 100 random normal values, 25 each from four distributions: N(22,5), N(23,5), N(24,8) and N(25,8). Then I generate a 4-level grouping variable. Finally I make the boxplot.

The black lines in the “middle” of the boxes are the median values for each group. For group 1, that appears to be a shade above 20. The vertical size of the boxes are the interquartile range, or IQR. They measure the spread of the data, sort of like standard deviation. The tops and bottoms of the boxes are referred to as “hinges”. We see groups 1 and 2 have less variability than groups 3 and 4, which makes sense given the way we generated the data. The IQR for group 1 looks to be about 5 judging from the height of the box. Again this makes sense given group 1 data was randomly generated from a normal distribution with a standard deviation of 5.

And now we come to the “whiskers”, or the flattened arrows extending out of the box. What do they represent and how are they calculated? They represent the reasonable extremes of the data. That is, these are the minimum and maximum values that do not exceed a certain distance from the middle 50% of the data. What is that certain distance? By default in R, it's \(1.5 \times IQR\). If no points exceed that distance, then the whiskers are simply the minimum and maximum values. That's the case in group 4. If there are points beyond that distance, the largest point that does not exceed that distance becomes the whisker. And the points beyond the distance are plotted as single points.

For groups 1 through 3, there appear to be single points beyond the whiskers. (I say “appear to be single” because technically we could have overplotting.) We might think of these as outliers, data points that are too big or too small compared to the rest of the data. Group 4 does not appear to have outliers.

When you create a boxplot in R, you can actually create an object that contains the plotted data. Just call the boxplot as you normally would and save to a variable.

bp <- boxplot(y ~ x, plot = F)
bp
## $stats
##          [,1]     [,2]      [,3]     [,4]
## [1,] 16.06564 12.90309  8.300651 12.05522
## [2,] 18.53334 18.90152 19.281150 19.14307
## [3,] 20.75958 21.98459 25.924704 23.00494
## [4,] 24.18153 24.84778 27.721310 30.84133
## [5,] 31.22629 31.65432 38.016463 36.66171
## 
## $n
## [1] 25 25 25 25
## 
## $conf
##          [,1]     [,2]     [,3]     [,4]
## [1,] 18.97475 20.10557 23.25761 19.30830
## [2,] 22.54441 23.86361 28.59179 26.70159
## 
## $out
## [1]  8.911472 36.409950 41.843672
## 
## $group
## [1] 1 2 3
## 
## $names
## [1] "1" "2" "3" "4"

We see the bp object is a list with 6 different elements. The first element, stats, contains the plotted points in each group. So looking at column 1, we see that the bottom and top whiskers are 16.0656374 and 31.226286, the 25th and 75th percentiles are 18.5333406 and 24.1815345, and the median is 20.759577.

The out element of the bp object contains the outliers while the group element contains the outliers' respective groups.

If you want to change how whiskers are determined, you can change the range argument in the boxplot function. The default setting is 1.5. Here we try it set to 1:

boxplot(y ~ x, range=1)

plot of chunk unnamed-chunk-3

Now we have more “outliers” for groups 1 – 3 but still none for group 4. Obviously the “outlier” classification is subjective. You'll also notice the size of the boxes didn't change. They will always capture the middle 50% of the data no matter the value of range. If we set range to 0, then the whiskers will extend to the minimum and maximum values for each group. In that case you get a plot of what is known as Tukey's Five-number summary: minimum, 25th percentile, median, 75th percentile and the maximum. In fact there's a function in R to calculate the Five-Number summary called fivenum. Here's how we can use it on our data:

tapply(y,x,fivenum)
## $`1`
## [1]  8.911472 18.533341 20.759577 24.181534 31.226286
## 
## $`2`
## [1] 12.90309 18.90152 21.98459 24.84778 36.40995
## 
## $`3`
## [1]  8.300651 19.281150 25.924704 27.721310 41.843672
## 
## $`4`
## [1] 12.05522 19.14307 23.00494 30.84133 36.66171
boxplot(y ~ x, range=0)

plot of chunk unnamed-chunk-4

A nice complement to boxplots are stripcharts. These are one-dimensional scatter plots of data. Here's out it works “out of the box”:

stripchart(y ~ x)

plot of chunk unnamed-chunk-5

I usually like to rotate stripcharts and use the familiar open circles with some “jitter”:

stripchart(y ~ x, vertical = TRUE, pch=1, method="jitter")

plot of chunk unnamed-chunk-6

Plotting stripcharts and boxplots side-by-side can be useful to visualize the spread and distribution of data. You can also calculate means and medians add them with the points function:

op <- par(mfrow=c(1,2))
means <- tapply(y,x,mean) # calculate means
medians <- tapply(y,x,median) # calculate medians
boxplot(y ~ x, ylab="y")
points(x = 1:4,y=means,pch=19) # add means
stripchart(y ~ x, vertical = TRUE, pch=1, method="jitter")
points(x = 1:4,y=means,pch=19, col="blue") # add means
points(x = 1:4,y=medians,pch=19, col="red") # add means

plot of chunk unnamed-chunk-7

par(op) # reset graphics parameters

Understanding Z-scores

The first time I took a statistics class I was mystified with z-scores. I understood how to find a z-score and look up the corresponding probability in a Standard Normal Distribution table. But I had no idea why I did it. I was just following orders. In this post I hope to explain the z-score and why it’s useful.

The first thing to understand about a z-score is that it’s simply a transformation, just like transforming inches to feet. Let’s say you and I each make our own paper airplane, throw them and measure the distance traveled. Mine flies 60 inches. Yours flies 78 inches. Yours flew 18 inches further than mine. We can transform inches to feet by multiplying by \( \frac{1}{12} \). \( 18 \times \frac{1}{12} = 1.5 \). Your airplane flew 1.5 feet further. A z-score is pretty much the same idea. You take an observation from a Normal distribution, measure how far it is from the mean of the distribution, and then convert that distance to a number of standard deviations. A quick example: I observe a measure of 67 inches from a Normal distribution with a mean of 69 and a standard deviation of 4. The z-score is calculated as \( z = \frac{67-69}{4} = -0.5\). That tells me my observation is half a standard deviation away from the mean. The negative tells me it’s less than the mean, which you knew before even doing the calculation. Also, notice there are no units. A z-score doesn’t have units like inches or kilograms. It’s just the number of standard deviations.

So that’s your z-score: distance expressed as number of standard deviations. By itself it’s kind of interesting and informative. But it’s rare you stop with the z-score. Usually in statistics the z-score is the first step toward finding a probability. For example, let’s say a candy maker produces mints that have a label weight of 20.4 grams. Assume that the distribution of the weights of these mints is Normal with a mean of 21.37 and a standard deviation of 0.4. If we select a mint at random off the production line, what is the probability it weighs more than 22.07 grams? This is a classic statistics problem. I pulled this one from Probability and Statistical Inference 7th ed. by Hogg and Tanis.

What this problem boils down to is finding the area of the red region below:

The probability of picking a piece of candy that weighs more than 22.07 grams is equal to the red area under the Normal (21.37, 0.4) distribution curve above. Nowadays this is easily done with computers and calculators. In Excel, you simply enter =1-NORMDIST(22.07,21.37,0.4,TRUE) to find the probability as 0.040059. In R, you do 1-pnorm(22.07,mean=21.37,sd=0.4). However, not so long ago, finding this area was no easy task. Mathematically speaking, finding the area means solving the following integral:

$$ \frac{1}{0.4\sqrt{2\pi}}\int_{22.07}^{\infty}exp[-(x-21.37)^2/ 2(0.4)^2] $$

Believe me when I say this is not an easy calculus problem to solve. There is no closed form solution for it. The answer has to be approximated using a power series.

Clever statisticians, however, realized that every Normal distribution can be transformed to a Standard Normal distribution using z-scores. That is, all z-scores come from a Normal distribution with mean 0 and standard deviation 1. So what they did was create a table with approximate probabilities for z-scores ranging from 0 to something like 3.49. (To see one of these tables, Google “Z-score table”). This allowed statisticians (and students) to easily solve problems like the candy problem above by transforming the value of interest to a z-score and then looking up the probability in a chart.

For our problem above, we would find \( z = \frac{22.07-21.37}{0.4} = 1.75\). Now the problem is find the probability of exceeding 1.75 in a Normal distribution with mean 0 and standard deviation 1. We would then look up 1.75 in a table and see the resulting probability. Usually the probability (or area under the curve) was calculated to the left of the z-score. For our problem we need area to the right of the z-score. If the table gave area to the left, we would simply subtract from 1. In the table I linked to, the probability for 1.75 is given as 0.9599. That’s area to the left. The area to the right is 1-0.9599 = 0.0401.

For many years, that was the TRUE value of a z-score. It allowed you to calculate areas under the normal curve by giving you a link to a table of pre-calculated values. I think most statistics classes still teach this method of finding probabilities, or at least mention it. But truth be told it’s obsolete. The math is now easily handled by computer. So while the z-score is still a useful descriptive measure of how far an observation lies from the mean of the Normal distribution from which it came, it’s no longer needed as it was in the past to find the area under the Normal curve.

So there you go, the story behind z-scores. Hopefully this post added to your understanding. If anything, I hope you appreciate what they meant to statisticians before modern computing. They truly were the only way to calculate probabilities based on a Normal distribution.

Understanding qq-plots

qq-plot is short for quantile-by-quantile plot. You plot one quantile against another and you see if their coordinate pairs form a straight line. One of the quantiles is your sample observations placed in ascending order. For example, you take the height of 30 people and place them in order from smallest to largest. That will be your y-axis values. The other quantile will be the quantiles of a probability distribution that you believe your sample may have come from. For example, you may think your heights are Normally distributed. So your x-axis values would be the quantiles of a standard Normal distribution. Since you have 30 heights, you’ll need 30 Normal quantiles. The way that’s done is to find the quantiles for \( \frac{1}{31} , \frac{2}{31} ,\dots, \frac{30}{31} \). In other words you solve for z in \( P(Z \le z) = \frac{1}{30}\). That turns out to be -1.85. And then you do the same for all the fractions. Finally you plot your heights versus the z-values and see if they’re in a somewhat straight line. If they are, then you can be pretty safe in your assumption that your sample comes from a Normal distribution.

For Normal distributions this can be made intuitive by recalling the standard Normal transformation: \( z = \frac{x-\mu}{\sigma}\). That’s the formula we use to create z-scores, which allow us to use the standard normal distribution to find probabilities. Solving for x we get \( x = z\sigma + \mu\). You may notice that looks just like the formula for a straight line but with different variables. Remember this one from algebra? \( y = mx + b\). That’s what we have but with the mean (\( \mu\)) serving as the intercept and the standard deviation (\( \sigma\)) as the slope. And instead of y and x, we have x and z. If we wanted to make a straight line, we would plug in our z-values into \( x = z\sigma + \mu\) and find our x values. That would give us a set of coordinate pairs that we could plot and then connect the dots to make a straight line.

Now of course we don’t know \( \mu\) and \( \sigma\). We have to estimate them from our sample. Furthermore we don’t know if our sample came from a Normal distribution. But when constructing a Normal qq-plot, we assume our sample did come from a Normal distribution. We have our straight line formula \( x = z\sigma + \mu\) and our x values. If our x values came from a Normal distribution, then they would have corresponding z-scores. We don’t know what they are (because we don’t know \( \mu\) and \( \sigma\)), but we can estimate them with the quantiles of a standard normal distribution, with the number of quantiles equal to our sample size. Let’s do this in R. We’ll generate a random sample of 30 observations from a Normal distribution with mean = 10 and standard deviation = 2.

# generate random sample from N(10,2)
x <- sort(rnorm(30,10,2))
k <- (1:30)/31
z <- qnorm(k)
plot(z,x,main='Normal qq-plot')

That gives us the following plot:

The points look like they're on a straight line, as they should. R has a built in function for this called qqnorm. All you have to do is type:

qqnorm(x)

And you get:

Which looks like what we did the first time but with axis labels.

If we want we can draw a straight line through the plot to help us gauge just how close to a straight line the points lie. One choice is to draw the least-squares line. Another is to use the formula \( x = z\sigma + \mu\) using our mean and standard deviation estimates for \( \mu\) and \( \sigma\) . In other words plot the line \( x = zs + \overline{x}\). Let's do both:

# create qq-plot with straight lines
par(mfrow=c(1,2))
plot(z,x,main='Normal qq-plot')
abline(lm(x~z)) # draw best fitting line
plot(z,x,main='Normal qq-plot')
abline(a=mean(x),b=sd(x)) # draw line with slope = std dev and intercept = x-bar

Here's the plot:

It's really hard to tell the difference between the two, but the points seem to lie pretty close to both straight lines.

So that's how Normal qq-plots work. But keep in mind that qq-plots are not just for assessing Normality. You can assess whether or not a sample came from, say, a chi-square distribution or a gamma distribution. Let's generate 30 points from a chi-square distribution with 4 degrees of freedom, and then construct two qq-plots: one with chi-square quantiles and the other with Normal quantiles.

# generate random sample from chi-sq(df=4)
c <- sort(rchisq(30,df=4))
# compare normal and chi-square qq-plots
k <- (1:30)/31
q <- qnorm(k)
cq <- qchisq(k,df=4)
par(mfrow=c(1,2))
plot(cq,c,main='chi-square qq-plot')
abline(lm(c~cq)) # draw best fitting line
plot(q,c,main='Normal qq-plot')
abline(lm(c~q)) # draw best fitting line

This gives us the following plots:

We see in the left plot our two quantiles lying pretty close to a straight line. They should. That's the chi-square qq-plot and our data are from a chi-square distribution. On the right however we see a fit that's not so good, especially in the extremes. That's our Normal qq-plot. Since our data come from a chi-square distribution, which is skewed right, it makes sense that the Normal qq-plot would show large deviations from a straight line in the "tails" of the plot.

Explaining and simulating an F distribution

I remember feeling very uncomfortable with the F distribution when I started learning statistics. It comes up when you’re learning ANOVA, which is rather complicated if you’re learning it for the first time. You learn about partitioning the sum of squares, determining degrees of freedom, calculating mean square error and mean square between-groups, putting everything in a table and finally calculating something called an F statistic. Oh, and don’t forget the assumptions of constant variance and normal distributions (although ANOVA is pretty robust to those violations). There’s a lot to take in! And that’s before you calculate a p-value. And how do you do that? You use an F distribution, which when I learned it meant I had to flip to a massive table in the back of my stats book. Where did that table come from and why do you use it for ANOVA? I didn’t know, I didn’t care. It was near the end of the semester and I was on overload.

I suspect this happens to a lot of people. You work hard to understand the normal distribution, hypothesis testing, confidence intervals, etc. throughout the semester. Then you get to ANOVA sometime in late November (or April) and it just deflates you. You will yourself to understand the rules of doing it, but any sort of intuition for it escapes you. You just want to finish the class.

This post attempts to provide an intuitive explanation of what an F distribution is with respect to one-factor ANOVA. Maybe someone will read this and benefit from it.

ANOVA stands for ANalysis Of VAriance. We’re analyzing variances to determine if there is a difference in means between more than 2 groups. In one-factor ANOVA we’re using one factor to determine group membership. Maybe we’re studying a new drug and create three groups: group A on 10 mg, group B on 5 mg, and group C on placebo. At the end of the study we measure some critical lab value. We take the mean of that lab value for the three groups. We wish to know if there is a difference in the means between those three population groups (assuming each group contains a random sample from three very large populations).

The basic idea is to estimate two variances and form a ratio of those variances. If the variances are about the same, the ratio will be close to 1 and we have no evidence that the populations means differ based on our sample.

One variance is simply the mean of the group variances. Say you have three groups. Take the variance of each group and then find the average of those variances. That’s called the mean square error (MSE). In symbol-free mathematical speak, for three groups we have:

MSE = variance(group A) + variance(group B) + variance(group C) / 3

The other variance is the variance of the sample means multiplied by the number of items in each group (assuming equal sample sizes in each group). That’s called mean square between groups (MSB). It looks something like this:

MSB = variance(group means)*(n)

The F statistic is the ratio of these two variances: F = MSB/MSE

Now if the groups have the same means and same variance and are normally distributed, the F statistic has an F distribution. In other words, if we were to run our experiment hundreds and hundreds of times on three groups with the same mean and variance from a normal distribution, calculate the F statistic each time, and then make a histogram of all our F statistics, our histogram would have a shape that can be modeled with an F distribution.

That’s something we can do in R with the following code:

# generate 4000 F statistics for 5 groups with equal means and variances
R <- 4000
Fstat <- vector(length = R)
for (i in 1:R){
  g1 <- rnorm(20,10,4)
  g2 <- rnorm(20,10,4)
  g3 <- rnorm(20,10,4)
  g4 <- rnorm(20,10,4)
  g5 <- rnorm(20,10,4)
  mse <- (var(g1)+var(g2)+var(g3)+var(g4)+var(g5))/5
  M <- (mean(g1)+mean(g2)+mean(g3)+mean(g4)+mean(g5))/5
  msb <- ((((mean(g1)-M)^2)+((mean(g2)-M)^2)+((mean(g3)-M)^2)+((mean(g4)-M)^2)+((mean(g5)-M)^2))/4)*20
  Fstat[i] <- msb/mse
}
# plot a histogram of F statistics and superimpose F distribution (4, 95)
ylim <- (range(0, 0.8))
x <- seq(0,6,0.01)
hist(Fstat,freq=FALSE, ylim=ylim)
curve(df(x,4,95),add=T) # 5 - 1 = 4; 100 - 5 = 95

So I have 5 normally distributed groups each with a mean of 10 and standard deviation of 4. I take 20 random samples from each and calculate the MSE and MSB as I outlined above. I then calculate the F statistic. And I repeat 4000 times. Finally I plot a histogram of my F statistics and super-impose a theoretical F distribution on top of it. I get the resulting figure:

This is the distribution of the F statistic when the groups samples come from Normal populations with identical means and variances. The smooth curve is an F distribution with 4 and 95 degrees of freedom. The 4 is Number of Groups - 1 (or 5 - 1). The 95 is from Total Number of Observations - Number of Groups (or 100 - 5). If we examine the figure we see that we most likely get an F statistic around 1. The bulk of the area under the curve is between 0.5 and 1.5. This is the distribution we would use to find a p-value for any experiment that involved 5 groups of 20 members each. When the populations means of the groups are different, we get an F statistic greater than 1. The bigger the differences, the larger the F statistic. The larger the F statistic, the more confident we are that the population means between the 5 groups are not the same.

There is much, much more that can be said about ANOVA. I'm not even scratching the surface. But what I wanted to show here is a visual of the distribution of the F statistic. Hopefully it'll give you a better idea of what exactly an F distribution is when you're learning ANOVA for the first time.

The moment generating function of the gamma

The moment-generating function of the gamma distribution is \( M(t) = (1-\theta t)^{-\alpha}\). Such a friendly little guy. Not what you would expect when you start with this:

$$ M(t) = \int_{0}^{\infty}e^{tx}\frac{1}{\Gamma (\alpha)\theta^{\alpha}}x^{\alpha – 1}e^{-x/\theta}dx$$

How do we get there? First let’s combine the two exponential terms and move the gamma fraction out of the integral:

$$ M(t) = \frac{1}{\Gamma (\alpha)\theta^{\alpha}}\int_{0}^{\infty}x^{\alpha – 1}e^{tx-x/\theta}dx$$

Multiply \( tx\) in the exponential by \( \frac{\theta}{\theta} = 1\), add the two terms together and factor out \( -x\):

$$ M(t) = \frac{1}{\Gamma (\alpha)\theta^{\alpha}}\int_{0}^{\infty}x^{\alpha – 1}e^{-x(1-\theta t)/\theta}dx$$

Now we’re ready to do so substitution in the integral. Let \( u = \frac{x(1-\theta t)}{\theta}\). That means we have \( du = \frac{1-\theta t}{\theta}dx\). Now solve those for x and dx, respectively. We get \( x = \frac{\theta u}{1-\theta t}\) and \( dx = \frac{\theta}{1-\theta t}du\). Now substitute those in the integral to see something that looks truly awful:

$$ M(t) = \frac{1}{\Gamma (\alpha)\theta^{\alpha}}\int_{0}^{\infty}\frac{\theta u}{1-\theta t}^{\alpha – 1}e^{-u}\frac{\theta}{1-\theta t}du$$

If re-write everything inside the integral with exponents we can do some cancellation and make this integral more attractive:

$$ M(t) = \frac{1}{\Gamma (\alpha)\theta^{\alpha}}\int_{0}^{\infty}(1-\theta t)^{-\alpha + 1}(1-\theta t)^{-1}(\theta u)^{\alpha-1}\theta e^{-u}du$$

Using the rules of exponents we see that \( (1-\theta t)^{-\alpha + 1}(1-\theta t)^{-1} = (1- \theta t)^{-\alpha}\) which can be moved out of the integral since it doesn’t have \( u\) in it:

$$ M(t) = \frac{(1- \theta t)^{-\alpha}}{\Gamma (\alpha)\theta^{\alpha}}\int_{0}^{\infty}(\theta u)^{\alpha-1}\theta e^{-u}du$$

Using the rules of exponents again we see that \( \theta^{\alpha -1}\theta = \theta^{\alpha}\):

$$ M(t) = \frac{(1- \theta t)^{-\alpha}}{\Gamma (\alpha)\theta^{\alpha}}\int_{0}^{\infty}\theta^{\alpha}u^{\alpha – 1} e^{-u}du$$

Now notice we can cancel out the two \( \theta^{\alpha}\) terms:

$$ M(t) = \frac{(1- \theta t)^{-\alpha}}{\Gamma (\alpha)}\int_{0}^{\infty}u^{\alpha – 1} e^{-u}du$$

The integral is now the gamma function: \( \Gamma (\alpha) = \int_{0}^{\infty}u^{\alpha – 1} e^{-u}du\). Make that substitution:

$$ M(t) = \frac{(1- \theta t)^{-\alpha}}{\Gamma (\alpha)}\Gamma(\alpha)$$

Cancel out the \( \Gamma(\alpha)\) terms and we have our nice-looking moment-generating function:

$$ M(t) = (1- \theta t)^{-\alpha}$$

If we take the derivative of this function and evaluate at 0 we get the mean of the gamma distribution:

$$ M'(t) = -\alpha (1-\theta t)^{-\alpha -1}(-\theta)$$

$$ M'(0) = -\alpha(1 – 0)^{-\alpha -1}(-\theta) = \alpha\theta = E(X)$$

Recall that \( \theta\) is the mean time between events and \( \alpha\) is the number of events. Multiply them together and you have the mean. This makes sense. If the mean wait time between events is 2 minutes (\( \theta = 2\)) and we’re interested in the distribution of times until 4 events occur (\( \alpha = 4\)), it seems reasonable to think the mean wait time to observe the 4th event would be 8 minutes (\( 4(2)\)).

If we take the second derivative of the moment-generating function and evaluate at 0, we get the second moment about the origin which we can use to find the variance:

$$ M”(t) = (-\alpha – 1)\theta\alpha(1-\theta t)^{-\alpha – 2}(-\theta)$$

$$ M”(0) = (-\alpha – 1)\theta\alpha(1-0)^{-\alpha – 2}(-\theta) = (-\alpha – 1)(-\theta^{2})\alpha$$

Now find the variance:

$$ Var(X) = (-\alpha – 1)(-\theta^{2})\alpha – (\alpha\theta)^{2}$$

$$ Var(X) = \alpha^{2}\theta^{2} + \alpha\theta^{2} – \alpha^{2}\theta^{2}$$

$$ Var(X) = \alpha\theta^{2}$$

Going back to our example with \( \alpha = 4\) (number of events) and \( \theta=2\) (mean time between events), we have as our variance \( 4(2^{2})=16\). The square root of that gives our standard deviation as 4.

Plugging the mean and standard deviation into the dgamma function in R, we can plot this particular gamma distribution:

xval <- seq(0,30, by = 0.01)
ga.4 <- dgamma(xval, shape=4, scale=2)
plot(xval,ga.4,col=1, type = "l", ylab="f(x)", xlab="x")
title(main="Gamma Distribution, scale = 2, shape = 4")

gamma with scale = 2 and shape = 4

Deriving the gamma distribution

Let’s take a look at the gamma distribution:

$$ f(x) = \frac{1}{\Gamma(\alpha)\theta^{\alpha}}x^{\alpha-1}e^{-x/\theta}$$

I don’t know about you, but I think it looks pretty horrible. Typing it up in Latex is no fun and gave me second thoughts about writing this post, but I’ll plow ahead.

First, what does it mean? One interpretation of the gamma distribution is that it’s the theoretical distribution of waiting times until the \( \alpha\)-th change for a Poisson process. In another post I derived the exponential distribution, which is the distribution of times until the first change in a Poisson process. The gamma distribution models the waiting time until the 2nd, 3rd, 4th, 38th, etc, change in a Poisson process.

As we did with the exponential distribution, we derive it from the Poisson distribution. Let W be the random variable the represents waiting time. Its cumulative distribution function then would be

$$ F(w) = P(W \le w) = 1 -P(W > w)$$

But notice that \( P(W > w)\) is the probability of fewer than \( \alpha\) changes in the interval [0, w]. The probability of that in a Poisson process with mean \( \lambda w\) is

$$ = 1 – \sum_{k=0}^{\alpha-1}\frac{(\lambda w)^{k}e^{- \lambda w}}{k!}$$

To find the probability distribution function we take the derivative of \( F(w)\). But before we do that we can simplify matters a little by expanding the summation to two terms:

$$ = 1 – \frac{(\lambda w)^{0}e^{-\lambda w}}{0!}-\sum_{k=1}^{\alpha-1}\frac{(\lambda w)^{k}e^{- \lambda w}}{k!} = 1 – e^{-\lambda w}-\sum_{k=1}^{\alpha-1}\frac{(\lambda w)^{k}e^{- \lambda w}}{k!}$$

Why did I know to do that? Because my old statistics book did it that way. Moving on…

$$ F'(w) = 0 – e^{-\lambda w}(-\lambda)-\sum_{k=1}^{\alpha-1}\frac{k!(\lambda w)^{k}e^{- \lambda w}+e^{- \lambda w}k( \lambda w)^{k-1}\lambda}{(k!)^{2}}$$

After lots of simplifying…

$$ = \lambda e^{-\lambda w}+\lambda e^{-\lambda w}\sum_{k=1}^{\alpha-1}[\frac{(\lambda w)^{k}}{k!}-\frac{(\lambda w)^{k-1}}{(k-1)!}]$$

And we’re done! Technically we have the gamma probability distribution. But it’s a little too bloated for mathematical taste. And of course it doesn’t match the form of the gamma distribution I presented in the beginning, so we have some more simplifying to do. Let’s carry out the summation for a few terms and see what happens:

$$ \sum_{k=1}^{\alpha-1}[\frac{(\lambda w)^{k}}{k!}-\frac{(\lambda w)^{k-1}}{(k-1)!}] = $$

$$ [\lambda w – 1] + [\frac{(\lambda w)^{2}}{2!}-\lambda w]+[\frac{(\lambda w)^{3}}{3!} – \frac{(\lambda w)^{2}}{2!}] +[\frac{(\lambda w)^{4}}{4!} – \frac{(\lambda w)^{3}}{3!}] + \ldots$$

$$ + [\frac{(\lambda w)^{\alpha – 2}}{(\alpha – 2)!} – \frac{(\lambda w)^{\alpha – 3}}{(\alpha – 3)!}] + [\frac{(\lambda w)^{\alpha – 1}}{(\alpha – 1)!} – \frac{(\lambda w)^{\alpha – 2}}{(\alpha – 2)!}] $$

Notice that besides the -1 and 2nd to last term, everything cancels, so we’re left with

$$ = -1 + \frac{(\lambda w)^{\alpha -1}}{(\alpha -1)!}$$

Plugging that back into the gamma pdf gives us

$$ = \lambda e^{-\lambda w} + \lambda e^{-\lambda w}[-1 + \frac{(\lambda w)^{\alpha -1}}{(\alpha -1)!}]$$

This simplifies to

$$ =\frac{\lambda (\lambda w)^{\alpha -1}}{(\alpha -1)!}e^{-\lambda w}$$

Now that’s a lean formula, but still not like the one I showed at the beginning. To get the “classic” formula we do two things:

  1. Let \( \lambda = \frac{1}{\theta}\), just as we did with the exponential
  2. Use the fact that \( (\alpha -1)! = \Gamma(\alpha)\)

Doing that takes us to the end:

$$ = \frac{\frac{1}{\theta}(\frac{w}{\theta})^{\alpha-1}}{\Gamma(\alpha)}e^{-w/\theta} = \frac{1}{\theta}(\frac{1}{\theta})^{\alpha -1}w^{\alpha – 1}\frac{1}{\Gamma(\alpha)}e^{-w/\theta} = (\frac{1}{\theta})^{\alpha}\frac{1}{\Gamma(\alpha)}w^{\alpha -1}e^{w/\theta}$$

$$ = \frac{1}{\Gamma(\alpha)\theta^{\alpha}}w^{\alpha-1}e^{-w/\theta}$$

We call \( \alpha \) the shape parameter and \( \theta \) the scale parameter because of their effect on the shape and scale of the distribution. Holding \( \theta\) (scale) at a set value and trying different values of \( \alpha\) (shape) changes the shape of the distribution (at least when you go from \( \alpha =1\) to \( \alpha =2\):

gamma - hold scale, change shape

Holding \( \alpha\) (shape) at a set value and trying different values of \( \theta\) (scale) changes the scale of the distribution:

gamma - hold shape, change scale

In the applied setting \( \theta\) (scale) is the mean wait time between events and \( \alpha\) is the number of events. If we look at the first figure above, we’re holding the wait time at 1 and changing the number of events. We see that the probability of waiting 5 minutes or longer increases as the number of events increases. This is intuitive as it would seem more likely to wait 5 minutes to observe 4 events than to wait 5 minutes to observe 1 event, assuming a one minute wait time between each event. The second figure holds the number of events at 4 and changes the wait time between events. We see that the probability of waiting 10 minutes or longer increases as the time between events increases. Again this is pretty intuitive as you would expect a higher probability of waiting more than 10 minutes to observe 4 events when there is mean wait time of 4 minutes between events versus a mean wait time of 1 minute.

Finally notice that if you set \( \alpha = 1\), the gamma distribution simplifies to the exponential distribution.

Update 5 Oct 2013: I want to point out that \( \alpha \) and \( \beta \) can take continuous values like 2.3, not just integers. So what I’ve really derived in this post is the relationship between the gamma and Poisson distributions.

The forgetful exponential distribution

The exponential distribution has the quirky property of having no memory. Before we wade into the math and see why, let’s consider a situation where there is memory: drawing cards. Let’s say you have a well-shuffled deck of 52 cards and you draw a single card. What’s the probability of drawing an ace? Since there are 4 aces in a deck of 52 cards, the probability is \( \frac{4}{52}\). We draw our card and it’s not an ace. We set the card aside, away from the deck, and draw again. Now our probability of drawing an ace is \( \frac{4}{51}\). We have a slightly better chance on the 2nd draw. The condition that we have already selected a card that wasn’t an ace changes the probability we draw an ace. This doesn’t happen with the exponential distribution.

Let’s say we have a state-of-the-art widget (version 2.0) that has a lifespan that can be described with an exponential distribution. Further, let’s say the mean lifespan is 60 months, or 5 years. Thanks to the “no memory” property, the probability of the lifespan lasting 7 years is that same whether the widget is new or 5 years old. In math words:

$$ P(X > 7 + 5 | X>5) = P(X>7)$$

That means if I bought a widget that was 5 years old, it has the same probability of lasting another 7 years as a brand new widget has for lasting 7 years. Not realistic but certainly interesting. Showing why this is the case is actually pretty straight-ahead.

We want to show that for the exponential distribution, \( P(X > y + x | X > x) = P(X > y)\).

Recall the cumulative distribution of an exponential distribution is \( P(X \le x)=F(x) = 1 -e^{-x/\theta}\). That’s the probability of an event occurring before a certain time x. The complement of the cumulative distribution is the probability of an event occurring after a certain time:

$$ P(X > x) = 1 – P(X \le x) = 1 – (1 – e^{-x/ \theta} ) = e^{-x/ \theta}$$

Also recall the definition of conditional probability: \( P(A |B) = \frac{P(A \cap B)}{P(B)}\)

Let’s plug into the equality we want to prove and see what happens:

$$ P(X > y + x | X > x) = \frac{P(X>y + x) \cap P(X>x)}{P(X > x)} = \frac{P(X>y + x)}{P(X > x)}$$

$$ =\frac{e^{-(x+y)/\theta}}{e^{-x/\theta}} = \frac{e^{-x/\theta}e^{-y/\theta}}{e^{-x/\theta}} = e^{-y/\theta} = P(X>y)$$

There you go. Not too bad.

We can actually go the other direction as well. That is, we can show that if \( P(X > y + x | X > x) = P(X > y)\) is true for a continuous random variable X, then X has an exponential distribution. Here’s how:

\( P(X > y + x | X > x) = P(X > y)\) (given)

\( P(1 – F(y + x) | 1 – F(x)) = 1 – F(y)\) (substitute the cdf expressions)

\( \frac{1-F(y + x) \cap 1-F(x))}{1-F(x)}=1-F(y)\) (using the definition of conditional probability)

\( \frac{1-F(y + x)}{1-F(x)}=1-F(y)\) (If X > y + x, then X > x)

Now substitute in generic function terminology, say \( h(x) = 1 – F(x)\):

$$ \frac{h(y + x)}{h(x)}=h(y)$$

Rearranging terms gives us \( h(y + x)=h(y)h(x)\)

Now for that equality to hold, the function h(x) has to have an exponential form, where the variable is in the exponent, like this: \( a^{x}\). Recall that \( a^{x}a^{y}=a^{x+y}\). If \( h(x) = a^{x}\), then our equality above works. So we let \( h(x)=a^{x}\). That allows to make the following conclusion:

$$ 1-F(x) = h(x) = a^{x} = e^{ln a^{x}} = e^{x ln a}$$

Now let b = ln a. We get \( 1-F(x) = e^{bx}\). Solving for F(x) we get \( F(x) = 1 – e^{bx}\). Since \( F(\infty) = 1\), b must be negative. So we have \( F(x) = 1 – e^{-bx}\). Now we just let \( b = \frac{1}{\theta}\) and we have the cumulative distribution function for an exponential distribution: \( F(x) = 1 – e^{-x/\theta}\).

That’s the memoryless property for you. Or maybe it’s called the forgetfulness property. I can’t remember.

Deriving the exponential distribution

The exponential distribution looks harmless enough:

$$ f(x) = \frac{1}{\theta}e^{-x/\theta}$$

It looks like someone just took the exponential function and multiplied it by \( \frac{1}{\theta}\), and then for kicks decided to do the same thing in the exponent except with a negative sign. If we integrate this for all \( x>0 \) we get 1, demonstrating it’s a probability distribution function. So is this just a curiosity someone dreamed up in an ivory tower? No it actually turns out to be related to the Poisson distribution.

Recall the Poisson describes the distribution of probability associated with a Poisson process. That is, the number of events occurring over time or on some object in non-overlapping intervals are independent. For example, maybe the number of 911 phone calls for a particular city arrive at a rate of 3 per hour. The interval of 7 pm to 8 pm is independent of 8 pm to 9 pm. The expected number of calls for each hour is 3. The Poisson distribution allows us to find, say, the probability the city’s 911 number receives more than 5 calls in the next hour, or the probability they receive no calls in the next 2 hours. It deals with discrete counts.

Now what if we turn it around and ask instead how long until the next call comes in? Now we’re dealing with time, which is continuous as opposed to discrete. We’re limited only by the precision of our watch. Let’s be more specific and investigate the time until the first change in a Poisson process. Before diving into math, we can develop some intuition for the answer. If events in a process occur at a rate of 3 per hour, we would probably expect to wait about 20 minutes for the first event. Three per hour implies once every 20 minutes. But what is the probability the first event within 20 minutes? What about within 5 minutes? How about after 30 minutes? (Notice I’m saying within and after instead of at. When finding probabilities of continuous events we deal with intervals instead of specific points. The probability of an event occurring at a specific point in a continuous distribution is always 0.)

Let’s create a random variable called W, which stands for wait time until the first event. The probability the wait time is less than or equal to some particular time w is \( P(W \le w)\). Let’s say w=5 minutes, so we have \( P(W \le 5)\). We can take the complement of this probability and subtract it from 1 to get an equivalent expression:

$$ P(W \le 5) = 1 – P(W>5)$$

Now \( P(W>5)\) implies no events occurred before 5 minutes. That is, nothing happened in the interval [0, 5]. What is the probability that nothing happened in that interval? Well now we’re dealing with events again instead of time. And for that we can use the Poisson:

Probability of no events in interval [0, 5] = \( P(X=0) = \frac{\lambda^{0}e^{-\lambda}}{0!}=e^{-\lambda}\)

So we have \( P(W \le 5) = 1 – P(W>5) = 1 – e^{-\lambda}\)

That’s the cumulative distribution function. If we take the derivative of the cumulative distribution function, we get the probability distribution function:

$$ F'(w)=f(w)=\lambda e^{-\lambda}$$

And there we have the exponential distribution! Usually we let \( \lambda = \frac{1}{\theta}\). And that gives us what I showed in the beginning:

$$ f(x) = \frac{1}{\theta}e^{-x/\theta}$$

Why do we do that? That allows us to have a parameter in the distribution that represents the mean waiting time until the first change. Recall my previous example: if events in a process occur at a mean rate of 3 per hour, or 3 per 60 minutes, we expect to wait 20 minutes for the first event to occur. In symbols, if \( \lambda\) is the mean number of events, then \( \theta=\frac{1}{\lambda}\), the mean waiting time for the first event. So if \( \lambda=3\) is the mean number of events per hour, then the mean waiting time for the first event is \( \theta=\frac{1}{3}\) of an hour. Notice that \( \lambda=3=\frac{1}{1/3}=\frac{1}{\theta}\).

Tying everything together, if we have a Poisson process where events occur at, say, 12 per hour (or 1 every 5 minutes) then the probability that exactly 1 event occurs during the next 5 minutes is found using the Poisson distribution (with \( \lambda=\frac{1}{5}\)):

$$ P(X=1) = \frac{(1/5)^{1}e^{-(1/5)}}{1!}=0.164$$

But the probability that we wait less than some time for the first event, say 5 minutes, is found using the exponential distribution (with \( \theta = \frac{1}{1/5} = 5\)):

$$ P(X<5)=\int_{0}^{5}\frac{1}{5}e^{-x/5}dx=1-e^{-5/5}=1-e^{-1}=0.632$$

Now it may seem we have a contradiction here. We have a 63% of witnessing the first event within 5 minutes, but only a 16% chance of witnessing one event in the next 5 minutes. While the two statements seem identical, they’re actually assessing two very different things. The Poisson probability is the chance we observe exactly one event in the next 5 minutes. Not 2 events, Not 0, Not 3, etc. Just 1. That’s a fairly restrictive question. We’re talking about one outcome out of many. The exponential probability, on the other hand, is the chance we wait less than 5 minutes to see the first event. This is inclusive of all times before 5 minutes, such as 2 minutes, 3 minutes, 4 minutes and 15 seconds, etc. There are many times considered in this calculation. So we’re likely to witness the first event within 5 minutes with a better than even chance, but there’s only a 16% chance that all we witness in that 5 minute span is exactly one event. The latter probability of 16% is similar to the idea that you’re likely to get 5 heads if you toss a fair coin 10 times. That is indeed the most likely outcome, but that outcome only has about a 25% chance of happening. Not impossible, but not exactly what I would call probable. Again it has to do with considering only 1 outcome out of many. The Poisson probability in our question above considered one outcome while the exponential probability considered the infinity of outcomes between 0 and 5 minutes.

Investigating the chi-square test of independence

Survey a random sample of 500 men and ask if they have seen Star Wars more than twice. Survey 500 women and ask the same question. We might guess that the proportion of people in each gender who say “yes” will differ significantly. In other words, the likelihood of answering the question one way or another depends on the gender. What if we surveyed a random sample of 500 men and 500 women and asked whether or not their birthday was in December? We probably wouldn’t expect gender to influence the likelihood of having a birthday in December.

In my first example, the two variables are…

  1. Have you seen Star Wars more than twice? (Yes/No)
  2. Gender (Male/Female)

In the second example the two variables are…

  1. Is your birthday in December? (Yes/No)
  2. Gender (Male/Female)

Our question is whether or not these variables are associated.The quick way to answer this question is the chi-square test of independence. There are many good web pages that explain how to do this test and interpret the results, so I won’t re-invent the iPhone. Instead, what I want to demonstrate is why it’s called the “chi-square” test of independence.

Let’s look at the statistic for this test:

$$ \sum_{i=1}^{k}\frac{(O-E)^{2}}{E}$$

k = the number of combinations possible between the two variables. For example,

  1. Male & seen Star Wars more than twice
  2. Male & have not seen Star Wars more than twice
  3. Female & seen Star Wars more than twice
  4. Female & have not seen Star Wars more than twice

O = the OBSERVED number of occurrences for each combination. Perhaps in my survey I found 12 females out of 500 who have seen Star Wars more than twice and 488 who have not. By contrast, I found 451 males who have seen it twice and 49 who have not.

E = the EXPECTED number of occurrences IF the variables were NOT associated.

The chi-square test of independence is a hypothesis test. Therefore we make a falsifiable assumption about the two variables and calculate the probability of witnessing our data given our assumption. In this case, the falsifiable assumption is that the two variables are independent. If the two variables really are independent of one another, how likely are we to see the data we just saw? That’s a rough interpretation of the p-value.

Now let’s use R to generate a bunch of data on two independent categorical variables (each with two levels), run 1000 chi-square tests on these variables, and plot a histogram of the chi-square statistic:

p <- 0.4
n1 <- 500
n2 <- 500
n <- n1 + n2
x <- rep(NA,1000)
for (i in 1:1000){
	n11 <- sum(rbinom(n1,1,p))
	n21 <- n1 - n11
	n12 <- sum(rbinom(n2,1,p))
	n22 <- n2 - n12

	table <- matrix(c(n11,n21,n12,n22),nrow=2)
	test <- chisq.test(table,correct=F)

	x[i] <- test$statistic
	}
hist(x, freq=F)

>

You can see I generated two binomial samples 500 times with n=1 and p = 0.4. You can think of it as flipping a coin 500 times (or asking 500 girls if they have seen Star Wars twice). I then took the sum of the "successes" (or girls who said yes). The probabilities for both samples are the same, so they really are independent of one another. After that I calculated the chi-square test statistic using R's chisq.test function and stored it in a vector that holds a 1000 values. Finally R made us a nice histogram. As you can see, when you run the test 1000 times, you most likely will get a test statistic value below 4.

Now let's plot a chi-square distribution with one degree freedom on top of the histogram and show that the chi-square test statistic does indeed follow a chi-square distribution approximately. Note that I put approximately in italics. I really wanted you to see that. Oh, and why one degree of freedom? Because given the row and column totals in your 2 x 2 table where you store your results, you only need one cell to figure out the other three. For example, if I told you I had 1000 data points in my Star Wars study (500 men and 500 women), and that I had 650 "yes" responses to my question and 350 "no" responses, I would only need to tell you, say, that 21 women answered "yes" in order for you to figure out the other three values (men-yes, men-no, women-no). In other words, only one out of my four table cells is free to vary. Hence the one degree of freedom. Now about that histogram.

h <- hist(x, plot=F)
ylim <- range(0, h$density, 0.75)
hist(x, freq=F, ylim=ylim)
curve(dchisq(x,df=1), add = TRUE)

When I said it was approximate, I wasn't kidding was I? Below 1 the fit isn't very good. Above 2 it gets much better. Overall though it's not bad. It's approximate, and that's often good enough. Hopefully this helps you visually see where this test statistic gets its name. Often this test gets taught in a class or presented in a textbook with no visual aids whatsoever. The emphasis is on the tables and the math. But here I wanted to crawl under the hood and show you how the chi-square distribution comes into play.

Explaining and simulating a chi-square distribution

Randomly select several values, say 10, from a standard Normal distribution. That’s a Normal distribution with a mean of 0 and standard deviation of 1. Now square each value and sum them all up. That sum is an observation from a chi-square distribution with 10 degrees of freedom. If you picked 12 values and did the same, you would have an observation from a chi-square distribution with 12 degrees of freedom.

We can do this in R as follows:

# random sample of 10 values from N(0,1) summed up
s <- rnorm(10)
sum(s^2)

When I did it, I got the following:

s <- rnorm(10)
s
[1] 2.3235602  0.1339985  0.5645995  1.3603184 -1.1383273  0.1735691
[7] 0.9075259  0.2842374 -0.4214184 -0.1069466
s^2
[1] 5.39893181 0.01795559 0.31877258 1.85046616 1.29578902 0.03012623
[7] 0.82360328 0.08079092 0.17759345 0.01143758
sum(s^2)
[1] 10.00547

Now let's repeat this 1000 times and make a histogram:

x <- rep(NA, 1000)
for (j in 1:1000){
x[j] <-sum(rnorm(10)^2)}
hist(x)

You can click on the image to see bigger version. Obviously we don't see the Normal-looking bell-shape. It skews right. Also notice that since we squared all our values, every sum is positive, which means the histogram starts at 0. Now let's plot a chi-square curve with 10 degrees of freedom on top of our histogram:

h <- hist(x, plot=F)
ylim  <- range(0, h$density, 0.10)
hist(x, freq=F, ylim=ylim)
curve(dchisq(x,df=10), add = TRUE)

I had to experiment a little to get the 0.10 value for the range statement. That was necessary to do in order to keep R from setting the displayed range of the y axis from being too small.

The formula for the chi-square curve (i.e., the probability density function) is

$$ f(x)=\frac{1}{\Gamma(r/2)2^{r/2}}x^{r/2-1}e^{-x/2}$$

where r is degrees of freedom and x must be positive. And, yes, I think it looks pretty frightening if you're not used to that sort of thing.