Category Archives: Simulation

A Combinatorial Simulation

I have started a new book called The Art of R Programming by Norman Matloff and I’m really digging it. I won’t blog about each chapter the way I did Machine Learning for Hackers, but I did come across something I thought made good blog material.

In Chapter 8 (Doing Math and Simulations in R), Matloff presents an example on combinatorial simulation. The motivation for the example is the following problem:

Three committees, of size 3, 4 and 5, are chosen from 20 people. What is the probability that persons A and B are chosen for the same committee?

Says Matloff: “This problem is not hard to solve analytically, but we may wish to check our solution using simulation…” I suppose it’s not that hard, but if you don’t think about these kinds of problems on a regular basis they can trip you up. He doesn’t give the analytic solution but does provide code for a neat simulation of it:

# 8.6.3 combinatorial simulation
sim <- function(nreps) {
 commdata <- list()
 commdata$countabsamecomm <- 0
 for (rep in 1:nreps) {
 commdata$whosleft <- 1:20
 commdata$numabchosen <- 0
 commdata <- choosecomm(commdata,5)
 if (commdata$numabchosen > 0) next
 commdata <- choosecomm(commdata,4)
 if (commdata$numabchosen > 0) next
 commdata <- choosecomm(commdata,3)
 }
 print(commdata$countabsamecomm/nreps)
}
choosecomm <- function(comdat,comsize) {
 committee <- sample(comdat$whosleft, comsize)
 comdat$numabchosen <- length(intersect(1:2,committee))
 if (comdat$numabchosen == 2)
 comdat$countabsamecomm <- comdat$countabsamecomm + 1
 comdat$whosleft <- setdiff(comdat$whosleft,committee)
 return(comdat)
}

That's quite a bit of code. The first block is the main function, called "sim". The second block is the other function, "choosecomm", that gets called in the first function. The whole thing makes creative use of a list and the intersect() and setdiff() functions. If you want to simulate selecting three committees of 3, 4 and 5 people 10,000 times and see how many time persons 1 and 2 are on the same committee, enter sim(100000) at the command line. You should get about 0.10.

Now how do we solve this using math and probability? First off, instead of 20 people, I like to think of 20 chips, 18 of which are white and 2 that are red. Imagine they're in a bag and we reach in and scoop out 12 and immediately and randomly divide into three groups of 3, 4 and 5. What is the probability one of those groups has both red chips? To calculate this we need to enumerate the possible selections that include both red chips and divide that by all possible selections . First let's do it for the committee of 5. All possible combination of 5 from 20 is \( \binom{20}{5}\). All possible combinations including both red chips are \( \binom{2}{2} \times \binom{18}{3}\). We can calculate this in R using the choose() function as follows:

c5 <- choose(18,3)/choose(20,5)

We then repeat this for the other two committees of 4 and 3, like so:

c4 <- choose(18,2)/choose(20,4)
c3 <- choose(18,1)/choose(20,3)

Summing this up we get c3 + c4 + c5 = 0.10. Now it may seem strange that we always set up our denominator as if we're drawing from 20. After all, in the simulation above, there's a specific order. First 5 are chosen. If the two people are not both on that committee, then we draw 4 and then 3. It seems that should be taken into account when you solve this mathematically. But you don't have to. That's why I like to imagine just scooping 12 people up and assigning instant membership to them on a random basis. While that's not how committee selection usually happens in "real life" it makes thinking about an analytic solution easier.

Incidentally we can simulate the scooping up and assigning of instant membership in R as well:

sim2 <- function(nreps) {
cnt <- 0
for (i in 1:nreps) {
 s <- sample(20,12)
 if (all(1:2 %in% s[1:3])) {cnt <- cnt + 1; next}
 if (all(1:2 %in% s[4:7])) {cnt <- cnt + 1; next}
 if (all(1:2 %in% s[8:12])) cnt <- cnt + 1
}
print(cnt/nreps)
}

I call my function "sim2", because I'm super creative like that. Feed my function the number of simulations you want to run and it does the following for each simulation:

1. samples 12 numbers from 1 through 20 and assigns to a vector called "s"
2. checks if 1 and 2 are in the first three elements of s (committee of 3)
3. checks if 1 and 2 are in the next four elements of s (committee of 4)
4. checks if 1 and 2 are in the last five elements of s (committee of 5)

Notice the "next" in the first two if statements to improve efficiency. (No need to check subsequent if statements if the current one is true.) Obviously the sample() function scoops up my 12 people. The indexing of the "s" vector provides my instant committee membership. The first 3 are committee 3. The next 4 committee of 4. The last 5 the committee of 5. The nice thing about my function is that it's way faster than the one in the book. When I timed both functions running 100,000 simulations I got the following results:

> system.time(sim(100000))
[1] 0.1016
 user system elapsed 
 7.74   0.00    7.74 
> system.time(sim2(100000))
[1] 0.10159
 user system elapsed 
 1.42   0.00    1.42

Much faster, by 6 seconds. However I must say that Matloff wasn't going for speed and says as much throughout the book. He's writing R code the reader can understand. Just want to mention that lest anyone think I'm trying to one-up the author.

Evidence of the truth of the Central Limit Theorem

The Central Limit Theorem is really amazing if you think about it. It says that the sum of a large number of independent random variables will be approximately normally distributed almost regardless of their individual distributions. Now that’s a mouthful and perhaps doesn’t sound terribly amazing. So let’s break it down a bit. “A large number of independent random variables” means any random variables from practically any distribution. I could take 6 observations from a binomial distribution, 2 from a uniform and 3 from a chi-squared. Now sum them all up. That sum has an approximate normal distribution. In other words, if I were to repeatedly take the observations I stated before and calculate the sum (say a 1000 times) and make a histogram of my 1000 sums, I would see something that looks like a Normal distribution. We can do this in R:

# Example of CLT at work
tot <- vector(length = 1000)
for(i in 1:1000){
  s1 <- rnorm(10,32,5)
  s2 <- runif(12)
  s3 <- rbinom(30,10,0.2)
  tot[i] <- sum(s1,s2,s3)
}
hist(tot)

See what I mean? I took 10 random variables from a Normal distribution with mean 32 and standard deviation 5, 12 random variables from a uniform (0,1) distribution, and 30 random variables from a Binomial (10,0.2) distribution, and calculated the sum. I repeated this a 1000 times and then made a histogram of my sums. The shape looks like a Normal distribution, does it not?

Now this is NOT a proof of the Central Limit Theorem. It's just evidence of its truth. But I think it's pretty convincing evidence and a good example of why the Central Limit Theorem is truly central to all of statistics.

Explaining and simulating an F distribution

I remember feeling very uncomfortable with the F distribution when I started learning statistics. It comes up when you’re learning ANOVA, which is rather complicated if you’re learning it for the first time. You learn about partitioning the sum of squares, determining degrees of freedom, calculating mean square error and mean square between-groups, putting everything in a table and finally calculating something called an F statistic. Oh, and don’t forget the assumptions of constant variance and normal distributions (although ANOVA is pretty robust to those violations). There’s a lot to take in! And that’s before you calculate a p-value. And how do you do that? You use an F distribution, which when I learned it meant I had to flip to a massive table in the back of my stats book. Where did that table come from and why do you use it for ANOVA? I didn’t know, I didn’t care. It was near the end of the semester and I was on overload.

I suspect this happens to a lot of people. You work hard to understand the normal distribution, hypothesis testing, confidence intervals, etc. throughout the semester. Then you get to ANOVA sometime in late November (or April) and it just deflates you. You will yourself to understand the rules of doing it, but any sort of intuition for it escapes you. You just want to finish the class.

This post attempts to provide an intuitive explanation of what an F distribution is with respect to one-factor ANOVA. Maybe someone will read this and benefit from it.

ANOVA stands for ANalysis Of VAriance. We’re analyzing variances to determine if there is a difference in means between more than 2 groups. In one-factor ANOVA we’re using one factor to determine group membership. Maybe we’re studying a new drug and create three groups: group A on 10 mg, group B on 5 mg, and group C on placebo. At the end of the study we measure some critical lab value. We take the mean of that lab value for the three groups. We wish to know if there is a difference in the means between those three population groups (assuming each group contains a random sample from three very large populations).

The basic idea is to estimate two variances and form a ratio of those variances. If the variances are about the same, the ratio will be close to 1 and we have no evidence that the populations means differ based on our sample.

One variance is simply the mean of the group variances. Say you have three groups. Take the variance of each group and then find the average of those variances. That’s called the mean square error (MSE). In symbol-free mathematical speak, for three groups we have:

MSE = variance(group A) + variance(group B) + variance(group C) / 3

The other variance is the variance of the sample means multiplied by the number of items in each group (assuming equal sample sizes in each group). That’s called mean square between groups (MSB). It looks something like this:

MSB = variance(group means)*(n)

The F statistic is the ratio of these two variances: F = MSB/MSE

Now if the groups have the same means and same variance and are normally distributed, the F statistic has an F distribution. In other words, if we were to run our experiment hundreds and hundreds of times on three groups with the same mean and variance from a normal distribution, calculate the F statistic each time, and then make a histogram of all our F statistics, our histogram would have a shape that can be modeled with an F distribution.

That’s something we can do in R with the following code:

# generate 4000 F statistics for 5 groups with equal means and variances
R <- 4000
Fstat <- vector(length = R)
for (i in 1:R){
  g1 <- rnorm(20,10,4)
  g2 <- rnorm(20,10,4)
  g3 <- rnorm(20,10,4)
  g4 <- rnorm(20,10,4)
  g5 <- rnorm(20,10,4)
  mse <- (var(g1)+var(g2)+var(g3)+var(g4)+var(g5))/5
  M <- (mean(g1)+mean(g2)+mean(g3)+mean(g4)+mean(g5))/5
  msb <- ((((mean(g1)-M)^2)+((mean(g2)-M)^2)+((mean(g3)-M)^2)+((mean(g4)-M)^2)+((mean(g5)-M)^2))/4)*20
  Fstat[i] <- msb/mse
}
# plot a histogram of F statistics and superimpose F distribution (4, 95)
ylim <- (range(0, 0.8))
x <- seq(0,6,0.01)
hist(Fstat,freq=FALSE, ylim=ylim)
curve(df(x,4,95),add=T) # 5 - 1 = 4; 100 - 5 = 95

So I have 5 normally distributed groups each with a mean of 10 and standard deviation of 4. I take 20 random samples from each and calculate the MSE and MSB as I outlined above. I then calculate the F statistic. And I repeat 4000 times. Finally I plot a histogram of my F statistics and super-impose a theoretical F distribution on top of it. I get the resulting figure:

This is the distribution of the F statistic when the groups samples come from Normal populations with identical means and variances. The smooth curve is an F distribution with 4 and 95 degrees of freedom. The 4 is Number of Groups - 1 (or 5 - 1). The 95 is from Total Number of Observations - Number of Groups (or 100 - 5). If we examine the figure we see that we most likely get an F statistic around 1. The bulk of the area under the curve is between 0.5 and 1.5. This is the distribution we would use to find a p-value for any experiment that involved 5 groups of 20 members each. When the populations means of the groups are different, we get an F statistic greater than 1. The bigger the differences, the larger the F statistic. The larger the F statistic, the more confident we are that the population means between the 5 groups are not the same.

There is much, much more that can be said about ANOVA. I'm not even scratching the surface. But what I wanted to show here is a visual of the distribution of the F statistic. Hopefully it'll give you a better idea of what exactly an F distribution is when you're learning ANOVA for the first time.

Stabilization simulation of relative frequency

One of my favorite simulations is seeing how the relative frequency of an event stabilizes around a theoretical probability as the number of trials increases. For example, we say the probability of rolling a 5 when rolling a single die is \( \frac{1}{6}\). Although it’s an obvious probability, it’s still fun to simulate 10,000 rolls of a die and see that indeed the relative frequency of the occurrence of 5’s stabilizes around \( \frac{1}{6}\). When this concept is presented in a textbook it is often accompanied by a squiggly line graph that settles down around a straight line that represents the theoretical probability. That’s what I want to show how to do in this post, except using a dice example that does not have an obvious probability.

Let’s say we roll a single fair Las Vegas die six times and define success as follows:

  • if the number on the die matches the number of the roll at least once, we have success. (So if it’s our second roll and we roll a 2, we have success.)
  • If we go all six rolls with no result matching the order of the roll, we have failure.

While each roll of the die in 6 tries has a probability of \( \frac{1}{6}\) of matching the attempt number, there is no intuition for determining the probability of success as we defined it above. Theoretically we can calculate it easily enough. We note the probability of NOT rolling the same die value as attempt number is \( \frac{5}{6}\). Furthermore, the probability of NOT rolling the same die value as attempt number for all six tries is \( (\frac{5}{6})^{6}\). Now if we subtract \( (\frac{5}{6})^{6}\) from 1, we have the probability of rolling the same die value as attempt number at least once. So our probability is \( 1-(\frac{5}{6})^{6}=0.665102\). That’s pretty high. Even though success on a single roll has probability of \( \frac{1}{6}\), the probability we get at least one success in 6 rolls is about \( \frac{2}{3}\). The idea that we have a better than even chance of seeing a success in 6 rolls may be intuitive, but the theoretical probability of 0.665012 is certainly not intuitive.

We can simulate this experiment many times using a computer and check whether or not the relative frequency of success converges to the theoretical probability we calculated. We can then make one of those cool line graphs and see if the relative frequency settles down around 0.665. Here’s how I did it in R:

p <- rep(NA,1000)
s <- rep(NA,1000)
    for (j in 1:1000){
        ex <- rbinom(1,6,(1/6)) #run the experiment
        s[j] <- ifelse(ex>=1,1,0) #success or failure?
        p[j] <- sum(s,na.rm = TRUE)/j
}

The first two lines create empty vectors to store the relative frequency (p) and whether or not we saw success in each trial (s). I then execute a for loop with 1000 iterations. Each time through the loop I run the experiment using the rbinom function, I check to see if I have success using the ifelse function (storing 1 if "yes" and 0 otherwise), and then calculate the relative frequency (p). Notice that the relative frequency uses information from the previous trials in each iteration. For example, I could see a sequence such as...

  1. roll six times, get a success: p = 1/1
  2. roll six times, get a success: p= 2/2
  3. roll six times, no success: p = 2/3
  4. roll six times, get a success: p = 3/4
  5. roll six times, get a success: p = 4/5

Each time the denominator increases and matches the trial number. The numerator is always the previous numerator plus 1 if there was a success, or plus 0 if no success.

Once you run the simulation, you can graph it with the following:

n <- 1:1000
plot(n,p,type="l",main="convergence to probability as n increases")
lines(c(1,1000),c(0.665, 0.665))

The first line creates my x-axis values. The second line makes the plot. The third line adds a horizontal line in the graph to represent the theoretical probability. Here's how my graph turned out:

Click on it to see the graph in full-size glory.

Simulating outcomes of a simple experiment

One of the first things you learn in an introductory mathematical statistics class is how to assign a probability model to the outcomes of a simple experiment. For example, say we have an urn containing the following:

  • 2 blue marbles
  • 3 green marbles
  • 3 red marbles
  • 2 yellow marbles

The experiment is to select one marble at random and note the color. Our intuition tells us to assign probabilities as follows:

  • P(blue) = \( \frac{2}{10}\)
  • P(green) = \( \frac{3}{10}\)
  • P(red) = \( \frac{3}{10}\)
  • P(yellow) = \( \frac{2}{10}\)

If we code our colors as 1=blue, 2=green, 3=red and 4=yellow, then our probability model is given by the probability mass function (p.m.f):

$$ f(1)=\frac{2}{10}, f(2)=\frac{3}{10}, f(3)=\frac{3}{10}, f(4)=\frac{2}{10} $$

We can construct a probability histogram of this p.m.f. using R as follows:

ht <- rep(c(1,2,3,4),c(2,3,3,2))
hist(ht,freq=F,breaks=c(0.5,1.5,2.5,3.5,4.5))

This gives us the nice histogram:

If we were to actually carry out this experiment a number of times and make a histogram, we would expect to see a similar shape. This can be easily simulated on a computer. Unfortunately many textbooks don't actually tell you how to do this. Probably because doing so would mean documenting some method using some program that will likely change or be obsolete at some point. But here I show you how it can be done in R. And I can't imagine this method will go out of date any time soon.

The easy part is the simulation. One call to the rmultinom function does the trick:

rmultinom(1, 500, c(2,3,3,2))

The 1 means I'm drawing one item from the four categories containing 2, 3, 3, and 2 elements each. The 500 means I do it 500 times. The c(2,3,3,2) is a vector representing the number of colored marbles I have in each category. I also could have used the probabilities: c (0.2,0.3,0.3,0.2).

When I call this function I need to remember the results, so I give it a name, "h":

h <- rmultinom(1, 100, c(2,3,3,2))

"h" stores a matrix of numbers:

h
    [,1]
[1,] 100
[2,] 159
[3,] 141
[4,] 100

These are the results of the 500 experiments. I picked 100 blues, 159 greens, 141 reds and 100 yellows. Now to make the histogram. What I choose to do is create a vector of 500 items using the codes I assigned above (1=blue, 2=green, 3=red, and 4=yellow). I can do this using the rep() function as follows:

hx <- rep(c(1,2,3,4),c(h[1,1],h[2,1],h[3,1],h[4,1]))

The first part - c(1,2,3,4) - says I want to repeat the numbers 1, 2, 3, and 4. The second part - c(h[1,1],h[2,1],h[3,1],h[4,1]) - indicates how many times to repeat each number. h[1,1] references row 1, column 1 of the matrix called "h"; h[2,1] reference row 2, column 1 of the matrix; and so forth. Now I can use the hist function to create our histogram:

hist(hx,freq=F,breaks=c(0.5,1.5,2.5,3.5,4.5))

Here are the result:

As expected, it looks pretty close to the theoretical histogram we created first. Again, nothing mind-blowing in this post. But I did want to document one way to simulate these simple experiments we so often see early in our statistical studies.

Explaining and simulating a chi-square distribution

Randomly select several values, say 10, from a standard Normal distribution. That’s a Normal distribution with a mean of 0 and standard deviation of 1. Now square each value and sum them all up. That sum is an observation from a chi-square distribution with 10 degrees of freedom. If you picked 12 values and did the same, you would have an observation from a chi-square distribution with 12 degrees of freedom.

We can do this in R as follows:

# random sample of 10 values from N(0,1) summed up
s <- rnorm(10)
sum(s^2)

When I did it, I got the following:

s <- rnorm(10)
s
[1] 2.3235602  0.1339985  0.5645995  1.3603184 -1.1383273  0.1735691
[7] 0.9075259  0.2842374 -0.4214184 -0.1069466
s^2
[1] 5.39893181 0.01795559 0.31877258 1.85046616 1.29578902 0.03012623
[7] 0.82360328 0.08079092 0.17759345 0.01143758
sum(s^2)
[1] 10.00547

Now let's repeat this 1000 times and make a histogram:

x <- rep(NA, 1000)
for (j in 1:1000){
x[j] <-sum(rnorm(10)^2)}
hist(x)

You can click on the image to see bigger version. Obviously we don't see the Normal-looking bell-shape. It skews right. Also notice that since we squared all our values, every sum is positive, which means the histogram starts at 0. Now let's plot a chi-square curve with 10 degrees of freedom on top of our histogram:

h <- hist(x, plot=F)
ylim  <- range(0, h$density, 0.10)
hist(x, freq=F, ylim=ylim)
curve(dchisq(x,df=10), add = TRUE)

I had to experiment a little to get the 0.10 value for the range statement. That was necessary to do in order to keep R from setting the displayed range of the y axis from being too small.

The formula for the chi-square curve (i.e., the probability density function) is

$$ f(x)=\frac{1}{\Gamma(r/2)2^{r/2}}x^{r/2-1}e^{-x/2}$$

where r is degrees of freedom and x must be positive. And, yes, I think it looks pretty frightening if you're not used to that sort of thing.