Working through Gelman and Hill’s book can be maddening. The exposition is wonderful but recreating their examples leads to new and exciting levels of frustration. Take the height and earnings example in chapter 4. It’s a simple linear regression of earnings on height. You’d think you could download the data from the book web site, use the code in the book, and reproduce the example. They sure lead you to think that. But it doesn’t work that way. For starters, when you load the data it has 2029 records. However the output of the regression in the book shows n = 1192. So subsetting needs to be done. As far as I can tell, the subsetting is not discussed in the book.
Now the author has an “earnings” folder on his site under an “examples” directory, which contains a R file called “earnings_setup.R“. (Never mind the book itself doesn’t appear to mention this directory on the web site.) So this seems to be the place where we find out how to subset the data. The key line of code is
ok <- !is.na (earn+height+sex+age) & earn>0 & yearbn>25
which creates a logical vector to subset the data. But when you run it and request the dimensions of the data frame you have 1059 records, not 1192! After trial and error I believe the subset to reproduce the results in the book should be
ok <- !is.na (earn+height+sex) & earn>0
That gave me 1192. For the record, here’s my full code:
heights <- read.dta ("http://www.stat.columbia.edu/~gelman/arm/examples/earnings/heights.dta")
attach(heights)
male <- 2 - sex # make male 1 (and therefore female = 0)
ok <- !is.na (earn+height+sex) & earn>0
heights.clean <- as.data.frame (cbind (earn, height, sex, male)[ok,])
heights.clean$log.earn <- log(heights.clean$earn)
OK, so now(!) we can reproduce their example:
earn.logmodel.3 <- lm (log.earn ~ height + male + height:male, data=heights.clean)
The reason I was interested in this example was for another example in chapter 7 on "using simulation to represent uncertainty in regression coefficients" (p. 142). In other words, instead of using the standard errors and intervals obtained from the predict() function in R, we compute uncertainties by simulation. It seems easy enough using the sim() function from the book's arm package. You do something like the following:
library(arm)
sim.1 <- sim(earn.logmodel.3, 1000)
The sim function takes two arguments: your model and the number of simulations. It returns a vector of simulated residual standard deviations and a matrix of simulated regression coefficients. We can use these to calculate confidence intervals for coefficients that do not have standard errors in the regression output. Their example asks "what can be said about the coefficient of height among men?" If you look up at the model, you can see it contains an interaction term of height and male. To answer the question we can't just look at the standard error of the height coefficient or just the interaction coefficient. There is no simple way to do what they ask using regression output. So the book instructs us to use the output of the sim function to answer this as follows:
height.for.men.coef <- sim.1$beta[,2] + sim.1$beta[,4]
quantile(height.for.men.coef, c(0.025,0.975))
Except that doesn't work. More frustration. It produces an error that says "Error in sim.1$beta : $ operator not defined for this S4 class" (instead of giving us a 95% interval). With some googling and persistence I was able to determine that the following is how it should be done:
height.for.men.coef <- sim.1@coef[,2] + sim.1@coef[,4]
quantile(height.for.men.coef, c(0.025,0.975))
2.5% 97.5%
-0.0004378039 0.0507464098
Notice that "@coef" replaces "$beta". And with that I was able to finally reproduce the example I was interested in!
Now about this simulation function. While I appreciate functions that save time and make life easy, I do like to know how they work. Fortunately Gelman and Hill provide pseudo-code in the book. It goes like this:
- Run your regression to compute the vector \( \hat{\beta} \) of estimated parameters, the unscaled estimation covariance matrix \( V_{\beta} \), and the residual variance \( \hat{\sigma^{2}} \)
- Create n random simulations for the coefficient vector \( \beta \) and residual standard deviation. For each simulation draw:
- Simulate \(\sigma = \hat{\sigma}\sqrt{(n - k)/X} \) where X is a random draw from the \( \chi^{2} \) distribution with n - k degrees of freedom.
- Given the random draw of \( \sigma \), simulate \( \beta \) from a multivariate normal distribution with mean \( \hat{\beta} \) and variance matrix \( \sigma^{2}V_{\beta}\)
Not too bad. Let's use this to manually run our own simulations, so we have an idea of how the sim() function works. (Plus you may not want to use the arm package as it requires loading 9 more packages.)
Step 1 is easy enough. That's just running your regression as you normally would. Next we need to extract our estimated parameters, the unscaled covariance matrix and the residual standard deviation. We also need to snag degrees of freedom for our chi-square random draw. Here's how we can get them:
# extract coefficents
earn.logmodel.3$coef
# extract residual standard error from model
summary(earn.logmodel.3)$sigma
# extract unscaled covariance matrix
summary(earn.logmodel.3)$cov.unscaled
# extract k and n - k; first two elements in vector
summary(earn.logmodel.3)$df
Let's use this information to do a single simulation.
s.hat <- summary(earn.logmodel.3)$sigma
n.minus.k <- summary(earn.logmodel.3)$df[2]
library(MASS) # need for mvrnorm function to simulate draws from multivariate normal dist'n
# simulate residual standard deviation
(s.hat.sim <- s.hat*sqrt(n.minus.k/rchisq(1, n.minus.k)))
[1] 0.8591814
# use simulated residual standard deviation to simulate regression coefficients
mvrnorm(1,earn.logmodel.3$coef, s.hat.sim^2*summary(earn.logmodel.3)$cov.unscaled)
(Intercept) height male height:male
7.906605029 0.025163124 -0.160828921 0.007904422
That seems to work. How about we try doing 1000 simulations. Here's one way:
n <- 1000
sim.2.sigma <- rep(NA,n)
sim.2.coef <- matrix(NA,n,4)
for (i in 1:n){
sim.2.sigma[i] <- s.hat*sqrt(n.minus.k/rchisq(1, n.minus.k))
sim.2.coef[i,] <- mvrnorm(1,earn.logmodel.3$coef,sim.2.sigma[i]^2*summary(earn.logmodel.3)$cov.unscaled)
}
Now let's see how our simulation results compare to what we got using the sim() function:
height.for.men.coef.2 <- sim.2.coef[,2] + sim.2.coef[,4]
quantile(height.for.men.coef.2, c(0.025,0.975))
2.5% 97.5%
-0.001828216 0.049499381
Looks similar to what we got above. Nice. It's probably better to just use the sim() function for this sort of thing, but at least now we know a little more about what it's doing.