Monthly Archives: June 2013

Editing a lot of variable names in a R data frame

Someone I work with asked about how to easily update lots of variable names in a R data frame after importing a CSV file. Apparently the column headers in the CSV file are long and unwieldy, and simply opening the CSV file beforehand and editing the variable names is not desirable. So what to do? Well you’re going to have to actually type the new variable names at least once. There’s no getting around that. But it would be nice to type them only once and never again. This would be useful if you need to import the same CSV file over and over as it gets updated over time. Maybe it’s a quarterly report or weekly data dump of some sort. In that case we can do something like the following:

# read in data set and save variable names into a vector
ds <- read.csv("dataset.csv",header=TRUE)
old_names <- names(ds)

# write variable names to text file
writeLines(old_names, "names.txt")

We imported the CSV file, saved the variable names to a vector, and then wrote that vector to a text file with each variable on its own line. Now we can open the text file in an editor and edit the variable names. I find this easier than doing it in the context of R code. No need for quotes, functions, assignment operators, etc. When you’re done, save and close your text file. Now you’re ready to change your variable names:

names(ds) <- readLines("names.txt")

Done! The next time you need to import the same CSV file, you can just run these two lines:

ds <- read.csv("dataset.csv",header=TRUE)
names(ds) <- readLines("names.txt")

Simulation to Represent Uncertainty in Regression Coefficients

Working through Gelman and Hill’s book can be maddening. The exposition is wonderful but recreating their examples leads to new and exciting levels of frustration. Take the height and earnings example in chapter 4. It’s a simple linear regression of earnings on height. You’d think you could download the data from the book web site, use the code in the book, and reproduce the example. They sure lead you to think that. But it doesn’t work that way. For starters, when you load the data it has 2029 records. However the output of the regression in the book shows n = 1192. So subsetting needs to be done. As far as I can tell, the subsetting is not discussed in the book.

Now the author has an “earnings” folder on his site under an “examples” directory, which contains a R file called “earnings_setup.R“. (Never mind the book itself doesn’t appear to mention this directory on the web site.) So this seems to be the place where we find out how to subset the data. The key line of code is

ok <- !is.na (earn+height+sex+age) & earn>0 & yearbn>25

which creates a logical vector to subset the data. But when you run it and request the dimensions of the data frame you have 1059 records, not 1192! After trial and error I believe the subset to reproduce the results in the book should be

ok <- !is.na (earn+height+sex) & earn>0

That gave me 1192. For the record, here’s my full code:

heights <- read.dta ("http://www.stat.columbia.edu/~gelman/arm/examples/earnings/heights.dta")
attach(heights)
male <- 2 - sex # make male 1 (and therefore female = 0)
ok <- !is.na (earn+height+sex) & earn>0
heights.clean <- as.data.frame (cbind (earn, height, sex, male)[ok,])
heights.clean$log.earn <- log(heights.clean$earn)

OK, so now(!) we can reproduce their example:

earn.logmodel.3 <- lm (log.earn ~ height + male + height:male, data=heights.clean)

The reason I was interested in this example was for another example in chapter 7 on "using simulation to represent uncertainty in regression coefficients" (p. 142). In other words, instead of using the standard errors and intervals obtained from the predict() function in R, we compute uncertainties by simulation. It seems easy enough using the sim() function from the book's arm package. You do something like the following:

library(arm)
sim.1 <- sim(earn.logmodel.3, 1000)

The sim function takes two arguments: your model and the number of simulations. It returns a vector of simulated residual standard deviations and a matrix of simulated regression coefficients. We can use these to calculate confidence intervals for coefficients that do not have standard errors in the regression output. Their example asks "what can be said about the coefficient of height among men?" If you look up at the model, you can see it contains an interaction term of height and male. To answer the question we can't just look at the standard error of the height coefficient or just the interaction coefficient. There is no simple way to do what they ask using regression output. So the book instructs us to use the output of the sim function to answer this as follows:

height.for.men.coef <- sim.1$beta[,2] + sim.1$beta[,4]
quantile(height.for.men.coef, c(0.025,0.975))

Except that doesn't work. More frustration. It produces an error that says "Error in sim.1$beta : $ operator not defined for this S4 class" (instead of giving us a 95% interval). With some googling and persistence I was able to determine that the following is how it should be done:

height.for.men.coef <- sim.1@coef[,2] + sim.1@coef[,4]
quantile(height.for.men.coef, c(0.025,0.975))
         2.5%         97.5% 
-0.0004378039  0.0507464098 

Notice that "@coef" replaces "$beta". And with that I was able to finally reproduce the example I was interested in!

Now about this simulation function. While I appreciate functions that save time and make life easy, I do like to know how they work. Fortunately Gelman and Hill provide pseudo-code in the book. It goes like this:

  1. Run your regression to compute the vector \( \hat{\beta} \) of estimated parameters, the unscaled estimation covariance matrix \( V_{\beta} \), and the residual variance \( \hat{\sigma^{2}} \)
  2. Create n random simulations for the coefficient vector \( \beta \) and residual standard deviation. For each simulation draw:
    1. Simulate \(\sigma = \hat{\sigma}\sqrt{(n - k)/X} \) where X is a random draw from the \( \chi^{2} \) distribution with n - k degrees of freedom.
    2. Given the random draw of \( \sigma \), simulate \( \beta \) from a multivariate normal distribution with mean \( \hat{\beta} \) and variance matrix \( \sigma^{2}V_{\beta}\)

Not too bad. Let's use this to manually run our own simulations, so we have an idea of how the sim() function works. (Plus you may not want to use the arm package as it requires loading 9 more packages.)

Step 1 is easy enough. That's just running your regression as you normally would. Next we need to extract our estimated parameters, the unscaled covariance matrix and the residual standard deviation. We also need to snag degrees of freedom for our chi-square random draw. Here's how we can get them:

# extract coefficents
earn.logmodel.3$coef

# extract residual standard error from model
summary(earn.logmodel.3)$sigma

# extract unscaled covariance matrix
summary(earn.logmodel.3)$cov.unscaled

# extract k and n - k; first two elements in vector
summary(earn.logmodel.3)$df

Let's use this information to do a single simulation.

s.hat <- summary(earn.logmodel.3)$sigma
n.minus.k <- summary(earn.logmodel.3)$df[2]

library(MASS) # need for mvrnorm function to simulate draws from multivariate normal dist'n

# simulate residual standard deviation
(s.hat.sim <- s.hat*sqrt(n.minus.k/rchisq(1, n.minus.k)))
[1] 0.8591814

# use simulated residual standard deviation to simulate regression coefficients
mvrnorm(1,earn.logmodel.3$coef, s.hat.sim^2*summary(earn.logmodel.3)$cov.unscaled)
(Intercept)       height         male  height:male 
7.906605029  0.025163124 -0.160828921  0.007904422 

That seems to work. How about we try doing 1000 simulations. Here's one way:

n <- 1000
sim.2.sigma <- rep(NA,n)
sim.2.coef <- matrix(NA,n,4)
for (i in 1:n){
 sim.2.sigma[i] <- s.hat*sqrt(n.minus.k/rchisq(1, n.minus.k))
 sim.2.coef[i,] <- mvrnorm(1,earn.logmodel.3$coef,sim.2.sigma[i]^2*summary(earn.logmodel.3)$cov.unscaled)
}

Now let's see how our simulation results compare to what we got using the sim() function:

height.for.men.coef.2 <- sim.2.coef[,2] + sim.2.coef[,4]
quantile(height.for.men.coef.2, c(0.025,0.975))
        2.5%        97.5% 
-0.001828216  0.049499381 

Looks similar to what we got above. Nice. It's probably better to just use the sim() function for this sort of thing, but at least now we know a little more about what it's doing.

Scraping Virginia Tech Football Data, Part 2

In an earlier post I described how I went about scraping football data off the Virginia Tech athletics web site. Basically I wrote a function that scraped data one season at a time. It worked, but when I decided to return to this project I realized what a pain it would be to do all seasons from 1987 to 2012. I would have to find the ID number for the first game and last game, then run the function for every season. And then do bowl games because the ID number for those games does not follow the pattern of regular season games. It wouldn’t have been unreasonable to just call the function for every season. Typing it all up probably wouldn’t take more than a few minutes. But it’s inefficient so I decided to modify the program to do all seasons at once. Here’s what I came up with.

First I needed to get all the ID numbers. Here’s my code:

allids <- data.frame(season=numeric(0),ID=character(0)) #will hold all box score ids; 322 games from 1987 - 2012
for (i in 1987:2012){
  url <- paste("http://www.hokiesports.com/football/stats/",i,sep="")
  wp <- readLines(url)  
  box <- grep("showstats\\.html\\?[0-9]+",wp,value=TRUE) # split the element at "?"
  ids <- sub(".*?html\\?([0-9]{4,5}).*", "\\1", box)
  ids <- data.frame(season=i,ID=ids)
  allids <- rbind(allids,ids)
}

So I go to each season's page, read the web page source code, find each line that contains "showstats.html?xxxx" (where xxxx = ID number), and pull the ID number into a vector. That last part requires a fancy pants regular expression which took me a while to figure out. It basically says "find everything per the rule in the double quotes, and substitute it with the sub-expression in the parentheses". This is known as "tagging". The part in the parentheses is the tag: ([0-9]{4,5}). It's represented in the sub function with \\1. For more information, see page 99 of Phil Spector's book, Data Manipulation with R. Anyway, I then create a 2 column data frame called "allids" that contains season and ID numbers:

> head(allids)
  season   ID
1   1987 5803
2   1987 5804
3   1987 5805
4   1987 5806
5   1987 5807
6   1987 5808

Now I can use the ID column in my for loop to retrieve drive summaries for every season, like so.

for (i in allids[,2]){
	url <- paste("http://www.hokiesports.com/football/stats/showstats.html?",i,sep="")
	web_page <- readLines(url)

To see the final (for now) version of the code and the CSV file of the data, see my GitHub page for this project. You'll notice I had to build in an error check as it turned out that not all games had drive summaries:

if (length(grep("Virginia Tech Drive Summary", web_page)) == 0) next

That says if grep doesn't find a line with "Virginia Tech Drive Summary" then go to the next iteration in the loop. For some reason the 1994 season only has two games with drive summaries.

So now I have a lot of data and I guess that means I should analyze it. I suppose that will be a future blog post.