Category Archives: Using R

Comparing Multilevel Models using Deviance Statistics (Ch 4 of ALDA)

The tour of Applied Longitudinal Data Analysis (ALDA) by Singer and Willett continues today with section 4.6, Comparing Models using Deviance Statistics. In the section prior to this they walk through building a model by way of examining hypothesis tests for fixed effects and variance components. While the former will be familiar to those who’ve done classical linear regression, the latter is likely something new. And perhaps not advised. As I mentioned in my previous post (last paragraph), they state in section 3.6.2 that “statisticians disagree as to the nature, form, and effectiveness of these tests.” They also “suggest you examine them only with extreme caution.” Therefore I decided not to blog about that particular tactic and instead focus on “a superior method for testing hypotheses about variance components.” (Also their words.) Of course I refer to the title of this post: Deviance Statistics.

As I’ve done in my other ALDA posts, I’m going to forgo exposition and get down to business. This post is meant to serve as a reference guide for my future self and maybe yours as well.

  • The Deviance Statistic is used to test the hypothesis that additional model predictors do not improve the fit of the model. The null hypothesis is that the coefficients of the additional predictors are 0.
  • To use the Deviance Statistic, one model must be nested in the other. That is, the smaller model can be derived from the bigger model by setting certain coefficients in the bigger model equal to 0.
  • Deviance = -2 * (Log Likelihood (LL) of model)
  • Deviance Statistic = -2 * (LL of model nested in bigger model – LL of bigger model)
  • Smaller Deviance is better. If adding more predictors to a model reduces deviance, that may be a good thing. The hypothesis test using the Deviance Statistic helps us determine whether or not the reduction in deviance is significant. A large p-value tells us no, it is not significant and that our model is not improved by the additional predictors. A small p-value tells us to reject the null and keep the extra predictors.
  • The distribution of the deviance statistic is chi-square with DF equal to the number of extra parameters in the bigger model.
  • Deviance obtained under Restricted Maximum Likelihood (REML) should only be used if the two models compared have the same fixed effects and differ only in their random effects. If this is not the case, the deviance obtained using Full ML should be used instead.

Example

The example in Chapter 4 of ALDA involves alcohol use by adolescents. 82 were surveyed over time (3 waves). Some of the data collected include:

alcuse, a continuous measure of alcohol use based on a rating scale (the response)
age_14, age of participant centered about 14 (14 = 0, 15 = 1, 16 = 2)
coa, an indicator whether or not participant is a child of an alcoholic (1 = yes, 0 = no)
id, an arbitrary level to group persons

The model building process is reproduced in R on the UCLA stats consulting site. They use the nlme package. I will use the lme4 package below to demonstrate the use of the deviance statistic.

# read in and attach the data
alcohol1 <- read.table("http://www.ats.ucla.edu/stat/r/examples/alda/data/alcohol1_pp.txt", header=T, sep=",")
attach(alcohol1)
library(lme4)

We're going to fit a model that only has age_14 as a predictor. Then we're going to build a model that has age_14 and coa as predictors. Notice the first model is "nested" in the second. In other words we can get the first model from the second model by setting the coa coefficients to 0.

FIRST MODEL
$$ alcuse = \gamma_{00} + \gamma_{10}*age14 + \zeta + \zeta*age14 + \epsilon $$

SECOND MODEL
$$ alcuse = \gamma_{00} + \gamma_{10}*age14 + \gamma_{01}*coa + \gamma_{11}*age14*coa + \zeta + \zeta_{1i}*age14 + \epsilon $$

Is the second model better than the first? The null hypothesis is no, it is not better.

$$ H_{0}: \gamma_{01} = \gamma_{11} = 0 $$

The second model has two additional fixed effects and no change in the random effects. Therefore to carry out this test, both models need to be fitted using Full Maximum Likelihood. (Note the argument "REML = FALSE" in the calls to lmer() below.)

# FIRST MODEL
model.b1 <- lmer(alcuse ~ age_14 + (age_14 | id), alcohol1, REML = FALSE)
summary(model.b1)

# SECOND MODEL
model.c1 <- lmer(alcuse ~ age_14*coa + (age_14 | id), alcohol1, REML = FALSE)
summary(model.c1)

Now we're ready to carry out the test. We can access the deviance of each model from the summary object, like so:

summary(model.b1)@AICtab$deviance
[1] 636.6111
summary(model.c1)@AICtab$deviance
[1] 621.2026

Notice the deviance of the bigger model is smaller than the deviance of the nested model. Is the reduction in deviance significant? To carry out the test we take the deviance of the smaller nested model and subtract from it the deviance of the bigger model. The difference is then compared to a chi-square distribution for significance. In this case, we'll compare the difference to a chi-square distribution with 2 degrees of freedom since the bigger model has two extra coefficients.

dev <- summary(model.b1)@AICtab$deviance - summary(model.c1)@AICtab$deviance
dev
[1] 15.40846
1 - pchisq(dev,2)
[1] 0.0004509163

Now that's a small p-value. That's the probability we would see a difference this large (or larger) in deviance if the two coefficients really were 0. We reject the null hypothesis and conclude our model is improved by including the two coefficients associated with the coa predictor. If we were planning to do several such tests, we could write a function to make the process go a little faster.

# function to calculate deviance statistic and return p-value
# a = nested model object, b = bigger model object, df = degrees of freedom
dev <- function(a,b,df){
return(1 - pchisq(
	         summary(a)@AICtab$deviance - 
      	         summary(b)@AICtab$deviance, 
		 df))
}

dev(model.b1,model.c1,2)
[1] 0.0004509163

Unconditional Multilevel Models for Change (Ch 4 of ALDA)

In Chapter 4 (section 4.4) of Applied Longitudinal Data Analysis (ALDA), Singer and Willett recommend fitting two simple unconditional models before you begin multilevel model building in earnest. These two models “allow you to establish: (1) whether there is systematic variation in your outcome that is worth exploring; and (2) where that variation resides (within or between people).” (p. 92) This is a great idea. Why start building models if there is no variation to explain? In this post I want to summarize these two models for reference purposes.

Model 1: The Unconditional Means Model

  • The keyword here is “means”. This model has one fixed effect that estimates the grand mean of the response across all occasions and individuals.
  • The main reason to fit this model is to examine the random effects (i.e., the within-person and between-person variance components). This tells us the amount of variation that exists at the within-person level and the between-person level.
  • Model specification: \( Y_{ij} = \gamma_{00} + \zeta_{0i} + \epsilon_{ij} \)
    • \( \gamma_{00} \) = grand mean (fixed effect)
    • \( \zeta_{0i} \) = the amount person i’s mean deviates from the population mean (between-person)
    • \( \epsilon_{ij} \) = the amount the response on occasion j deviates from person i’s mean (within-person)
    • \( \epsilon_{ij} \sim N(0,\sigma_{\epsilon}^{2}) \)
    • \( \zeta_{0i} \sim N(0, \sigma_{0}^{2}) \)
  • Use the intraclass correlation coefficient to describe the proportion of the total outcome variation that lies “between” people: \( \rho = \sigma_{0}^{2} / (\sigma_{0}^{2} + \sigma_{\epsilon}^{2}) \)
  • In the unconditional means model the intraclass correlation coefficient is also the “error autocorrelation coefficient”, which estimates the average correlation between any pair of composite residuals: \( \zeta_{0i} + \epsilon_{ij} \)
  • Sample R code for fitting the unconditional means model (where “id” = person-level grouping indicator):
    library(nlme)
    lme(response ~ 1, data=dataset, random= ~ 1 | id)
    

    Or this:

    library(lme4)
    lmer(response ~ 1 + (1 | id), dataset)
    

To replicate the Unconditional Means Model example in ALDA, the UCLA stats page suggests the following:

alcohol1 <- read.table("http://www.ats.ucla.edu/stat/r/examples/alda/data/alcohol1_pp.txt", 
                       header=T, sep=",")
library(nlme)
model.a <- lme(alcuse~ 1, alcohol1, random= ~1 |id)
summary(model.a)

This works OK, but returns slightly different results because it fits the model using REML (Restricted Maximum Likelihood) instead of ML (Maximum Likelihood). It also does not return the estimated between-person variance \( \sigma_{0}^{2} \). We can "fix" the first issue by including the argument method="ML". There doesn't appear to be anything we can do about the second. However, the lmer() function allows us to replicate the example and obtain the same results presented in the book, as follows (notice we have to specify ML implicitly with the argument REML = FALSE):

model.a1 <- lmer(alcuse ~ 1 + (1 | id), alcohol1, REML = FALSE)
summary(model.a1)

The output provides the values discussed in the book in the "Random effects" section under the variance column:

> summary(model.a1)
Linear mixed model fit by maximum likelihood 
Formula: alcuse ~ 1 + (1 | id) 
   Data: alcohol1 
   AIC   BIC logLik deviance REMLdev
 676.2 686.7 -335.1    670.2     673
Random effects:
 Groups   Name        Variance Std.Dev.
 id       (Intercept) 0.56386  0.75091 
 Residual             0.56175  0.74950 
Number of obs: 246, groups: id, 82

Fixed effects:
            Estimate Std. Error t value
(Intercept)   0.9220     0.0957   9.633

The "Random effect" id has variance = 0.564. That's the between-person variance. The "Random effect" Residual has variance = 0.562. That's the within-person variance. We can access these values using "summary(model.a1)@REmat" and calculate the intraclass correlation coefficient like so:

icc_n <- as.numeric(summary(model.a1)@REmat[1,3])
icc_d <- as.numeric(summary(model.a1)@REmat[1,3]) + 
         as.numeric(summary(model.a1)@REmat[2,3])
icc_n / icc_d
[1] 0.5009373

Model 2: The Unconditional Growth Model

  • This model partitions and quantifies variance across people and time.
  • The fixed effects estimate the starting point and slope of the population average change trajectory.
  • Model specification: \( Y_{ij} = \gamma_{00} + \gamma_{10}*time_{ij} + \zeta_{0i} + \zeta_{1i}*time_{ij} + \epsilon_{ij} \)
    • \( \gamma_{00} \) = average intercept (fixed effect)
    • \( \gamma_{10} \) = average slope (fixed effect)
    • \( \zeta_{0i} \) = the amount person i's intercept deviates from the population intercept
    • \( \zeta_{1i} \) = the amount person i's slope deviates from the population slope
    • \( \epsilon_{ij} \) = the amount the response on occasion j deviates from person i's true change trajectory
    • \( \epsilon_{ij} \sim N(0,\sigma_{\epsilon}^{2}) \)
    • \( \zeta_{0i} \sim N(0, \sigma_{0}^{2}) \)
    • \( \zeta_{1i} \sim N(0, \sigma_{1}^{2}) \)
    • \( \zeta_{0i}\) and \( \zeta_{1i} \) have covariance \( \sigma_{1}^{2} \)
  • The residual variance \( \sigma_{\epsilon}^{2} \) summarizes the average scatter of an individual's observed outcome values around his/her own true change trajectory. Compare this to the same value in the unconditional means model to see if within-person variation is systematically associated with linear time.
  • The level-2 variance components, \( \sigma_{0}^{2} \) and \( \sigma_{1}^{2} \) quantify the unpredicted variability in the intercept and slope of individuals. That is, they assess the scatter of a person's intercept and slope about the population average change trajectory. DO NOT compare to the same values in the unconditional means model since they have a different interpretation.
  • The level-2 covariance \( \sigma_{01} \) quantifies the population covariance between the true initial status (intercept) and true change (slope). Interpretation is easier if we re-express the covariance as a correlation coefficient: \( \hat{\rho}_{01} = \hat{\sigma}_{01} / \sqrt{\hat{\sigma}_{0}^{2}\hat{\sigma}_{1}^{2}} \)
  • Sample R code for fitting the unconditional growth model (where "id" = person-level grouping indicator):
    lme(response ~ time , data=dataset, random= ~ time | id)
    

    Or this:

    lmer(alcuse ~ time + (time | id), dataset)
    

To replicate the Unconditional Growth Model example in ALDA, the UCLA stats page suggests the following:

alcohol1 <- read.table("http://www.ats.ucla.edu/stat/r/examples/alda/data/alcohol1_pp.txt", 
                       header=T, sep=",")
library(nlme)
model.b <- lme(alcuse ~ age_14 , data=alcohol1, random= ~ age_14 | id, method="ML")
summary(model.b)

However I think the following is better as it gives the same values in the book:

model.b1 <- lmer(alcuse ~ age_14 + (age_14 | id), alcohol1, REML = FALSE)
summary(model.b1)

For instance it provides variance values instead of standard deviation values. It doesn't really matter in the long run, but it makes it easier to quickly check your work against the book. Here's the output:

> summary(model.b1)
Linear mixed model fit by maximum likelihood 
Formula: alcuse ~ age_14 + (age_14 | id) 
   Data: alcohol1 
   AIC   BIC logLik deviance REMLdev
 648.6 669.6 -318.3    636.6   643.2
Random effects:
 Groups   Name        Variance Std.Dev. Corr   
 id       (Intercept) 0.62436  0.79017         
          age_14      0.15120  0.38885  -0.223 
 Residual             0.33729  0.58077         
Number of obs: 246, groups: id, 82

Fixed effects:
            Estimate Std. Error t value
(Intercept)  0.65130    0.10508   6.198
age_14       0.27065    0.06246   4.334

Correlation of Fixed Effects:
       (Intr)
age_14 -0.441

Again the main section to review is the "Random effects". The Residual variance (within-person) has decreased to 0.337 from 0.562. We can conclude that \( (0.562 - 0.337)/0.562 = 0.40 \) (i.e., 40%) of the within-person variation in the response is systematically associated with linear time. We also see the negative correlation (-0.223) between the true initial status (intercept) and true change (slope). However, the book notes this correlation is not statistically significant. As you can see this is not something the output of the lmer object reports. The book mentions in chapter 3 (p. 73) that statisticians disagree about the effectiveness of such significance tests on variance components, and I can only assume the authors of the lme4 package question their use. Finally, we notice the level-2 variance components: 0.624 and 0.151. These provide a benchmark for predictors' effects as the authors proceed to build models.

The Multilevel Model for Change (Ch 3 of ALDA) – revisited

In my previous post I talked about how to replicate the example in Chapter 3 of ALDA. It was mostly about finding the dataset and modifying the R code on the UCLA web site so it works. In this post I want to talk a little more about the statistics.

Recall the example involves two groups of infants: one assigned to a program to enhance cognitive functioning and the other acting as a control. Cognitive measurements were taken at three different time periods for both groups. Did the group assigned to the program perform differently than the control group? To answer this question the authors postulate a linear model where cognitive test results are explained by time, as follows:

$$ Y = \beta_{0} + \beta_{1}*time $$

But the intercept and slope coefficients in that model are modeled as follows:

$$ \beta_{0} = \gamma_{00} + \gamma_{01}*program $$

$$ \beta_{1} = \gamma_{10} + \gamma_{11}*program $$

So we have two levels of modeling happening at the same time. The first level concerns within-person change while the second level concerns between-person differences in change. We can consolidate the two levels into one formula, like this:

$$ Y = \gamma_{00} + \gamma_{10}*time + \gamma_{01}*program + \gamma_{11}*time*program $$

So we have an intercept and three coefficients. When we fit the model to the data, we get:

$$ Y = 107.84 – 21.13*time + 6.85*program + 5.27*program*time $$

All of which are significant coefficients. When program = 0, our linear model is \( Y = 107.84 – 21.13*time \). When program = 1, our linear model is \( Y = 114.69 – 15.86*time \). The intercept in these models is interpreted as the cognitive score at the first measurement (when the infants were 12 months old). We can see that infants in the program had a higher performance at the first measurement than those not in the program: 114.69 – 107.84 = 6.85. The slope tells us the rate of decline of cognitive performance (decline?). We see the infants in the program had a slower rate of decline over time: -15.86 versus -21.13. Or put another way: -21.13 – -15.86 = -5.27, which is the coefficient of the interaction. That is, the difference in slopes between the two groups is -5.27.

Now it’s interesting to note that the model does not appear to make use of the fact that each subject contributed three waves of data. Our dataset has 309 records. But those 309 records are in 103 groups. Those 103 groups are the 103 infants. Each infant contributed three cognitive test scores over time. The dataset has a variable, ID, to indicate those groups. But ID is not in our model. Shouldn’t we make use of that information? Well, in fact we did. Have a look at the R code:

lme(cog ~ time*program, data=ch3, random = ~ time | id, method="ML", 
    control=list(opt = "optim"))

Notice how “id” indicates the grouping structure in the “random” argument. Time is specified as the random effect and “| id” indicates it is grouped by “id” (i.e., the 103 infants). In so many words, this allows us to capture the variability in each infant’s own change trajectory. We can think of plotting the cognitive test score for one infant over time and fitting a line to those three points. There will be some error in that line. But not as much error than if we fit a line to all infants in a group over the three times. In this latter scenario we’re not accounting for the grouping of the measurements by infant. We can actually see what happens if we don’t account for this grouping by doing an Analysis of Covariance (ANCOVA).

With ANCOVA, we’re basically doing regression with continuous and categorical variables. The usual approach to ANCOVA is to think of doing a regular ANOVA analysis but blocking on a continuous variable. For example, comparing cholesterol levels (y) between a treated group and a reference group adjusted for age (x, in years). We’re interested in the treatment effect but we want to account for the effect of age.

We can naively do ANCOVA with the Chapter 3 example from ALDA as follows:

lm(cog ~ program + time + time:program, data=ch3)

Look at the results:

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   107.841      1.773  60.822  < 2e-16 ***
program         6.855      2.363   2.901  0.00399 ** 
time          -21.133      2.747  -7.694 1.99e-13 ***
program:time    5.271      3.660   1.440  0.15087 

Now compare those to the multilevel modelling results, obtained from the call to lme() above:

                 Value Std.Error  DF   t-value p-value
(Intercept)  107.84074  2.048799 204  52.63608  0.0000
time         -21.13333  1.903664 204 -11.10140  0.0000
program        6.85466  2.730259 101   2.51063  0.0136
time:program   5.27126  2.536850 204   2.07788  0.0390

Notice the similarities? That's right, both return the same model coefficients! But compare the difference in standard errors. The most dramatic is the interaction between time and program. In the ANCOVA analysis the interaction appears to be insignificant (SE = 3.7; p = 0.15). But in the multilevel model it's significant at the 5% level (SE = 2.5; p = 0.03). We see that the ANCOVA model does not take into account the change trajectories at the individual level, and is thus not sensitive enough to detect the significant difference in rates of cognitive decline between the infants in the program and those in the control group.

The Multilevel Model for Change (Ch 3 of ALDA)

ALDA stands for the book, Applied Longitudinal Data Analysis, by Singer and Willett. I’ve had this lying around for a while and decided to give it another go. If you know a little about regression and want to get started with data analysis using multilevel models, this is probably a good place to start. It’s light on the math and heavy on application. Well, maybe not that heavy. They don’t actually provide anything in the way of computer code when it comes time to fit a model. They just do it and produce a table of results. Fortunately, our friends at the UCLA Statistical Consulting Group have generously provided programs to reproduce the examples in the book using R, SAS, and Stata, among others. So that’s where I turned when I wanted to replicate the example running through chapter 3.

The example involves 103 African-American infants born into low-income families. When the children were 6 months old, about half were randomly assigned to an early intervention program designed to enhance cognitive functioning. The other half received no intervention and acted as a control. Cognitive performance was measured at ages 12, 18 and 24 months by way of a test. The researches wanted to examine the effects of program participation on the test results. There’s actually a lot more to this study. Here’s the abstract. But the authors scaled it back and took a portion of the data to create a friendly introductory example to multilevel modeling. And I have to say, they succeeded. It is indeed an easy-to-follow example. The chapter can be read in one sitting and gives you a sense of what multilevel modeling is all about. In this case we want to model individual change in cognitive ability over time with a straight line,

$$ y = \beta_{0} + \beta_{1}(time)$$

but then also model the intercept and slope of that line with respect to program:

$$ \beta_{0} = \gamma_{00} + \gamma_{01}(program)$$

$$ \beta_{1} = \gamma_{10} + \gamma_{11}(program)$$

So we have two levels of modeling going on. Hence, “multilevel” modeling.

Naturally I wanted to replicate the Chapter 3 results in R. That’s why I’m reading the book, so I can learn how to do this stuff. So I hit up the UCLA site to get the necessary R code. And there at the top the page, before I can even get started, is a roadblock:

Please note that the “early_int” data file (which is used in Chapter 3) is not included among the data files. This was done at the request of the researchers who contributed this data file to ensure the privacy of the participants in the study. Although the web page shows how to obtain the results with this data file, we regret that visitors do not have access to this file to be able to replicate the results for themselves.

Seriously? C’mon. Did the authors not think about this when writing the book? Couldn’t they come up with a different example that can be replicated? Very frustrating. So I did what any honorable statistician would do and gave up, turned off the computer and went outside to enjoy nature Googled the name of the dataset used in the R code on the UCLA web site (earlyint_pp.txt). The very first result takes you to this Harvard web page, where the data in question is available in SAS and Stata formats. Now the authors of ALDA are both Harvard professors. And here is the data – that I’m not supposed to have access to – available on a Harvard web page for an old statistics class taught by one of the authors of the book. I guess the researchers changed their mind? Anyway, I now had the data and could try to replicate the results. And you’ll be happy to know the data contained no identifying information. Nov 2024 update: see this GitHub repo for the data.

It had four variables: (1) an ID number for the infant, (2) the test score, (3) age at test, and (4) an indicator for program participation.

I downloaded the Stata dataset because I can read that into R using the foreign package:

library(foreign)
early.int <- read.dta("earlyint_pp.dta")

With the data loaded, I was ready to start replicating the results. This required me to copy and paste the R code from the UCLA web site. I don’t usually go for copying-and-pasting, but in this case I was OK with it. It was mostly exploratory analysis. And it worked like a charm. That is until I got to the part where you actually fit a multilevel model to the data. Running the following R code produced an error:

library(nlme)
lme(cog ~ time*program, data=early.int, random = ~ time | id, , method="ML")

Error in lme.formula(cog ~ time * program, data = early.int, random = ~time |  : 
  nlminb problem, convergence error code = 1
  message = iteration limit reached without convergence (10)

Hmm. Seemed to work for the UCLA person. What to do? Long story short, I needed to specify a different optimizer. The default optimizer for lme is “nlminb”. I needed to specify “optim”, like so:

model1 <- lme(cog ~ time*program, data=early.int, random = ~ time | id, 
              method="ML", control=list(opt = "optim"))
summary(model1)

That worked! Or at least it gave me some output. Upon closer inspection the random effects were slightly different, but everything else looked about the same. Why does “optim” work and not “nlminb”? I have no idea. Obviously they go about optimization in different ways. But I’m not too worried at the moment since I’m just getting my feet wet with multilevel modeling.

In researching the error above I discovered I could also fit the multilevel model using the following code:

library(lme4)
mod1 <- lmer(cog ~ time*program + (time | id), early.int)
summary(mod1)

The output from this is also slightly different from the book as well as the output from the lme call above using “optim”. The coefficients of the fixed effects are all the same, but the random effects are slightly different.

So what does all this mean? How do you interpret the output? What is multilevel modeling? Well, that wasn’t quite the point of this post. I wanted to document for posterity how to replicate the results of ALDA Chapter 3 using R despite the message to the contrary on the UCLA site. Besides, if you’ve read this far then you're probably also working through ALDA and don't need me to repeat what's in the book. However, if you insist, I point you to this solid PowerPoint presentation that boils Chapter 3 of ALDA to its essence. It tells you everything you need to know about modeling this example and interpreting the results.

Using a Bootstrap to Estimate Power and Significance Level

I’ve been reading Common Errors in Statistics (and How to Avoid Them) by Phillip Good and James Hardin. It’s a good bathroom/bedtime book. You can pick it up and put it down as you please. Each chapter is self-contained and contains bite-size, easy-to-read sections. I’m really happy with it so far.

Anyway, chapter 3 had a section on computing Power and sample size that inspired me to hop on the computer:

If the data do not come from one of the preceding distributions, then we might use a bootstrap to estimate the power and significance level.

In preliminary trials of a new device, the following test results were observed: 7.0 in 11 out of 12 cases and 3.3 in 1 out of 12 cases. Industry guidelines specified that any population with a mean test result greater than 5 would be acceptable. A worst-case or boundary-value scenario would include one in which the test result was 7.0 3/7th of the time, 3.3 3/7th of the time, and 4.1 1/7th of the time. i.e., \((7 \times \frac{3}{7}) + (3.3 \times \frac{3}{7}) + (4.1 \times \frac{1}{7}) = 5\)

The statistical procedure required us to reject if the sample mean of the test results were less than 6. To determine the probability of this event for various sample sizes, we took repeated samples with replacement from the two sets of test results.

If you want to try your hand at duplicating these results, simply take the test values in the proportions observed, stick them in a hat, draw out bootstrap samples with replacement several hundred times, compute the sample means, and record the results.

Well of course I want to try my hand at duplicating the results. Who wouldn’t?

The idea here is to bootstrap from two samples: (1) the one they drew in the preliminary trial with mean = 6.69, and (2) the hypothetical worst-case boundary example with mean = 5. We bootstrap from each and calculate the proportion of samples with mean less than 6. The proportion of results with mean less than 6 from the first population (where true mean = 6.69) can serve as a proxy for Type I error or the significance level. This is proportion of times we make the wrong decision. We conclude the mean is less than 6 when in fact it’s really 6.69. The proportion of results with mean less than 6 from the second population (where true mean = 5) can serve as a proxy for Power. This is proportion of times we make the correct decision. We conclude the mean is less than 6 when in fact it’s really 5.

In the book they show the following table of results:
table_3_2

We see they have computed the significance level (Type I error) and power for three different sample sizes. Here’s me doing the same thing in R:

# starting sample of test results (mean = 6.69)
el1 <- c(7.0,3.3)
prob1 <- c(11/12, 1/12)

# hypothetical worst-case population (mean = 5)
el2 <- c(7, 3.3, 4.1)
prob2 <- c(3/7, 3/7, 1/7)

n <- 1000
for (j in 3:5){ # loop through sample sizes
  m1 <- double(n)
  m2 <- double(n)
    for (i in 1:n) {
      m1[i] <- mean(sample(el1,j, replace=TRUE,prob=prob1)) # test results
      m2[i] <- mean(sample(el2,j, replace=TRUE,prob=prob2)) # worst-case
    }	
print(paste("Type I error for sample size =",j,"is",sum(m1 < 6.0)/n)) 
print(paste("Power for sample size =",j,"is",sum(m2 < 6.0)/n))
}

To begin I define vectors containing the values and their probability of occurrence. Next I set n = 1000 because I want to do 1000 bootstrap samples. Then I start the first of two for loops. The first is for my sample sizes (3 - 5) and the next is for my bootstrap samples. Each time I begin a new sample size loop I need to create two empty vectors to store the means from each bootstrap sample. I calls these m1 and m2. As I loop through my 1000 bootstrap samples, I take the mean of each sample and assign to the ith element of the m1 and m2 vectors. m1 holds the sample means from the test results and m2 holds the sample means from the worst-case boundary scenario. Finally I print the results using the paste function. Notice how I calculate the proportion. I create a logical vector by calling mx < 6.0. This returns a vector of 0s and 1s, where 0 is false and 1 is true. I then sum this vector to get the number of times the mean was less than 6. Dividing that by n (1000) gives me the proportion. Here are my results:

[1] "Type I error for sample size = 3 is 0.244"
[1] "Power for sample size = 3 is 0.845"
[1] "Type I error for sample size = 4 is 0.04"
[1] "Power for sample size = 4 is 0.793"
[1] "Type I error for sample size = 5 is 0.067"
[1] "Power for sample size = 5 is 0.886"

Pretty much the same thing! I guess I could have used the boot function in the boot package to do this. That’s probably more efficient. But this was a clear and easy way to duplicate their results.

Editing a lot of variable names in a R data frame

Someone I work with asked about how to easily update lots of variable names in a R data frame after importing a CSV file. Apparently the column headers in the CSV file are long and unwieldy, and simply opening the CSV file beforehand and editing the variable names is not desirable. So what to do? Well you’re going to have to actually type the new variable names at least once. There’s no getting around that. But it would be nice to type them only once and never again. This would be useful if you need to import the same CSV file over and over as it gets updated over time. Maybe it’s a quarterly report or weekly data dump of some sort. In that case we can do something like the following:

# read in data set and save variable names into a vector
ds <- read.csv("dataset.csv",header=TRUE)
old_names <- names(ds)

# write variable names to text file
writeLines(old_names, "names.txt")

We imported the CSV file, saved the variable names to a vector, and then wrote that vector to a text file with each variable on its own line. Now we can open the text file in an editor and edit the variable names. I find this easier than doing it in the context of R code. No need for quotes, functions, assignment operators, etc. When you’re done, save and close your text file. Now you’re ready to change your variable names:

names(ds) <- readLines("names.txt")

Done! The next time you need to import the same CSV file, you can just run these two lines:

ds <- read.csv("dataset.csv",header=TRUE)
names(ds) <- readLines("names.txt")

Simulation to Represent Uncertainty in Regression Coefficients

Working through Gelman and Hill’s book can be maddening. The exposition is wonderful but recreating their examples leads to new and exciting levels of frustration. Take the height and earnings example in chapter 4. It’s a simple linear regression of earnings on height. You’d think you could download the data from the book web site, use the code in the book, and reproduce the example. They sure lead you to think that. But it doesn’t work that way. For starters, when you load the data it has 2029 records. However the output of the regression in the book shows n = 1192. So subsetting needs to be done. As far as I can tell, the subsetting is not discussed in the book.

Now the author has an “earnings” folder on his site under an “examples” directory, which contains a R file called “earnings_setup.R“. (Never mind the book itself doesn’t appear to mention this directory on the web site.) So this seems to be the place where we find out how to subset the data. The key line of code is

ok <- !is.na (earn+height+sex+age) & earn>0 & yearbn>25

which creates a logical vector to subset the data. But when you run it and request the dimensions of the data frame you have 1059 records, not 1192! After trial and error I believe the subset to reproduce the results in the book should be

ok <- !is.na (earn+height+sex) & earn>0

That gave me 1192. For the record, here’s my full code:

heights <- read.dta ("http://www.stat.columbia.edu/~gelman/arm/examples/earnings/heights.dta")
attach(heights)
male <- 2 - sex # make male 1 (and therefore female = 0)
ok <- !is.na (earn+height+sex) & earn>0
heights.clean <- as.data.frame (cbind (earn, height, sex, male)[ok,])
heights.clean$log.earn <- log(heights.clean$earn)

OK, so now(!) we can reproduce their example:

earn.logmodel.3 <- lm (log.earn ~ height + male + height:male, data=heights.clean)

The reason I was interested in this example was for another example in chapter 7 on "using simulation to represent uncertainty in regression coefficients" (p. 142). In other words, instead of using the standard errors and intervals obtained from the predict() function in R, we compute uncertainties by simulation. It seems easy enough using the sim() function from the book's arm package. You do something like the following:

library(arm)
sim.1 <- sim(earn.logmodel.3, 1000)

The sim function takes two arguments: your model and the number of simulations. It returns a vector of simulated residual standard deviations and a matrix of simulated regression coefficients. We can use these to calculate confidence intervals for coefficients that do not have standard errors in the regression output. Their example asks "what can be said about the coefficient of height among men?" If you look up at the model, you can see it contains an interaction term of height and male. To answer the question we can't just look at the standard error of the height coefficient or just the interaction coefficient. There is no simple way to do what they ask using regression output. So the book instructs us to use the output of the sim function to answer this as follows:

height.for.men.coef <- sim.1$beta[,2] + sim.1$beta[,4]
quantile(height.for.men.coef, c(0.025,0.975))

Except that doesn't work. More frustration. It produces an error that says "Error in sim.1$beta : $ operator not defined for this S4 class" (instead of giving us a 95% interval). With some googling and persistence I was able to determine that the following is how it should be done:

height.for.men.coef <- sim.1@coef[,2] + sim.1@coef[,4]
quantile(height.for.men.coef, c(0.025,0.975))
         2.5%         97.5% 
-0.0004378039  0.0507464098 

Notice that "@coef" replaces "$beta". And with that I was able to finally reproduce the example I was interested in!

Now about this simulation function. While I appreciate functions that save time and make life easy, I do like to know how they work. Fortunately Gelman and Hill provide pseudo-code in the book. It goes like this:

  1. Run your regression to compute the vector \( \hat{\beta} \) of estimated parameters, the unscaled estimation covariance matrix \( V_{\beta} \), and the residual variance \( \hat{\sigma^{2}} \)
  2. Create n random simulations for the coefficient vector \( \beta \) and residual standard deviation. For each simulation draw:
    1. Simulate \(\sigma = \hat{\sigma}\sqrt{(n - k)/X} \) where X is a random draw from the \( \chi^{2} \) distribution with n - k degrees of freedom.
    2. Given the random draw of \( \sigma \), simulate \( \beta \) from a multivariate normal distribution with mean \( \hat{\beta} \) and variance matrix \( \sigma^{2}V_{\beta}\)

Not too bad. Let's use this to manually run our own simulations, so we have an idea of how the sim() function works. (Plus you may not want to use the arm package as it requires loading 9 more packages.)

Step 1 is easy enough. That's just running your regression as you normally would. Next we need to extract our estimated parameters, the unscaled covariance matrix and the residual standard deviation. We also need to snag degrees of freedom for our chi-square random draw. Here's how we can get them:

# extract coefficents
earn.logmodel.3$coef

# extract residual standard error from model
summary(earn.logmodel.3)$sigma

# extract unscaled covariance matrix
summary(earn.logmodel.3)$cov.unscaled

# extract k and n - k; first two elements in vector
summary(earn.logmodel.3)$df

Let's use this information to do a single simulation.

s.hat <- summary(earn.logmodel.3)$sigma
n.minus.k <- summary(earn.logmodel.3)$df[2]

library(MASS) # need for mvrnorm function to simulate draws from multivariate normal dist'n

# simulate residual standard deviation
(s.hat.sim <- s.hat*sqrt(n.minus.k/rchisq(1, n.minus.k)))
[1] 0.8591814

# use simulated residual standard deviation to simulate regression coefficients
mvrnorm(1,earn.logmodel.3$coef, s.hat.sim^2*summary(earn.logmodel.3)$cov.unscaled)
(Intercept)       height         male  height:male 
7.906605029  0.025163124 -0.160828921  0.007904422 

That seems to work. How about we try doing 1000 simulations. Here's one way:

n <- 1000
sim.2.sigma <- rep(NA,n)
sim.2.coef <- matrix(NA,n,4)
for (i in 1:n){
 sim.2.sigma[i] <- s.hat*sqrt(n.minus.k/rchisq(1, n.minus.k))
 sim.2.coef[i,] <- mvrnorm(1,earn.logmodel.3$coef,sim.2.sigma[i]^2*summary(earn.logmodel.3)$cov.unscaled)
}

Now let's see how our simulation results compare to what we got using the sim() function:

height.for.men.coef.2 <- sim.2.coef[,2] + sim.2.coef[,4]
quantile(height.for.men.coef.2, c(0.025,0.975))
        2.5%        97.5% 
-0.001828216  0.049499381 

Looks similar to what we got above. Nice. It's probably better to just use the sim() function for this sort of thing, but at least now we know a little more about what it's doing.

Scraping Virginia Tech Football Data, Part 2

In an earlier post I described how I went about scraping football data off the Virginia Tech athletics web site. Basically I wrote a function that scraped data one season at a time. It worked, but when I decided to return to this project I realized what a pain it would be to do all seasons from 1987 to 2012. I would have to find the ID number for the first game and last game, then run the function for every season. And then do bowl games because the ID number for those games does not follow the pattern of regular season games. It wouldn’t have been unreasonable to just call the function for every season. Typing it all up probably wouldn’t take more than a few minutes. But it’s inefficient so I decided to modify the program to do all seasons at once. Here’s what I came up with.

First I needed to get all the ID numbers. Here’s my code:

allids <- data.frame(season=numeric(0),ID=character(0)) #will hold all box score ids; 322 games from 1987 - 2012
for (i in 1987:2012){
  url <- paste("http://www.hokiesports.com/football/stats/",i,sep="")
  wp <- readLines(url)  
  box <- grep("showstats\\.html\\?[0-9]+",wp,value=TRUE) # split the element at "?"
  ids <- sub(".*?html\\?([0-9]{4,5}).*", "\\1", box)
  ids <- data.frame(season=i,ID=ids)
  allids <- rbind(allids,ids)
}

So I go to each season's page, read the web page source code, find each line that contains "showstats.html?xxxx" (where xxxx = ID number), and pull the ID number into a vector. That last part requires a fancy pants regular expression which took me a while to figure out. It basically says "find everything per the rule in the double quotes, and substitute it with the sub-expression in the parentheses". This is known as "tagging". The part in the parentheses is the tag: ([0-9]{4,5}). It's represented in the sub function with \\1. For more information, see page 99 of Phil Spector's book, Data Manipulation with R. Anyway, I then create a 2 column data frame called "allids" that contains season and ID numbers:

> head(allids)
  season   ID
1   1987 5803
2   1987 5804
3   1987 5805
4   1987 5806
5   1987 5807
6   1987 5808

Now I can use the ID column in my for loop to retrieve drive summaries for every season, like so.

for (i in allids[,2]){
	url <- paste("http://www.hokiesports.com/football/stats/showstats.html?",i,sep="")
	web_page <- readLines(url)

To see the final (for now) version of the code and the CSV file of the data, see my GitHub page for this project. You'll notice I had to build in an error check as it turned out that not all games had drive summaries:

if (length(grep("Virginia Tech Drive Summary", web_page)) == 0) next

That says if grep doesn't find a line with "Virginia Tech Drive Summary" then go to the next iteration in the loop. For some reason the 1994 season only has two games with drive summaries.

So now I have a lot of data and I guess that means I should analyze it. I suppose that will be a future blog post.

Playing with R Markdown

I was playing with R Markdown in R Studio and thought I’d share my results. I can’t believe how easy this is to use! In R Studio, just go File…New…R Markdown. This opens a new template with some helpful code ready to use. This is your Rmd file. It’s basically a script file, except not only can you submit R code from this file, you can also save the output and graphs into one fancy HTML page. This is so much easier than copying-and-pasting R code into a Word Press post and saving/uploading/linking to images created in R. For example, see this post I wrote a while back. It took a long time to make.

Anyway, when you’re done, you click the Knit HTML button and your Rmd file is “knitted” into an HTML file. There’s a little bit of extra code you need to use to ensure proper formatting, but it’s super easy to use. Check out this page to see how easy it is to use. You just surround chunks of your R code in some simple markup.

So here’s what I did. First I worked through the Introductory session of Modern Applied Statistics with S (MASS) by Venables and Ripley. Here is my Rmd file and final output. Next I worked problem 4 from chapter 3 from Data Analysis Using Regression and Multilevel/Hierarchical Models by Gelman and Hill. Here’s the Rmd file and final output. The final output links are the cool parts. Those are just HTML files with embedded images. I uploaded the Rmd files so you could see the marked-up R code. As you’ll see, there’s not much there. If you want your R code and associated output and graphs to be nicely formatted in HTML, just surround it with

```....```{r}

That’s it. You can also create headers using double-hashtags.

A Logistic Regression Checklist

I recently read The Checklist Manifesto by Atul Gawande and was fascinated by how relatively simple checklists can improve performance and results in such complex endeavors as surgery or flying a commercial airplane. I decided I wanted to make a checklist of my own for Logistic regression. It ended up not being a checklist on how to do it per se, but rather a list of important facts to remember. Here’s what I came up with.

  • Logistic regression models the probability that y = 1, \( P(y_{i} = 1) = logit^{-1}(X_{i}\beta) \) where \( logit^{-1} = \frac{e^{x}}{1+e^{x}} \)
  • Logistic predictions are probabilistic. It predicts a probability that y = 1. It does not make a point prediction.
  • The function \( logit^{-1} = \frac{e^{x}}{1+e^{x}} \) transforms continuous values to the range (0,1).
  • Dividing a regression coefficient by 4 will give an upper bound of the predictive difference corresponding to a unit difference in x. For example if \(\beta = 0.33\), then \(0.33/4 = 0.08\). This means a unit increase in x corresponds to no more than a 8% positive difference in the probability that y = 1.
  • The odds of success (i.e., y = 1) increase multiplicatively by \( e^{\beta} \) for every one-unit increase in x. That is, exponentiating logistic regression coefficients can be interpreted as odds ratios. For example, let’s say we have a regression coefficient of 0.497. Exponentiating gives \( e^{0.497} = 1.64 \). That means the odds of success increase by 64% for each one-unit increase in x. Recall that odds = \( \frac{p}{1-p} \). If our predicted probability at x is 0.674, then the odds of success are \(\frac{0.674}{0.326} = 2.07 \). Therefore at x + 1, the odds will increase by 64% from 2.07 to \(2.07(1.64) = 3.40\). Notice that \( 1.64 = \frac{3.40}{2.70}\), which is an odds ratio. The ratio of odds of x + 1 to x will always be \( e^{\beta}\), where \( \beta\) is a logistic regression coefficient.
  • Plots of raw residuals from logistic regression are generally not useful. Instead it’s preferable to plot binned residuals “by dividing the data into categories (bins) based on their fitted values, and then plotting the average residual versus the average fitted value for each bin.” (Gelman & Hill, p. 97). Example R code for doing this can be found here.
  • The error rate is the proportion of cases in your model that predicts y = 1 when the case is actually y = 0 in the data. We predict y = 1 when the predicted probability exceeds 0.5. Otherwise we predict y = 0. It’s not good if your error rate equals the null rate. The null rate is usually the proportion of 0’s in your data. In other words, if you guessed all cases in your data are y = 1, then the null rate is the percentage you guessed wrong. Let’s say your data has 58% of y = 1 and 42% of y = 0, then the null rate is 42%. Further, let’s say you do some logistic regression on this data and your model has an error rate of 36%. That is, 36% of the time it predicts the wrong outcome. This means your model does only 4% better than simply guessing that all cases are y = 1.
  • Deviance is a measure of error. Lower deviance is better. When an informative predictor is added to a model, we expect deviance to decrease by more than 1. If not, then we’ve likely added a non-informative predictor to the model that just adds noise.
  • If a predictor x is completely aligned with the outcome so that y = 1 when x is above some threshold and y = 0 when x is below some threshold, then the coefficient estimate will explode to some gigantic value. This means the parameter cannot be estimated. This is an identifiability problem called separation.

Most of this comes from Chapter 5 of Data Analysis Using Regression and Multilevel/Hierarchical Models by Gellman and Hill. I also pulled a little from chapter 5 of An Introduction to Categorical Data Analysis by Agresti.