Monthly Archives: May 2024

Hypothesis testing: section 4.6 of Regression and Other Stories

Gelman, et al. tell a story from long ago where someone sent them a fax (that’s right, a fax) asking for help with suspected voter fraud. The story is in section 4.6 (page 63) and is included to provide an example of constructing a hypothesis test. They provide data and code for this example in the Coop folder on Github. The point of this post is to document some changes I made to the code to help me understand it.

The story involves the election of a board of directors for a “residential organization”. 5553 people were allowed to vote for up to 6 people. 27 candidates were running for the board. Votes were tallied after 600 people voted, then again at 1200, 2444, 3444, 4444, and the end after all 5553 people voted. What aroused suspicion was the fact that the proportion of votes for the candidates remained steady each time the votes were tallied. According to the author of the fax: “the election was rigged…[it] is a fixed vote with fixed percentages being assigned to each and every candidate making it impossible to participate in an honest election.”

Let’s read in the data and demonstrate what they’re talking about. Notice this data is the rare CSV without column headers. The data consists of 27 rows, one for each candidate, showing cumulative vote totals.

data <- read.csv("https://raw.githubusercontent.com/avehtari/ROS-Examples/master/Coop/data/Riverbay.csv", 
                 header = FALSE)
# drop 1st and 8th columns; contain candidate names which we don't need.
votes <- data[,2:7]
head(votes)
##    V2  V3  V4   V5   V6   V7
## 1 208 416 867 1259 1610 2020
## 2  55 106 215  313  401  505
## 3 133 250 505  716  902 1129
## 4 101 202 406  589  787  976
## 5 108 249 512  745  970 1192
## 6  54  94 196  279  360  451

Now let’s calculate the proportion of votes received at each interval and create a basic line plot. Each line below represents proportion of votes received for a candidate at each of the six intervals. Notice how the lines are mostly flat. This is what prompted the emergency fax.

vote_p <- apply(votes, 2, proportions)
matplot(t(vote_p), type = "l", col = 1, lty = 1)

Gelman, et al. demonstrate this using separate plots for the top 8 vote-getters (Fig 4.5). They also divide by number of voters instead of total votes received. (Remember, each voter gets to vote for up to six people.) This simply changes the denominator, and hence, the y-axis. The steady vote patterns remain.

voters <- c(600,1200,2444,3444,4444,5553)
vote_p <- sweep(votes, 2, voters, FUN = "/")
matplot(t(vote_p), type = "l", col = 1, lty = 1)

They note that the data in this plot is not independent since proportions at times 2 and beyond include votes that came before. To address this, they create a matrix that contains number of votes received at each interval instead of cumulative totals.

interval_votes <- t(apply(votes, 1, diff))
interval_votes <- cbind(votes[,1], interval_votes)
head(interval_votes)
##           V3  V4  V5  V6  V7
## [1,] 208 208 451 392 351 410
## [2,]  55  51 109  98  88 104
## [3,] 133 117 255 211 186 227
## [4,] 101 101 204 183 198 189
## [5,] 108 141 263 233 225 222
## [6,]  54  40 102  83  81  91

After taking differences the lines still seem mostly stable.

interval_p <- apply(interval_votes, 2, proportions)
matplot(t(interval_p), type = "l", col = 1, lty = 1)

Again, the authors divide by number of voters instead of total votes to create these plots, but the result is the same with a different y-axis. Here’s how I would do the calculations and create the plot.

interval_voters <- c(600, diff(voters))
interval_p <- sweep(interval_votes, 2, interval_voters, FUN = "/")
matplot(t(interval_p), type = "l", col = 1, lty = 1)

And now comes the hypothesis test. What is the probability of seeing steady proportions like this if the votes really were coming in at random? I’ll quote the book here: “Because the concern was that the votes were unexpectedly stable as the count proceeded, we define a test statistic to summarize variability.” The test statistic in this case is the standard deviations of the sample proportions. We can quickly get these from the interval_p object we created above.

test_stat <- apply(interval_p, 1, sd)

Now we need to calculate the theoretical test statistic. For this we assume each candidate has a fixed but unknown proportion of voters who will vote for them, \(\pi_i\). Under the null, the six intervals where votes are tallied are random samples of the voters. So at each time point we can think of the proportion as a draw from a distribution with mean \(\pi_i\) and standard deviation \(\sqrt{\pi_i(1 – \pi_i)/n_t}\), where \(n_t\) is the number of voters at each interval. To calculate this, we first need to estimate \(\pi_i\) with \(p_i\), the observed proportion of votes each candidate received. This is the last column of the votes data frame divided by the total number of voters, 5553.

p_hat <- votes[,6]/5553

Then we take the average of the variances calculated at each time point and take the square root to get the theoretical test statistic.

theory_test_stat <- sapply(p_hat, function(x)sqrt(mean(x*(1-x)/interval_voters)))

Under the null, the observed test statistics should be very close to the theoretical test statistics. This is assessed in Fig 4.7 in the book. I replicate the plot as follows:

plot(x = votes[,6], y = test_stat, xlab = "total # of votes for the candidate",
     ylab = "sd of separate vote proportions")
points(x = votes[,6], y = theory_test_stat, pch = 19)

The authors note that “the actual standard deviations appear consistent with the theoretical model.”

Personally I think the plot would be a little more effective if they zoomed out a little. Some of the dramatic looking departures are only off by 0.01. For example:

plot(x = votes[,6], y = test_stat, xlab = "total # of votes for the candidate",
     ylab = "sd of separate vote proportions", ylim = c(0,0.05))
points(x = votes[,6], y = theory_test_stat, pch = 19)

Another null hypothesis approach is the chi-square test of association. Under the null, the number of votes is not associated with the interval when votes were tallied. We can run this test for each candidate and look at the p-values. If there is no association for each candidate we should see a fairly uniform scatter of p-values. On the other hand, if there was “suspiciously little variation over time” we would see a surplus of high p-values. Here’s how I carried out these calculations. I first created the 2-way tables of yes/no versus time for each candidate. I then applied the chi-square test to each table, and to that result, I extracted each p-value. A uniform QQ plot shows the p-values are mostly uniformly distributed.

tables <- apply(interval_votes, 1, function(x) rbind(x, interval_voters - x), 
      simplify = FALSE)
chisq_out <- lapply(tables, chisq.test, correct = FALSE)
p_values <- sapply(chisq_out, function(x)x$p.value)
qqplot(ppoints(27), p_values)
qqline(p_values, distribution = qunif)

Finally the authors mention that a single test on the entire 27 x 6 table could be performed. This seems like the easiest approach of all.

chisq.test(interval_votes, correct = F)
## 
##  Pearson's Chi-squared test
## 
## data:  interval_votes
## X-squared = 114.72, df = 130, p-value = 0.8279

My R code differs quite a bit from the R code provided by the authors. I’m not saying mine is better, it just makes more sense to me. Maybe someone else will find this approach useful.

Parametric Bootstrap of Kolmogorov–Smirnov Test

Zeimbekakis, et al. recently published an article in The American Statistician titled On Misuses of the Kolmogorov–Smirnov Test for One-Sample Goodness-of-Fit. One of the misues they discuss is using the KS test with parameters estimated from the sample. For example, let’s sample some data from a normal distribution.

x <- rnorm(200, mean = 8, sd = 8)
c(xbar = mean(x), s = sd(x))
##     xbar        s 
## 8.333385 7.979586

If we wanted to assess the goodness-of-fit of this sample to a normal distribution, the following is a bad way to use the KS test:

ks.test(x, "pnorm", mean(x), sd(x))
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  x
## D = 0.040561, p-value = 0.8972
## alternative hypothesis: two-sided

The appropriate way to use the KS test is to actually supply hypothesized parameters. For example:

ks.test(x, "pnorm", 8, 8)
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  x
## D = 0.034639, p-value = 0.9701
## alternative hypothesis: two-sided

The results of both tests are the same. We fail to reject the null hypothesis that the sample is from a Normal distribution with the stated mean and standard deviation. However, the former test is very conservative. Zeimbekakis, et al. show this via simulation. I show a simplified version of this simulation. The basic idea is that if the test were valid, the p-values would be uniformly distributed and the points in the uniform distribution QQ-plot would fall along a diagonal line. Clearly that’s not the case.

n <- 200
rout <- replicate(n = 1000, expr = {
  x <- rnorm(n, 8 , 8)
  xbar <- mean(x)
  s <- sd(x)
  ks.test(x, "pnorm", xbar, s)$p.value
})
hist(rout, main = "Histogram of p-values")

qqplot(x = ppoints(n), y = rout, main = "Uniform QQ-plot")
qqline(rout, distribution = qunif)

Conclusion: using fitted parameters in place of the true parameters in the KS test yields conservative results. The authors state in the abstract that this “has been ‘discovered’ multiple times.”

When done the right way, the KS test yields uniformly distributed p-values.

rout2 <- replicate(n = 1000, expr = {
  x <- rnorm(n, 8 , 8)
  ks.test(x, "pnorm", 8, 8)$p.value
})
hist(rout2)

qqplot(x = ppoints(n), y = rout2, main = "Uniform QQ-plot")
qqline(rout2, distribution = qunif)

Obviously it’s difficult to know which parameters to supply to the KS test. Above we knew to supply 8 as the mean and standard deviation because that’s what we used to generate the data. But what to do in real life? Zeimbekakis, et al. propose a parametric bootstrap to approximate the null distribution of the KS test statistic. The steps to implement the bootstrap are as follows:

  1. draw a random sample from the fitted distribution
  2. get estimates of parameters of random sample
  3. obtain the empirical distribution function
  4. calculate the bootstrapped KS statistic
  5. repeat steps 1 – 4 many times

Let’s do it. The following code is a simplified version of what the authors provide with the paper. Notice they use MASS::fitdistr() to obtain MLE parameter estimates. This returns the same mean for the normal distribution but a slightly smaller (i.e. biased) estimated standard deviation.

param  <- MASS::fitdistr(x, "normal")$estimate
ks <- ks.test(x, function(x)pnorm(x, param[1], param[2]))
stat <- ks$statistic
B <- 1000
stat.b <- double(B)
n <- length(x)

## bootstrapping
for (i in 1:B) {
  # (1) draw a random sample from a fitted dist
  x.b <- rnorm(n, param[1], param[2])
  # (2) get estimates of parameters of random sample
  fitted.b <- MASS::fitdistr(x.b, "normal")$estimate
  # (3) get empirical distribution function
  Fn <- function(x)pnorm(x, fitted.b[1], fitted.b[2])
  # (4) calculate bootstrap KS statistic
  stat.b[i] <- ks.test(x.b, Fn)$statistic
}
mean(stat.b >= stat)
## [1] 0.61

The p-value is the proportion of statistics greater than or equal to the observed statistic calculated with estimated parameters.

Let’s turn this into a function and show that it returns uniformly distributed p-values when used with multiple samples. Again this is a simplified version of the R code the authors generously shared with their paper.

ks.boot <- function(x, B = 1000){
  param  <- MASS::fitdistr(x, "normal")$estimate
  ks <- ks.test(x, function(k)pnorm(k, param[1], param[2]))
  stat <- ks$statistic
  stat.b <- double(B)
  n <- length(x)
  for (i in 1:B) {
    x.b <- rnorm(n, param[1], param[2])
    fitted.b <- MASS::fitdistr(x.b, "normal")$estimate
    Fn <- function(x)pnorm(x, fitted.b[1], fitted.b[2])
    stat.b[i] <- ks.test(x.b, Fn)$statistic
  }
  mean(stat.b >= stat)
}

Now replicate the function with many samples. This takes a moment to run. It took my Windows 11 PC with an Intel i7 chip about 100 seconds to run.

rout_boot <- replicate(n = 1000, expr = {
  x <- rnorm(n, 8 , 8)
  ks.boot(x)
})
hist(rout_boot)

qqplot(x = ppoints(n), y = rout_boot, main = "Uniform QQ-plot")
qqline(rout_boot, distribution = qunif)

Age Adjustment: section 2.4 of Regression and Other Stories

In section 2.4 of Regression and Other Stories, Gelman, et al. explain the necessity of age adjustment when investigating mortality rates. The book is freely available as a PDF online and the section of interest is on pages 31-33. Upon first reading, I had trouble understanding what they were doing. In particular I didn’t follow Figure 2.12. I’m aware that says more about me and than the authors. Fortunately, there’s a footnote in the book that says all data and code are available in the AgePeriodCohort folder on GitHub. Good, I thought, I’ll look at the code and figure out what’s going on. Famous last words.

The R script that performs the age adjustment is births.R. It clocks in at over 400 lines has practically no comments outside of the occasional “Sum it up.” As you run the code, you’ll find the script generates several plots not in the book. In addition, the plots that are in the book are generated in a different order. Trying to parse the R code to help me understand the exposition was frustrating. But I persisted.

Reading the bibliographic note at the end of the chapter indicated the age adjustment example was first discussed on Gelman’s blog. In the blog post he walks through the process of age adjustment, creating the same plots in the book, and provides the R code. This is basically the births.R script. He says at the end, “the code is ugly. Don’t model your code after my practices! If any of you want to make a statistics lesson out of this episode, I recommend you clean the code.”

This blog post is my statistics lesson trying to understand and clean this code.

Fig 2.11 (a)

The data apparently come from the CDC, but I’m using the data file Gelman provides with his R code. The data shows number of deaths per age per gender per year for white non-hispanics in the US. For example, the first row shows 1291 female deaths (Male = 0) in 1999 for those who were 35 years old. The total population of 35 year old women in 1999 was 1,578,829. The rate is 1291/1,578,829 x 100,000 = 81.8, or 81 deaths per 100,000.

data <- read.table("white_nonhisp_death_rates_from_1999_to_2013_by_sex.txt", 
                   header=TRUE)
head(data)
##   Age Male Year Deaths Population Rate
## 1  35    0 1999   1291    1578829 81.8
## 2  35    0 2000   1264    1528463 82.7
## 3  35    0 2001   1186    1377466 86.1
## 4  35    0 2002   1194    1333639 89.5
## 5  35    0 2003   1166    1302188 89.5
## 6  35    0 2004   1166    1325435 88.0

The first plot is mortality rate of the 45-54 age group from 1999 – 2013. We first sum both Deaths and Population by year and then calculate the Mortality Rate by dividing Deaths by Population. This is a nice opportunity to use the base R pipe operator, |>.

aggregate(cbind(Deaths, Population) ~ Year, data = data, FUN = sum, 
          subset = Age %in% 45:54) |> 
  transform(Rate = Deaths/Population) |> 
  plot(Rate ~ Year, data = _, type = "l", ylab = "Death Rate")

This is the third plot in Gelman’s blog post titled “So take the ratio!”

Fig 2.11 (b)

The second plot shows the average of the 45-54 age group increasing as the baby boomers move through. First we sum the Population by Age and Year for the 45-54 group. Then we take that data and take the mean age per year weighted by the population.

aggregate(Population ~ Age + Year, data = data, sum, 
          subset = Age %in% 45:54) |> 
  aggregate(Population ~ Year, data = _, 
            function(x)weighted.mean(45:54, x)) |> 
  plot(Population ~ Year, data = _, type = "l")

To help make this clear, let’s find the mean age of the 45-54 group in 1999. First find the population for each age in 1999:

tmp <- aggregate(Population ~ Age + Year, data = data, sum, 
          subset = Age %in% 45:54 & Year == 1999)
tmp$Population
##  [1] 3166393 3007083 2986252 2805975 2859406 2868751 2804957 3093631 2148382
## [10] 2254975

To find the mean age of the 45-54 group in 1999, we need to weight each age with the population. We can do that with the weighted.mean() function.

weighted.mean(45:54, tmp$Population)
## [1] 49.25585

The code above does this for 1999-2013. I think it’s worth noting that while the plot looks dramatic, the average age only increases from about 49.2 to 49.7. But I suppose when you’re dealing with millions of people that increase makes a difference.

This is the fourth plot in Gelman’s blog post titled “But the average age in this group is going up!”

Fig 2.11 (c)

This is where I began to struggle when reading the book.

This figure is titled “The trend in raw death rates since 2005 can be explained by age-aggregation bias”. This is the eighth plot in the blog post where it has a bit more motivation. Let’s recreate the plots in the blog post leading up to this plot.

The first plot is the sixth plot. It’s basically the previous plot rescaled as a rate. It’s created by first calculating the death rate in 1999, and then taking the weighted mean of that rate by using the total population for each age group.

dr1999 <- aggregate(cbind(Deaths, Population) ~ Age, data = data, FUN = sum, 
          subset = Age %in% 45:54 & Year == 1999) |> 
  transform(Rate = Deaths/Population) 

# Now create plot
aggregate(Population ~ Age + Year, data = data, sum, 
          subset = Age %in% 45:54) |> 
  aggregate(Population ~ Year, data = _, 
            function(x)weighted.mean(dr1999$Rate, x)) |> 
  plot(Population ~ Year, data = _, type = "l", ylab = "Reconstructed death rate")

Next he combines this plot with the plot of the raw death rate (Fig 2.11 (a)). This is the seventh plot in the blog post.

years <- 1999:2013

Raw <- aggregate(cbind(Deaths, Population) ~ Year, data = data, FUN = sum, 
          subset = Age %in% 45:54) |> 
  transform(Rate = Deaths/Population)

Expected <- aggregate(Population ~ Age + Year, data = data, sum, 
          subset = Age %in% 45:54) |> 
  aggregate(Population ~ Year, data = _, 
            function(x)weighted.mean(dr1999$Rate, x))

plot(years, Raw$Rate, type="l", ylab="Death rate for 45-54 non-Hisp whites")
lines(years, Expected$Population, col="green4")
text(2002.5, .00404, "Raw death rate", cex=.8)
text(2009, .00394, "Expected just from\nage shift", col="green4", cex=.8)

Then finally he says, “We can sharpen this comparison by anchoring the expected-trend-in-death-rate-just-from-changing-age-composition graph at 2013, the end of the time series, instead of 1999.” This means we need to calculate the death rate in 2013, and then take the weighted mean of that rate by using the total population for each age group. This is the dr2013 data frame. Then we create the same plot as above except now using the death rate in 2013.

dr2013 <- aggregate(cbind(Deaths, Population) ~ Age, data = data, FUN = sum, 
                    subset = Age %in% 45:54 & Year == 2013) |> 
  transform(Rate = Deaths/Population) 

Raw <- aggregate(cbind(Deaths, Population) ~ Year, data = data, FUN = sum, 
                 subset = Age %in% 45:54) |> 
  transform(Rate = Deaths/Population)

Expected <- aggregate(Population ~ Age + Year, data = data, sum, 
                      subset = Age %in% 45:54) |> 
  aggregate(Population ~ Year, data = _, 
            function(x)weighted.mean(dr2013$Rate, x))

plot(years, Raw$Rate,  type="l", 
     ylab="Death rate for 45-54 non-Hisp whites")
lines(years, Expected$Population, col="green4")
text(2002.5, 0.00395, "Raw death rate", cex=.8)
text(2002, .00409, "Expected just from\nage shift", col="green4", cex=.8)

Gelman notes, “since 2003, all the changes in raw death rate in this group can be explained by changes in age composition.”

Fig 2.12 (a)

This is the first plot showing age-adjusted death rates. Gelman explains this as follows in his blog post: “for each year in time, we take the death rates by year of age and average them, thus computing the death rate that would’ve been observed had the population distribution of 45-54-year-olds been completely flat each year.” The book calls it “the simplest such adjustment, normalizing each year to a hypothetical uniformly distributed population in which the number of people is equal at each age from 45 through 54.” I found this latter explanation a little confusing.

To create this plot we first sum Deaths and Populations by age and year for the 45-54 age group, then calculate the death rate, and then simply take the mean rate by year. That’s it. Gelman takes the additional step of rescaling the rate so that the rate is 1 in 1999.

aggregate(cbind(Deaths, Population) ~ Age + Year, data = data, sum, 
          subset = Age %in% 45:54) |> 
  transform(Rate = Deaths/Population) |> 
  aggregate(Rate ~ Year, data = _, mean) |> 
  transform(AA_Rate = Rate/Rate[1]) |>   # relative to 1999
  plot(AA_Rate ~ Year, data = _, type = "l", 
       ylab = "age-adjusted death rate, relative to 1999")

Fig 2.12 (b)

In the book, this plot shows two different age adjustments, even thought the exposition says there are three. They’re probably referring to the original blog post plot which does show three. I recreate the plot in the blog post, which is the second to last plot.

This plot shows (1) age-adjustment using the simple mean of rates, (i.e., the plot above), (2) age-adjustment using the distribution of ages in 1999, and (3) age-adjustment using the distribution of ages in 2013.

This plot requires the most work of all. First we need to get the total population for all ages in 1999 and 2013. These are used to make the age adjustments.

pop1999 <- aggregate(Population ~ Age, data = data, 
                     subset = Year == 1999 & Age %in% 45:54, sum)[["Population"]]
pop2013 <- aggregate(Population ~ Age, data = data, 
                     subset = Year == 2013 & Age %in% 45:54, sum)[["Population"]]

Next we calculate age-adjusted rates using the population distributions from 1999 and 2013. Again we sum Deaths and Populations by age and year for the 45-54 age group and calculate the death rate. Then we calculate the average rate by year using the population distributions to calculate a weighted mean.

# age-adjustment from Fig 2.12 (a)
aa_rate_uniform <- aggregate(cbind(Deaths, Population) ~ Age + Year, 
                             data = data, sum, 
                             subset = Age %in% 45:54) |> 
  transform(Rate = Deaths/Population) |> 
  aggregate(Rate ~ Year, data = _, mean) |> 
  transform(AA_Rate = Rate/Rate[1])

aa_rate_1999 <- aggregate(cbind(Deaths, Population) ~ Age + Year, 
                          data = data, sum,
                          subset = Age %in% 45:54) |> 
  transform(Rate = Deaths/Population) |> 
  aggregate(Rate ~ Year, data = _, function(x)weighted.mean(x, pop1999))

aa_rate_2013 <- aggregate(cbind(Deaths, Population) ~ Age + Year, 
                          data = data, sum, 
                          subset = Age %in% 45:54) |> 
  transform(Rate = Deaths/Population) |> 
  aggregate(Rate ~ Year, data = _, function(x)weighted.mean(x, pop2013))

Now we can make the plot. Notice we find the range of all the data to help set the limits of the y-axis. Also notice we rescale the plot so all lines begin at 1.

rng <- range(aa_rate_uniform$Rate/aa_rate_uniform$Rate[1], 
             aa_rate_1999$Rate/aa_rate_1999$Rate[1], 
             aa_rate_2013$Rate/aa_rate_2013$Rate[1])
plot(years, aa_rate_uniform$Rate/aa_rate_uniform$Rate[1], type = "l", ylim=rng,
     ylab = "age-adjusted death rate, relative to 1999")
lines(years, aa_rate_1999$Rate/aa_rate_1999$Rate[1], lty=2)
lines(years, aa_rate_2013$Rate/aa_rate_2013$Rate[1], lty=3)
text(2003, 1.053, "Using 1999\nage dist", cex=.8)
text(2004, 1.032, "Using 2013\nage dist", cex=.8)

The point of this plot is to demonstrate it doesn’t matter how the age-adjustment is done.

Fig 2.12 (c)

The final plot shows age adjusted death rates broken down by sex. This is basically the same code as Fig 2.12 (a) but with male included in the calls to aggregate(). To rescale the y-axis so it starts at 1 we need divide each vector of rates by the respective 1999 value.

aa_rate_sex <- aggregate(cbind(Deaths, Population) ~ Age + Year + Male, 
                         data = data, sum, 
          subset = Age %in% 45:54) |> 
  transform(Rate = Deaths/Population) |> 
  aggregate(Rate ~ Year + Male, data = _, mean) 

plot(years, aa_rate_sex$Rate[aa_rate_sex$Male == 0]/
       aa_rate_sex$Rate[aa_rate_sex$Year == 1999 & aa_rate_sex$Male == 0], 
     col="red", type = "l",
     ylab = "Death rate relative to 1999")
lines(years, aa_rate_sex$Rate[aa_rate_sex$Male == 1]/
        aa_rate_sex$Rate[aa_rate_sex$Year == 1999 & aa_rate_sex$Male == 1], 
      col="blue", type = "l")
text(2011.5, 1.075, "Women", col="red")
text(2010.5, 1.02, "Men", col="blue")

Gelman called his code “ugly”, but it’s his code and he understands it. I don’t claim my code is any better, but it’s my code and I understand it.