Tag Archives: regression and other stories

US Names: section 2.3 of Regression and Other Stories

In section 2.3 of Regression and Other Stories, Gelman et al. discuss using graphs to learn more about data. The third example involves the exploration of names in the United States from 1880 to 2010. In this post I recreate figures 2.6 – 2.9 using R code that makes more sense to me. The original data and code are in this GitHub repo.

First read in the data and drop the superfluous first column which just contains row numbers:

allnames <- read.csv("https://github.com/avehtari/ROS-Examples/raw/master/Names/data/allnames_clean.csv")
allnames$X <- NULL
dim(allnames)
## [1] 98012   133

Notice this data set is quite large and “wide” with over 98,000 rows and 133 columns. There is one row per name per sex with counts per year spanning from 1880 to 2010. Here’s a portion of the data:

allnames[1:3,1:6]
##   name sex X1880 X1881 X1882 X1883
## 1 Mary   F  7065  6919  8149  8012
## 2 Anna   F  2604  2698  3143  3306
## 3 Emma   F  2003  2034  2303  2367

Apparently 7,065 girls were named Mary in 1880, and then 6,919 in 1881, and so on.

Figs 2.6 and 2.7 show the distribution of the last letters of boys’ first names for the years 1906, 1956, and 2006. I wrote a function to create the plots for a given year.

last_letter <- function(year){
  xyear <- paste0("X", year)
  d <- subset(allnames, sex == "M", select = c("name", xyear))
  n <- nchar(d$name)
  d$last <- substr(d$name, n, n)
  d$last <- factor(d$last, levels = letters, labels = letters)
  p_data <- aggregate(d[[xyear]], by = list(d[["last"]]), sum)
  names(p_data) <- c("letter", "count")
  p_data$p <- p_data[["count"]]/sum(p_data[["count"]])
  barplot(p ~ letter, data = p_data, 
          main = paste("Last letter of boys' names in", year))
}

Now create the plots:

last_letter(1906)

last_letter(1956)

last_letter(2006)

The 2006 plot shows a trend in giving boys names that end with the letter “n”, such as Ethan, Jayden, Aiden, Mason, and Logan.

Fig 2.8 looks at the distribution of the letters of boys’ last names over time. To create this plot I found it helpful to use the {tidyr} and {dplyr} packages. First I reshaped the data to “long” format using pivot_longer() so there is one row per name per sex per year. Then I removed the “X” from the year value and once again extracted the last letter of the last names using the nchar() and substr() functions.

library(tidyr)
allnames_long <- pivot_longer(allnames, cols = X1880:X2010, 
                              names_to = "year", values_to = "count")
allnames_long$year <- sub("X", "", allnames_long$year)
n <- nchar(allnames_long$name)
allnames_long$last <- substr(allnames_long$name, n, n)
allnames_long$last <- factor(allnames_long$last, 
                             levels = letters, labels = letters)

Next I used some {dplyr} functions to calculate the proportions of boys’ names ending in one of the 26 English letters by year. In the last step I create a new variable called “last2” to indicate if the letter was an N, D or Y. This is to help create a plot matching figure 2.8

library(dplyr)
last_letter_males <- allnames_long |> 
  filter(sex == "M") |> 
  group_by(year, last) |> 
  summarise(letter_count = sum(count)) |> 
  mutate(p = letter_count/sum(letter_count),
         last2 = case_when(last == "n" ~ "n",
                           last == "d" ~ "d",
                           last == "y" ~ "y",
                           .default = "other"))

Before I create the plot, I cut the data into two sets: one with letters N, D and Y; and one with the rest of the letters.

ndy <- subset(last_letter_males, last2 != "other")
other <- subset(last_letter_males, last2 == "other")

And now to create the plot using {ggplot2}. Using two data sets allows me to tweak the lines for the letters N, D and Y The “Set2” palette is a good all-purpose color blind friendly palette.

library(ggplot2)
ggplot() +
  aes(x = year, y = p, group = last) +
  geom_line(data = other, alpha = 1/5) +
  geom_line(mapping = aes(color = last2), data = ndy, linewidth = 1.5) +
  scale_x_discrete(breaks = c("1900", "1950", "2000")) +
  scale_color_brewer(palette = "Set2") +
  theme_minimal() +
  labs(color = "letter")

Names ending in D and Y had their moments in the sun, but now it’s all about names ending in N.

Figure 2.9 displays trends in the concentrations of names in the top 10. This was the easiest plot to create. Again I used some {dplyr} reasoning to create the data. After I calculate proportions of counts by year and sex, I sort the data in descending order, slice off the top 10 names, and then sum the proportions by year and sex.

names_conc <- allnames_long |>
  group_by(year, sex) |> 
  mutate(p = count/sum(count)) |> 
  arrange(desc(p)) |> 
  slice_head(n = 10) |> 
  summarize(total_p = sum(p))

And here’s the plot:

ggplot(names_conc) + 
  aes(x = year, y = total_p, group = sex, color = sex) +
  geom_line() +
  scale_x_discrete(breaks = c("1900", "1950", "2000")) +
  scale_color_brewer(palette = "Set2") +
  ylim(c(0, 0.45)) +
  theme_minimal()

Pre 1900, about 40% of all boys’ names were in the top 10. Fast forward 100 years to 2000, only about 10% of names, boy or girl, are in the top 10. We see much more variability in names in the 21st century.

Hypothesis testing: section 4.6 of Regression and Other Stories

Gelman, et al. tell a story from long ago where someone sent them a fax (that’s right, a fax) asking for help with suspected voter fraud. The story is in section 4.6 (page 63) and is included to provide an example of constructing a hypothesis test. They provide data and code for this example in the Coop folder on Github. The point of this post is to document some changes I made to the code to help me understand it.

The story involves the election of a board of directors for a “residential organization”. 5553 people were allowed to vote for up to 6 people. 27 candidates were running for the board. Votes were tallied after 600 people voted, then again at 1200, 2444, 3444, 4444, and the end after all 5553 people voted. What aroused suspicion was the fact that the proportion of votes for the candidates remained steady each time the votes were tallied. According to the author of the fax: “the election was rigged…[it] is a fixed vote with fixed percentages being assigned to each and every candidate making it impossible to participate in an honest election.”

Let’s read in the data and demonstrate what they’re talking about. Notice this data is the rare CSV without column headers. The data consists of 27 rows, one for each candidate, showing cumulative vote totals.

data <- read.csv("https://raw.githubusercontent.com/avehtari/ROS-Examples/master/Coop/data/Riverbay.csv", 
                 header = FALSE)
# drop 1st and 8th columns; contain candidate names which we don't need.
votes <- data[,2:7]
head(votes)
##    V2  V3  V4   V5   V6   V7
## 1 208 416 867 1259 1610 2020
## 2  55 106 215  313  401  505
## 3 133 250 505  716  902 1129
## 4 101 202 406  589  787  976
## 5 108 249 512  745  970 1192
## 6  54  94 196  279  360  451

Now let’s calculate the proportion of votes received at each interval and create a basic line plot. Each line below represents proportion of votes received for a candidate at each of the six intervals. Notice how the lines are mostly flat. This is what prompted the emergency fax.

vote_p <- apply(votes, 2, proportions)
matplot(t(vote_p), type = "l", col = 1, lty = 1)

Gelman, et al. demonstrate this using separate plots for the top 8 vote-getters (Fig 4.5). They also divide by number of voters instead of total votes received. (Remember, each voter gets to vote for up to six people.) This simply changes the denominator, and hence, the y-axis. The steady vote patterns remain.

voters <- c(600,1200,2444,3444,4444,5553)
vote_p <- sweep(votes, 2, voters, FUN = "/")
matplot(t(vote_p), type = "l", col = 1, lty = 1)

They note that the data in this plot is not independent since proportions at times 2 and beyond include votes that came before. To address this, they create a matrix that contains number of votes received at each interval instead of cumulative totals.

interval_votes <- t(apply(votes, 1, diff))
interval_votes <- cbind(votes[,1], interval_votes)
head(interval_votes)
##           V3  V4  V5  V6  V7
## [1,] 208 208 451 392 351 410
## [2,]  55  51 109  98  88 104
## [3,] 133 117 255 211 186 227
## [4,] 101 101 204 183 198 189
## [5,] 108 141 263 233 225 222
## [6,]  54  40 102  83  81  91

After taking differences the lines still seem mostly stable.

interval_p <- apply(interval_votes, 2, proportions)
matplot(t(interval_p), type = "l", col = 1, lty = 1)

Again, the authors divide by number of voters instead of total votes to create these plots, but the result is the same with a different y-axis. Here’s how I would do the calculations and create the plot.

interval_voters <- c(600, diff(voters))
interval_p <- sweep(interval_votes, 2, interval_voters, FUN = "/")
matplot(t(interval_p), type = "l", col = 1, lty = 1)

And now comes the hypothesis test. What is the probability of seeing steady proportions like this if the votes really were coming in at random? I’ll quote the book here: “Because the concern was that the votes were unexpectedly stable as the count proceeded, we define a test statistic to summarize variability.” The test statistic in this case is the standard deviations of the sample proportions. We can quickly get these from the interval_p object we created above.

test_stat <- apply(interval_p, 1, sd)

Now we need to calculate the theoretical test statistic. For this we assume each candidate has a fixed but unknown proportion of voters who will vote for them, \(\pi_i\). Under the null, the six intervals where votes are tallied are random samples of the voters. So at each time point we can think of the proportion as a draw from a distribution with mean \(\pi_i\) and standard deviation \(\sqrt{\pi_i(1 – \pi_i)/n_t}\), where \(n_t\) is the number of voters at each interval. To calculate this, we first need to estimate \(\pi_i\) with \(p_i\), the observed proportion of votes each candidate received. This is the last column of the votes data frame divided by the total number of voters, 5553.

p_hat <- votes[,6]/5553

Then we take the average of the variances calculated at each time point and take the square root to get the theoretical test statistic.

theory_test_stat <- sapply(p_hat, function(x)sqrt(mean(x*(1-x)/interval_voters)))

Under the null, the observed test statistics should be very close to the theoretical test statistics. This is assessed in Fig 4.7 in the book. I replicate the plot as follows:

plot(x = votes[,6], y = test_stat, xlab = "total # of votes for the candidate",
     ylab = "sd of separate vote proportions")
points(x = votes[,6], y = theory_test_stat, pch = 19)

The authors note that “the actual standard deviations appear consistent with the theoretical model.”

Personally I think the plot would be a little more effective if they zoomed out a little. Some of the dramatic looking departures are only off by 0.01. For example:

plot(x = votes[,6], y = test_stat, xlab = "total # of votes for the candidate",
     ylab = "sd of separate vote proportions", ylim = c(0,0.05))
points(x = votes[,6], y = theory_test_stat, pch = 19)

Another null hypothesis approach is the chi-square test of association. Under the null, the number of votes is not associated with the interval when votes were tallied. We can run this test for each candidate and look at the p-values. If there is no association for each candidate we should see a fairly uniform scatter of p-values. On the other hand, if there was “suspiciously little variation over time” we would see a surplus of high p-values. Here’s how I carried out these calculations. I first created the 2-way tables of yes/no versus time for each candidate. I then applied the chi-square test to each table, and to that result, I extracted each p-value. A uniform QQ plot shows the p-values are mostly uniformly distributed.

tables <- apply(interval_votes, 1, function(x) rbind(x, interval_voters - x), 
      simplify = FALSE)
chisq_out <- lapply(tables, chisq.test, correct = FALSE)
p_values <- sapply(chisq_out, function(x)x$p.value)
qqplot(ppoints(27), p_values)
qqline(p_values, distribution = qunif)

Finally the authors mention that a single test on the entire 27 x 6 table could be performed. This seems like the easiest approach of all.

chisq.test(interval_votes, correct = F)
## 
##  Pearson's Chi-squared test
## 
## data:  interval_votes
## X-squared = 114.72, df = 130, p-value = 0.8279

My R code differs quite a bit from the R code provided by the authors. I’m not saying mine is better, it just makes more sense to me. Maybe someone else will find this approach useful.

Parametric Bootstrap of Kolmogorov–Smirnov Test

Zeimbekakis, et al. recently published an article in The American Statistician titled On Misuses of the Kolmogorov–Smirnov Test for One-Sample Goodness-of-Fit. One of the misues they discuss is using the KS test with parameters estimated from the sample. For example, let’s sample some data from a normal distribution.

x <- rnorm(200, mean = 8, sd = 8)
c(xbar = mean(x), s = sd(x))
##     xbar        s 
## 8.333385 7.979586

If we wanted to assess the goodness-of-fit of this sample to a normal distribution, the following is a bad way to use the KS test:

ks.test(x, "pnorm", mean(x), sd(x))
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  x
## D = 0.040561, p-value = 0.8972
## alternative hypothesis: two-sided

The appropriate way to use the KS test is to actually supply hypothesized parameters. For example:

ks.test(x, "pnorm", 8, 8)
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  x
## D = 0.034639, p-value = 0.9701
## alternative hypothesis: two-sided

The results of both tests are the same. We fail to reject the null hypothesis that the sample is from a Normal distribution with the stated mean and standard deviation. However, the former test is very conservative. Zeimbekakis, et al. show this via simulation. I show a simplified version of this simulation. The basic idea is that if the test were valid, the p-values would be uniformly distributed and the points in the uniform distribution QQ-plot would fall along a diagonal line. Clearly that’s not the case.

n <- 200
rout <- replicate(n = 1000, expr = {
  x <- rnorm(n, 8 , 8)
  xbar <- mean(x)
  s <- sd(x)
  ks.test(x, "pnorm", xbar, s)$p.value
})
hist(rout, main = "Histogram of p-values")

qqplot(x = ppoints(n), y = rout, main = "Uniform QQ-plot")
qqline(rout, distribution = qunif)

Conclusion: using fitted parameters in place of the true parameters in the KS test yields conservative results. The authors state in the abstract that this “has been ‘discovered’ multiple times.”

When done the right way, the KS test yields uniformly distributed p-values.

rout2 <- replicate(n = 1000, expr = {
  x <- rnorm(n, 8 , 8)
  ks.test(x, "pnorm", 8, 8)$p.value
})
hist(rout2)

qqplot(x = ppoints(n), y = rout2, main = "Uniform QQ-plot")
qqline(rout2, distribution = qunif)

Obviously it’s difficult to know which parameters to supply to the KS test. Above we knew to supply 8 as the mean and standard deviation because that’s what we used to generate the data. But what to do in real life? Zeimbekakis, et al. propose a parametric bootstrap to approximate the null distribution of the KS test statistic. The steps to implement the bootstrap are as follows:

  1. draw a random sample from the fitted distribution
  2. get estimates of parameters of random sample
  3. obtain the empirical distribution function
  4. calculate the bootstrapped KS statistic
  5. repeat steps 1 – 4 many times

Let’s do it. The following code is a simplified version of what the authors provide with the paper. Notice they use MASS::fitdistr() to obtain MLE parameter estimates. This returns the same mean for the normal distribution but a slightly smaller (i.e. biased) estimated standard deviation.

param  <- MASS::fitdistr(x, "normal")$estimate
ks <- ks.test(x, function(x)pnorm(x, param[1], param[2]))
stat <- ks$statistic
B <- 1000
stat.b <- double(B)
n <- length(x)

## bootstrapping
for (i in 1:B) {
  # (1) draw a random sample from a fitted dist
  x.b <- rnorm(n, param[1], param[2])
  # (2) get estimates of parameters of random sample
  fitted.b <- MASS::fitdistr(x.b, "normal")$estimate
  # (3) get empirical distribution function
  Fn <- function(x)pnorm(x, fitted.b[1], fitted.b[2])
  # (4) calculate bootstrap KS statistic
  stat.b[i] <- ks.test(x.b, Fn)$statistic
}
mean(stat.b >= stat)
## [1] 0.61

The p-value is the proportion of statistics greater than or equal to the observed statistic calculated with estimated parameters.

Let’s turn this into a function and show that it returns uniformly distributed p-values when used with multiple samples. Again this is a simplified version of the R code the authors generously shared with their paper.

ks.boot <- function(x, B = 1000){
  param  <- MASS::fitdistr(x, "normal")$estimate
  ks <- ks.test(x, function(k)pnorm(k, param[1], param[2]))
  stat <- ks$statistic
  stat.b <- double(B)
  n <- length(x)
  for (i in 1:B) {
    x.b <- rnorm(n, param[1], param[2])
    fitted.b <- MASS::fitdistr(x.b, "normal")$estimate
    Fn <- function(x)pnorm(x, fitted.b[1], fitted.b[2])
    stat.b[i] <- ks.test(x.b, Fn)$statistic
  }
  mean(stat.b >= stat)
}

Now replicate the function with many samples. This takes a moment to run. It took my Windows 11 PC with an Intel i7 chip about 100 seconds to run.

rout_boot <- replicate(n = 1000, expr = {
  x <- rnorm(n, 8 , 8)
  ks.boot(x)
})
hist(rout_boot)

qqplot(x = ppoints(n), y = rout_boot, main = "Uniform QQ-plot")
qqline(rout_boot, distribution = qunif)