Category Archives: Using R

The stats::filter() function

The base R function filter() can be used to calculate moving averages. This is one of the base R functions masked when the {dplyr} package is loaded.

Before we see how it works, let’s create some toy data.

n <- 100
t <- as.Date(seq(n), origin = as.Date("01/01/2022", "%m/%d/%Y"))
set.seed(1)
y <- cos(pi*seq(-2,2, length.out = n)) + rnorm(n, sd = 0.5)
d <- data.frame(t, y)

library(ggplot2)
ggplot(d) +
  aes(t, y) +
  geom_line()

3 day moving average

Let’s say we want to calculate a 3 day moving average. Two ways to approach this:

  1. on any given day, take average of day before, current day, and next day. In other words use data from both sides of current day.

  2. on any given day, take average of prior two days and current day. In other words use data from only one side of current day.

The first approach is the default of the filter() function. The first argument is the vector of data we want to calculate the moving average for. The second argument is the filter we want to apply to the data. These are coefficients we apply to the data before summing. For a 3 day moving average this is a vector of three 1/3, which we can create using rep(1/3, 3). The sides=2 argument says use both sides of the data.

d$ma3 <- stats::filter(d$y, filter = rep(1/3, 3), sides = 2)

Let’s look at the first three values. There’s a NA for day 1 because we have no data for the prior day.

d$ma3[1:3]
## [1]        NA 0.7735613 1.1199731

The first 3 day moving average is 0.7735613 calculated at day 2. This is the average of days 1, 2, and 3.

mean(y[1:3])
## [1] 0.7735613

Notice we can get the same result by multiplying each data point by 1/3 and summing. This is the “filter” applied to the data by the filter() function.

sum(1/3 * y[1:3])
## [1] 0.7735613

One reason to calculate a moving average is to smooth out day-to-day variation. Below we plot the original data with the 3 day moving average superimposed.

ggplot(d) +
  aes(t, y) +
  geom_line(alpha = 1/4) +
  geom_line(aes(y = ma3), color  = "red")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

The other approach is to use one side of the data. That means setting sides= 1. Now we get 2 NAs at the beginning because days 1 and 2 did not have two prior days.

d$ma3 <- stats::filter(d$y, filter = rep(1/3, 3), sides = 1)
d$ma3[1:3]
## [1]        NA        NA 0.7735613

odd versus even numbers

Odd numbers are good to use for sides = 2 because we’ll get an equal number of days before and after the current day. If an even number is used, “more of the filter is forward in time than backward” (?filter).

For example, consider a 6 day moving average:

d$ma6 <- stats::filter(d$y, filter = rep(1/6, 6), sides = 2)
d$ma6[1:3]
## [1]        NA        NA 0.9133886

The first 6 day moving average occurs at day 3 which is the average of the current day, the 2 previous days, and the 3 following days. Notice “more of the filter is forward in time than backward”:

1 2 3 4 5 6

sum(1/6 * y[1:6])
## [1] 0.9133886

Other functions

Three other functions for calculating moving averages are:

  1. runmean() from the {caTools} package
  2. frollmean() from the {data.table} package
  3. roll_mean() from the {RcppRoll} package

Below I use all three to replicate the stats::filter() result.

For caTools::runmean() we need to specify endrule = "NA", otherwise it uses an algorithm to calculate means in the extremes using smaller windows than 3.

For data.table::frollmean() we need to specify align = "center", which is the same as sides = 2 for filter(). Otherwise it defaults to align = "right" which is equivalent to sides = 1 for filter().

For RcppRoll::roll_mean we need to specify fill = NA to pad the output with missing values in the extremes. Otherwise, in this case, it returns a vector of length 98 instead of 100.

d$ma3 <- stats::filter(d$y, filter = rep(1/3, 3), sides = 2)
d$ma3_runmean <- caTools::runmean(d$y, k = 3, endrule = "NA")
d$ma3_frollmean <- data.table::frollmean(d$y, n = 3, align = "center")
d$ma3_rcpproll <- RcppRoll::roll_mean(d$y, n = 3, align = "center",
                                      fill = NA)
head(d[,c("ma3", "ma3_runmean", "ma3_frollmean", "ma3_rcpproll")])
##         ma3 ma3_runmean ma3_frollmean ma3_rcpproll
## 1        NA          NA            NA           NA
## 2 0.7735613   0.7735613     0.7735613    0.7735613
## 3 1.1199731   1.1199731     1.1199731    1.1199731
## 4 1.1049153   1.1049153     1.1049153    1.1049153
## 5 1.0532159   1.0532159     1.0532159    1.0532159
## 6 0.8003626   0.8003626     0.8003626    0.8003626

US Names: section 2.3 of Regression and Other Stories

In section 2.3 of Regression and Other Stories, Gelman et al. discuss using graphs to learn more about data. The third example involves the exploration of names in the United States from 1880 to 2010. In this post I recreate figures 2.6 – 2.9 using R code that makes more sense to me. The original data and code are in this GitHub repo.

First read in the data and drop the superfluous first column which just contains row numbers:

allnames <- read.csv("https://github.com/avehtari/ROS-Examples/raw/master/Names/data/allnames_clean.csv")
allnames$X <- NULL
dim(allnames)
## [1] 98012   133

Notice this data set is quite large and “wide” with over 98,000 rows and 133 columns. There is one row per name per sex with counts per year spanning from 1880 to 2010. Here’s a portion of the data:

allnames[1:3,1:6]
##   name sex X1880 X1881 X1882 X1883
## 1 Mary   F  7065  6919  8149  8012
## 2 Anna   F  2604  2698  3143  3306
## 3 Emma   F  2003  2034  2303  2367

Apparently 7,065 girls were named Mary in 1880, and then 6,919 in 1881, and so on.

Figs 2.6 and 2.7 show the distribution of the last letters of boys’ first names for the years 1906, 1956, and 2006. I wrote a function to create the plots for a given year.

last_letter <- function(year){
  xyear <- paste0("X", year)
  d <- subset(allnames, sex == "M", select = c("name", xyear))
  n <- nchar(d$name)
  d$last <- substr(d$name, n, n)
  d$last <- factor(d$last, levels = letters, labels = letters)
  p_data <- aggregate(d[[xyear]], by = list(d[["last"]]), sum)
  names(p_data) <- c("letter", "count")
  p_data$p <- p_data[["count"]]/sum(p_data[["count"]])
  barplot(p ~ letter, data = p_data, 
          main = paste("Last letter of boys' names in", year))
}

Now create the plots:

last_letter(1906)

last_letter(1956)

last_letter(2006)

The 2006 plot shows a trend in giving boys names that end with the letter “n”, such as Ethan, Jayden, Aiden, Mason, and Logan.

Fig 2.8 looks at the distribution of the letters of boys’ last names over time. To create this plot I found it helpful to use the {tidyr} and {dplyr} packages. First I reshaped the data to “long” format using pivot_longer() so there is one row per name per sex per year. Then I removed the “X” from the year value and once again extracted the last letter of the last names using the nchar() and substr() functions.

library(tidyr)
allnames_long <- pivot_longer(allnames, cols = X1880:X2010, 
                              names_to = "year", values_to = "count")
allnames_long$year <- sub("X", "", allnames_long$year)
n <- nchar(allnames_long$name)
allnames_long$last <- substr(allnames_long$name, n, n)
allnames_long$last <- factor(allnames_long$last, 
                             levels = letters, labels = letters)

Next I used some {dplyr} functions to calculate the proportions of boys’ names ending in one of the 26 English letters by year. In the last step I create a new variable called “last2” to indicate if the letter was an N, D or Y. This is to help create a plot matching figure 2.8

library(dplyr)
last_letter_males <- allnames_long |> 
  filter(sex == "M") |> 
  group_by(year, last) |> 
  summarise(letter_count = sum(count)) |> 
  mutate(p = letter_count/sum(letter_count),
         last2 = case_when(last == "n" ~ "n",
                           last == "d" ~ "d",
                           last == "y" ~ "y",
                           .default = "other"))

Before I create the plot, I cut the data into two sets: one with letters N, D and Y; and one with the rest of the letters.

ndy <- subset(last_letter_males, last2 != "other")
other <- subset(last_letter_males, last2 == "other")

And now to create the plot using {ggplot2}. Using two data sets allows me to tweak the lines for the letters N, D and Y The “Set2” palette is a good all-purpose color blind friendly palette.

library(ggplot2)
ggplot() +
  aes(x = year, y = p, group = last) +
  geom_line(data = other, alpha = 1/5) +
  geom_line(mapping = aes(color = last2), data = ndy, linewidth = 1.5) +
  scale_x_discrete(breaks = c("1900", "1950", "2000")) +
  scale_color_brewer(palette = "Set2") +
  theme_minimal() +
  labs(color = "letter")

Names ending in D and Y had their moments in the sun, but now it’s all about names ending in N.

Figure 2.9 displays trends in the concentrations of names in the top 10. This was the easiest plot to create. Again I used some {dplyr} reasoning to create the data. After I calculate proportions of counts by year and sex, I sort the data in descending order, slice off the top 10 names, and then sum the proportions by year and sex.

names_conc <- allnames_long |>
  group_by(year, sex) |> 
  mutate(p = count/sum(count)) |> 
  arrange(desc(p)) |> 
  slice_head(n = 10) |> 
  summarize(total_p = sum(p))

And here’s the plot:

ggplot(names_conc) + 
  aes(x = year, y = total_p, group = sex, color = sex) +
  geom_line() +
  scale_x_discrete(breaks = c("1900", "1950", "2000")) +
  scale_color_brewer(palette = "Set2") +
  ylim(c(0, 0.45)) +
  theme_minimal()

Pre 1900, about 40% of all boys’ names were in the top 10. Fast forward 100 years to 2000, only about 10% of names, boy or girl, are in the top 10. We see much more variability in names in the 21st century.

Age Adjustment: section 2.4 of Regression and Other Stories

In section 2.4 of Regression and Other Stories, Gelman, et al. explain the necessity of age adjustment when investigating mortality rates. The book is freely available as a PDF online and the section of interest is on pages 31-33. Upon first reading, I had trouble understanding what they were doing. In particular I didn’t follow Figure 2.12. I’m aware that says more about me and than the authors. Fortunately, there’s a footnote in the book that says all data and code are available in the AgePeriodCohort folder on GitHub. Good, I thought, I’ll look at the code and figure out what’s going on. Famous last words.

The R script that performs the age adjustment is births.R. It clocks in at over 400 lines has practically no comments outside of the occasional “Sum it up.” As you run the code, you’ll find the script generates several plots not in the book. In addition, the plots that are in the book are generated in a different order. Trying to parse the R code to help me understand the exposition was frustrating. But I persisted.

Reading the bibliographic note at the end of the chapter indicated the age adjustment example was first discussed on Gelman’s blog. In the blog post he walks through the process of age adjustment, creating the same plots in the book, and provides the R code. This is basically the births.R script. He says at the end, “the code is ugly. Don’t model your code after my practices! If any of you want to make a statistics lesson out of this episode, I recommend you clean the code.”

This blog post is my statistics lesson trying to understand and clean this code.

Fig 2.11 (a)

The data apparently come from the CDC, but I’m using the data file Gelman provides with his R code. The data shows number of deaths per age per gender per year for white non-hispanics in the US. For example, the first row shows 1291 female deaths (Male = 0) in 1999 for those who were 35 years old. The total population of 35 year old women in 1999 was 1,578,829. The rate is 1291/1,578,829 x 100,000 = 81.8, or 81 deaths per 100,000.

data <- read.table("white_nonhisp_death_rates_from_1999_to_2013_by_sex.txt", 
                   header=TRUE)
head(data)
##   Age Male Year Deaths Population Rate
## 1  35    0 1999   1291    1578829 81.8
## 2  35    0 2000   1264    1528463 82.7
## 3  35    0 2001   1186    1377466 86.1
## 4  35    0 2002   1194    1333639 89.5
## 5  35    0 2003   1166    1302188 89.5
## 6  35    0 2004   1166    1325435 88.0

The first plot is mortality rate of the 45-54 age group from 1999 – 2013. We first sum both Deaths and Population by year and then calculate the Mortality Rate by dividing Deaths by Population. This is a nice opportunity to use the base R pipe operator, |>.

aggregate(cbind(Deaths, Population) ~ Year, data = data, FUN = sum, 
          subset = Age %in% 45:54) |> 
  transform(Rate = Deaths/Population) |> 
  plot(Rate ~ Year, data = _, type = "l", ylab = "Death Rate")

This is the third plot in Gelman’s blog post titled “So take the ratio!”

Fig 2.11 (b)

The second plot shows the average of the 45-54 age group increasing as the baby boomers move through. First we sum the Population by Age and Year for the 45-54 group. Then we take that data and take the mean age per year weighted by the population.

aggregate(Population ~ Age + Year, data = data, sum, 
          subset = Age %in% 45:54) |> 
  aggregate(Population ~ Year, data = _, 
            function(x)weighted.mean(45:54, x)) |> 
  plot(Population ~ Year, data = _, type = "l")

To help make this clear, let’s find the mean age of the 45-54 group in 1999. First find the population for each age in 1999:

tmp <- aggregate(Population ~ Age + Year, data = data, sum, 
          subset = Age %in% 45:54 & Year == 1999)
tmp$Population
##  [1] 3166393 3007083 2986252 2805975 2859406 2868751 2804957 3093631 2148382
## [10] 2254975

To find the mean age of the 45-54 group in 1999, we need to weight each age with the population. We can do that with the weighted.mean() function.

weighted.mean(45:54, tmp$Population)
## [1] 49.25585

The code above does this for 1999-2013. I think it’s worth noting that while the plot looks dramatic, the average age only increases from about 49.2 to 49.7. But I suppose when you’re dealing with millions of people that increase makes a difference.

This is the fourth plot in Gelman’s blog post titled “But the average age in this group is going up!”

Fig 2.11 (c)

This is where I began to struggle when reading the book.

This figure is titled “The trend in raw death rates since 2005 can be explained by age-aggregation bias”. This is the eighth plot in the blog post where it has a bit more motivation. Let’s recreate the plots in the blog post leading up to this plot.

The first plot is the sixth plot. It’s basically the previous plot rescaled as a rate. It’s created by first calculating the death rate in 1999, and then taking the weighted mean of that rate by using the total population for each age group.

dr1999 <- aggregate(cbind(Deaths, Population) ~ Age, data = data, FUN = sum, 
          subset = Age %in% 45:54 & Year == 1999) |> 
  transform(Rate = Deaths/Population) 

# Now create plot
aggregate(Population ~ Age + Year, data = data, sum, 
          subset = Age %in% 45:54) |> 
  aggregate(Population ~ Year, data = _, 
            function(x)weighted.mean(dr1999$Rate, x)) |> 
  plot(Population ~ Year, data = _, type = "l", ylab = "Reconstructed death rate")

Next he combines this plot with the plot of the raw death rate (Fig 2.11 (a)). This is the seventh plot in the blog post.

years <- 1999:2013

Raw <- aggregate(cbind(Deaths, Population) ~ Year, data = data, FUN = sum, 
          subset = Age %in% 45:54) |> 
  transform(Rate = Deaths/Population)

Expected <- aggregate(Population ~ Age + Year, data = data, sum, 
          subset = Age %in% 45:54) |> 
  aggregate(Population ~ Year, data = _, 
            function(x)weighted.mean(dr1999$Rate, x))

plot(years, Raw$Rate, type="l", ylab="Death rate for 45-54 non-Hisp whites")
lines(years, Expected$Population, col="green4")
text(2002.5, .00404, "Raw death rate", cex=.8)
text(2009, .00394, "Expected just from\nage shift", col="green4", cex=.8)

Then finally he says, “We can sharpen this comparison by anchoring the expected-trend-in-death-rate-just-from-changing-age-composition graph at 2013, the end of the time series, instead of 1999.” This means we need to calculate the death rate in 2013, and then take the weighted mean of that rate by using the total population for each age group. This is the dr2013 data frame. Then we create the same plot as above except now using the death rate in 2013.

dr2013 <- aggregate(cbind(Deaths, Population) ~ Age, data = data, FUN = sum, 
                    subset = Age %in% 45:54 & Year == 2013) |> 
  transform(Rate = Deaths/Population) 

Raw <- aggregate(cbind(Deaths, Population) ~ Year, data = data, FUN = sum, 
                 subset = Age %in% 45:54) |> 
  transform(Rate = Deaths/Population)

Expected <- aggregate(Population ~ Age + Year, data = data, sum, 
                      subset = Age %in% 45:54) |> 
  aggregate(Population ~ Year, data = _, 
            function(x)weighted.mean(dr2013$Rate, x))

plot(years, Raw$Rate,  type="l", 
     ylab="Death rate for 45-54 non-Hisp whites")
lines(years, Expected$Population, col="green4")
text(2002.5, 0.00395, "Raw death rate", cex=.8)
text(2002, .00409, "Expected just from\nage shift", col="green4", cex=.8)

Gelman notes, “since 2003, all the changes in raw death rate in this group can be explained by changes in age composition.”

Fig 2.12 (a)

This is the first plot showing age-adjusted death rates. Gelman explains this as follows in his blog post: “for each year in time, we take the death rates by year of age and average them, thus computing the death rate that would’ve been observed had the population distribution of 45-54-year-olds been completely flat each year.” The book calls it “the simplest such adjustment, normalizing each year to a hypothetical uniformly distributed population in which the number of people is equal at each age from 45 through 54.” I found this latter explanation a little confusing.

To create this plot we first sum Deaths and Populations by age and year for the 45-54 age group, then calculate the death rate, and then simply take the mean rate by year. That’s it. Gelman takes the additional step of rescaling the rate so that the rate is 1 in 1999.

aggregate(cbind(Deaths, Population) ~ Age + Year, data = data, sum, 
          subset = Age %in% 45:54) |> 
  transform(Rate = Deaths/Population) |> 
  aggregate(Rate ~ Year, data = _, mean) |> 
  transform(AA_Rate = Rate/Rate[1]) |>   # relative to 1999
  plot(AA_Rate ~ Year, data = _, type = "l", 
       ylab = "age-adjusted death rate, relative to 1999")

Fig 2.12 (b)

In the book, this plot shows two different age adjustments, even thought the exposition says there are three. They’re probably referring to the original blog post plot which does show three. I recreate the plot in the blog post, which is the second to last plot.

This plot shows (1) age-adjustment using the simple mean of rates, (i.e., the plot above), (2) age-adjustment using the distribution of ages in 1999, and (3) age-adjustment using the distribution of ages in 2013.

This plot requires the most work of all. First we need to get the total population for all ages in 1999 and 2013. These are used to make the age adjustments.

pop1999 <- aggregate(Population ~ Age, data = data, 
                     subset = Year == 1999 & Age %in% 45:54, sum)[["Population"]]
pop2013 <- aggregate(Population ~ Age, data = data, 
                     subset = Year == 2013 & Age %in% 45:54, sum)[["Population"]]

Next we calculate age-adjusted rates using the population distributions from 1999 and 2013. Again we sum Deaths and Populations by age and year for the 45-54 age group and calculate the death rate. Then we calculate the average rate by year using the population distributions to calculate a weighted mean.

# age-adjustment from Fig 2.12 (a)
aa_rate_uniform <- aggregate(cbind(Deaths, Population) ~ Age + Year, 
                             data = data, sum, 
                             subset = Age %in% 45:54) |> 
  transform(Rate = Deaths/Population) |> 
  aggregate(Rate ~ Year, data = _, mean) |> 
  transform(AA_Rate = Rate/Rate[1])

aa_rate_1999 <- aggregate(cbind(Deaths, Population) ~ Age + Year, 
                          data = data, sum,
                          subset = Age %in% 45:54) |> 
  transform(Rate = Deaths/Population) |> 
  aggregate(Rate ~ Year, data = _, function(x)weighted.mean(x, pop1999))

aa_rate_2013 <- aggregate(cbind(Deaths, Population) ~ Age + Year, 
                          data = data, sum, 
                          subset = Age %in% 45:54) |> 
  transform(Rate = Deaths/Population) |> 
  aggregate(Rate ~ Year, data = _, function(x)weighted.mean(x, pop2013))

Now we can make the plot. Notice we find the range of all the data to help set the limits of the y-axis. Also notice we rescale the plot so all lines begin at 1.

rng <- range(aa_rate_uniform$Rate/aa_rate_uniform$Rate[1], 
             aa_rate_1999$Rate/aa_rate_1999$Rate[1], 
             aa_rate_2013$Rate/aa_rate_2013$Rate[1])
plot(years, aa_rate_uniform$Rate/aa_rate_uniform$Rate[1], type = "l", ylim=rng,
     ylab = "age-adjusted death rate, relative to 1999")
lines(years, aa_rate_1999$Rate/aa_rate_1999$Rate[1], lty=2)
lines(years, aa_rate_2013$Rate/aa_rate_2013$Rate[1], lty=3)
text(2003, 1.053, "Using 1999\nage dist", cex=.8)
text(2004, 1.032, "Using 2013\nage dist", cex=.8)

The point of this plot is to demonstrate it doesn’t matter how the age-adjustment is done.

Fig 2.12 (c)

The final plot shows age adjusted death rates broken down by sex. This is basically the same code as Fig 2.12 (a) but with male included in the calls to aggregate(). To rescale the y-axis so it starts at 1 we need divide each vector of rates by the respective 1999 value.

aa_rate_sex <- aggregate(cbind(Deaths, Population) ~ Age + Year + Male, 
                         data = data, sum, 
          subset = Age %in% 45:54) |> 
  transform(Rate = Deaths/Population) |> 
  aggregate(Rate ~ Year + Male, data = _, mean) 

plot(years, aa_rate_sex$Rate[aa_rate_sex$Male == 0]/
       aa_rate_sex$Rate[aa_rate_sex$Year == 1999 & aa_rate_sex$Male == 0], 
     col="red", type = "l",
     ylab = "Death rate relative to 1999")
lines(years, aa_rate_sex$Rate[aa_rate_sex$Male == 1]/
        aa_rate_sex$Rate[aa_rate_sex$Year == 1999 & aa_rate_sex$Male == 1], 
      col="blue", type = "l")
text(2011.5, 1.075, "Women", col="red")
text(2010.5, 1.02, "Men", col="blue")

Gelman called his code “ugly”, but it’s his code and he understands it. I don’t claim my code is any better, but it’s my code and I understand it.

Encoding a contrast in a linear model

This is note to myself on how to encode a contrast in a linear model.

The following data come from the text Design and Analysis of Experiments by Dean and Voss (1999). It involves the lifetime per unit cost of nonrechargeable batteries. Four types of batteries are considered:

  1. alkaline, name brand
  2. alkaline, store brand
  3. heavy duty, name brand
  4. heavy duty, store brand

We can read the data in from the textbook web site.

URL <- "https://corescholar.libraries.wright.edu/cgi/viewcontent.cgi?filename=12&article=1007&context=design_analysis&type=additional"
bat <- read.table(URL, header=TRUE)
names(bat) <- tolower(names(bat))
bat$type <- factor(bat$type)
head(bat)
##   type lpuc order
## 1    1  611     1
## 2    2  923     2
## 3    1  537     3
## 4    4  476     4
## 5    1  542     5
## 6    1  593     6

A quick look at the means suggests the alkaline store brand battery seems like the best battery for the money.

means <- tapply(bat$lpuc, bat$type, mean)
means
##      1      2      3      4 
## 570.75 860.50 433.00 496.25

We might want to make comparisons between these means. Three such comparisons are as follows:

  1. compare battery duty (alkaline vs heavy duty)
  2. compare battery brand (name brand versus store brand)
  3. compare the interaction (levels 1/4 vs levels 2/3)

These are the comparisons presented in the text (p. 171). We can make these comparisons “by hand”.

# compare battery duty
mean(means[1:2]) - mean(means[3:4])
## [1] 251
# compare battery brand
mean(means[c(1,3)]) - mean(means[c(2,4)])
## [1] -176.5
# compare interaction
mean(means[c(1,4)]) - mean(means[c(2,3)])
## [1] -113.25

We can also make these comparisons using a contrast matrix. Below we create the contrast as a matrix object and name it K. Then we use matrix multiplication to calculate the same comparisons above.

K <- matrix(c(1/2, 1/2, -1/2, -1/2,
              1/2, -1/2, 1/2, -1/2,
              1/2, -1/2, -1/2, 1/2), 
            byrow = T, ncol = 4)
K %*% means
##         [,1]
## [1,]  251.00
## [2,] -176.50
## [3,] -113.25

This particular contrast matrix is orthogonal. “Two contrasts are orthogonal if the sum of the products of corresponding coefficients (i.e. coefficients for the same means) adds to zero.” (source) We can show this using the crossprod() function.

crossprod(K[1,],K[2,])
##      [,1]
## [1,]    0
crossprod(K[2,],K[3,])
##      [,1]
## [1,]    0
crossprod(K[1,],K[3,])
##      [,1]
## [1,]    0

Obviously we would like to calculate standard errors and confidence intervals for these comparisons. One way is to fit a model using lm() and do the comparisons as a follow-up using a package such as {multcomp}. To do this we use the glht() function with our contrast, K.

library(multcomp)
m <- lm(lpuc ~ type, data = bat)
comp <- glht(m, linfct = mcp(type = K))
confint(comp)
## 
##   Simultaneous Confidence Intervals
## 
## Multiple Comparisons of Means: User-defined Contrasts
## 
## 
## Fit: lm(formula = lpuc ~ type, data = bat)
## 
## Quantile = 2.7484
## 95% family-wise confidence level
##  
## 
## Linear Hypotheses:
##        Estimate  lwr       upr      
## 1 == 0  251.0000  184.1334  317.8666
## 2 == 0 -176.5000 -243.3666 -109.6334
## 3 == 0 -113.2500 -180.1166  -46.3834

Another way is to encode the contrasts directly in our model. We can do that using the capital C() function. The only difference here is that we need to transpose the matrix. The comparisons need to be defined on the columns instead of the rows. We could use the t() function to transpose K, but we’ll go ahead and create a new matrix to make this clear.

K2 <- matrix(c(1/2, 1/2, -1/2, -1/2,
              1/2, -1/2, 1/2, -1/2,
              1/2, -1/2, -1/2, 1/2), 
            ncol = 3)

Now we refit the model using the contrast in the model formula. Notice the coefficients are the same comparisons we calculated above. The intercept is the grand mean of lpuc, (i.e. mean(bat$lpuc)).

m2 <- lm(lpuc ~ C(type, K2), data = bat)
summary(m2)
## 
## Call:
## lm(formula = lpuc ~ C(type, K2), data = bat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -66.50 -33.56 -18.12  38.19  72.75 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    590.12      12.16  48.511 3.85e-15 ***
## C(type, K2)1   251.00      24.33  10.317 2.55e-07 ***
## C(type, K2)2  -176.50      24.33  -7.255 1.01e-05 ***
## C(type, K2)3  -113.25      24.33  -4.655 0.000556 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48.66 on 12 degrees of freedom
## Multiple R-squared:  0.9377, Adjusted R-squared:  0.9222 
## F-statistic: 60.24 on 3 and 12 DF,  p-value: 1.662e-07

And now we can use confint() to get the confidence intervals.

confint(m2)
##                  2.5 %     97.5 %
## (Intercept)   563.6202  616.62977
## C(type, K2)1  197.9905  304.00954
## C(type, K2)2 -229.5095 -123.49046
## C(type, K2)3 -166.2595  -60.24046

These are narrower than what we got with {multcomp}. That’s because {multcomp} uses a family-wise confidence interval that adjusts for making multiple comparisons.

Another property of orthogonal contrasts is that their estimated means are uncorrelated. We can see this by calling vcov() on the fitted model object that directly uses the contrast matrix we created.

zapsmall(vcov(m2))
##              (Intercept) C(type, K2)1 C(type, K2)2 C(type, K2)3
## (Intercept)     147.9818       0.0000       0.0000       0.0000
## C(type, K2)1      0.0000     591.9271       0.0000       0.0000
## C(type, K2)2      0.0000       0.0000     591.9271       0.0000
## C(type, K2)3      0.0000       0.0000       0.0000     591.9271

We can also use the {emmeans} package to make these comparisons as follows using the original model.

library(emmeans)
emm_out <- emmeans(m, "type")
contrast(emm_out, list(c1 = c(1, 1, -1, -1)/2,
                       c2 = c(1, -1, 1, -1)/2,
                       c3 = c(1, -1, -1, 1)/2)) |>
  confint()
##  contrast estimate   SE df lower.CL upper.CL
##  c1            251 24.3 12      198    304.0
##  c2           -176 24.3 12     -230   -123.5
##  c3           -113 24.3 12     -166    -60.2
## 
## Confidence level used: 0.95

Some R Fundamentals

I recently came across the book R Programming for Bioinformatics at my local library and decided to check it out. I don’t do bioinformatics and the book is a little old (published in 2009), but I figured I would browse through it anyway. Chapter 2 is titled R Language Fundamentals. As I was flipping through it I found several little nuggets of information that I had either forgotten about over the years or never knew in the first place. I decided to document them here.

Variable names

Variable names cannot begin with a digit or underscore, and if they begin with a period they cannot be followed by a number. But we can bend these rules by quoting the names with backticks.

`_evil` <- "probably not wise"
`_evil`
## [1] "probably not wise"
`.666_number of the beast` <- sqrt(666^2)
`.666_number of the beast`
## [1] 666
rm(`_evil`, `.666_number of the beast`)

Attributes

Attributes can be attached to any R object except NULL. They can be useful for storing metadata among many other things. For example, add a source for a dataset.

d <- VADeaths
attr(d, "source") <- "Molyneaux, L., Gilliam, S. K., and Florant, L. C.(1947) Differences in Virginia death rates by color, sex, age, and rural or urban residence. American Sociological Review, 12, 525–535."

To see the source:

attr(d, "source")
## [1] "Molyneaux, L., Gilliam, S. K., and Florant, L. C.(1947) Differences in Virginia death rates by color, sex, age, and rural or urban residence. American Sociological Review, 12, 525–535."

To see all attributes of an object:

attributes(d)
## $dim
## [1] 5 4
## 
## $dimnames
## $dimnames[[1]]
## [1] "50-54" "55-59" "60-64" "65-69" "70-74"
## 
## $dimnames[[2]]
## [1] "Rural Male"   "Rural Female" "Urban Male"   "Urban Female"
## 
## 
## $source
## [1] "Molyneaux, L., Gilliam, S. K., and Florant, L. C.(1947) Differences in Virginia death rates by color, sex, age, and rural or urban residence. American Sociological Review, 12, 525–535."

To remove an attribute:

attr(d, "source") <- NULL

Not all attributes are displayed when called on an object. For example, after fitting a linear model, it appears there are only two attributes.

m <- lm(dist ~ speed, data = cars)
attributes(m)
## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"        
## 
## $class
## [1] "lm"

However, elements of the model object also have attributes. For example, the terms element has 10 attributes.

out <- attributes(m$terms)
length(out)
## [1] 10
names(out)
##  [1] "variables"    "factors"      "term.labels"  "order"        "intercept"   
##  [6] "response"     "class"        ".Environment" "predvars"     "dataClasses"
attr(m$terms, "factors")
##       speed
## dist      0
## speed     1

The colon operator

I often forget the colon operator can work with decimal values.

2.5:10.5
## [1]  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5

And can go backwards:

10.2:1.2
##  [1] 10.2  9.2  8.2  7.2  6.2  5.2  4.2  3.2  2.2  1.2

zero length vectors

The sum of zero length vector is 0, but the product of a zero length vector is 1.

x <- numeric()
length(x)
## [1] 0
sum(x)
## [1] 0
prod(x)
## [1] 1

This is ensures expected behavior when working with sums and products:

# 12 + 0
sum(12, x)
## [1] 12
# 12 * 1
prod(12, x)
## [1] 12

.Machine

The .Machine variable holds information about the numerical characteristics of your machine. For example, the largest integer my machine can represent:

.Machine$integer.max
## [1] 2147483647

If I add 1 to that, the result is numeric, not an integer.

x <- .Machine$integer.max
x2 <- x + 1
is.integer(x2)
## [1] FALSE

If I add 1L (an explicit integer) to that, the result is a warning and a NA. My machine cannot represent that integer.

x2 <- x + 1L
## Warning in x + 1L: NAs produced by integer overflow
x2
## [1] NA

Recoding factors

There are several convenience functions in other packages for recoding variables such as recode in the {car} package, case_when in {dplyr}, and a bunch of functions in the {forcats} package. But it’s good to remember how to use base R to recode factors. Create a list with the recoding definitions and assign to the levels of the factor.

g <- sample(letters[1:5], 30, replace = TRUE)
g <- factor(g)
g
##  [1] e c d c d c b a e c a d e e d b e c b c b c b d d c b a d e
## Levels: a b c d e

Put “a” and “b” into one group, “c” and “d” into another group, and keep “e” in it’s own group.

lst <- list("A" = c("a", "b"),
            "B" = c("c", "d"),
            "C" = "e")
levels(g) <- lst
g
##  [1] C B B B B B A A C B A B C C B A C B A B A B A B B B A A B C
## Levels: A B C

If we like we can add an attribute to store the definition.

attr(g, "recoding") <- c("A = {ab}, B = {cd}, C = {e}")
g
##  [1] C B B B B B A A C B A B C C B A C B A B A B A B B B A A B C
## attr(,"recoding")
## [1] A = {ab}, B = {cd}, C = {e}
## Levels: A B C

lists can have dimensions

Something more interesting than applicable is that lists can have dimensions.

M <- as.table(rbind(c(762, 327, 468), c(484, 239, 477)))
Xsq <- chisq.test(M) # produces 9 element list
Xsq <- unclass(Xsq) # remove htest class
dim(Xsq) <- c(3,3)
Xsq
##      [,1]         [,2]                         [,3]     
## [1,] 30.07015     "Pearson's Chi-squared test" numeric,6
## [2,] 2            "M"                          table,6  
## [3,] 2.953589e-07 table,6                      table,6
Xsq[1,3]
## [[1]]
##          A        B        C
## A 703.6714 319.6453 533.6834
## B 542.3286 246.3547 411.3166

Environments

We are not restricted to creating objects in the Global Environment. We can create our own environments using the new.env() function and then create objects in that environment. We can use the dollar sign operator or the assign() function.

e1 <- new.env()
e1$mod <- lm(dist ~ speed, data = cars)
e1$cumTotal <- function(x)tail(cumsum(x), n = 1)
assign("vals", c(20, 23, 34, 19), envir = e1)
ls(e1)
## [1] "cumTotal" "mod"      "vals"
ls() # list objects in Global Environment
##  [1] "d"   "e1"  "g"   "lst" "m"   "M"   "out" "x"   "x2"  "Xsq"

We can access objects in our environment using the dollar sign operator or the get() and mget() functions.

e1$cumTotal(c(2,4,6))
## [1] 12
get("vals", envir = e1)
## [1] 20 23 34 19
mget(c("mod", "vals"), envir = e1) # get more than one object
## $mod
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932  
## 
## 
## $vals
## [1] 20 23 34 19

We can save the environment and reload it in a future session.

save(e1, file = "e1.Rdata")
rm(e1)
load(file = "e1.Rdata")

We can also change the environment associated with an object that was created in the Global Environment.

f <- function(x)(vals + 1000) # vals object defined in e1 environment
environment(f) <- e1
f
## function(x)(vals + 1000)
## <environment: 0x000001b5ec98d4e0>
f()
## [1] 1020 1023 1034 1019

Notice if we remove the environment using rm(), the function still remains in that environment and we have access to its objects

rm(e1)
f
## function(x)(vals + 1000)
## <environment: 0x000001b5ec98d4e0>
f()
## [1] 1020 1023 1034 1019

rm(e1) simply removes the binding between the symbol “e1” and structure that contains the objects. Since the environment can be reached as the environment of f(), it remains available.

Brackets and Dollar Signs

I found this sentence enlightening: “One way of describing the behavior of the single bracket operator is that the type of the return value matches the type of the value it is applied to.” (p. 28) I like this in favor of metaphors involving trains.

lst <- list(a1 = 1:5, b = c("d", "g"), c = 99)
lst["a1"] # returns a list
## $a1
## [1] 1 2 3 4 5

[[ and $ extract single values.

lst[["a1"]]
## [1] 1 2 3 4 5
lst$a1
## [1] 1 2 3 4 5

The $ operator supports partial matching.

lst$a
## [1] 1 2 3 4 5

The [ and [[ operators support expressions, but not partial matching.

ans <- "c"
lst[ans]
## $c
## [1] 99
lst[[ans]]
## [1] 99

If names are duplicated in named vectors, then only the value corresponding to the first one is returned when subsetting with brackets.

x <- c("a" = 1, "a" = 2)
x["a"]
## a 
## 1

The %in% operator can be useful to get all elements with the same name.

x[names(x) %in% "a"]
## a a 
## 1 2

Matrix indexing

I don’t work with arrays that often, but when I do I often forget that I can index them with a matrix. Below I extract the value in row 1, column 4, from each of the 3 layers of the iris3 array.

m <- matrix(c(1,4,1,
              1,4,2,
              1,4,3), 
            ncol = 3, byrow = TRUE)
iris3[m]
## [1] 0.2 1.4 2.5

Of course we can get the same result (in this case) using subsetting indices.

iris3[1,4,]
##     Setosa Versicolor  Virginica 
##        0.2        1.4        2.5

Negative subscripts

Negative subscripts can appear on the left side of assignment.

x <- 1:10
x[-(2:4)] <- 99
x
##  [1] 99  2  3  4 99 99 99 99 99 99

Subsetting without dimensions

Use empty double brackets to select all elements and not change any attributes.

x <- matrix(10:1, ncol = 2)
x
##      [,1] [,2]
## [1,]   10    5
## [2,]    9    4
## [3,]    8    3
## [4,]    7    2
## [5,]    6    1
x[] <- sort(x)
x
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10

Would you rather

Someone texted me the following image and asked for my response.

So what would I rather do?

  • Flip 3 coins and win if all match
  • Roll 3 dice and win if none match

I would rather do the one with the highest probability of happening. So let’s calculate that using R.

There are two ways to solve problems like this. One is to use the appropriate probability distribution and calculate it mathematically. The other is to enumerate all possible outcomes and find the proportion of interest. Let’s do the latter first.

We’ll use the expand.grid function to create a data frame of all possible outcomes of flipping 3 coins. All we have to do is give it 3 vectors representing coins. It’s such a small sample space we can eyeball it and see the probability of getting all heads or all tails is 2/8 or 0.25.

s1 <- expand.grid(c("H", "T"), c("H", "T"), c("H", "T"))
s1
##   Var1 Var2 Var3
## 1    H    H    H
## 2    T    H    H
## 3    H    T    H
## 4    T    T    H
## 5    H    H    T
## 6    T    H    T
## 7    H    T    T
## 8    T    T    T

If we wanted to use R to determine the proportion, we could use apply to apply a function to each row and return the number of unique values, and then find the proportion of 1s. (1 means there was only one unique value: all “H” or all “T”)

count1 <- apply(s1, 1, function(x)length(unique(x)))
mean(count1 == 1)
## [1] 0.25

We can do the same with rolling 3 dice. Give expand.grid 3 dice and have it generate all possible results. This will generate 216 possibilities, so there’s no eyeballing this for an answer. As before, we’ll apply a function to each row and determine if there are any duplicates. The anyDuplicated function returns the location of the first duplicate within a vector. If there are no duplicates, the result is a 0, which means we just need to find the proportion of 0s.

s2 <- expand.grid(1:6, 1:6, 1:6)
count2 <- apply(s2, 1, anyDuplicated)
mean(count2 == 0)
## [1] 0.5555556

An easier way to do that is to use permutation calculations. There are \(6^3 = 216\) possible results from rolling 3 dice. Furthermore there are \(6 \cdot 5 \cdot 4 = 120\) possible non-matching results. There are 6 possibilities for the first die, only 5 for the second, and 4 for the third. We can then divide to find the probability: \(120/216 = 0.55\).

Clearly we’d rather roll the dice (assuming fair coins and fair dice).

We can also answer the question using the binomial probability distribution. The binomial distribution is appropriate when you have:

  • two outcomes (Heads vs Tales, no match vs One or more match)
  • independent events
  • same probability for each event

The dbinom function calculates probabilities of binomial outcomes. Below we use it to calculate the probability of 0 heads (3 tails) and 3 heads, and sum the total. The x argument is the sum of “successes”, for example 0 heads (or tails, whatever you call a “success”). The size argument is the number of trials, or coins in this case. The prob argument is the probability of success on each trial, or for each coin in this case.

dbinom(x = 0, size = 3, prob = 0.5) + 
  dbinom(x = 3, size = 3, prob = 0.5)
## [1] 0.25

We can also frame the dice rolling as a binomial outcome, but it’s a little trickier. Think of rolling the dice one at a time:

  • It doesn’t matter what we roll the first time. We don’t care if it’s a 1 or 6 or whatever. We’re certain to get something, so the probability is 1.
  • The second role is a “success” if it does not match the first role. That’s one dice roll (size = 1) where success (x = 1) happens with probability of 5/6.
  • The third and final roll is a “success” if it doesn’t match either of the first two rolls. That’s one dice roll (size = 1) where success (x = 1) happens with probability of 4/6.

Notice we don’t add these probabilities but rather multiply them.

1 * dbinom(x = 1, size = 1, prob = 5/6) *
  dbinom(x = 1, size = 1, prob = 4/6)
## [1] 0.5555556

We added the coin flipping probabilities because they each represented mutually exclusive events. They cannot both occur. There is one way to get 0 successes and one way to get 3 successes. Together they sum to the probability of all matching (ie, all heads or all tales).

We multiplied the dice rolling probabilities because they each represented events that could occur at the same time, and they were conditional if we considered each die in turn. This requires the multiplication rule of probability.

Fixing the p-value note in a HTML stargazer table

If you insert a stargazer table into R Markdown that outputs an HTML file, you may get a p-value note that looks like this:

p<0.1; p<0.05; p<0.01

What should be displayed is this:

*p<0.1; **p<0.05; ***p<0.01

This is the legend for understanding the “stars” in the regression table. What’s happening is the asterisks are being treated as markdown code, which is adding italics and bolding to the note instead of simply showing the asterisks verbatim. (Recall that in Markdown surrounding text with one asterisk adds italics and surrounding text with two asterisks adds bolding.)

To fix this we can add the following two arguments to the stargazer function:

notes.append = FALSE

notes = c("<sup>&sstarf;</sup>p<0.1; <sup>&sstarf;&sstarf;</sup>p<0.05; <sup>&sstarf;&sstarf;&sstarf;</sup>p<0.01")

&sstarf; is an HTML entity for a star character. notes.append = FALSE says to NOT append the note but rather replace the existing note. The notes argument specifies the note we want to add using the &sstarf; HTML entity.

Most useful tests for an ANCOVA model

In his book Regression Modeling Strategies, 2nd Ed, Frank Harrell provides a list of what he calls the “most useful tests” for a 2-level factor \(\times\) numeric model (Table 2.2, p. 19). This is often called an Analysis of Covariance, or ANCOVA. The basic idea is we have a numeric response with substantial variability and we seek to understand the variability by modeling the mean of the response as a function of a categorical variable and a numeric variable.

Let’s simulate some data for such a model and then see how we can use R to carry out these tests.

n <- 400
set.seed(1)
sex <- factor(sample(x = c("f", "m"), size = n, replace = TRUE))
age <- round(runif(n = n, min = 18, max = 65))
y <- 1 + 0.8*age + 0.4*(sex == "m") - 0.7*age*(sex == "m") + rnorm(n, mean = 0, sd = 8)
dat <- data.frame(y, age, sex)

The data contain a numeric response, y, that is a function of age and sex. I set the “true” coefficient values to 1, 0.8, 0.4, and -0.7. They correspond to \(\beta_0\) through \(\beta_3\) in the following model:

\[y = \beta_0 + \beta_1 age + \beta_2 sex + \beta_3 age \times sex\]

In addition the error component is a Normal distribution with a standard deviation of 8.

Now let’s model the data and see how close we get to recovering the true parameter values.

mod <- lm(y ~ age * sex, dat)
summary(mod)
## 
## Call:
## lm(formula = y ~ age * sex, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.8986  -5.8552  -0.2503   6.0507  30.6188 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.27268    1.93776   0.141    0.888    
## age          0.79781    0.04316  18.484   <2e-16 ***
## sexm         2.07143    2.84931   0.727    0.468    
## age:sexm    -0.72702    0.06462 -11.251   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.661 on 396 degrees of freedom
## Multiple R-squared:  0.7874, Adjusted R-squared:  0.7858 
## F-statistic:   489 on 3 and 396 DF,  p-value: < 2.2e-16

While the coefficient estimates for age and the age \(\times\) sex interaction are pretty close to the true values, the same cannot be said for the intercept and sex coefficients. The residual standard error of 8.661 is close to the true value of 8.

We can see in the summary output of the model that four hypothesis tests, one for each coefficient, are carried out for us. Each are testing if the coefficient is equal to 0. Of those four, only one qualifies as one of the most useful tests: the last one for age:sexm. This tests if the effect of age is independent of sex and vice versa. Stated two other ways, it tests if age and sex are additive, or if the age effect is the same for both sexes. To get a better understanding of what we’re testing, let’s plot the data with fitted age slopes for each sex.

library(ggplot2)
ggplot(dat, aes(x = age, y = y, color = sex)) +
  geom_point() +
  geom_smooth(method="lm")

plot of chunk unnamed-chunk-3

Visually it appears the effect of age is not independent of sex. It seems more pronounced for females. Is this effect real or maybe due to chance? The hypothesis test in the summary output for age:sexm evaluates this. Obviously the effect seems very real. We are not likely to see such a difference in slopes this large if there truly was no difference. It does appear the effect of age is different for each sex. The estimate of -0.72 estimates the difference in slopes (or age effect) for the males and females.

The other three hypothesis tests are not very useful.

  • Testing if the Intercept is 0 is testing whether y is 0 for females at age 0.
  • Testing if age is 0 is testing whether age is associated with y for males.
  • Testing if sexm is 0 is testing whether sex is associated with y for subjects at age 0.

Other more useful tests, as Harrell outlines in Table 2.2, are as follows:

  • Is age associated with y?
  • Is sex associated with y?
  • Are either age or sex associated with y?

The last one is answered in the model output. That’s the F-statistic in the last line. It tests whether all coefficients (except the intercept) are equal to 0. The result of this test is conclusive. At least one of the coeffcients is not 0.

To test if age is associated with y, we need to test if both the age and age:sexm coefficents are equal to 0. The car package by John Fox provides a nice function for this purpose called linearHypothesis. It takes at least two arguments. The first is the fitted model object and the second is a vector of hypothesis tests. Below we specify we want to test if “age = 0” and “age:sexm = 0”

library(car)
linearHypothesis(mod, c("age = 0", "age:sexm = 0"))
## Linear hypothesis test
## 
## Hypothesis:
## age = 0
## age:sexm = 0
## 
## Model 1: restricted model
## Model 2: y ~ age * sex
## 
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    398 55494                                  
## 2    396 29704  2     25790 171.91 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The result is once again conclusive. The p-value is virtually 0. It does indeed appear that age is associated with y.

Likewise, to test if sex is associated with y, we need to test if both the sex and age:sexm coefficents are equal to 0.

linearHypothesis(mod, c("sexm = 0", "age:sexm = 0"))
## Linear hypothesis test
## 
## Hypothesis:
## sexm = 0
## age:sexm = 0
## 
## Model 1: restricted model
## Model 2: y ~ age * sex
## 
##   Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
## 1    398 119354                                 
## 2    396  29704  2     89651 597.6 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As expected this test confirms that sex is associated with y, just as we specified when we simulated the data.

Now that we have established that age is associated with y, and that the association differs for each sex, what exactly is that association for each sex? In other words what are the slopes of the lines in our plot above?

We can sort of answer that with the model coefficients.

round(coef(mod),3)
## (Intercept)         age        sexm    age:sexm 
##       0.273       0.798       2.071      -0.727

That corresponds to the following model:

\[y = 0.273 + 0.799 age + 2.071 sex – 0.727 age \times sex\]

When sex is female, the fitted model is

\[y = 0.273 + 0.799 age \]

This says the slope of the age is about 0.8 when sex is female.

When sex is male, the fitted model is

\[y = (0.273 + 2.071) + (0.797 – 0.727) age \]
\[y = 2.344 + 0.07 age \]

This says the slope of the age is about 0.07 when sex is male.

How certain are we about these estimates? That’s what standard error is for. For the age slope estimate for females the standard error is provided in the model output for the age coefficient. It shows about 0.04. Adding and subtracting 2 \(\times\) 0.04 to the coefficient gives us a rough 95% confidence interval. Or we could just use the confint function:

confint(mod, parm = "age")
##         2.5 %    97.5 %
## age 0.7129564 0.8826672

The standard error of the age slope estimate for males takes a little more work. Another car function useful for this is the deltaMethod function. It takes at least three arguments: the model object, the quantity expressed as a character phrase that we wish to estimate a standard error for, and the names of the parameters. The function then calculates the standard error using the delta method. Here’s one way to do it for our model

deltaMethod(mod, "b1 + b3", parameterNames = paste0("b", 0:3)) 
##           Estimate         SE       2.5 %    97.5 %
## b1 + b3 0.07079277 0.04808754 -0.02345709 0.1650426

The standard error is similar in magnitude, but since our estimate is so small the resulting confidence interval overlaps 0. This tells us the effect of age on males is too small for our data to determine if the effect is positive or negative.

Another way to get the estimated age slopes for each sex, along with standard errors and confidence intervals, is to use the margins package. We use the margins function with our model object and specify that we want to estimate the marginal effect of age at each level of sex. (“marginal effect of age” is another way of saying the effect of age at each level of sex)

library(margins)
margins(mod, variables = "age", at = list(sex = c("f", "m")))
## Average marginal effects at specified values
## lm(formula = y ~ age * sex, data = dat)
##  at(sex)     age
##        f 0.79781
##        m 0.07079

This does the formula work we did above. It plugs in sex and returns the estmimated slope coefficient for age. If we wrap the call in summary we get the standard errors and confidence intervals.

summary(margins(mod, variables = "age", at = list(sex = c("f", "m"))))
##  factor    sex    AME     SE       z      p   lower  upper
##     age 1.0000 0.7978 0.0432 18.4841 0.0000  0.7132 0.8824
##     age 2.0000 0.0708 0.0481  1.4722 0.1410 -0.0235 0.1650

Using natural splines in linear modeling

Take a look at this scatterplot:

plot of chunk unnamed-chunk-2

It's clear there is a relationship between x and y, but the relationship is non-linear. How can we fit a linear model to this data? We could try fitting a polynomial model. The relationship seems to “change directions” four different times, so we could try fitting a 4th-degree polynomial model.

modp <- lm(y ~ poly(x, 4))
# add fitted line
plot(x,y)
lines(x, fitted(modp))

plot of chunk unnamed-chunk-3

This sort of captures the general nature of the relationship but the peaks and valleys just aren't quite right. They either over-predict or under-predict. We would like to do better.

Another approach might be to use a nonparametric regression approach such as loess. If we set the span parameter to 0.5, which controls the amount of smoothing, we get a decent fitting model:

modl <- loess(y ~ x, span = 0.5)
plot(x, y)
lines(x, predict(modl))

plot of chunk unnamed-chunk-4

This matches what we get when we use ggplot with the smooth geom:

library(ggplot2)
ggplot(data.frame(x, y), aes(x, y)) +
  geom_point() +
  geom_smooth(se = F, span = 0.5)

plot of chunk unnamed-chunk-5

But the drawback is we have no prediction equation. This is a non-parametric approach, hence no parameters were estimated.

This leads us to restricted cubic splines, or natural splines. The basic idea is to model a non-linear relationship such as the one in our example with piecewise cubic polynomials. Let's go ahead and first use natural splines in our linear model and then talk a little more about what's happening behind the scenes. Below we first load the splines package (a recommended package that comes with base R) so we have access to the ns function (natural splines). Notice we call ns on our predictor and specify df as 4. Specifying df = 4 implies 3 interior knots (ie, not including two “boundary knots”).

library(splines)
modns <- lm(y ~ ns(x, df = 4))
plot(x, y)
lines(x, predict(modns))

plot of chunk unnamed-chunk-6

This looks better than the polynomial model. And unlike the loess fit, we have a prediction equation:

summary(modns)
## 
## Call:
## lm(formula = y ~ ns(x, df = 4))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.355 -1.302 -0.052  1.279  5.325 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.6489     0.8242   0.787    0.433    
## ns(x, df = 4)1  11.9996     1.0508  11.420   <2e-16 ***
## ns(x, df = 4)2  50.1753     1.0438  48.071   <2e-16 ***
## ns(x, df = 4)3  72.0807     2.1079  34.195   <2e-16 ***
## ns(x, df = 4)4  19.5918     0.9794  20.004   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.114 on 95 degrees of freedom
## Multiple R-squared:  0.977,  Adjusted R-squared:  0.976 
## F-statistic:  1008 on 4 and 95 DF,  p-value: < 2.2e-16

Obviously the coefficients defy interpretation, but we can work with them as we would any other linear model. For example, we can test for linearity, that is \(H_0: \beta_2 = \beta_3 = \beta_4 = 0\). In our model that means testing that the last 3 coefficients are equal to 0. The car package provides the powerful linearHypothesis function for this purpose.

library(car)
linearHypothesis(modns, names(coef(modns))[3:5])
## Linear hypothesis test
## 
## Hypothesis:
## ns(x, df = 4)2 = 0
## ns(x, df = 4)3 = 0
## ns(x, df = 4)4 = 0
## 
## Model 1: restricted model
## Model 2: y ~ ns(x, df = 4)
## 
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1     98 18311.8                                  
## 2     95   424.4  3     17887 1334.7 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It's no surprise that the test is highly significant. We can also verify our model with natural splines is superior to the polynomial model via AIC. (Recall a lower AIC is better.)

AIC(modp, modns)
##       df      AIC
## modp   6 521.2002
## modns  6 440.3375

Now that we have played with natural splines, let's back up and try to get a better understanding of what's going on. To begin with, here's how we generated the data:

x <- 1:100         # independent variable
k <- c(25, 50, 75) # 3 interior knots

# function to construct variables x2, x3, x4
u <- function(x)ifelse(x > 0, x, 0)

x2 <- u(x - k[1])
x3 <- u(x - k[2])
x4 <- u(x - k[3])

# generate data
set.seed(1)
y <- 0.8 + 1*x + -1.2*x2 + 1.4*x3 + -1.6*x4 + rnorm(100,sd = 2.2)
plot(x, y)

plot of chunk unnamed-chunk-10

Our first predictor is x, which is simply the numbers 1 – 100. Next we define 3 “knots” at 25, 50, and 75. We then use those knots to construct three additional variables: x2, x3, and x4.

  • x2 is equal to x – 25 when x – 25 is greater than 0
  • x3 is equal to x – 50 when x – 50 is greater than 0
  • x4 is equal to x – 75 when x – 75 is greater than 0

Finally we generate our dependent variable, y, as a function of x, x2, x3, and x4 plus some noise from a Normal(0, 2.2) distribution. The formula on the right side of the assignment operator is our True model. This is technically called a linear spline function. The best linear model we could fit to the data would be the following:

mod <- lm(y ~ x + x2 + x3 + x4)
summary(mod)
## 
## Call:
## lm(formula = y ~ x + x2 + x3 + x4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0699 -1.2797  0.1007  1.2322  4.9155 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.34673    0.77485   1.738   0.0854 .  
## x            0.97506    0.04425  22.035   <2e-16 ***
## x2          -1.14818    0.07094 -16.186   <2e-16 ***
## x3           1.35246    0.06220  21.743   <2e-16 ***
## x4          -1.57494    0.06865 -22.941   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.01 on 95 degrees of freedom
## Multiple R-squared:  0.9792, Adjusted R-squared:  0.9783 
## F-statistic:  1117 on 4 and 95 DF,  p-value: < 2.2e-16
plot(x, y)
lines(x, fitted(mod))

plot of chunk unnamed-chunk-11

So we see that to make the trajectory of our data change directions 4 times, we needed to create 4 predictors, 3 of which were based on one. This might make intuitive sense if we think of a simple parabola. It changes directions twice and has two coefficients for the \(x\) and \(x^2\) parameters. It's a 2nd degree polynomial. Likewise for a 3rd degree polynomial, a 4th degree polynomial, and so forth. There's only one \(x\), but the trajectory of \(y\) changes depending on the degree of the polynomial.

plot of chunk unnamed-chunk-12

The takeaway is that when we see that our response variable has a non-linear relationship with a predictor variable in real life, we may need to consider more than just a single slope coefficient to model the relationship. In the example we've been using the slope, or trajectory, changes directions 4 times, which suggests using four predictors instead of one. We tried a 4th degree polynomial of \(x\) but saw that didn't work as well as we would have liked. A recommended approach then is to try natural splines.

The basic, and I mean very basic, idea of natural splines is to fit a 3rd degree polynomial to data within knots, and then connect those lines together. For example, below is our data with knots defined at 0, 25, 50, 75, and 100.

plot(x,y)
abline(v = c(0,25,50,75,100))

plot of chunk unnamed-chunk-13

With 5 knots, we have 4 regions of data. Within those 4 regions of data, natural splines essentially allow us to fit 4 different 3rd degree polynomials, all smoothed together. The magic is in how the 3 additional predictors are generated. Using the ns function in the splines package, we can create a basis matrix that allows us to fit a natural cubic spline using regular regression functions such as lm and glm. How the basis matrix is generated is quite complicated and probably something you'll just want to take on faith, like I do.

We can sort of see the natural spline in action if we fit and then color the lines between the knots. Below we regress \(y\) on a natural spline of \(x\) with knots defined at 25, 50 and 70. We then color the fitted lines differently between the knots.

plot(x,y)
mod.ns <- lm(y ~ ns(x, knots = c(25,50,75)))
lines(x[1:25], fitted(mod.ns)[1:25],col=1)
lines(x[26:50], fitted(mod.ns)[26:50],col=2)
lines(x[51:75], fitted(mod.ns)[51:75],col=3)
lines(x[76:100], fitted(mod.ns)[76:100],col=4)

plot of chunk unnamed-chunk-14

Notice the red and green interior lines have a noticeable 3rd degree polynomial shape. The exterior red and blue lines look more like a 2nd degree polynomial. That's because natural splines are constrained to be linear in the tails (ie, the boundary knots). For this reason, natural splines are sometimes called restricted cubic splines. If we didn't want that constraint, we could use the bs function to generate a b-spline matrix basis. Here's what that looks like.

plot(x,y)
mod.bs <- lm(y ~ bs(x, knots = c(25,50,75)))
lines(x[1:25], fitted(mod.bs)[1:25],col=1)
lines(x[26:50], fitted(mod.bs)[26:50],col=2)
lines(x[51:75], fitted(mod.bs)[51:75],col=3)
lines(x[76:100], fitted(mod.bs)[76:100],col=4)

plot of chunk unnamed-chunk-15

Notice the blue line in particular curls up at the boundary as would be expected of a 3rd degree polynomial. But that's not the only difference. Look at the summary output. Notice the bs function generated matrix with 6 columns instead of 4.

summary(mod.bs)
## 
## Call:
## lm(formula = y ~ bs(x, knots = c(25, 50, 75)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8247 -1.1582 -0.0468  1.2780  5.0283 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      1.983      1.218   1.628  0.10699    
## bs(x, knots = c(25, 50, 75))1    6.623      2.281   2.903  0.00461 ** 
## bs(x, knots = c(25, 50, 75))2   33.328      1.529  21.794  < 2e-16 ***
## bs(x, knots = c(25, 50, 75))3    8.281      1.840   4.501 1.96e-05 ***
## bs(x, knots = c(25, 50, 75))4   60.284      1.722  35.000  < 2e-16 ***
## bs(x, knots = c(25, 50, 75))5   40.697      1.913  21.278  < 2e-16 ***
## bs(x, knots = c(25, 50, 75))6   38.377      1.688  22.736  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.052 on 93 degrees of freedom
## Multiple R-squared:  0.9788, Adjusted R-squared:  0.9774 
## F-statistic: 714.2 on 6 and 93 DF,  p-value: < 2.2e-16

So fitting a b-spline means actually fitting a more complicated model. For this reason and the fact that b-splines can be poory behaved in the tails (Harrell, p. 24), most statisticians recommend working with natural splines.

To wrap up, let's go over some guidelines for using natural splines with real data. First off, you don't want to go looking at your data and guessing how many times it changes direction to determine your knots or degrees of freedom! I did that simply as a way to help explain how splines worked. In practice you'll want to determine degrees of freedom based on sample size and how important you suspect a predictor to be. Harrell states that 4 degrees of freedom is usually sufficient. If your sample is on the small side, perhaps choose 3 degrees of freedom. If it's large, go with 5. And notice we're talking about degrees of freedom, not knots. The location of knots isn't that crucial and ns will automatically select knots based on the quantiles of the predictor.

Something else to remember is that the coefficients on a model with natural splines defy any sort of interpretation. So forget using the “1-unit increase in x leads to a __ increase in y” method to explain association. An alternative approach is an effect plot, which allows you to visualize your model given certain predictor values. Here's a quick demonstration using a simplified example that comes with the powerful effects package. Below we model the log of prestige, a prestige score for someone's occupation, as a function of logged income and a 4 degree of freedom natural spline basis of education. Calling summary and anova on the model object will reveal the natural spline appears warranted and highly significant.

library(effects)
mod.pres1 <- lm(log(prestige) ~ log(income) + ns(education, 4),
                data=Prestige)

But what does it mean? What is the association between prestige and education when, say, holding income at the mean value? An effect plot sheds some light.

eff.log <- Effect("education", mod.pres1, transformation=list(inverse=exp))
plot(eff.log)

plot of chunk unnamed-chunk-18

This shows that from about 9 – 14, the effect of education is pretty dramatic on prestige scores, but rather uncertain in the extremes, below 9 and above 14. From 10 – 14, it looks like a 2-level increase in education is worth about a 10 point increase in prestige scores. We couldn't guess that from the summary output but we can sort of infer it from the effect plot. Again, this is just an example and not a replication of the original analysis.

Reference:

F. Harrell. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. 2nd Ed Springer. 2015

Recreating a Geometric CDF plot from Casella and Berger

For reasons I no longer remember I decided a while back to reproduce Figure 1.5.2 from Casella and Berger (page 32):

It's a plot of the cumulative distribution function of a geometric distribution with p = 0.3. The geometric distribution is best explained with coin-flipping. Let's say we keep flipping a coin until we observe a head. Now let X be a random variable that equals the number of tosses required to see a head. So if we get a head on the first flip of the coin, that means we needed one toss. If it takes two tosses, we got a tail on the first toss and then a head on the second toss. The graph above visualizes the possibilities for 15 flips with probability of getting heads set to 0.3. We see that the probability of taking only one toss is 0.3, which makes sense since the probability of getting heads is 0.3. The probability of requiring two tosses is about 0.5. And so on. The straight lines indicate the probabilities only change at the whole numbers. There is no such thing as, say, 2.23 flips. This is often referred to as a step function. We see that there's a high probability of getting a head by the 5th or 6th flip. This is a simple waiting-time distribution that can be used to model the number of successes before a failure and vice-versa.

To recreate this plot in R I used the pgeom function, which returns the cumulative probability of the geometric distribution for a given number of failures before a success occurs and a given probability of success. For example, the probability of 0 failures before observing a success with p = 0.3:

pgeom(0, prob = 0.3)
## [1] 0.3

The probability of 1 failure before observing a success with p = 0.3:

pgeom(1, prob = 0.3)
## [1] 0.51

To generate all the probabilities in the plot, we can do the following:

y <- pgeom(q = 0:14, prob = 0.3)

Notice that instead of using 1:15 (number of total flips for success), we use 0:14. That's because pgeom works with the number of failures before success. Now that we have our y coordinates, we can create our first version of the plot as follows:

plot(x = 1:15, y, pch = 19)

plot of chunk unnamed-chunk-5

Now let's add the horizontal lines using the segments function:

plot(x = 1:15, y, pch = 19)
segments(x0 = 0:15, y0 = c(0, y),
         x1 = 1:16, y1 = c(0, y), lwd = 2)

plot of chunk unnamed-chunk-6

The x0 and y0 coordinates are where the line starts. The x1 and y1 coordinates are where the line ends. Since the lines are horizontal, the y coordinates are the same for the start and end postions. The y coordinates include 0, so we add that value to y with the c function. The lwd = 2 argument makes the line a little thicker and darker.

Our plot is basically done, but just for fun I wanted to see how close I could make it look like the version in the book. That means relabeling the axes, moving the axis labels to the ends, and removing the lines at the top and right side of the plot. It also means moving the axis tick marks inside the plotting area. After some trial and error and deep dives into R documentation, here's what I was able to come up with:

plot(x = 1:15, y, pch = 19, 
     yaxt = "n", xaxt = "n", ylim = c(0,1.05), xlim = c(0,15.5), 
     bty="l", xaxs="i", yaxs = "i", xlab = "", ylab = "")
segments(x0 = 0:15, y0 = c(0, y),
         x1 = 1:16, y1 = c(0, y), lwd = 2)
axis(side = 1, at = 0:15, 
     labels = 0:15, tcl = 0.5, family = "serif")
axis(side = 2, at = seq(0,1,0.1), 
     labels = c(0,paste(".",1:9),1), las=1, tcl = 0.5, family = "serif")
mtext(text = expression(italic(x)), side = 4, 
      at = 0, las = 1, line = 0.5, family = "serif")
mtext(text = expression(italic(F[x](x))), side = 3, 
      at = 0, line = 0.5, family = "serif")

plot of chunk unnamed-chunk-7

In the plot function:

  • the yaxt = "n" and xaxt = "n" arguments say “don't label the axes”. I instead use the axis function to create the axes.
  • the ylim = c(0,1.05) and xlim = c(0,15.5) arguments tell the axes to end at 1.05 and 15.5, respectively. I wanted them to extend beyond the last value just as they do in the book.
  • the bty="l" argument says “draw a box around the plot like a capital letter L”
  • the xaxs="i" and yaxs = "i" arguments ensures the axes fit within the original range of the data. The default is to extend the range by 4 percent at each end. Again, I was trying to recreate the graph in the book. Notice how the origin has the x-axis and y-axis 0 values right next to one another.
  • The xlab = "" and ylab = "" set blank axis labels. I instead use the mtext function to add axis labels.

The segments function remained unchanged.

The axis function allows us to explicitly define how the axis is created.

  • The side argument specifies which side of the plot we're placing the axis. 1 is the bottom, 2 is the left.
  • at is where we draw the tick marks.
  • labels are how we label the tick marks.
  • The tcl argument specifies how long to make the tick marks and in what direction. A positive value extends the tick marks into the plotting region.
  • The las argument in the second axis function makes the labels on the y-axis horizontal.

Finally I used the mtext function to create the axis labels. mtext writes text into the margins of a plot and can take some trial and error to get the placement of text just so.

  • The text argument is what text we want to place in the graph. In this case I make use of the expression function which allows us to create mathematical notation. For example, the syntax expression(italic(F[x](x))) returns \(F_x (x)\)
  • The side argument again refers to where in the plot to place the text. 3 is top and 4 is right. This means the y-axis label is actually in the top of the plot and the x-axis label is on the right. A little bit of a hack.
  • at says where to place the text along the axis parallel to the margin. In both cases we use 0. We want the y-axis label at the 0 point corresponding to the x-axis, and the x-axis label at the 0 point corresponding to the y-axis. A little confusing, I think.
  • The las argument rotates the x label to be horizontal
  • The line argument specifies on which margin line to place the text, starting at 0 counting outwards. This is one that takes some trial and error to get just right.
  • The family argument specifies the type of font to use. “serif” is like Times New Roman.

Not perfect, but close enough. Of course I much prefer the R defaults when it comes to plotting layout. Even though R allows us to recreate this graph, I don't think it's necessarily a “good” graph.

I also decided to tackle this using ggplot. Here's how far I got.

library(ggplot2)
dat <- data.frame(x = 1:15, y = pgeom(q = 0:14, prob = 0.3))
dat2 <- data.frame(x = 0:15, y = c(0, dat$y[-16]), xend = 1:16, yend = c(0,dat$y[-16]))

ggplot(dat, aes(x = x, y = y)) + geom_point() +
  geom_segment(aes(x = x, y = y, xend = xend, yend = yend), data = dat2) +
  scale_x_continuous(breaks = 0:15, labels = 0:15) +
  scale_y_continuous(breaks = seq(0,1,0.1), labels = c(0,paste(".",1:9),1)) +
  labs(x = "x", y = expression(italic(F[x](x)))) +
  theme(panel.background = element_blank(),
        axis.line.x = element_line(color = "black"),
        axis.line.y = element_line(color = "black"),
        axis.title.x = element_text(face = "italic", hjust = 1),
        axis.title.y = element_text(face = "italic", hjust = 1, angle = 0))

plot of chunk unnamed-chunk-8

You can see I couldn't figure out how to move the axis ticks into the plotting region, or how to place axis labels at the ends of the axis, or how to get the origin to start at precisely (0, 0). I'm not saying it can't be done, just that I lost interest in trying to go further. And a very strong argument can be made that these are things you shouldn't do anyway! But as I said at the outset, this was all for fun.