I was reading this sobering article about the rise of suicide rates in the United States and decided to read the NCHS data brief on which it was based. I noticed the source data was freely available online through the CDC Wonder databases, specifically the Detailed Mortality database, and figured I would take a look. One graph that caught my eye was Figure 2, which showed that suicide rates for females were higher in 2014 than in 1999 for all age groups under 75 years.
That's a very sad and concerning graph. What happened in the last 15 years? I have no theories about the increase in suicide rates, but I do admire the CDC statisticians who tackle this kind of gut-wrenching data, day in and day out. Let's join them for a moment and recreate this graph in R.
First we need to get the data from the Detailed Mortality database. Instead of directly downloading raw data, we submit a query for the data we want. There is a 7-step form to complete.
Step 1: Group by ICD-10 113 Cause List, Cause of death, Age Groups, Gender, and Year. This uses all available pull-down lists.
Step 2: leave as is. (ie, all US states)
Step 3: leave as is. (ie, all genders, ages, and races)
Step 4: select 1999 and 2014 using Ctrl + Click to select both years
Step 5: leave as is. (ie, all weekdays, autopsies, and places of death)
Step 6: select the ICD-10 113 Cause List radio button and select “Intentional self-harm (suicide) (*U03,X60-X84,Y87.0)”
Step 7: click Send.
The results should appear in your web browser. If everything looks good, click the Export button to download a tab-delimited file of the results. It will have a .txt extenstion. Now that we have our data, we can go to R and get to work.
Before we read in the data we should open it in a text editor and see what we got. We have a header row, so that's good. If we scroll down, we'll see there's about 70 lines of metadata. This is data about our data, such as where it came from, when we downloaded it, etc. We don't want that in our data frame, but we should save it with our data frame as an attribute.
To start we'll use readLines
to read in the data one line at a time. I chose to do this as a way to programmatically figure out where the metadata begins. I noticed it begins after a divider of three dashes (“—”), so I use grep
in combination with min
to find the first occurence of this divider and save a n
.
raw <- readLines("Underlying Cause of Death, 1999-2014 (3).txt")
## Warning in readLines("Underlying Cause of Death, 1999-2014 (3).txt"): ## incomplete final line found on 'Underlying Cause of Death, 1999-2014 ## (3).txt'
n <- min(grep(pattern = "---",raw))
Now I can use n
with the nrows
argument in the read.delim
function to tell it how many rows I want to read in. Notice I have to say n - 2
. I have to subtract the header row and the divider row itself. I also specify “Not Applicable” and “Unreliable” as NA values. Finally, take a look at my file name. Yep, it took me a few tries to get the data I wanted. I'm not too proud to admit that. :)
dat <- read.delim(file = "Underlying Cause of Death, 1999-2014 (3).txt", header = TRUE, nrows = n-2,
na.strings = c("Not Applicable","Unreliable"))
Let's go ahead and save the metadata with our data frame. We can do that with the comment
function. We don't need it to create the graph, but it's good practice to preserve this kind of information. To see the metadata after we save, just enter comment(dat)
. It's not pretty when it's printed to the console, but we have it if we need it.
comment(dat) <- raw[n:length(raw)]
Now we need to clean up the data. First order of business is to reorder the age groups. By default, R brought them in as a factor and put the 15-24 age group as the first level. We want that first level to be the 5-14 age group. It's already a factor, but we reorder the levels by making the column a factor again and specifying the levels in the specified order. The original 5th level comes first and then everything else. There may be a better way of doing this, but it works.
dat$Ten.Year.Age.Groups.Code <- factor(dat$Ten.Year.Age.Groups.Code,
levels = levels(dat$Ten.Year.Age.Groups.Code)[c(5,1,2,3,4,6,7,8,9,10)])
We only need females, so let's subset. I make a new data frame called dat2
.
dat2 <- subset(dat, Gender=="Female")
Now if we look at Figure 2 again, the x-axis has 6 age groups, but our data has 9 age groups. So we need to combine a few age groups. Specifically we need to combine the “25-34” and “35-44” groups into “25-44”, combine the “45-54” and “55-64” groups into “45-64”, and the “75-84” and “85+” into “75+”. That's not too bad. We just reset the levels.
levels(dat2$Ten.Year.Age.Groups.Code) <- c("5-14", "15-24", "25-44", "25-44", "45-64", "45-64",
"65-74", "75+", "75+", "NS")
Since we have 6 age groups instead of 9, we need to combine death totals for the three new age groups. We also need to combine death totals for all the various causes of deaths. We can do that with aggregate
. Below we sum Deaths
by Ten.Year.Age.Groups.Code
and Year
. This returns a new data frame that I call dat3
.
(dat3 <- aggregate(Deaths ~ Ten.Year.Age.Groups.Code + Year, data=dat2, sum))
## Ten.Year.Age.Groups.Code Year Deaths ## 1 5-14 1999 50 ## 2 15-24 1999 575 ## 3 25-44 1999 2359 ## 4 45-64 1999 1868 ## 5 65-74 1999 420 ## 6 75+ 1999 469 ## 7 5-14 2014 151 ## 8 15-24 2014 990 ## 9 25-44 2014 3018 ## 10 45-64 2014 4195 ## 11 65-74 2014 828 ## 12 75+ 2014 477 ## 13 NS 2014 1
Next I need to calculate “deaths per 100,000” for each age group. We do this by dividing Deaths by the age group population and multiplying by 100,000. I need to add the age group populations to dat3
to do this. First I need to sum the populations for my new age groups. I can use aggregate again with dat2
, but I need to remove duplicate rows of age group population values before I do this. I decided to save the de-duped data frame to tmp and not overwrite dat2
tmp <- subset(dat2, select = c("Year", "Ten.Year.Age.Groups.Code", "Population"))
tmp <- tmp[!duplicated(tmp),]
(dat4 <- aggregate(Population ~ Ten.Year.Age.Groups.Code + Year, data=tmp, sum))
## Ten.Year.Age.Groups.Code Year Population ## 1 5-14 1999 19908282 ## 2 15-24 1999 18860184 ## 3 25-44 1999 42614694 ## 4 45-64 1999 31011068 ## 5 65-74 1999 10125424 ## 6 75+ 1999 10371906 ## 7 5-14 2014 20161424 ## 8 15-24 2014 21456371 ## 9 25-44 2014 41900194 ## 10 45-64 2014 42789506 ## 11 65-74 2014 14049245 ## 12 75+ 2014 11842674
Now I'm ready to merge dat3
and dat4
so I have Deaths and Populations in one data frame. Before I do that I remove the row with age group of “NS” which means “Not Stated”. We don't need that row.
dat3 <- subset(dat3, Ten.Year.Age.Groups.Code != "NS")
dat5 <- merge(dat3,dat4,by = c("Year","Ten.Year.Age.Groups.Code"))
And finally we can calculate Rate. I decided to also sort the data frame as well.
dat5$Rate <- round(dat5$Deaths/dat5$Population * 100000,1)
(dat5 <- dat5[order(dat5$Ten.Year.Age.Groups.Code, dat5$Year),])
## Year Ten.Year.Age.Groups.Code Deaths Population Rate ## 4 1999 5-14 50 19908282 0.3 ## 10 2014 5-14 151 20161424 0.7 ## 1 1999 15-24 575 18860184 3.0 ## 7 2014 15-24 990 21456371 4.6 ## 2 1999 25-44 2359 42614694 5.5 ## 8 2014 25-44 3018 41900194 7.2 ## 3 1999 45-64 1868 31011068 6.0 ## 9 2014 45-64 4195 42789506 9.8 ## 5 1999 65-74 420 10125424 4.1 ## 11 2014 65-74 828 14049245 5.9 ## 6 1999 75+ 469 10371906 4.5 ## 12 2014 75+ 477 11842674 4.0
A quick comparison of our rates with those in Figure 2 shows that we're in agreement. Now it's time to re-create the graph. For this I chose to use ggplot2. To match the colors, I uploaded a picture of the original graph to this wonderful site and figured out the shades of green were #6D9F40 (dark green) and #CFDABA (light green).
library(ggplot2)
ggplot(dat5, aes(x=Ten.Year.Age.Groups.Code, y=Rate, fill=factor(Year))) +
geom_bar(stat="identity", position = "dodge") +
labs(x="Age (years)",y="Deaths per 100,000 in specified group") +
scale_fill_manual("Year", values = c("#6D9F40","#CFDABA")) +
annotate(geom = "text", x = (rep(1:6,each=2) + c(-1,1)*0.22),
y = dat5$Rate + 0.2,
label = sprintf("%.1f", dat5$Rate)) +
theme_bw() +
scale_y_continuous(breaks=seq(0,10,2))
The first line defines the aesthetics. I want Year
to be treated like a categorical variable, so I call it with factor
. The next lines says, “don't count the rows to create the bar graph, use the actual values of Rate
(ie, identity).” It also says “dodge”“ the Years so they're next to each other (instead of stacked). The third line labels the axes. The fourth line titles the legend and defines the colors. The annontate
line adds the counts to the plot. That took some trial and error! The x and y values are the coordinates. The label is the Rate
values rendered as character with one decimal place. The theme_bw()
line is an easy way to get rid of the default grey background and the last line relabels the y axis.