Machine Learning for Hackers, Chapter 9

In this chapter we learn a little about Multidimensional Scaling, or MDS for short. Like PCA in Chapter 8, MDS is a topic usually taught in multivariate statistics classes. I guess it’s classified as machine learning because of the computing required to do it, though the same can be said of almost any statistical method. The MDS that is presented is the classical type (as opposed to the non-metric type). The case study is exploring similarity in the US Senate based on votes. As it turns out, a proper MDS analysis shows Democrats and Republicans indeed vote differently and are quite polarized. Not surprising but better than anecdotal evidence from some political pundit.

Now if you’ve been following my series of posts on this book (and I’m pretty sure you haven’t) then you know I don’t summarize the chapter but rather highlight the R code that I found interesting and new. But for the first time I really don’t have much to highlight. It’s a short chapter and the data for the case study required very little preparation. I was actually able to follow all the R code without running to the help pages and Google. Still, though, I feel like I should document something from this chapter.

How about reading in a bunch of files. I feel like this something I will forget how to do one day, so I should probably write about it to help form some memory. First they define a variable that stores a path and then they call the list.files function to create a vector of file names:

data.dir <- "data/roll_call/"
data.files <- list.files(data.dir)

Next they use the lapply function to create one big list object that contains the contents of all those files, which happen to be Stata data sets:

rollcall.data <- lapply(data.files, function(f) 
read.dta(paste(data.dir, f, sep=""), convert.factors=FALSE))

So the lapply function applies the read.dta function to each element in the data.files vector, which is file name. But in the read.dta function you'll notice the paste function is called to create a full path with the file name. Just dandy. That never fails to impress.

If you want to inspect some of the data, you can't just do head(rollcall.data). Since it's a list object you have to do something like this:

head(rollcall.data[[1]])

That shows the first 6 rows of the data stored in the first list element.

Later on they use the strsplit function to split first and last names of senators by comma. I'm pretty sure I've posted about this before, but it doesn't hurt to mention it again. Running a vector through strplit returns a list. For example:

 strsplit(congress$name[1], "[, ]")
[[1]]
[1] "SHELBY" ""       "RIC"

If I just want the last name, I can do this:

 strsplit(congress$name[1], "[, ]")[[1]][1]
[1] "SHELBY"

Now the authors use the strsplit function in an sapply function, like this:

congress.names <- sapply(congress$name, function(n) 
strsplit(n, "[, ]")[[1]][1])

That gives us this:

 head(congress.names)
SHELBY, RIC HEFLIN, HOW STEVENS, TH MURKOWSKI,  DECONCINI,  MCCAIN, JOH 
"SHELBY"    "HEFLIN"   "STEVENS" "MURKOWSKI" "DECONCINI"    "MCCAIN"

Notice each element in the vector has a "name". Ugly, I think, though it doesn't matter in the context of what they do. But what if you wanted a clean vector of last names without the "names"? You can do this:

congress.names <- sapply(congress$name, function(n) 
strsplit(n, "[, ]")[[1]][1],USE.NAMES = FALSE)
head(congress.names)
[1] "SHELBY"    "HEFLIN"    "STEVENS"   "MURKOWSKI" "DECONCINI" "MCCAIN"

Much better. Notice the USE.NAMES = FALSE setting. Pretty self-explanatory what it does.

If you want to see the rest of their code including how they carry out the MDS analysis and MDS clustering plots, see their git hub page. For more on how exactly MDS works, see Chapter 14 of A Handbook of Statistical Analyses Using R.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.