Category Archives: Machine Learning for Hackers

Machine Learning for Hackers, Chapter 12

In the final chapter the authors introduce Support Vector Machines (SVM). What SVM do are generalize linear decision boundaries for classification. Like the other chapters, this one avoids math and simply shows you how to use SVM in R. It then analyzes the spam mail dataset from chapter 3 using SVM, kNN and regularized logistic regression and compares their performance.

The first section of the chapter demonstrates SVM using a toy dataset. The dataset has two predictors (X and Y) and a response (a classification of 0 or 1). The dataset looks like this plotted:


I show this figure because the book is printed in black and white and you can’t tell class 0 from class 1 in the picture on page 276. That’s a problem that plagues this entire chapter. Anyway, what we have here is data that require a non-linear decision boundary. In other words we can’t draw a single straight line through the picture to indicate a boundary between the blue and red dots. This is where SVM comes in handy. Here’s the SVM code they use to analyze this dataset. As you’ll see they play around with the svm function from the e1071 package to show how changing certain parameters affects the quality of predictions. I didn’t get anything new out of this section that I felt compelled to blog about. It’s basically an extended example of how to use the svm function similar to what you might see on a help page. They do make use of the melt function from the reshape package, which is very cool and useful, but I already wrote about that in the Chapter 8 post.

The second section is “Comparing Algorithms”. They resurrect the spam dataset from chapter 3 and analyze it using SVM, kNN and regularized logistic regression. Recall the spam dataset was converted to a term document matrix (TDM) where the emails are the rows and the columns are terms. So email #1 would be row 1. In that row we would see the number of times the words in the column headers appeared. For example. the word “banner” appears 0 times in email #1. Column 1 of the spam dataset indicates whether or not the email is spam. They create testing and training sets and use those to create models and compare the different classification methods. Again, there’s nothing too fancy here where R programming is concerned, or at least nothing I haven’t already covered. They determine logistic regression works best for classifying spam email.

Now here’s something interesting worth noting. Instead of prepping the spam data as they did in Chapter 3, they skip directly to loading it into R as follows:

load('data/dtm.RData')

But how did they save that data as dtm.Rdata? By doing something like this:

save(dtm, file="dtm.RData")

…where dtm was a matrix. I say “matrix” because that’s what dtm.RData is when it gets loaded.

Another snippet worth noting is how they create their training and test sets. First they randomly select numbers as the indices for selecting the training data from the number of rows from the spam matrix. Then they take whatever numbers were not selected and make those the indices for the test data:

training.indices <- sort(sample(1:nrow(dtm), round(0.5 * nrow(dtm))))
test.indices <- which(! 1:nrow(dtm) %in% training.indices)

The dtm dataset has 3249 rows. So the sample function selects (without replacement) from the numbers 1 through 3249. The second argument to the sample function says how many to sample, which in this case is \( 0.5 \times 3249 = 1624.5 \). That gets rounded to 1624 by the round function. So 1624 numbers are sampled without replacement from the numbers 1 through 3249. And then they get sorted in ascending order. Those are the indices for the training data. To get the test data, they use the value matching %in% operator to look for matches, but negate the matches to get all the numbers that were not selected with the sample function.

The code "1:nrow(dtm) %in% training.indices" returns a logical vector with 3249 elements. 1624 are TRUE, the rest are FALSE. The code "! 1:nrow(dtm) %in% training.indices" returns the opposite logical vector. Now 1624 are FALSE and the rest are TRUE. Finally, inserting that strip of code into the which function returns the indices of the vector that are FALSE. Therefore both training.indices and test.indices are vectors of numbers from 1:3249, with no overlap between them. For example:

> training.indices[1:10]
 [1] 1 2 6 9 11 12 15 16 18 19
> test.indices[1:10]
 [1] 3 4 5 7 8 10 13 14 17 24

These vectors of numbers are then used to select the training and test data as follows:

train.x <- dtm[training.indices, 3:ncol(dtm)]
train.y <- dtm[training.indices, 1]
test.x <- dtm[test.indices, 3:ncol(dtm)]
test.y <- dtm[test.indices, 1]

I thought that was good to go through and understand since all the functions in this section require data (or are capable of taking data) in this format. For example:

library('glmnet')
regularized.logit.fit <- glmnet(train.x, train.y, family = c('binomial'))
library('e1071')
linear.svm.fit <- svm(train.x, train.y, kernel = 'linear')
library('class')
knn.fit <- knn(train.x, test.x, train.y, k = 50)

And that does it for me! I'm ready to move on to another book.

Machine Learning for Hackers, Chapter 11

Are you a heavy Twitter user? Would you like to build a custom recommendation system in R that will recommend new Twitter feeds for you to follow? If you answered “yes” to both questions then this is the chapter for you. Here’s the code. Have fun.

Now I don’t use Twitter, so this chapter was not very interesting for me. I should probably use Twitter and get with it and join the 2010’s. But at the moment I don’t, so building a Twitter recommendation system in R doesn’t do me any good. I guess I could have followed along and built the recommendation system in the book for one of the authors, but I just couldn’t motivate. It’s a lot of code that requires new packages. Plus there’s a section that requires using a program outside of R called Gephi. It’s an extremely specific case study. It’s not like some of the other chapters where they introduce a general machine learning method, like kNN or PCA, and show an example of how to use it. There is nothing general about this chapter. At one point they do make use of hierarchical clustering to find similarities between Twitter users. Clustering is a common machine learning tactic. But they only give it a couple of pages. I hate to sound down on the chapter. It’s well done and the authors put a lot of work into it. Personally, though, I just couldn’t get excited about it.

Having said that, I did see some interesting things in the R code that I wanted to file away and remember. The first was creating an empty vector. In a loop they run, they sometimes get a null result and want to store an empty vector for that pass. They do this as follows:

y <- c(integer(0), integer(0))

What caught my eye was the way rbind handles those vectors. It ignores them. I could see that being potentially useful. Here's a toy demonstration:

x <- c(5, 4)
y <- c(integer(0), integer(0))
z <- c(3, 2)
rbind(x,y,z)
   [,1] [,2]
x     5    4
z     3    2

Another snippet I wanted to note was how to identify unique values in a matrix. You use the unique function, which I was familiar with. What I didn't know was that you can do it for a matrix by calling the columns of the matrix, like this:

mat <- matrix(floor(runif(10,2,8)),5,2)
mat
           [,1] [,2]
[1,]        4      3
[2,]        2      7
[3,]        7      2
[4,]        3      7
[5,]        5      7
> unique(c(mat[,1],mat[,2]))
[1] 4 2 7 3 5

This chapter also used a base function that I had never seen before called duplicated. It returns a vector of True/False values that tells you whether or not the value occurs previously in the vector. I was surprised I hadn't heard of it before. Here's a demo of duplicated:

x <- c(1:5,2:6)
x
[1] 1 2 3 4 5 2 3 4 5 6
duplicated(x)
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
!duplicated(x)
[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
x[duplicated(x)]
[1] 2 3 4 5
x[!duplicated(x)]
[1] 1 2 3 4 5 6

Finally they make use of sapply in one of their functions and I wanted to document its use. The suite of apply functions always give me trouble for some reason. I get the gist of what they do, I just have a hard time remember why I would want to use one over the other. Anyway, what they do is use sapply to run a function over each row of a matrix. That seems like a useful thing to remember. First let's create a 100 x 3 data frame of values:

a <- rnorm(100,2,4)
b <- rnorm(100,3,4)
c <- rnorm(100,5,4)
df <- data.frame(a,b,c)

Now, let's say I wanted to create a True/False vector indicating which rows contained a value less than 0. Here's how I can do that using sapply:

sapply(1:nrow(df), function(i) ifelse(any(df[i,]<0),1,0))

The authors did something similar to that and then used the vector of T/F values to subset another vector. Of course if I wanted to subset my example data frame above to show only rows that contain a value less than o, then I could do this:

subset(df, a < 0 | b <0 | c < 0)

So that's the helpful R code I gleaned from Chapter 11. I feel like there should be more. It's a long chapter with a lot of code. But much of it either involves functions from specific packages or uses common R functions I've already documented or know about. Next up is chapter 12, the final chapter.

Machine Learning for Hackers, Chapter 10

k-nearest neighbors (kNN) is the topic of chapter 10. The authors say it’s the most intuitive of the machine learning algorithms covered in the book and I guess I would agree, provided you understand the euclidean distance formula, understand what a distance matrix is, and know a little about how R works.

They provide a toy dataset (named “df”) to demonstrate how kNN works. It looks like this:

         X        Y Label
1 2.373546 5.398106     0
2 3.183643 4.387974     0
3 2.164371 5.341120     0
4 4.595281 3.870637     0
5 3.329508 6.433024     0
6 2.179532 6.980400     0

Think of the rows as points lying in an XY plane. Notice they also have labels. The dataset has 100 rows, with 50 labeled 0 and 50 labeled 1. To run the kNN algorithm, we need to create a distance matrix that stores the distance between all the points. So row 1 of the distance matrix would have the distances of all points from the first point (2.373546, 5.398106). Row 2 would have the distances of all points from (3.183643, 4.387974). And so on. The authors actually write a function to create the distance matrix, which is good for illustrating how the distance matrix is created, but I want to note you can use the built-in “dist” function for this:

distance <- as.matrix(dist(df, diag=TRUE, upper=TRUE))

I set the "diag" and "upper" parameters to TRUE in order to match the matrix they create. The result is this:

 distance[1:3,1:3]
          1        2         3
1 0.0000000 1.294845 0.2167983
2 1.2948454 0.000000 1.3954937
3 0.2167983 1.395494 0.0000000

That's just a portion of the matrix. The full matrix is 100 x 100. The first row (or first column) refers to point 1 (2.373546, 5.398106). distance[1,1] is 0 because that's how far point 1 is from itself. distance[1,2] is 1.294845 because that's how far point 2 (3.183643, 4.387974) is from point 1. The distance between two points A(x_{1},y_{1}) and B(x_{2},y_{2}) is calculated using the Euclidean distance formula:

$$ AB = \sqrt{(x_{1}-x_{2})^{2} + (y_{1}-y_{2})^{2}}$$

$$ 1.294845 = \sqrt{(2.373546-3.183643)^{2} + (5.398106-4.387974)^{2}}$$

Once we have our distance matrix, we can find the k-nearest neighbors for a particular point by sorting the row from smallest to greatest distance. We can use the order function to help us do this. For example, let's find the 5 nearest neighbors for the first point:

indices <- order(distance[1,])[2:6]
indices
[1] 46 3 36 13 29
distance[1,indices]
46        3         36        13        29 
0.1796931 0.2167983 0.2212696 0.2916801 0.3561144

The order function returns the index positions of the nearest neighbors in the distance matrix. The index position can also be thought of as the identifying point number. So the closest neighbor of point 1 is point 46. The second-closest neighbor of point 1 is point 3. The command "distance[1,indices]" shows us the 5 closest points and their distances. What are their labels?

df[indices,'Label']
[1] 0 0 0 0 0

They're all 0's. If we wanted to predict the label of point 1, we could use majority rule. In this case since the 5-nearest neighbors are all labeled "0", we would predict point 1 is also labeled "0". (In truth, it is.) The authors carry this out for the entire dataset of 100 points and correctly predict the label for 93 of them given their 5-nearest neighbors.

The case study in this chapter is recommending R packages. The dataset is a list of 2487 packages, listed 50 times over for 50 programmers, with an indicator of whether or not that package is installed for the programmer. Using kNN they build a list of 25 recommendations for each programmer. Here's the code.

Next up is Chapter 11, Analyzing Social Graphs. It looks like the case study will involve building an engine to recommend people to follow on Twitter. I guess I'm a poopy stick-in-the-mud because this doesn't appeal to me. I'm not really into the Twitter. But I'll keep an open mind, slog through it and learn as much as I can.

Machine Learning for Hackers, Chapter 9

In this chapter we learn a little about Multidimensional Scaling, or MDS for short. Like PCA in Chapter 8, MDS is a topic usually taught in multivariate statistics classes. I guess it’s classified as machine learning because of the computing required to do it, though the same can be said of almost any statistical method. The MDS that is presented is the classical type (as opposed to the non-metric type). The case study is exploring similarity in the US Senate based on votes. As it turns out, a proper MDS analysis shows Democrats and Republicans indeed vote differently and are quite polarized. Not surprising but better than anecdotal evidence from some political pundit.

Now if you’ve been following my series of posts on this book (and I’m pretty sure you haven’t) then you know I don’t summarize the chapter but rather highlight the R code that I found interesting and new. But for the first time I really don’t have much to highlight. It’s a short chapter and the data for the case study required very little preparation. I was actually able to follow all the R code without running to the help pages and Google. Still, though, I feel like I should document something from this chapter.

How about reading in a bunch of files. I feel like this something I will forget how to do one day, so I should probably write about it to help form some memory. First they define a variable that stores a path and then they call the list.files function to create a vector of file names:

data.dir <- "data/roll_call/"
data.files <- list.files(data.dir)

Next they use the lapply function to create one big list object that contains the contents of all those files, which happen to be Stata data sets:

rollcall.data <- lapply(data.files, function(f) 
read.dta(paste(data.dir, f, sep=""), convert.factors=FALSE))

So the lapply function applies the read.dta function to each element in the data.files vector, which is file name. But in the read.dta function you'll notice the paste function is called to create a full path with the file name. Just dandy. That never fails to impress.

If you want to inspect some of the data, you can't just do head(rollcall.data). Since it's a list object you have to do something like this:

head(rollcall.data[[1]])

That shows the first 6 rows of the data stored in the first list element.

Later on they use the strsplit function to split first and last names of senators by comma. I'm pretty sure I've posted about this before, but it doesn't hurt to mention it again. Running a vector through strplit returns a list. For example:

 strsplit(congress$name[1], "[, ]")
[[1]]
[1] "SHELBY" ""       "RIC"

If I just want the last name, I can do this:

 strsplit(congress$name[1], "[, ]")[[1]][1]
[1] "SHELBY"

Now the authors use the strsplit function in an sapply function, like this:

congress.names <- sapply(congress$name, function(n) 
strsplit(n, "[, ]")[[1]][1])

That gives us this:

 head(congress.names)
SHELBY, RIC HEFLIN, HOW STEVENS, TH MURKOWSKI,  DECONCINI,  MCCAIN, JOH 
"SHELBY"    "HEFLIN"   "STEVENS" "MURKOWSKI" "DECONCINI"    "MCCAIN"

Notice each element in the vector has a "name". Ugly, I think, though it doesn't matter in the context of what they do. But what if you wanted a clean vector of last names without the "names"? You can do this:

congress.names <- sapply(congress$name, function(n) 
strsplit(n, "[, ]")[[1]][1],USE.NAMES = FALSE)
head(congress.names)
[1] "SHELBY"    "HEFLIN"    "STEVENS"   "MURKOWSKI" "DECONCINI" "MCCAIN"

Much better. Notice the USE.NAMES = FALSE setting. Pretty self-explanatory what it does.

If you want to see the rest of their code including how they carry out the MDS analysis and MDS clustering plots, see their git hub page. For more on how exactly MDS works, see Chapter 14 of A Handbook of Statistical Analyses Using R.

Machine Learning for Hackers, Chapter 8

This is a tidy chapter on principle component analysis (PCA). There are many fine tutorials on PCA to be found on the internet. This one is special because it doesn’t start with a nice data set. They have to first beat it into shape. They’re goal is to develop a market index based on 25 stocks. They will do PCA and use the first component to develop an index.

After they read in a CSV file with the date, they have to translate a column of dates into a date object. They use a package called “lubridate” for this, but it seems the as.Date function works just as well. This is what I did:

prices <- read.csv('data/stock_prices.csv')
prices <- transform(prices, Date = as.Date(Date))

Next they have to reshape the data. The prices dataset looks like this:

 head(prices)
  Date     Stock Close
1 2011-05-25 DTE 51.12
2 2011-05-24 DTE 51.51
3 2011-05-23 DTE 51.47
4 2011-05-20 DTE 51.90
5 2011-05-19 DTE 51.91
6 2011-05-18 DTE 51.68

They want to transform this so the column headers are the Date and all the Stocks. They use the cast function from the reshape package.

library(reshape)
date.stock.matrix <- cast(prices, Date ~ Stock, value = 'Close')

The result:

        Date   ADC   AFL ARKR AZPN  CLFD   DDR   DTE  ENDP FLWS     FR GMXR
1 2002-01-02 17.70 23.78 8.15 17.10 3.19 18.80 42.37 11.54 15.77 31.16 4.50
2 2002-01-03 16.14 23.52 8.15 17.41 3.27 18.55 42.14 11.48 17.40 31.45 4.37
3 2002-01-04 15.45 23.92 7.79 17.90 3.28 18.46 41.79 11.60 17.11 31.46 4.45
4 2002-01-07 16.59 23.12 7.79 17.49 3.50 18.30 41.48 11.90 17.38 31.10 4.38
5 2002-01-08 16.76 25.54 7.35 17.89 4.24 18.57 40.69 12.41 14.62 31.40 4.30
6 2002-01-09 16.78 26.30 7.40 18.25 4.25 18.41 40.81 12.27 14.27 31.05 4.25

Now after they do this they mention the new dataset has missing entries (NAs) that need to be removed. But they don't tell you how they found them. Furthermore they proceed to not just remove a row of NAs, but an entire column of a stock because it had a value on a day when all other stocks were NA! Here's what they do:

# remove a row and a column
prices <- subset(prices, Date != "2002-02-01")
prices <- subset(prices, Stock != "DDR")
date.stock.matrix <- cast(prices, Date ~ Stock, value = 'Close')

That reduces the data set to 2366 rows and 25 columns. Instead of 25 stocks, we now have 24. (Remember, the first column is Date.) What confused me is why they didn't do something like this:

date.stock.matrix <- na.omit(date.stock.matrix)

That reduces the data set by two rows to 2365 rows but keeps all the columns. That means you have 25 stocks to develop your index instead of 24. But let's back up for a moment and think about how they found those missing values in the first place. They don't tell you in the book. They just say that they notice they have missing entries and need to remove them. Here's one way you could find out if you didn't know whether or not there were missing values. First get the dimensions of your data frame:

dim(date.stock.matrix)
[1] 2367 26

Now run the following to see if the dimensions change:

dim(date.stock.matrix[complete.cases(date.stock.matrix),])
[1] 2365 26

They do! Two rows have missing values. But which two rows? Try this

row.has.na <- apply(date.stock.matrix, 1, function(x){any(is.na(x))})
date.stock.matrix[row.has.na,]

That returns the two rows in the data frame with missing values, in this case rows 22 and 875. Again, we can use the na.omit code above to clear out those rows. I'm not sure why the authors didn't do this. When they run their PCA on their final data set with 24 stocks, the first component accounts for 29% of the variance. But when I run it on the data set with all 25 stocks and only two rows removed, the first component accounts for almost 33% of the variance. Not a huge improvement, but still improvement.

A couple of other "that was cool" nuggets I took from this chapter were the way they looked at the correlation matrix (to get a feel if PCA was warranted) and their use of the melt function from the reshape package. First the correlation trick:

cor.matrix <- cor(date.stock.matrix[,2:ncol(date.stock.matrix)])
corrs <- as.numeric(cor.matrix)
library(ggplot2)
ggplot(data.frame(Correlation = corrs), aes(x = Correlation, fill = 1))
 + geom_density() + opts(legend.position = 'none')

They calculate the correlation matrix of the closing values, convert the correlation matrix to a vector, and then make a density plot. This visually shows you that most of the correlations are positive and that PCA doesn't seem out of the question. I thought that was a neat trick.

Now the melt function. They use this function to convert a data set with both their market index data and the DJI such that both can be plotted over time. Let's look at before and after and see what it does.

Before melt:

 head(comparison)
       Dates MarketIndex       DJI
1 2002-01-02 -1.0346886 -0.3251888
2 2002-01-03 -1.0265119 -0.2602872
3 2002-01-04 -1.0199864 -0.2027079
4 2002-01-07 -1.0230787 -0.2439139
5 2002-01-08 -0.9953525 -0.2744783
6 2002-01-09 -0.9873846 -0.3115893

After melt:

alt.comparison <- melt(comparison, id.vars='Dates')
names(alt.comparison) <- c('Dates', 'Index', 'Price')
 head(alt.comparison)
       Dates       Index      Price
1 2002-01-02 MarketIndex -1.0346886
2 2002-01-03 MarketIndex -1.0265119
3 2002-01-04 MarketIndex -1.0199864
4 2002-01-07 MarketIndex -1.0230787
5 2002-01-08 MarketIndex -0.9953525
6 2002-01-09 MarketIndex -0.9873846

So what they want to do is plot the indices over time and put them both on the same graph. In the first data set, each row is a date with both index values on that date. The melt function says to identify the other two columns by date. Each column name in the first data set becomes a grouping factor in the second data set. The values that are in columns 2 and 3 of the first data set are now all in column three in the second data set. It's hard to explain in words but makes sense if you stare at the before-and-after long enough.

Machine Learning for Hackers, Chapter 7

The title of this chapter, “Optimization: Breaking Codes”, sounds more exciting than it really is. The code breaking is attempted via an optimization algorithm called the Metropolis method. The authors show you how to encrypt text using a simple Caesar cipher and then how to implement the Metropolis method to attempt to decipher the message. Their test message is “here is some sample text”. They encrypt it using the Caesar cipher, so it looks like “ifsf jt tpnf tbnqmf ufyu “. The encryption rule is simply replacing every letter with the following letter in the alphabet. The letter a is replaced with b, b with c, and so on until you replace z with a. They manage to decipher their test text through the course of 50,000 Metropolis steps (on the 45,609th attempt). I suppose this is a fun demonstration of optimization. It just seemed a little outdated. I guess I was expecting to see them crack an actual message from World War II or something.

They start the chapter by demonstrating how to use the R optim() function. For some reason this function has always confused me. Probably because I don’t use it that often. But I have to say I feel much better about it now. They give clear demonstrations of its use and visually show you how it works. Their example involves the height and weight dataset from chapter 2. First we do linear regression and find coefficients for the slope and intercept. Then we find the same coefficients using optim(). The function minimized in the call to optim() is a hand-built sum of squared errors function. I thought this was a great way to demonstrate how optim() works. But after this, they do something puzzling: they continue to demonstrate how to use optim() with Ridge Regression. Now Ridge Regression is a type of regression that helps remedy multicollinearity. If you have predictor variables that are highly correlated, you may get some weird coefficients in your linear modelling results. I’m talking about getting a negative value where you would expect a positive value, that sort of thing. Ridge regression helps address this by “modifying the method of least squares to allow biased estimators of the regression coefficients” (Kutner, et al., p. 432). Knowing that, it doesn’t make much sense to demonstrate Ridge regression on height and weight data. There’s only one predictor. I thought maybe they were going to do more with Ridge regression in this chapter, but they don’t. They move on to code breaking. It just seemed out of place to bring up Ridge regression in the context of single predictor regression.

I won’t go into the code breaking program. The code is available on the book’s git hub page. What I wanted to review for my own benefit is their function for encrypting text using the Caesar cipher. I thought it was a good example of using R’s indexing capabilities with list objects. First we need a vector with the English alphabet and two empty list objects for the ciphers: one for encrypting text and another for deciphering text:

english.letters <- letters
caesar.cipher <- list()
inverse.caesar.cipher <- list()

Now we create the ciphers:

# need to use index %% 26 + 1 to handle letter "z" at end
# 26 %% 26 + 1 = 0 + 1 = 1
for (index in 1:length(english.letters))
{
 caesar.cipher[[english.letters[index]]] <- english.letters[index %% 26 + 1]
 inverse.caesar.cipher[[english.letters[index %% 26 + 1]]] <- english.letters[index]
}

The "index %% 26 + 1" confused me for a couple of reasons. First off, the %% operator means modulus and I'm used to see something like 12 %% 5, where the first number is bigger than the second. The answer to that is of course 2. If you divide 12 by 5 you get remainder 2, and that's what the %% operator returns. But in the code above, you'll notice the "index" value is less than 26 on every iteration except the last. How do you calculate something like 2 %% 26? I wasn't sure so I had to google around and figure it out. It turns out the modulus operation a %% n is defined as follows:

$$ a \, mod \, n = a - \lfloor \frac{a}{n} \rfloor \times n$$

So, 2 %% 26 is calculated like so:

$$2 \, mod \, 26 = 2 - \lfloor \frac{2}{26} \rfloor \times 26 = 2 - \lfloor 0.0769 \rfloor \times 26 = 2 - (0 \times 26) = 2$$

Once I was clear on that, I was confused why we needed to use the %% operator at all. Again if you look at the code above, you'll see "index %% 26 + 1". When index = 2, "index %% 26" gives 2. When index = 3, "index %% 26" gives 3. And so on. Why not just use "index + 1"? Well, when we get to the letter "z", we need to substitute "a". The index number for "a" is 1. But if we do "index + 1", then we would get 27 at the letter "z" since its index number is 26. That would not index anything in the 26 element "english.letters" vector and produce an error. Therefore we need to use the %% operator to allows us to recycle through the numbers 1 - 26. When index = 26 (i.e., when we get to the letter "z"), we do 26 %% 26 + 1, which equals 1 since 26/26 has remainder equal 0.

Now that I understand how the ciphers are generated, I can print them and have a look if I like:

print(caesar.cipher) # encrypt text
print(inverse.caesar.cipher) # decipher text

Since these are lists, we investigate with indices using double brackets. For example, to see what we substitute for letter "h", we do the following:

caesar.cipher[["h"]]

That returns "i". It seems like a fun extension to this would be to create a general function that allowed you to specify the substitution index. In the Caesar cipher, this is 1. It would be neat to specify, say, 13 so "a" becomes "n", "b" becomes "o", and so on. Anyway, moving on...

Now we use the ciphers we just made to encrypt text. To do this the authors build two functions. The first encrypts text letter-by-letter using the cipher you give it. The second function encrypts words using the first function. Let's have a look at these:

apply.cipher.to.string <- function(string, cipher)
{
 output <- ''
 for (i in 1:nchar(string))
 {
 output <- paste(output, cipher[[substr(string, i, i)]], sep='')
 }
 return(output)
}
apply.cipher.to.text <- function(text, cipher)
{
 output <- c()
 for (string in text)
 {
 output <- c(output, apply.cipher.to.string(string, cipher))
 }
 return(output)
}

The first function loops through each letter of a word (or "string"). Using the substr function, we extract each letter, use that letter to look up the substitution in the cipher, and "paste" the substituted letter to the previously substituted letter. At the conclusion of the loop we have encrypted the word using our cipher. The second function uses this first function to encrypt whole phrases. Here is how the authors encrypt the message "here is some sample text" using the Caesar cipher:

apply.cipher.to.text(c('here', 'is', 'some', 'sample','text'), caesar.cipher)

That gives us "ifsf" "jt" "tpnf" "tbnqmf" "ufyu" . If we wanted to decipher the text, we would use the inverse cipher as follows:

apply.cipher.to.text(c('ifsf', 'jt', 'tpnf', 'tbnqmf', 'ufyu'), inverse.caesar.cipher)

Again it would be a good exercise to generalize these functions to accept a whole text file and return an encrypted text file. Not that I'm going to do that, but it sounds like it would make a good project assignment in some programming class.

Machine Learning for Hackers, Chapter 6

Chapter 6 covers orthogonal polynomial regression, cross-validation, regularization and a touch of logistic regression. The case study is predicting popularity of O’Reilly’s top 100 books based on their back cover descriptions. As usual I picked up some useful tricks from this chapter. Let’s get into it.

If you have a data frame called “df” with a variable \(x\) and you want to add, say, \(x^{2}\) to it (and name it “X2”), you can do the following:

df <- transform(df, X2 = X ^ 2)

Now it turns out R has a function called poly that does this for you:

poly(X , degree=2, raw=TRUE)

So you could leave your data frame as is (say it has two columns, Y and X) and do polynomial regression like so:

lm(Y ~ poly(X, degree=2, raw=TRUE), data = df)

That's straight-up raw polynomial regression. If you leave out the "raw = TRUE" parameter (or set it to FALSE, its default value), the poly function generates orthogonal polynomials. As the book states: "instead of naively adding simple powers of x, we add more complicated variants of x that work like successive powers of x, but aren't correlated with each other in the way that \(x\) and \(x^{2}\) are." Two vectors are orthogonal when their dot product equals 0. You can verify the poly produces orthogonal vectors. Here's a simple example:

x <- 1:10
p <- poly(x, 2)
p[,1] %*% p[,2]

That returns 0. I don't understand how R creates these orthogonal vectors. It returns a couple of items called "alpha" and "norm2" which I'm pretty sure have something to do with it. But the help pages and Google searches yielded nothing. (Though I will say this guy's blog post was useful.) Anyway, they illustrate polynomial regression using some toy data:

set.seed(1)
x <- seq(0, 1, by = 0.01)
y <- sin(2 * pi * x) + rnorm(length(x), 0, 0.1)
df <- data.frame(X = x, Y = y)
# raw polynomial regression
m1 <- lm(Y ~ poly(X,degree=5,raw=TRUE), data=df)
# orthogonal polynomial regression
m2 <- lm(Y ~ poly(X,degree=5), data=df)

If you want to make predictions using your models, do something like this:

d <- seq(0,1,by = 0.05)
p1 <- predict(m1, data.frame(X=d))
p2 <- predict(m2, data.frame(X=d))

If you want to see how your model visually performs, do something like this:

plot(x, y, xlab = "X", ylab = "Y", las = 1, xlim = c(0, 1))
lines(d, p1, col = "red")
lines(d, p2, col = "blue")

Finally if you want to update your orthogonal model, you can do something like this:

m1.2 <- lm(Y ~ poly(X,degree=5,raw=TRUE)[,-4], data=df)
m1.3 <- lm(Y ~ poly(X,degree=5,raw=TRUE)[,c(-2,-4)], data=df)

Now they only play around with orthogonal regression for the most part and they use cross validation. They present a general way to randomly split data into a testing and training sets and then loop through polynomial degrees evaluating RMSE at each step for both the training and testing sets. See their 14th and 15th code snippets for this code. I think it's quite useful.

After this they go into Regularization, which is some heavy-duty black box stuff. I guess they do a good job of showing you how to do it using the "glmnet" package. Still, though, I have no idea what's going on under the hood. I would feel somewhat disingenuous using this for real. I have no doubt it works. I'm not questioning their application or the glmnet package. I just feel uneasy using methods that I don't understand. I'm always afraid someone will ask me to explain how it works.

Finally they use Regularization and logistic regression to build a model for predicting whether or not a book will be in the top 50 (of the 100 best selling books) given its back-cover description. Instead of summarizing that analysis I wanted to write a little about how they hammer the data into shape to analyze it. First they read in the data, then they load the "tm" package and start calling functions:

ranks <- read.csv('data/oreilly.csv', stringsAsFactors = FALSE)
library('tm')
documents <- data.frame(Text = ranks$Long.Desc.)
row.names(documents) <- 1:nrow(documents) # not sure why this is necessary
corpus <- Corpus(DataframeSource(documents))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords('English'))

The last four lines is where the magic happens. First they make a "corpus". This is sort of like a list where the elements are text documents. The next line makes everything lower case. The next collapses multiple white space characters to a single blank. The last removes the most common words in English, like "the" and "an". When you're done, you can't view the corpus like you would a data frame. You can't just type the name at the command line and hit enter. You have to use the inspect() function, like this:

inspect(corpus) # all documents
corpus[[2]] # second document

Pretty cool. But it's still not ready for regression analysis. You have to produce a Document Term Matrix (DTM) from your corpus. This is done like so:

dtm <- DocumentTermMatrix(corpus)

Easy enough. But, again, how to view it? Use the inspect() function, like this:

# first five documents and first five terms
inspect(dtm[1:3,1:5])
# all documents and first five terms
inspect(dtm[,1:5])

The rows indicate the document (in this case numbered 1 - 100) and the column headers are the words (or terms). The values in the matrix tell you how many times the term appeared in a document. Here's a quick way to do some analysis on those terms:

# find those terms that occur at least five times
findFreqTerms(dtm, 5)

The last step to get everything ready for analysis is to convert the DTM into a matrix:

x <- as.matrix(dtm)

We do this because the glmnet function requires the predictors to be in a matrix, and the DTM represents the predictors. (The response is an indicator: top-50 book or not.) Now the dimension of the dtm matrix is 100 x 5605! That's a lot of columns! The authors have this to say: "Because our data set has more columns than rows, unregularized regression will always produce an overfit model. For that reason, we have to use some form of regularization to get any meaningful results." They probably could have eliminated some of the columns. When I inspected the DTM I saw terms such as "!)</li", "#</--", "#</,", "#</em", "#</em," and "&amp;". These are clearly just left over HTML codes. I can't see how they contribute to this analysis. But even if they were removed the matrix would still have many more columns than rows.

Machine Learning for Hackers, Chapter 5

This chapter covers linear regression. The case study is predicting page views for web sites given unique visitors, whether the site advertises, and whether the site is in English. The funny thing is that while they spend a lot of time explaining how to do linear regression in R, they never get around to explaining how to make a prediction using your regression model. (The name of the chapter is actually called “Regression: Predicting Page Views”.) I’ll explain how to do prediction in this post. But first I want to say the authors do a fine job of explaining the “big idea” of linear regression. It’s not an easy topic to cover in under 30 pages. If you’re new to the subject this is a great introduction. I highly recommend it. (And we all know my recommendation means a great deal.)

So let’s say you follow along with the book and a create a linear model to predict page views given unique visitors:

lm.fit <- lm(log(PageViews) ~ log(UniqueVisitors), data = top.1000.sites)

The results of the regression are as follows:

Coefficients:
                   Estimate Std. Error t value   Pr(>|t|) 
(Intercept)        -2.83441    0.75201  -3.769   0.000173 ***
log(UniqueVisitors) 1.33628    0.04568  29.251    < 2e-16 ***
---
Residual standard error: 1.084 on 998 degrees of freedom
Multiple R-squared: 0.4616, Adjusted R-squared: 0.4611 
F-statistic: 855.6 on 1 and 998 DF, p-value: < 2.2e-16

If we want to predict Page Views for a site with, say, 500,000,000 Unique Visitors, we use the coefficient estimates to calculate the following:

$$-2.83441 + 1.33628*\text{log}(500,000,000)$$

To easily do this in R we can submit code like this:

new <- data.frame(UniqueVisitors = 500000000)
exp(predict(lm.fit, new, interval = "prediction"))

This gives us not only a prediction (fit) but a prediction interval as well (lwr and upr):

fit           lwr        upr
1 24732846396 2876053856 2.12692e+11

Also notice we had to wrap the predict function in the exp function. That's because we ran the regression with the data transformed in the log scale. Once we make our prediction we need to transform it back to the original scale. I'm surprised the authors don't cover this. They do mention the predict function as a way to see what the regression model predicts for the original data values. But they don't talk about how to use your regression results to make a prediction for a new value that wasn't in your data. Perhaps that will be covered in future chapters.

Another minor complaint I have is a statement they make on page 150 in the 2nd paragraph: "...the Residual standard error is simply the RMSE of the model that we could compute using sqrt(mean(residuals(lm.git) ^ 2))" where lm.fit is the model object created when regressing Page Views on Unique Visitors. That's incorrect. The residual standard error is computed for this model with sqrt((sum(residuals(lm.fit)^2))/998), where 998 = 1000 - 2. In other words, instead of dividing the sum of the squared residuals by 1000 (i.e., taking the mean), we divide by the number of observations minus the number of coefficients in our model. In this case we have 1000 observations and two coefficients in our model (the intercept and slope). So we divide by 998.

While I didn't learn anything new about linear regression in this chapter, I did learn something about plotting lm model objects in R. I've known for a while you can easily create diagnostic plots by submitting the plot function with a model object as the argument. This creates 4 plots in a 2 x 2 grid. What I learned in this chapter is that you can specify which plot is produced by issuing an additional argument "which = ". For example, to just see a plot of residuals versus fitted values, you submit

plot(lm.fit, which = 1)

The book says issuing this statement will "plot the residuals against the truth." I think "truth" should say "fitted values". I'm not sure I've ever heard "fitted values" referred to as the "truth".

Anyway, as I said, I think this chapter is a solid introduction to linear regression. That's a good thing because it appears the next chapter steps it up a notch.

Machine Learning for Hackers, Chapter 4

This chapter explains how to create a ranking algorithm to prioritize email messages in your inbox. The authors mention Google popularizing the idea of creating a “priority inbox” and point to a paper that describes their strategies. They then proceed to use some of these strategies to create their own ranking program using emails from the SpamAssassin public corpus.

This was a long and unpleasant chapter for me to slog through. For one I don’t think I will ever be asked to design a ranking algorithm for emails. I suppose this concept could be extended to other applications but the chapter doesn’t go there. Another sticking point was that the algorithm the authors build is necessarily incomplete because we’re only dealing with emails received (instead of both emails sent and received). The authors acknowledge this up front and say “the methods and algorithms used in this chapter should be considered as exercises only and not examples of how enterprise priority inbox systems should be implemented.” So I was learning how to implement something sort of like what Google does (but not really) that isn’t actually a traditional machine learning method. Personally that wasn’t huge motivation to pick through the authors’ butt-load of R code and figure out what they were doing. I don’t mean to be critical of the authors. I’m impressed with what they did. I mean, think about it: they wrote a dense 33-page chapter on how to build a ranking program in R as an exercise. That’s dedication. I guess I just couldn’t get rid of that nagging feeling that I would never make use of the material in this chapter in my professional life. So my goal became to study their R code and focus on becoming a better programmer.

When all was said and done, their R code wasn’t too hard to follow. I thought they did a great job of breaking everything down into manageable functions. In fact I would say what they do in this chapter is provide a good model for building a long, complicated R program. Having said that, the portions that jumped out at me as being both unfamiliar and potentially useful were the sections that used regular expressions.

I took a 4-week Coursera class that touched briefly on regular expressions. I also found this tutorial that I thought was useful. Other than that I really don’t know much about them. My impression of regular expressions is that unless you use them on a regular basis you’re not going to master them. So my strategy is to just know what they’re used for and remember that. When the time comes, I’ll have to dig in and figure out how to write them, because there is no way I’m going to teach myself how to use them and have total recall several months (or years) from now.

The first instance they use regular expressions is to go through an email message and pull out the lines that contain “From:” in order to eventually strip out an email address:

msg.vec[grepl("From: ", msg.vec)]

The grepl function takes a regular expression as its first argument and returns a logical vector of TRUE/FALSE the same length as its second argument, the vector it combs looking for matches. So here msg.vec is the email message. grepl goes through each line and looks for the text “From: “. If it finds it, it returns TRUE, otherwise FALSE. This is no doubt the easiest example of a regular expression because it’s a literal string. But that first argument could be a really sophisticated regular expression. Finally, grepl is used to index msg.vec such that only lines with “From: ” are returned.

Now the lines with “From: ” contain extraneous characters such as angle brackets and of course colons. To address this they use the strsplit function, which splits a character element into a list by a given regular expression. Here’s how they use it:

strsplit(from, '[":<> ]')

where “from” is the character element. Here’s an example of how it works:

test <- "From: Test Guy "
strsplit(test, '[":<> ]')
 [[1]]
 [1] "From" "" "Test" "Guy" ""
 [6] "test@guy.com"

You can see it splits the character element by the characters in the regular expression. (The square brackets mean match anything inside the square brackets for one character position once and only once.) If I wanted to extract the email address, I could do the following:

grepl("@", test2[[1]])
 [1] FALSE FALSE FALSE FALSE FALSE TRUE

That shows me which element contains the email address. So let’s save our strsplit result and use grepl to pull out the email address:

test2 <- strsplit(test, '[":<> ]')
test2[[1]][grepl("@", test2[[1]])]
 [1] "test@guy.com"

Another function they use is the gsub function which basically does a search and replace based on a regular expression. Here’s an example:

test <- "   20 Dec 2011      "
test
 [1] "   20 Dec 2011     "
gsub("^\\s+|\\s+$", "", test)
 [1] "20 Dec 2011"

Notice the spaces before and after the date. The gsub function searches for leading or trailing spaces and replaces them with nothing. And behold that regular expression. Pretty, isn't it? The caret ^ means search the beginning, the "\s" means match any whitespace characters. The extra backslash "\" in front of it means escape the backslash in front of the "s"! (This is unique to R, I think.) The + sign means match the previous character 1 or more times. The pipe "|" means "or". So search for leading spaces OR trailing spaces.  After the pipe is the search string for trailing spaces. It's just like the search string for leading spaces, except instead of a caret at the front, it has a dollar sign at the end, which means look only at the end of the target string.

Again, there's a massive amount of R programming going on in this chapter and I barely even began to scratch the outside of the wrapping around the surface. Even though I question how practical the actual program is, I do think it's an excellent example of how to organize and tackle a complicated and long project in R.

Machine Learning for Hackers, Chapter 3

In this chapter you build a naïve Bayes classifier to classify an email as spam or not spam (or “ham” as they call it) using only the body of the message. The idea is to take a bunch of spam messages and build a Term Document Matrix (TDM). A TDM is a big grid where the row labels are the terms in the spam emails and the column headers are the spam emails. If row 21 is “viagra” and column 20 is the 20th spam email, their cell [21,20] displays the number of times the word “viagra” appeared in that email. Using the TDM they create a data frame which contains the terms and the percentage of documents in which a given term occurs. For example, of the 500 spam emails, the term “html” appears in 33% of the messages. That gets treated as a conditional probability. The probability of seeing the term “html” in an email given the email is spam is 0.33. The same process is carried out for emails that are not spam. So we have two data frames of terms (one for spam and one for not spam) and their estimated probability of occurrence.

To classify a new email as either spam or not spam we compare it to both data frames and see which terms in the email are in the respective data frame. If a term is in the date frame, we pull out its probability. If not, we assign it a low probability of 0.000001. We multiply the probabilities together along with the prior probability of the message being spam, which they set as 0.5 for starters.

For example, say we have a message with 5 words in the body. None of the words in the email appear in the spam data frame, so we calculate \( 0.5*0.000001^{5}\). However say two of the words appear in the good email data frame, each with probability of 0.05 of occurring. So we calculate \( 0.5*0.05^{2}*0.000001^{3}\). Since the latter calculation is greater than the former, we classify the email as not spam. That’s how you do naïve Bayes classification.

Now if you’re like me you’re probably thinking, where’s the denominator? I mean, I’m no Bayesian expert, but I do know Bayes’s theorem has a denominator. Well it turns out the denominator is the same in both calculations so they drop it. (I have to give a shout out to this guy’s explanation. That’s where I went to figure it out.) You’ll also notice we’re multiplying probabilities as if they’re independent. That’s the naïve assumption here.

Using the language of the probability, here’s Bayes’s Theorem for two terms, where A = message is spam, B = some word, and C = some other word

$$ P(A|B \cap C ) = \frac{ P(B|A \cap C ) P(C|A) P(A) }{ P(B|C) P(C) }$$

For not spam this is…

$$ P(A^{\prime} | B \cap C) = \frac{P(B|A^{\prime} \cap C)P(C|A^{\prime})P(A^{\prime})}{P(B|C)P(C)}$$

Notice the denominators are the same so we can drop them when comparing the two. Since we’re also assuming independence, the numerator changes to \( P(B|A)P(C|A)P(A)\) and \(P(B|A^{\prime})P(C|A^{\prime})P(A^{\prime})\), respectively.

The book does not explain the classification algorithm in the detail I just did. I guess since the book is “for hackers” they didn’t want to get bogged down in math formulas and explanations of Bayes’s Theorem. I get that. But still I wanted to understand why it worked and not just how to do it R.

The R code itself for this chapter is not too hard to understand. A few nuggets I got out of it are…

The readLines() function will return each line of text as a separate element of a character vector.
The dir() function will return a listing of all file names in a directory.
The intersect() function will return the values in common between two vectors.