Chapter 6 covers orthogonal polynomial regression, cross-validation, regularization and a touch of logistic regression. The case study is predicting popularity of O’Reilly’s top 100 books based on their back cover descriptions. As usual I picked up some useful tricks from this chapter. Let’s get into it.
If you have a data frame called “df” with a variable \(x\) and you want to add, say, \(x^{2}\) to it (and name it “X2”), you can do the following:
df <- transform(df, X2 = X ^ 2)
Now it turns out R has a function called poly that does this for you:
poly(X , degree=2, raw=TRUE)
So you could leave your data frame as is (say it has two columns, Y and X) and do polynomial regression like so:
lm(Y ~ poly(X, degree=2, raw=TRUE), data = df)
That's straight-up raw polynomial regression. If you leave out the "raw = TRUE" parameter (or set it to FALSE, its default value), the poly function generates orthogonal polynomials. As the book states: "instead of naively adding simple powers of x, we add more complicated variants of x that work like successive powers of x, but aren't correlated with each other in the way that \(x\) and \(x^{2}\) are." Two vectors are orthogonal when their dot product equals 0. You can verify the poly produces orthogonal vectors. Here's a simple example:
x <- 1:10 p <- poly(x, 2) p[,1] %*% p[,2]
That returns 0. I don't understand how R creates these orthogonal vectors. It returns a couple of items called "alpha" and "norm2" which I'm pretty sure have something to do with it. But the help pages and Google searches yielded nothing. (Though I will say this guy's blog post was useful.) Anyway, they illustrate polynomial regression using some toy data:
set.seed(1) x <- seq(0, 1, by = 0.01) y <- sin(2 * pi * x) + rnorm(length(x), 0, 0.1)
df <- data.frame(X = x, Y = y)
# raw polynomial regression m1 <- lm(Y ~ poly(X,degree=5,raw=TRUE), data=df) # orthogonal polynomial regression m2 <- lm(Y ~ poly(X,degree=5), data=df)
If you want to make predictions using your models, do something like this:
d <- seq(0,1,by = 0.05) p1 <- predict(m1, data.frame(X=d)) p2 <- predict(m2, data.frame(X=d))
If you want to see how your model visually performs, do something like this:
plot(x, y, xlab = "X", ylab = "Y", las = 1, xlim = c(0, 1)) lines(d, p1, col = "red") lines(d, p2, col = "blue")
Finally if you want to update your orthogonal model, you can do something like this:
m1.2 <- lm(Y ~ poly(X,degree=5,raw=TRUE)[,-4], data=df) m1.3 <- lm(Y ~ poly(X,degree=5,raw=TRUE)[,c(-2,-4)], data=df)
Now they only play around with orthogonal regression for the most part and they use cross validation. They present a general way to randomly split data into a testing and training sets and then loop through polynomial degrees evaluating RMSE at each step for both the training and testing sets. See their 14th and 15th code snippets for this code. I think it's quite useful.
After this they go into Regularization, which is some heavy-duty black box stuff. I guess they do a good job of showing you how to do it using the "glmnet" package. Still, though, I have no idea what's going on under the hood. I would feel somewhat disingenuous using this for real. I have no doubt it works. I'm not questioning their application or the glmnet package. I just feel uneasy using methods that I don't understand. I'm always afraid someone will ask me to explain how it works.
Finally they use Regularization and logistic regression to build a model for predicting whether or not a book will be in the top 50 (of the 100 best selling books) given its back-cover description. Instead of summarizing that analysis I wanted to write a little about how they hammer the data into shape to analyze it. First they read in the data, then they load the "tm" package and start calling functions:
ranks <- read.csv('data/oreilly.csv', stringsAsFactors = FALSE) library('tm')
documents <- data.frame(Text = ranks$Long.Desc.) row.names(documents) <- 1:nrow(documents) # not sure why this is necessary
corpus <- Corpus(DataframeSource(documents)) corpus <- tm_map(corpus, tolower) corpus <- tm_map(corpus, stripWhitespace) corpus <- tm_map(corpus, removeWords, stopwords('English'))
The last four lines is where the magic happens. First they make a "corpus". This is sort of like a list where the elements are text documents. The next line makes everything lower case. The next collapses multiple white space characters to a single blank. The last removes the most common words in English, like "the" and "an". When you're done, you can't view the corpus like you would a data frame. You can't just type the name at the command line and hit enter. You have to use the inspect() function, like this:
inspect(corpus) # all documents corpus[[2]] # second document
Pretty cool. But it's still not ready for regression analysis. You have to produce a Document Term Matrix (DTM) from your corpus. This is done like so:
dtm <- DocumentTermMatrix(corpus)
Easy enough. But, again, how to view it? Use the inspect() function, like this:
# first five documents and first five terms inspect(dtm[1:3,1:5]) # all documents and first five terms inspect(dtm[,1:5])
The rows indicate the document (in this case numbered 1 - 100) and the column headers are the words (or terms). The values in the matrix tell you how many times the term appeared in a document. Here's a quick way to do some analysis on those terms:
# find those terms that occur at least five times findFreqTerms(dtm, 5)
The last step to get everything ready for analysis is to convert the DTM into a matrix:
x <- as.matrix(dtm)
We do this because the glmnet function requires the predictors to be in a matrix, and the DTM represents the predictors. (The response is an indicator: top-50 book or not.) Now the dimension of the dtm matrix is 100 x 5605! That's a lot of columns! The authors have this to say: "Because our data set has more columns than rows, unregularized regression will always produce an overfit model. For that reason, we have to use some form of regularization to get any meaningful results." They probably could have eliminated some of the columns. When I inspected the DTM I saw terms such as "!)</li", "#</--", "#</,", "#</em", "#</em," and "&". These are clearly just left over HTML codes. I can't see how they contribute to this analysis. But even if they were removed the matrix would still have many more columns than rows.