Category Archives: Using R

The standard deviation of the sampling distribution of the mean

Someone on Reddit posted Chapter 1 of Howard Wainer’s book, Picturing the Uncertain World. The name of the chapter is The Most Dangerous Equation and its subject is the formula for the standard deviation of the sampling distribution of the mean. This little guy:

$$ \sigma_{\bar{x}}=\frac{\sigma}{\sqrt{n}}$$

Why is it dangerous? The danger comes from not understanding it. Wainer gives several examples of how ignorance of this equation has “led to billions of dollars of loss over centuries, yielding untold hardship.” But before diving into the examples he briefly describes the formula. It’s not a proof but rather a heuristic demonstration of how it works:

…if we measure, for example, the heights of, say, 1000 students at a particular high school, we might find that the average height is 67 inches, but heights might range from perhaps as little as 55 inches to as much as 80 inches. A number that characterizes this variation is the standard deviation….But now suppose we randomly grouped the 1000 children into 10 groups of 100 and calculated the average within each group. The variation of these 10 averages would likely be much smaller than [the standard deviation of all 1000 students]…

This got me to thinking about how I could demonstrate this in R. It seems like we could generate a population, then take repeated samples of size n to create a sampling distribution, and then show that the standard deviation of the sampling distribution is indeed equal to the population standard deviation divided by n. Let’s do it.

First I load the gtools package which has a function called permutations(). As you might guess, it takes a vector of values and generates all permutations of a given size. After that I generate a “population” of 10 values using the rnorm() function. So my 10 values come from a normal distribution with mean 10 and standard deviation 2, but I’m treating these 10 values as if they’re my entire population for this toy example.

library(gtools)
# generate population of size 10
pop <- rnorm(10,10,2)

Now we're ready to generate the sampling distribution. I decided to let n = 3. This means we need to generate every possible sample of size 3. This is where the permutations() function comes. The first argument tells it how big the source vector is, the second states the size of the target vectors, and the third is the source vector. The last tells it to allow repeats. This is important. This replicates sampling with replacement, which is necessary to demonstrate this formula using a finite population.

# generate the sampling distribution
# first generate all samples of size 3
sdist <- permutations(10,3,pop,repeats=TRUE)

If you look at the variable sdist, you'll see it's a 1000 x 3 matrix. That's 1000 permutations of size 3 from our original population. Next we take the mean of each row (or sample of size 3):

# find the mean of all samples (in rows)
sdm <- apply(sdist,1,mean)

The variable "sdm" is our sampling distribution of the mean. We took every possible sample of size 3 from our population and calculated the mean of each sample. Now the first thing to note is that the mean of the sampling distribution is equal to the mean of our population:

mean(pop) == mean(sdm)
[1] TRUE

But what I really wanted to verify was the formula \( \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}\), or equivalently \(\sigma_{\bar{x}}^{2} = \frac{\sigma^{2}}{n}\):

sum(sdm^2)/1000 - (mean(sdm)^2) == (sum(pop^2)/10 - (mean(pop)^2))/3
[1] TRUE

Aha! It's true! Of course this isn't a proof, but it came out like we expected. Notice I had to manually find the population variance in each case using \( \sigma^{2} = E(X^{2}) - \mu^{2}\). That's because the var() and sd() functions in R divide by n-1.

Using Simulation to Compute Confidence Intervals

I’ve been working through Gelman and Hill’s book, Data Analysis Using Regression and Multilevel/Hierarchical Models. I originally wanted to read it and blog about each chapter the way I did Machine Learning for Hackers. But that book had 12 chapters. This one has 25. And the chapters are longer and denser. I don’t think I have the appetite for such an endeavor. However, I’m still working through the book and want to at least blog about certain topics that catch my eye. Today’s post comes from Chapter 2, Concepts and Methods from Basic Probability and Statistics.

In reviewing confidence intervals, the authors mention using simulation to form confidence intervals for statistics that have complicated standard error formulas. It’s one thing to compute a confidence interval for a mean. The standard error is \( \frac{s}{\sqrt{n}} \). But what about a ratio? Well, that’s more difficult. Thus the book gives a nice example on how to use simulation to find a confidence interval for a ratio (p. 20). Instead of reproducing it verbatim, I thought I would apply the strategy to finding a confidence interval for a median.

My old stats textbook, Probability and Statistical Inference by Hogg and Tanis (7th edition) has a problem in section 6.10 (#7) that gives you measurements of one of the front legs of 20 different spiders and asks you to find a 95% confidence interval for the median. Now this chapter presents a method for calculating this, which is covered in depth at this very nice Penn State web site. Since the problem is odd-numbered, I can flip to the back and see the answer is (15.40, 17.05). Let’s try finding a 95% confidence interval for this data using simulation.

First off, here is the data:

x <- scan("Exercise_6_10-07.txt",skip=1)
> sort(x)
 [1] 13.55 13.60 14.05 15.10 15.25 15.40 15.45 15.75 16.25 16.40 16.45 16.65
[13] 16.80 16.95 17.05 17.55 17.75 19.05 19.05 20.00

The measurements are in millimeters. The median of this data is 16.425, easily calculated in R with the median() function. Now what about a confidence interval? This is after all a sample of 20 spiders. Our median is but a point estimate of the population median, a value we will never be able to directly measure. Using R we can simulate a large number of samples by re-sampling from the data and taking the median each time, like this:

nsims <- 1000 # number of simulations
m <- rep(NA,nsims) # empty vector to store medians
for (i in 1:nsims){  
	m[i] <- median(sample(x,replace=TRUE))
	}

When we're done, we can use the quantile() function as follows to find the 2.5 and 97.5 percentiles and thus estimate a confidence interval:

quantile(m,c(0.025,0.975))
    2.5%    97.5% 
15.42500 17.00125 

That's very close to the given answer of (15.40, 17.05). This example would technically be filed under "bootstrap", but I think it captures the spirit of using simulation to find a confidence interval.

Buffon’s Needle Problem, or How to use Probability to Estimate Pi

I gave a presentation on Buffon’s Needle Problem in a job interview once. Here’s the presentation I gave in PDF format if you’re interested. If you’ve never heard of Buffon’s Needle Problem, you should open my little presentation and browse through it. It’s one of the damndest things I’ve ever learned.

Here’s the setup. Imagine you have a floor with parallel lines, kind of like hardwood floors, with all lines equally spaced from one another. Let’s say the distance between the lines is L:
buf_01

Now imagine you drop some needles of length L on the floor and count the instances of where the needles cross the lines:
buf_02

Yes, I know, I have red needles! Cool, right? Anyway, here’s the crazy thing: if you drop a lot of needles (say 10,000) and count the number of needles crossing lines, you can use that information to estimate Pi! It turns out that if the distance between lines and the needle length are both 1, then \( \pi \approx \frac{2n}{k} \), where n = number of needles and k = number of needles crossing lines. I don’t know about you but I think that’s nuts!

Let’s fire up R and see this in action. Here’s slightly modified code from the presentation:

a <- 1 # length of needle
L <- 1 # distance between lines
n <- 100000 # number of dropped needles
hit <- 0
for(i in 1:n) {
	x <- runif(1,0,1)
	y <- runif(1,0,1)
	 while(x^2 + y^2 > 1) { # no points outside of unit circle
	  	x <- runif(1,0,1)
	 	y <- runif(1,0,1)
	}	
	theta <- atan(y/x) # the random angle
	d <- runif(1,0,(L/2)) # distance of needle midpoint to nearest line
	if(d <= (a/2)*sin(theta)) {
		hit <- hit + 1
	} 
}
pi.est <- (n*2)/(hit)
pi.est

First I define the distance between the lines (L) and the needle length (a) to both be 1. They don't have to be equal, but the needle length does need to be less than or equal to the distance between the lines. It turns out that in general \( \pi \approx \frac{2na}{kL}\). In my example, I have \( a = L = 1\), so it simplifies to \( \pi \approx \frac{2n}{k}\). Next I define a variable called "hit" to count the number of needles crossing a line and then I dive into a loop to simulate dropping 100,000 needles.

The first 7 lines in the loop generate a random acute angle (less than 90 degrees or \( \frac{\pi}{2}\) radians) by way of the arctan function and x and y points that lie within the unit circle. The reason the points need to lie within the unit circle is to ensure all angles have an equal chance of being selected. The next line randomly generates a distance (d) from the midpoint of the needle to the nearest line. Using my random angle and random distance I then check to see if my needle crossed the line. If \( d \le \frac{a}{2}sin(\theta)\) then the needle crossed the line and I increment my hit counter. In my presentation I try to justify this mathematical reasoning using pretty pictures.

Finally I calculate my Pi estimate and spit it out. I ran it just now with n = 100,000 and got 3.136517. Not an accurate estimate but pretty close. When I tried it with n = 1,000,000 I got 3.142337. The more needles, the better your estimate.

Now is this practical? Of course not. I can Google Pi and get it to 11 decimal places. R even has a built-in Pi variable (pi). But the fact we can use probability to estimate Pi just never ceases to amaze me. Nice going Buffon!

Scraping Data off a Web Site

I’m taking the Data Analysis class through Coursera and one of the topics we’ve covered so far is how to “scape” data off a web site. The idea is to programmatically got through the source code of a web page, pull out some data, and then clean it up so you can analyze it. This may seem like overkill at first glance. After all, why not just select the data with your mouse and copy-and-paste into a spreadsheet? Well, for one, there may be dozens (or hundreds) of pages to visit and copying-and-pasting from each one would be time-consuming and impractical. Second, rarely does a copy-and-paste off a web site produce data ready for analysis. You have to tidy it up, sometimes quite a bit. Clearly these are both tasks we would like to automate.

To put this idea to use, I decided to scrape some data from the box scores of Virginia Tech football games. I attended Tech and love watching their football team, so this seemed like a fun exercise. Here’s an example of one of their box scores. You’ll see it is has everything but what songs the band played during halftime. I decided to start simple and just scrape the Virginia Tech Drive Summaries. This summarizes each drive, including things like number of plays, number of yards gained, and time of possession. Here’s the function I wrote in R, called vtFballData:

vtFballData <- function(start,stop,season){
    dsf <- c()
    # read the source code
    for (i in start:stop){
    url <- paste("http://www.hokiesports.com/football/stats/showstats.html?",i,sep="")
    web_page <- readLines(url)

    # find where VT drive summary begins
    dsum <- web_page[(grep("Virginia Tech Drive Summary", web_page) - 2):
                         (grep("Virginia Tech Drive Summary", web_page) + 18)]
    dsum2 <- readHTMLTable(dsum)
    rn <- dim(dsum2[[1]])[1]
    cn <- dim(dsum2[[1]])[2]
    ds <- dsum2[[1]][4:rn,c(1,(cn-2):cn)]
    ds[,3] <- as.character(ds[,3]) # convert from factor to character
    py <- do.call(rbind,strsplit(sub("-"," ",ds[,3])," "))
    ds2 <- cbind(ds,py)
    ds2[,5] <- as.character(ds2[,5]) # convert from factor to character
    ds2[,6] <- as.character(ds2[,6]) # convert from factor to character
    ds2[,5] <- as.numeric(ds2[,5]) # convert from character to numeric
    ds2[,6] <- as.numeric(ds2[,6]) # convert from character to numeric
    ds2[,3] <- NULL # drop original pl-yds column

    names(ds2) <-c("quarter","result","top","plays","yards")
    # drop unused factor levels carried over from readlines
    ds2$quarter <- ds2$quarter[, drop=TRUE] 
    ds2$result <- ds2$result[, drop=TRUE]

    # convert TOP from factor to character
    ds2[,3] <- as.character(ds2[,3]) 
    # convert TOP from M:S to just seconds
    ds2$top <- sapply(strsplit(ds2$top,":"),
        function(x) {
            x <- as.numeric(x)
            x[1]*60 + x[2]})

    # need to add opponent
    opp <- web_page[grep("Drive Summary", web_page)]
    opp <- opp[grep("Virginia Tech", opp, invert=TRUE)] # not VT
    opp <- strsplit(opp,">")[[1]][2]
    opp <- sub(" Drive Summary</td","",opp)
    ds2 <- cbind(season,opp,ds2)
    dsf <- rbind(dsf,ds2)
    }
return(dsf)
}

I’m sure this is three times longer than it needs to be and could be written much more efficiently, but it works and I understand it. Let’s break it down.

My function takes three values: start, stop, and season. Start and stop are both numerical values needed to specify a range of URLs on hokiesports.com. Season is simply the year of the season. I could have scraped that as well but decided to enter it by hand since this function is intended to retrieve all drive summaries for a given season.

The first thing I do in the function is define an empty variable called “dsf” (“drive summaries final”) that will ultimately be what my function returns. Next I start a for loop that will start and end at numbers I feed the function via the “start” and “stop” parameters. For example, the box score of the 1st game of the 2012 season has a URL ending in 14871. The box score of the last regular season game ends in 14882. To hit every box score of the 2012 season, I need to cycle through this range of numbers. Each time through the loop I “paste” the number to the end of “http://www.hokiesports.com/football/stats/showstats.html?” and create my URL. I then feed this URL to the readLines() function which retrieves the code of the web page and I save it as “web_page”.

Let’s say we’re in the first iteration of our loop and we’re doing the 2012 season. We just retrieved the code of the box score web page for the Georgia Tech game. If you go to that page, right click on it and view source, you’ll see exactly what we have stored in our “web_page” object. You’ll notice it has a lot of stuff we don’t need. So the next part of my function zeros in on the Virginia Tech drive summary:

# find where VT drive summary begins
dsum <- web_page[(grep("Virginia Tech Drive Summary", web_page) - 2):
                 (grep("Virginia Tech Drive Summary", web_page) + 18)]

This took some trial and error to assemble. The grep() function tells me which line contains the phrase “Virginia Tech Drive Summary”. I subtract 2 from that line to get the line number where the HTML table for the VT drive summary begins (i.e., where the opening <table> tag appears). I need this for the upcoming function. I also add 18 to that line number to get the final line of the table code. I then use this range of line numbers to extract the drive summary table and store it as “dsum”. Now I feed “dsum” to the readHTMLTable() function, which converts an HTML table to a dataframe (in a list object) and save it as “dsum2”. The readHTMLTable() function is part of the XML package, so you have download and install that package first and call library(XML) before running this function.

At this point we have a pretty good looking table. But it has 4 extra rows at the top we need to get rid of. Plus I don’t want every column. I only want the first column (quarter) and last three columns (How lost, Pl-Yds, and TOP). This is a personal choice. I suppose I could have snagged every column, but decided to just get a few. To get what I want, I define two new variables, “rn” and “cn”. They stand for row number and column number, respectively. “dsum2” is a list object with the table in the first element, [[1]]. I reference that in the call to the dim () function. The first element returned is the number of rows, the second the number of columns. Using “rn” and “cn” I then index dsum2 to pull out a new table called “ds”. This is pretty much what I wanted. The rest of the function is mainly just formatting the data and giving names to the columns.

The next three lines of code serve to break up the “Pl-Yds” column into two separate columns: plays and yards. The following five lines change variable classes and remove the old “Pl-Yds” column. After that I assign names to the columns and drop unused factor levels. Next up I convert TOP into seconds. This allows me to do mathematical operations, such as summing and averaging.

The final chunk of code adds the opponent. This was harder than I thought it would be. I’m sure it can be done faster and easier than I did it, but what I does works. First I use the grep() function to identify the two lines that contain the phrase “Drive Summary”. One will always have Virginia Tech and the other their opponent. The next line uses the invert parameter of grep to pick the line that does not contain Virginia Tech. The selected line looks like this for the first box score of 2012: “<td colspan=”9″>Georgia Tech Drive Summary</td>”. Now I need to extract “Georgia Tech”. To do this I split the string by “>” and save the second element:

opp <- strsplit(opp,">")[[1]][2]

It looks like this after I do the split:

[[1]]
[1] "<td colspan=\"9\""              "Georgia Tech Drive Summary</td"

Hence the need to add the “[[1]][2]” reference. Finally I substitute ” Drive Summary</td” with nothing and that leaves me with “Georgia Tech”. Finally I add the season and opponent to the table and update the “dsf” object. The last line is necessary to allow me to add each game summary to the bottom of the previous table of game summaries.

Here’s how I used the function to scrape all VT drive summaries from the 2012 regular season:

dsData2012 <- vtFballData(14871,14882,2012)

To identify start and stop numbers I had to go to the VT 2012 stats page and hover over all the box score links to figure out the number sequence. Fortunately they go in order. (Thank you VT athletic dept!) The bowl game is out of sequence; its number is 15513. But I could get it by calling vtFballData(15513,15513,2012). After I call the function, which takes about 5 seconds to run, I get a data frame that looks like this:

 season          opp quarter result top plays yards
   2012 Georgia Tech       1   PUNT 161     6    24
   2012 Georgia Tech       1     TD 287    12    56
   2012 Georgia Tech       1  DOWNS 104     5    -6
   2012 Georgia Tech       2   PUNT 298     7    34
   2012 Georgia Tech       2   PUNT  68     4    10
   2012 Georgia Tech       2   PUNT  42     3     2

Now I’m ready to do some analysis! There are plenty of other variables I could have added, such as whether VT won the game, whether it was a home or away game, whether it was a noon, afternoon or night game, etc. But this was good enough as an exercise. Maybe in the future I’ll revisit this little function and beef it up.

A Combinatorial Simulation

I have started a new book called The Art of R Programming by Norman Matloff and I’m really digging it. I won’t blog about each chapter the way I did Machine Learning for Hackers, but I did come across something I thought made good blog material.

In Chapter 8 (Doing Math and Simulations in R), Matloff presents an example on combinatorial simulation. The motivation for the example is the following problem:

Three committees, of size 3, 4 and 5, are chosen from 20 people. What is the probability that persons A and B are chosen for the same committee?

Says Matloff: “This problem is not hard to solve analytically, but we may wish to check our solution using simulation…” I suppose it’s not that hard, but if you don’t think about these kinds of problems on a regular basis they can trip you up. He doesn’t give the analytic solution but does provide code for a neat simulation of it:

# 8.6.3 combinatorial simulation
sim <- function(nreps) {
 commdata <- list()
 commdata$countabsamecomm <- 0
 for (rep in 1:nreps) {
 commdata$whosleft <- 1:20
 commdata$numabchosen <- 0
 commdata <- choosecomm(commdata,5)
 if (commdata$numabchosen > 0) next
 commdata <- choosecomm(commdata,4)
 if (commdata$numabchosen > 0) next
 commdata <- choosecomm(commdata,3)
 }
 print(commdata$countabsamecomm/nreps)
}
choosecomm <- function(comdat,comsize) {
 committee <- sample(comdat$whosleft, comsize)
 comdat$numabchosen <- length(intersect(1:2,committee))
 if (comdat$numabchosen == 2)
 comdat$countabsamecomm <- comdat$countabsamecomm + 1
 comdat$whosleft <- setdiff(comdat$whosleft,committee)
 return(comdat)
}

That's quite a bit of code. The first block is the main function, called "sim". The second block is the other function, "choosecomm", that gets called in the first function. The whole thing makes creative use of a list and the intersect() and setdiff() functions. If you want to simulate selecting three committees of 3, 4 and 5 people 10,000 times and see how many time persons 1 and 2 are on the same committee, enter sim(100000) at the command line. You should get about 0.10.

Now how do we solve this using math and probability? First off, instead of 20 people, I like to think of 20 chips, 18 of which are white and 2 that are red. Imagine they're in a bag and we reach in and scoop out 12 and immediately and randomly divide into three groups of 3, 4 and 5. What is the probability one of those groups has both red chips? To calculate this we need to enumerate the possible selections that include both red chips and divide that by all possible selections . First let's do it for the committee of 5. All possible combination of 5 from 20 is \( \binom{20}{5}\). All possible combinations including both red chips are \( \binom{2}{2} \times \binom{18}{3}\). We can calculate this in R using the choose() function as follows:

c5 <- choose(18,3)/choose(20,5)

We then repeat this for the other two committees of 4 and 3, like so:

c4 <- choose(18,2)/choose(20,4)
c3 <- choose(18,1)/choose(20,3)

Summing this up we get c3 + c4 + c5 = 0.10. Now it may seem strange that we always set up our denominator as if we're drawing from 20. After all, in the simulation above, there's a specific order. First 5 are chosen. If the two people are not both on that committee, then we draw 4 and then 3. It seems that should be taken into account when you solve this mathematically. But you don't have to. That's why I like to imagine just scooping 12 people up and assigning instant membership to them on a random basis. While that's not how committee selection usually happens in "real life" it makes thinking about an analytic solution easier.

Incidentally we can simulate the scooping up and assigning of instant membership in R as well:

sim2 <- function(nreps) {
cnt <- 0
for (i in 1:nreps) {
 s <- sample(20,12)
 if (all(1:2 %in% s[1:3])) {cnt <- cnt + 1; next}
 if (all(1:2 %in% s[4:7])) {cnt <- cnt + 1; next}
 if (all(1:2 %in% s[8:12])) cnt <- cnt + 1
}
print(cnt/nreps)
}

I call my function "sim2", because I'm super creative like that. Feed my function the number of simulations you want to run and it does the following for each simulation:

1. samples 12 numbers from 1 through 20 and assigns to a vector called "s"
2. checks if 1 and 2 are in the first three elements of s (committee of 3)
3. checks if 1 and 2 are in the next four elements of s (committee of 4)
4. checks if 1 and 2 are in the last five elements of s (committee of 5)

Notice the "next" in the first two if statements to improve efficiency. (No need to check subsequent if statements if the current one is true.) Obviously the sample() function scoops up my 12 people. The indexing of the "s" vector provides my instant committee membership. The first 3 are committee 3. The next 4 committee of 4. The last 5 the committee of 5. The nice thing about my function is that it's way faster than the one in the book. When I timed both functions running 100,000 simulations I got the following results:

> system.time(sim(100000))
[1] 0.1016
 user system elapsed 
 7.74   0.00    7.74 
> system.time(sim2(100000))
[1] 0.10159
 user system elapsed 
 1.42   0.00    1.42

Much faster, by 6 seconds. However I must say that Matloff wasn't going for speed and says as much throughout the book. He's writing R code the reader can understand. Just want to mention that lest anyone think I'm trying to one-up the author.

Machine Learning for Hackers, Chapter 12

In the final chapter the authors introduce Support Vector Machines (SVM). What SVM do are generalize linear decision boundaries for classification. Like the other chapters, this one avoids math and simply shows you how to use SVM in R. It then analyzes the spam mail dataset from chapter 3 using SVM, kNN and regularized logistic regression and compares their performance.

The first section of the chapter demonstrates SVM using a toy dataset. The dataset has two predictors (X and Y) and a response (a classification of 0 or 1). The dataset looks like this plotted:


I show this figure because the book is printed in black and white and you can’t tell class 0 from class 1 in the picture on page 276. That’s a problem that plagues this entire chapter. Anyway, what we have here is data that require a non-linear decision boundary. In other words we can’t draw a single straight line through the picture to indicate a boundary between the blue and red dots. This is where SVM comes in handy. Here’s the SVM code they use to analyze this dataset. As you’ll see they play around with the svm function from the e1071 package to show how changing certain parameters affects the quality of predictions. I didn’t get anything new out of this section that I felt compelled to blog about. It’s basically an extended example of how to use the svm function similar to what you might see on a help page. They do make use of the melt function from the reshape package, which is very cool and useful, but I already wrote about that in the Chapter 8 post.

The second section is “Comparing Algorithms”. They resurrect the spam dataset from chapter 3 and analyze it using SVM, kNN and regularized logistic regression. Recall the spam dataset was converted to a term document matrix (TDM) where the emails are the rows and the columns are terms. So email #1 would be row 1. In that row we would see the number of times the words in the column headers appeared. For example. the word “banner” appears 0 times in email #1. Column 1 of the spam dataset indicates whether or not the email is spam. They create testing and training sets and use those to create models and compare the different classification methods. Again, there’s nothing too fancy here where R programming is concerned, or at least nothing I haven’t already covered. They determine logistic regression works best for classifying spam email.

Now here’s something interesting worth noting. Instead of prepping the spam data as they did in Chapter 3, they skip directly to loading it into R as follows:

load('data/dtm.RData')

But how did they save that data as dtm.Rdata? By doing something like this:

save(dtm, file="dtm.RData")

…where dtm was a matrix. I say “matrix” because that’s what dtm.RData is when it gets loaded.

Another snippet worth noting is how they create their training and test sets. First they randomly select numbers as the indices for selecting the training data from the number of rows from the spam matrix. Then they take whatever numbers were not selected and make those the indices for the test data:

training.indices <- sort(sample(1:nrow(dtm), round(0.5 * nrow(dtm))))
test.indices <- which(! 1:nrow(dtm) %in% training.indices)

The dtm dataset has 3249 rows. So the sample function selects (without replacement) from the numbers 1 through 3249. The second argument to the sample function says how many to sample, which in this case is \( 0.5 \times 3249 = 1624.5 \). That gets rounded to 1624 by the round function. So 1624 numbers are sampled without replacement from the numbers 1 through 3249. And then they get sorted in ascending order. Those are the indices for the training data. To get the test data, they use the value matching %in% operator to look for matches, but negate the matches to get all the numbers that were not selected with the sample function.

The code "1:nrow(dtm) %in% training.indices" returns a logical vector with 3249 elements. 1624 are TRUE, the rest are FALSE. The code "! 1:nrow(dtm) %in% training.indices" returns the opposite logical vector. Now 1624 are FALSE and the rest are TRUE. Finally, inserting that strip of code into the which function returns the indices of the vector that are FALSE. Therefore both training.indices and test.indices are vectors of numbers from 1:3249, with no overlap between them. For example:

> training.indices[1:10]
 [1] 1 2 6 9 11 12 15 16 18 19
> test.indices[1:10]
 [1] 3 4 5 7 8 10 13 14 17 24

These vectors of numbers are then used to select the training and test data as follows:

train.x <- dtm[training.indices, 3:ncol(dtm)]
train.y <- dtm[training.indices, 1]
test.x <- dtm[test.indices, 3:ncol(dtm)]
test.y <- dtm[test.indices, 1]

I thought that was good to go through and understand since all the functions in this section require data (or are capable of taking data) in this format. For example:

library('glmnet')
regularized.logit.fit <- glmnet(train.x, train.y, family = c('binomial'))
library('e1071')
linear.svm.fit <- svm(train.x, train.y, kernel = 'linear')
library('class')
knn.fit <- knn(train.x, test.x, train.y, k = 50)

And that does it for me! I'm ready to move on to another book.

How to run R in batch mode in Windows 7

I recently purchased The Art of R Programming with a gift card I got for Christmas. Three pages into chapter 1 it talks about running R in “batch mode”. I had never done this and thought I would give it a try. It seems very useful, especially for large programs. Here’s the example program the book provides:

pdf("xh.pdf")
hist(rnorm(100))
dev.off()

You type that into a text file and save it as z.R. This program produces a histogram of 100 values drawn from a standard Normal distribution and outputs the histogram to a PDF file called xh.pdf. Now we want to submit that program to R without actually opening R. That means using the command line. But first we have to update our path if we’re using Windows. The book doesn’t mention this, hence the reason for this post. Here’s how you do it if you’re running 64-bit R on Windows 7:

  1. go to Control Panel
  2. click User Accounts and Family Safety
  3. click User Accounts
  4. click “Change my environment variables” in the left menu
  5. Highlight Path and click Edit
  6. Add semicolon at the end and then add “c:\Program Files\R\R-2.15.1\bin\x64” (without the quotes; update version if different from 2.15.1).
  7. Click OK twice.
  8. Log out and log back in to have the new path settings take effect.

Now go to the command prompt. Just go to Start and type “cmd”. Next go to the directory where your R program is located. If it’s in Documents, type “cd Documents”. Finally submit the following command:

R CMD BATCH z.R

That runs the R program and creates a PDF file in your working directory.

Machine Learning for Hackers, Chapter 11

Are you a heavy Twitter user? Would you like to build a custom recommendation system in R that will recommend new Twitter feeds for you to follow? If you answered “yes” to both questions then this is the chapter for you. Here’s the code. Have fun.

Now I don’t use Twitter, so this chapter was not very interesting for me. I should probably use Twitter and get with it and join the 2010’s. But at the moment I don’t, so building a Twitter recommendation system in R doesn’t do me any good. I guess I could have followed along and built the recommendation system in the book for one of the authors, but I just couldn’t motivate. It’s a lot of code that requires new packages. Plus there’s a section that requires using a program outside of R called Gephi. It’s an extremely specific case study. It’s not like some of the other chapters where they introduce a general machine learning method, like kNN or PCA, and show an example of how to use it. There is nothing general about this chapter. At one point they do make use of hierarchical clustering to find similarities between Twitter users. Clustering is a common machine learning tactic. But they only give it a couple of pages. I hate to sound down on the chapter. It’s well done and the authors put a lot of work into it. Personally, though, I just couldn’t get excited about it.

Having said that, I did see some interesting things in the R code that I wanted to file away and remember. The first was creating an empty vector. In a loop they run, they sometimes get a null result and want to store an empty vector for that pass. They do this as follows:

y <- c(integer(0), integer(0))

What caught my eye was the way rbind handles those vectors. It ignores them. I could see that being potentially useful. Here's a toy demonstration:

x <- c(5, 4)
y <- c(integer(0), integer(0))
z <- c(3, 2)
rbind(x,y,z)
   [,1] [,2]
x     5    4
z     3    2

Another snippet I wanted to note was how to identify unique values in a matrix. You use the unique function, which I was familiar with. What I didn't know was that you can do it for a matrix by calling the columns of the matrix, like this:

mat <- matrix(floor(runif(10,2,8)),5,2)
mat
           [,1] [,2]
[1,]        4      3
[2,]        2      7
[3,]        7      2
[4,]        3      7
[5,]        5      7
> unique(c(mat[,1],mat[,2]))
[1] 4 2 7 3 5

This chapter also used a base function that I had never seen before called duplicated. It returns a vector of True/False values that tells you whether or not the value occurs previously in the vector. I was surprised I hadn't heard of it before. Here's a demo of duplicated:

x <- c(1:5,2:6)
x
[1] 1 2 3 4 5 2 3 4 5 6
duplicated(x)
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
!duplicated(x)
[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
x[duplicated(x)]
[1] 2 3 4 5
x[!duplicated(x)]
[1] 1 2 3 4 5 6

Finally they make use of sapply in one of their functions and I wanted to document its use. The suite of apply functions always give me trouble for some reason. I get the gist of what they do, I just have a hard time remember why I would want to use one over the other. Anyway, what they do is use sapply to run a function over each row of a matrix. That seems like a useful thing to remember. First let's create a 100 x 3 data frame of values:

a <- rnorm(100,2,4)
b <- rnorm(100,3,4)
c <- rnorm(100,5,4)
df <- data.frame(a,b,c)

Now, let's say I wanted to create a True/False vector indicating which rows contained a value less than 0. Here's how I can do that using sapply:

sapply(1:nrow(df), function(i) ifelse(any(df[i,]<0),1,0))

The authors did something similar to that and then used the vector of T/F values to subset another vector. Of course if I wanted to subset my example data frame above to show only rows that contain a value less than o, then I could do this:

subset(df, a < 0 | b <0 | c < 0)

So that's the helpful R code I gleaned from Chapter 11. I feel like there should be more. It's a long chapter with a lot of code. But much of it either involves functions from specific packages or uses common R functions I've already documented or know about. Next up is chapter 12, the final chapter.

Machine Learning for Hackers, Chapter 10

k-nearest neighbors (kNN) is the topic of chapter 10. The authors say it’s the most intuitive of the machine learning algorithms covered in the book and I guess I would agree, provided you understand the euclidean distance formula, understand what a distance matrix is, and know a little about how R works.

They provide a toy dataset (named “df”) to demonstrate how kNN works. It looks like this:

         X        Y Label
1 2.373546 5.398106     0
2 3.183643 4.387974     0
3 2.164371 5.341120     0
4 4.595281 3.870637     0
5 3.329508 6.433024     0
6 2.179532 6.980400     0

Think of the rows as points lying in an XY plane. Notice they also have labels. The dataset has 100 rows, with 50 labeled 0 and 50 labeled 1. To run the kNN algorithm, we need to create a distance matrix that stores the distance between all the points. So row 1 of the distance matrix would have the distances of all points from the first point (2.373546, 5.398106). Row 2 would have the distances of all points from (3.183643, 4.387974). And so on. The authors actually write a function to create the distance matrix, which is good for illustrating how the distance matrix is created, but I want to note you can use the built-in “dist” function for this:

distance <- as.matrix(dist(df, diag=TRUE, upper=TRUE))

I set the "diag" and "upper" parameters to TRUE in order to match the matrix they create. The result is this:

 distance[1:3,1:3]
          1        2         3
1 0.0000000 1.294845 0.2167983
2 1.2948454 0.000000 1.3954937
3 0.2167983 1.395494 0.0000000

That's just a portion of the matrix. The full matrix is 100 x 100. The first row (or first column) refers to point 1 (2.373546, 5.398106). distance[1,1] is 0 because that's how far point 1 is from itself. distance[1,2] is 1.294845 because that's how far point 2 (3.183643, 4.387974) is from point 1. The distance between two points A(x_{1},y_{1}) and B(x_{2},y_{2}) is calculated using the Euclidean distance formula:

$$ AB = \sqrt{(x_{1}-x_{2})^{2} + (y_{1}-y_{2})^{2}}$$

$$ 1.294845 = \sqrt{(2.373546-3.183643)^{2} + (5.398106-4.387974)^{2}}$$

Once we have our distance matrix, we can find the k-nearest neighbors for a particular point by sorting the row from smallest to greatest distance. We can use the order function to help us do this. For example, let's find the 5 nearest neighbors for the first point:

indices <- order(distance[1,])[2:6]
indices
[1] 46 3 36 13 29
distance[1,indices]
46        3         36        13        29 
0.1796931 0.2167983 0.2212696 0.2916801 0.3561144

The order function returns the index positions of the nearest neighbors in the distance matrix. The index position can also be thought of as the identifying point number. So the closest neighbor of point 1 is point 46. The second-closest neighbor of point 1 is point 3. The command "distance[1,indices]" shows us the 5 closest points and their distances. What are their labels?

df[indices,'Label']
[1] 0 0 0 0 0

They're all 0's. If we wanted to predict the label of point 1, we could use majority rule. In this case since the 5-nearest neighbors are all labeled "0", we would predict point 1 is also labeled "0". (In truth, it is.) The authors carry this out for the entire dataset of 100 points and correctly predict the label for 93 of them given their 5-nearest neighbors.

The case study in this chapter is recommending R packages. The dataset is a list of 2487 packages, listed 50 times over for 50 programmers, with an indicator of whether or not that package is installed for the programmer. Using kNN they build a list of 25 recommendations for each programmer. Here's the code.

Next up is Chapter 11, Analyzing Social Graphs. It looks like the case study will involve building an engine to recommend people to follow on Twitter. I guess I'm a poopy stick-in-the-mud because this doesn't appeal to me. I'm not really into the Twitter. But I'll keep an open mind, slog through it and learn as much as I can.

Machine Learning for Hackers, Chapter 9

In this chapter we learn a little about Multidimensional Scaling, or MDS for short. Like PCA in Chapter 8, MDS is a topic usually taught in multivariate statistics classes. I guess it’s classified as machine learning because of the computing required to do it, though the same can be said of almost any statistical method. The MDS that is presented is the classical type (as opposed to the non-metric type). The case study is exploring similarity in the US Senate based on votes. As it turns out, a proper MDS analysis shows Democrats and Republicans indeed vote differently and are quite polarized. Not surprising but better than anecdotal evidence from some political pundit.

Now if you’ve been following my series of posts on this book (and I’m pretty sure you haven’t) then you know I don’t summarize the chapter but rather highlight the R code that I found interesting and new. But for the first time I really don’t have much to highlight. It’s a short chapter and the data for the case study required very little preparation. I was actually able to follow all the R code without running to the help pages and Google. Still, though, I feel like I should document something from this chapter.

How about reading in a bunch of files. I feel like this something I will forget how to do one day, so I should probably write about it to help form some memory. First they define a variable that stores a path and then they call the list.files function to create a vector of file names:

data.dir <- "data/roll_call/"
data.files <- list.files(data.dir)

Next they use the lapply function to create one big list object that contains the contents of all those files, which happen to be Stata data sets:

rollcall.data <- lapply(data.files, function(f) 
read.dta(paste(data.dir, f, sep=""), convert.factors=FALSE))

So the lapply function applies the read.dta function to each element in the data.files vector, which is file name. But in the read.dta function you'll notice the paste function is called to create a full path with the file name. Just dandy. That never fails to impress.

If you want to inspect some of the data, you can't just do head(rollcall.data). Since it's a list object you have to do something like this:

head(rollcall.data[[1]])

That shows the first 6 rows of the data stored in the first list element.

Later on they use the strsplit function to split first and last names of senators by comma. I'm pretty sure I've posted about this before, but it doesn't hurt to mention it again. Running a vector through strplit returns a list. For example:

 strsplit(congress$name[1], "[, ]")
[[1]]
[1] "SHELBY" ""       "RIC"

If I just want the last name, I can do this:

 strsplit(congress$name[1], "[, ]")[[1]][1]
[1] "SHELBY"

Now the authors use the strsplit function in an sapply function, like this:

congress.names <- sapply(congress$name, function(n) 
strsplit(n, "[, ]")[[1]][1])

That gives us this:

 head(congress.names)
SHELBY, RIC HEFLIN, HOW STEVENS, TH MURKOWSKI,  DECONCINI,  MCCAIN, JOH 
"SHELBY"    "HEFLIN"   "STEVENS" "MURKOWSKI" "DECONCINI"    "MCCAIN"

Notice each element in the vector has a "name". Ugly, I think, though it doesn't matter in the context of what they do. But what if you wanted a clean vector of last names without the "names"? You can do this:

congress.names <- sapply(congress$name, function(n) 
strsplit(n, "[, ]")[[1]][1],USE.NAMES = FALSE)
head(congress.names)
[1] "SHELBY"    "HEFLIN"    "STEVENS"   "MURKOWSKI" "DECONCINI" "MCCAIN"

Much better. Notice the USE.NAMES = FALSE setting. Pretty self-explanatory what it does.

If you want to see the rest of their code including how they carry out the MDS analysis and MDS clustering plots, see their git hub page. For more on how exactly MDS works, see Chapter 14 of A Handbook of Statistical Analyses Using R.