k-nearest neighbors (kNN) is the topic of chapter 10. The authors say it’s the most intuitive of the machine learning algorithms covered in the book and I guess I would agree, provided you understand the euclidean distance formula, understand what a distance matrix is, and know a little about how R works.
They provide a toy dataset (named “df”) to demonstrate how kNN works. It looks like this:
X Y Label 1 2.373546 5.398106 0 2 3.183643 4.387974 0 3 2.164371 5.341120 0 4 4.595281 3.870637 0 5 3.329508 6.433024 0 6 2.179532 6.980400 0
Think of the rows as points lying in an XY plane. Notice they also have labels. The dataset has 100 rows, with 50 labeled 0 and 50 labeled 1. To run the kNN algorithm, we need to create a distance matrix that stores the distance between all the points. So row 1 of the distance matrix would have the distances of all points from the first point (2.373546, 5.398106). Row 2 would have the distances of all points from (3.183643, 4.387974). And so on. The authors actually write a function to create the distance matrix, which is good for illustrating how the distance matrix is created, but I want to note you can use the built-in “dist” function for this:
distance <- as.matrix(dist(df, diag=TRUE, upper=TRUE))
I set the "diag" and "upper" parameters to TRUE in order to match the matrix they create. The result is this:
distance[1:3,1:3] 1 2 3 1 0.0000000 1.294845 0.2167983 2 1.2948454 0.000000 1.3954937 3 0.2167983 1.395494 0.0000000
That's just a portion of the matrix. The full matrix is 100 x 100. The first row (or first column) refers to point 1 (2.373546, 5.398106). distance[1,1] is 0 because that's how far point 1 is from itself. distance[1,2] is 1.294845 because that's how far point 2 (3.183643, 4.387974) is from point 1. The distance between two points A() and B() is calculated using the Euclidean distance formula:
$$ AB = \sqrt{(x_{1}-x_{2})^{2} + (y_{1}-y_{2})^{2}}$$
$$ 1.294845 = \sqrt{(2.373546-3.183643)^{2} + (5.398106-4.387974)^{2}}$$
Once we have our distance matrix, we can find the k-nearest neighbors for a particular point by sorting the row from smallest to greatest distance. We can use the order function to help us do this. For example, let's find the 5 nearest neighbors for the first point:
indices <- order(distance[1,])[2:6] indices [1] 46 3 36 13 29 distance[1,indices] 46 3 36 13 29 0.1796931 0.2167983 0.2212696 0.2916801 0.3561144
The order function returns the index positions of the nearest neighbors in the distance matrix. The index position can also be thought of as the identifying point number. So the closest neighbor of point 1 is point 46. The second-closest neighbor of point 1 is point 3. The command "distance[1,indices]" shows us the 5 closest points and their distances. What are their labels?
df[indices,'Label'] [1] 0 0 0 0 0
They're all 0's. If we wanted to predict the label of point 1, we could use majority rule. In this case since the 5-nearest neighbors are all labeled "0", we would predict point 1 is also labeled "0". (In truth, it is.) The authors carry this out for the entire dataset of 100 points and correctly predict the label for 93 of them given their 5-nearest neighbors.
The case study in this chapter is recommending R packages. The dataset is a list of 2487 packages, listed 50 times over for 50 programmers, with an indicator of whether or not that package is installed for the programmer. Using kNN they build a list of 25 recommendations for each programmer. Here's the code.
Next up is Chapter 11, Analyzing Social Graphs. It looks like the case study will involve building an engine to recommend people to follow on Twitter. I guess I'm a poopy stick-in-the-mud because this doesn't appeal to me. I'm not really into the Twitter. But I'll keep an open mind, slog through it and learn as much as I can.