In this chapter you build a naïve Bayes classifier to classify an email as spam or not spam (or “ham” as they call it) using only the body of the message. The idea is to take a bunch of spam messages and build a Term Document Matrix (TDM). A TDM is a big grid where the row labels are the terms in the spam emails and the column headers are the spam emails. If row 21 is “viagra” and column 20 is the 20th spam email, their cell [21,20] displays the number of times the word “viagra” appeared in that email. Using the TDM they create a data frame which contains the terms and the percentage of documents in which a given term occurs. For example, of the 500 spam emails, the term “html” appears in 33% of the messages. That gets treated as a conditional probability. The probability of seeing the term “html” in an email given the email is spam is 0.33. The same process is carried out for emails that are not spam. So we have two data frames of terms (one for spam and one for not spam) and their estimated probability of occurrence.
To classify a new email as either spam or not spam we compare it to both data frames and see which terms in the email are in the respective data frame. If a term is in the date frame, we pull out its probability. If not, we assign it a low probability of 0.000001. We multiply the probabilities together along with the prior probability of the message being spam, which they set as 0.5 for starters.
For example, say we have a message with 5 words in the body. None of the words in the email appear in the spam data frame, so we calculate \( 0.5*0.000001^{5}\). However say two of the words appear in the good email data frame, each with probability of 0.05 of occurring. So we calculate \( 0.5*0.05^{2}*0.000001^{3}\). Since the latter calculation is greater than the former, we classify the email as not spam. That’s how you do naïve Bayes classification.
Now if you’re like me you’re probably thinking, where’s the denominator? I mean, I’m no Bayesian expert, but I do know Bayes’s theorem has a denominator. Well it turns out the denominator is the same in both calculations so they drop it. (I have to give a shout out to this guy’s explanation. That’s where I went to figure it out.) You’ll also notice we’re multiplying probabilities as if they’re independent. That’s the naïve assumption here.
Using the language of the probability, here’s Bayes’s Theorem for two terms, where A = message is spam, B = some word, and C = some other word
$$ P(A|B \cap C ) = \frac{ P(B|A \cap C ) P(C|A) P(A) }{ P(B|C) P(C) }$$
For not spam this is…
$$ P(A^{\prime} | B \cap C) = \frac{P(B|A^{\prime} \cap C)P(C|A^{\prime})P(A^{\prime})}{P(B|C)P(C)}$$
Notice the denominators are the same so we can drop them when comparing the two. Since we’re also assuming independence, the numerator changes to \( P(B|A)P(C|A)P(A)\) and \(P(B|A^{\prime})P(C|A^{\prime})P(A^{\prime})\), respectively.
The book does not explain the classification algorithm in the detail I just did. I guess since the book is “for hackers” they didn’t want to get bogged down in math formulas and explanations of Bayes’s Theorem. I get that. But still I wanted to understand why it worked and not just how to do it R.
The R code itself for this chapter is not too hard to understand. A few nuggets I got out of it are…
The readLines()
function will return each line of text as a separate element of a character vector.
The dir()
function will return a listing of all file names in a directory.
The intersect()
function will return the values in common between two vectors.