A few months ago, Paul Graham lit a fire under the anti-spam community with his, A Plan for Spam web page.
Paul uses so-called "naive Bayes" to filter spam astoundingly well. Others have created such filters to cope with last year's exponential growth in spam traffic. Field experience is supporting strong claims of effectiveness for these types of filters.
For the past several months, I have been using Spam Assassin to filter email for a system that delivers email summaries to phones. Spam Assassin works reasonably well for a wide variety of email. I have used it on my own email and have a pretty good feel for its characteristics. But it suffers from some deficiencies for my system use:
"Naive Bayes" filters can be really fast. And they can be very, very simple. Too, they can be used to personalize email filtering. This latter capability is also important.
To gather first-hand knowledge about these Bayes filters, I wrote some Perl modules, the main program for which is named TZBayesClassifier.pm.
What follows is a summary of, and some discussion about, what I have found.
TZBayesClassifier.pm learns and classifies emails from Netscape/Unix mailbox files.
Unix:
From whoFrom@whereFrom.com Mon Feb 3 01:02:28 2003
Netscape:
From - Sat Sep 28 09:55:41 2002
The key thing is that the lines begin with "From " and contain a date.
TZBayesClassifier.pm uses EmailToke.pm to tokenize each email.
Which parts of the email does EmailToke.pm tokenize?
EmailToke.pm throws away Message-ID headers.
EmailToke.pm replicates certain header lines: subject, to, from, etc. This effectively ups the counts for whatever tokens are in these particular headers.
EmailToke.pm decodes URLs that appear to have URL-encoded characters. E.g. "&20" is converted to SPACE.
EmailToke.pm splits each token by dot/period characters and dash/hyphen characters. Both the whole token and the split-parts are included in the token list.
Notes:
Plain text and HTML are de-mimed, though - both quoted printable and base-64. The thinking behind doing so is that the spammer can too easily randomize the mime encoding. Also, tokenizing base-64 encoded text is a bit more difficult than tokenizing plain text.
The two text lines from attachments are grabbed in an attempt to find tokens that identify viruses, and to recognize little image files that some people attach to their emails as a kind of signature.
The punctuation characters are included in the tokens because they are likely to be in URLs and in certain HTML character sequences.
URL decoding is done because so many particular URLs scream, "SPAM!!!". Run your own email and spam through TZBayesClassifier and look for "http://66." in the token memory list. Can you say, "spam?" Of course, after a while, the 32 (or 64) possible encodings of this URL fragment would all mark spam, but ... why wait?
TZBayesClassifier.pm then puts the tokens through some more processing:
Tokens that are all-numeric are thrown away.
Tiny tokens (3 characters and under) are concatenated together with previous tiny tokens until the resulting token is long enough to be included in the token list.
Long tokens (over 32 characters) are truncated.
Notes:
I have not tested the effects of converting tokens to lower case any more than to know that it does not seem to have a significant effect on the 20,000+ email tests. But that result is not too surprising. With that many emails, unique tokens are easy to come by. This logic might have an effect upon early learning, though.
Concatenation of tiny tokens might help with "V I A G R A". In any case, it did not have much effect on my email/spam data.
Long tokens are truncated instead of being thrown away so that the virus/attachment, base-64 text, and long URLs are included.
If the email is being learned, TZBayesClassifier.pm bumps the count for the email's category for each instance of tokens contained in the email.
Or, if the email is to be classified, TZBayesClassifier.pm does the "Bayes" logic with the email's tokens.
TZBayesClassifier.pm takes the most "distinct" tokens in the email and calculates the probability that each token is in each known category. Then TZBayesClassifier.pm "sums" the probabilities to find how probable it is that the email is an instance of each known category.
What constitutes a "distinct" token?
More exactly, the tokens are sorted by their highest absolute probability for any category (e.g. spam or "ham"). If the highest probability is the same for two tokens, then the token with the highest raw count is considered to be more "distinct".
TZBayesClassifier.pm does some magic arithmetic to normalize the probabilities. But, of course, no matter what it does, the probabilities that it calculates are wrong.
Naturally, there are quite a number of settings and variants that can be made to this general logic. I tried quite a few. Some are mentioned below. The rest have notes about them in the source code.
I used the following test data:
The emails were actually sent to 2 or 3 of my "guises": work #1, work #2, personal. That sort of thing.
A lot of these emails are software order notifications and mail-list content. Such email is quite similar from email to email.
This spam was mostly sent to my tranzoa.com accounts. But some were to family members and some were to a couple of "work" accounts of much different ilk. (One, for instance, is a buttoned down, corporate kind of account. Another is a chat-room, Instant Message kind of account.)
By "validated", I mean that I had specifically designated each and every one of these emails as spam. Not surprisingly, these experiments with TZBayesClassifier have proven the imperfection of my own eyes and mind.
These "spams" include quite a lot of virus-containing emails. If you interact with a wide variety of people from all over the world (as I do in one of my work guises), you will often receive several virus emails (e.g. Klez) in a day.
All of my spam has had Spam Assassin headers and decoration removed.
Then, with various settings, I told TZBayesClassifier to learn one half of the spam and one half of the "ham". After it learned from these two halves, it classified the other halves.
It did not take long to find settings that yielded extremely good false positive (innocent victim) numbers. That is, I found settings that, for instance, considered zero to two of the 13,000+ "hams" to be spam. Given that quite a few of the "hams" are actually on the edge of spamminess, that's almost too good.
It took longer to find settings that brought spam filtering above 95%. Eventually, spam filtering got slightly above 95% at a .9 probability cutoff, and around 98% at a .5 probability cutoff. Both cutoffs did fine on the false positive front (e.g. zero emails at .9, and two emails and .5).
Now, these results are not a big surprise. 20,000 emails is quite a learning experience for anyone, even a simple computer program. So it's not surprising that the learning was effective.
I did learn a few things, myself, though:
Generally speaking, lower values tend to yield results that are the same no matter what the probability-of-spam cutoff. That is, either emails are considered strongly spam or not. Not many emails are considered ambiguous.
Higher values for this number of distinct tokens resulted in more emails being considered ambiguous.
This makes sense, intuitively. So it made for a nice bit of validation of the program logic.
To be more specific, a few tokens were used to classify emails a lot. Most were not used at all.
I tried a couple of other tricks to mitigate the importance of, say, a token with a spam count of 10 and a ham count of 0 being considered more distinct than, say, a token with a spam count of 700, and a ham count of 1.
The results for each of these tricks were disappointing.
But, some of the original spark for writing TZBayesClassifier was to find out if a "naive Bayes" filter could be used to pick a good token set for use by an email "signature" algorithm. So, this CRC trail remains to be followed.
So, I downloaded a sample of spam from Spam Archive.
And, I ran it through the email that had been taught my own spam and "ham".
Disaster!
In early runs, anywhere from 20% to 40% of the Spam Archive spam was classified as spam. The rest of those emails fooled the filter.
It did not take long to find out a reason for many of the false negatives: the Spam Archive spam contains many emails with spam filter headers and decoration. Since my own spam has all of that removed, and since some of my own "hams" contain such headers and decoration, ... Well, the results were what one would expect.
Did I strip such headers and decoration from the Spam Archive spam? Did I strip such emails from my own "ham"?
No and no.
Maybe someday. But, my computers were getting tired by then, and refused to do it.
Just kidding. It was I who didn't want to do it.
Heck, computers are here to do our work, not the other way around.
And, anyway, I didn't want to "get cute" with the data.
I extracted two random selections of 500 emails from my spam and ham. I split the Spam Archive spam into two randomly chosen parts of ~500 emails a part.
I use these data sets in this way:
Well, what happened? !!!!
This logic dumbly trigrams the email text, HTML, replicated headers, and 2-attachment-lines.
Of course, the unique token count explodes with n-gramming.
Turns out, n-gramming does pretty well all alone. I found that using the top 40 n-gram distinct tokens was best. Things degrade very slowly when more tokens than 40 are considered. And below 20, things get pretty bad.
For some of the "foreign" spam tests (either learning from my own spam or from the Spam Archive spam, and then classifying the other), normal tokens in combination with 3-grams worked best, 3-grams alone were a close 2nd best, and using just normal tokens was notably worse. Gut feeling: 3-grams helped with the spam-header problem. 3-grams that came from spam headers played a significant role in classification, for instance, but their effect seemed to be diluted.
Anyway, since n-gramming and Perl don't go together very well, these 3-gram tests were just a quick way to get a sense of how a poor-man's CRM114 might work. It's worth looking in to more, I think.
So what happens if the program is required to be a good filter right from the start?
I used the following logic (in TestIncTZBLearn.pl):
The result: An email or two are misclassified in the first 10 emails. After a 100 emails, as many as 5 emails have been misclassified - more or less evenly divided between spams or hams. After 200 emails there generally is another email that has been misclassified. And so on.
In other words, the logic is so good that the measurements are at the noise level.
Does it make a difference how many distinct tokens are used?
The essential thing is that the logic works well enough that "improvements" hinge on individual emails and cases - not on any global "smarts."
| Type | Spam Probability Cutoff | Goof Percentage | Goof Count | Total Emails |
|---|---|---|---|---|
| Ham | .50 | 01.00% | 5 | 500 |
| Ham | .90 | 01.00% | 5 | 500 |
| Spam | .50 | 02.20% | 11 | 500 |
| Spam | .90 | 00.60% | 3 | 500 |
| Spam Archive | .50 | 19.66% | 90 | 458 |
| Spam Archive | .90 | 38.65% | 177 | 458 |
Now comes the good part: those 5 false positives among the hams? 4 of them are ambiguous spams. I could have easily lived without any of those four. The one definite false positive is a very, very strange email from someone in France. Yes, yes. I know what you are saying: "That explains it!" But it was a real, solid, important email. So perfection is not currently the state of the art.
Those 177 Spam Archive "hams"? Well, a quick sort shows that the top distinct token for 87 of them is "x-spam-status:". 54 of the 90 "hams" had "x-spam-status:" as their top token. You get the picture. By a quick look at a few of the emails, it appears that almost all of the mistakes with the Spam Archive spams are because of spam headers and decoration!
Turns out that after the first 50 emails or so (of a random spam or ham selection), around 20% or 25% of the unique tokens in an email are new. After a few hundred emails, that percentage of new tokens can drop below 20% relatively often. After 2000 emails, the new token percentage is in the teens, and the percentage of emails that have fewer than 5% new tokens has gone up to the 5% to 10% range. After 4000 emails, that percentage of such emails that don't have 5% new tokens is up in the 12% to 18% range. That is, as far as new-token tracking is concerned, 12% to 18% of the emails do not add value. Many of these emails are probably "dupes," in fact. After 6000 emails, those been-there-done-that emails comprise somewhere from 25% to 35% of all emails. And the average percentage of new tokens for emails has dropped to 6% to 12%.
Interestingly, I found that my own spams became more "familiar" in this new-token sense than the hams. At the 8000 email mark, hams were still providing new tokens at the kinds of rates that spams were doing in the 4000's.
That surprised me because the hams contained so many emails that came from automated notification systems. Apparently, these emails just had enough unique information (transaction ID's, names, email addresses, etc.) that they continued to provide new, though useless, tokens.
Can a Bayes filter be used to fill a good email hash function token list?
Bayes filters quickly find really popular, but unexpected, tokens.
Sounds like a match made in heaven.
Everybody's favorite question: What about combining tokens? Or using other combinations of tokens than simple sequential appearance in the email?
Putting the filter under stress might reveal reasons to do so. Trying to classify "foreign" email is an example of stress, for instance.
Does the little trick work that TZBayesClassifier uses to recognize viruses?
Frankly, I've not looked to see if this works. ... Or is needed!
But, for a general purpose spam filter, filtering viruses has gotta be about number one in the list of good things to do.
Which tokens should be thrown away if tokens must be aged so that useless tokens can be flushed from memory to save space.
But, does it make more sense, for instance, to throw away tokens that have been found only in emails that didn't provide many new tokens to memory?
Or vice versa.
The danger to flushing logic is that, after some critical token has been flushed, at a later time, many emails from a different category (than that which the token was found in before it was flushed) will be learned.
My current thinking on the best thing to do about this is to use two token memorys. Starting out, the first memory would be used for classification and would learn, too. The second token memory would start learning for first time when the first memory is "half" full. Then, when the first memory is "full", the second memory would be used for classification, and the first would be wiped clean and start learning again. Ping, pong...
Perl source code for TZBayesClassifier:
http://www.tranzoa.com/Download/tzbayes.zip
| B. Alex Robinson | February 17, 2003 |