Monday, October 13, 2003

Spam Stuff


Spam... lovely stuff... I get around 2500 messages per month that are spam, another few hundred that are mailing list memberships, and of course, the dozens of daily work e-mails. (2200 messages during 2002... and around 1200 so far for 2003.)

To deal with the deluge, I work off of whitelist rules combined with bayesian filtering.

My first layer of rules is designed to catch newsletters. Some newsletters end up in their own sub-folder (e.g. SAList, SourceOffSite, Declan McCullagh), others get lumped into a genre sub-folder (XML, TechNews, Database). Newsletters are usually easy to filter based on a string that always appears in the subject line or they always come from the same e-mail address. The goal of the the first layer of rules is to be as specific as possible.

The second layer of rules is my domain whitelist. For work, this means any e-mail from our (multiple) e-mail domains gets left in my inbox with a pop-up window notifying me that I have work e-mail.

The third layer is the personal whitelist - friends, associates, family all get added to this rule so that when e-mail comes from those addresses it gets left in my inbox.

Everything else gets shoved into a bulk mail folder. Which, at this point, contains 90-99% spam (or messages that I'm not interested in). However, in order to avoid missing someone's e-mail, I have to periodically go through this folder and retrieve stray e-mails. It's finally gotten to the point where I was just deleting the entire contents of the folder every few days rather then sort through it.

Which is where bayesian filtering comes into play... using SpamBayes, I run the remaining messages through a bayesian filter to seperate the remainder of my e-mail. Since bayesian filtering is about 95% accurate and my bulk mail folder is already 95% certain to be spam - I have pretty high confidence that what SpamBayes tags as spam is really spam.

The MSOutlook plug-in works moderately well... except that I often have to manually tell it to filter messages because it doesn't notice when messages get dumped into the bulk mail folder.

Labels:



posted by Wuphon's at 11:21 AM

Powered by Blogger Who's linked to me?