From: "Nigel Henry" <cave.dnb@xxxxxxxxxx>
I can't say I'm too clued up on the finer points of spam filtering, but am willing to learn. Ideally spam should be stopped at source, but I don't suppose there's much chance of that happening.
I can give a data dump. But I am not at the level of expertise of the BogoFilter authors or the SpamAssassin authors. There are some really fine writeups of the filtering techniques in SpamAssassin that have been done over the years. It has a state of the art Bayes filter and it supports rules for things that indicate spam. I think I mentioned this before. But there are some MUAs that will accept a text/plain base64 encoding that is really a gif image and display it. So an obvious "rule" comes to mind. Gif files have a common leadin other wise they cannot be decoded. So the first 6 or 8 BYTEs of the BASE64 are "obvious". If you can detect them anywhere in a message as the start of a BASE64 coding section, not exceptionally hard to do with SA, you can create a __RULE type rule that defines this detection. Then you can setup another rule that detects the MIME type - if it is image/gif you give abother __RULE type rule that criterion to fire. (The __RULE type rules will figure in META rules but not the final score.) So we have __IS_GIF and __MIME_IS_GIF for a pair of rules. We then create a META rule and give it a perhaps hefty score: meta JD_MISPLACED_GIF (__IS_GIF && !__MIME_IS_GIF). Bang, it's captured. There's no way a BAYES engine alone can do this sort of trick. This is why I ended up dismissing them way back several years ago when I surveyed the situation. Rules alone can be quite good. This is especially true since SA allows plugin code modules and it allows block list testing WITH SCORES. That latter is critically important. "SORBS" is quick, it is also very dirty. So when I use its general list I give that only a modest score well below the spam threshold. The SURBL lists are slower reacting and much better with respect to false positives. I use them with a higher score. A filter that uses only Bayes misses out on rule based flexibility and these BL lists. To be fair if you are using your own smtp server to receive email this is somewhat mooted if you can use greylisting with scores on the block lists. That is often used to take load off the SA filters. One modest ISP, which features one of the SARE ninjas on its staff, works ONLY based on rules. Others use global Bayes to make the picture a little tighter. Although this runs into the "one man's 'poison' is another man's gourmet soup." (Think cilantro. For some men it's intolerably bitter.) Very large ISPs may have to trim down more than medium sized or smaller ISPs in this regard. The boutique, small office, or home "ISP" has the easy luxury of per user Bayes. If the users are "smart enough" to train the Bayes the whole system can become quite well tuned and effective. If it is small enough and there is time to toss in special whitelist or blacklist rules per user you can really do well. (And I have a devastatingly accurate rule that is based on a 'peculiarity' about Earthlink that chops out one whole set of spam messages based on the TLD. But it requires a custom rule per user.) That's the small data dump. Wandering around the SpamAssassin wiki might be interesting. Hm, I need to tweak some of the people who have done nice writeups to place them in an easy to find place in the wiki. This is Dr. Curtis Kret's presentation from BH 04 (Slow to load): http://www.blackhat.com/presentations/bh-usa-04/bh-us-04-kret.pdf It is an excellent introduction to the concepts of rule based anti-spam. And with the SpamAssassin Rules Emporium ninjas on the job you could consider it almost a Bayesian filter for complex spam features. This is his presentation slides from Torocon 2004: http://spamassassin.apache.org/presentations/2004-09-Toorcon/html And this page on the wiki site has some excellent writeups about SA's Bayes filter: http://wiki.apache.org/spamassassin/BayesAccuracy The spamassassin users mailinglist is also a world of help. That's probably more dump than I should have done. But somebody asked. It's almost an interesting game trying to win against these turkeys. I just wish the ultimate anti-spam could be legislated - open hunting season for convicted spammers.... Yeahhhhh! {^_-} Joanne