Fedora Users — Re: Spamassassin and Spambayes

From: "Aaron Konstam" <akonstam@xxxxxxxxxxxxx>

On Sun, 2006-06-25 at 16:51 -0400, Claude Jones wrote:
After the big discussion a month or so ago, I decided to give spamassassinanother fair try - this prompted by my having created self-imposed problemsgetting spambayes to work on my home FC5 pc. I use kmail - supposedly, kmaildetects spamassassin and give you the option of configuring/using it - itprovides training buttons on the menu-bar, so you can go through your inboxand separate the spam; I also trained on a number of ham messages that hadbeen filtered into various folders, training on several from each folder.After intially terrible results, my spam detection crept up to about 80%.Since I get over 200 spam messages daily, that still would leave over 40messages to sift through daily, not so great. Today, I finally got Spambayesrunning; after initial training on about 200+ messages, I'm already gettingaround 95% spam detection...


Puny results, kid. {^_-} With SpamAssassin, rules, and a carefully hand
fed Bayes I'm not kidding when I say I get about one spam in 1000 that
creeps through. (And for the most part those train at near Bayes 0.50.)

Am I missing something here? Is there a better way to train spamassassi

Some people find it helpful to change the BAYES_99 test to me equal to the spam cutoff
or slightly below it. If the spam cutoff value is 5.0, set BAYES_99 test
value is set at 5.0 or 4.9.


NOOOOOOOOOOOOOOOooooooooooooooo!

If you are going to automatically train Bayes widen the automatic
thresholds from the stock settings, at least at first. Once you have
the weight of a working Bayes behind you the stock settings might
work OK. I studied how the automatic classification system was
supposed to work, thought about it a little while, and decided I
am a big girl and can spoon feed SpamAssassin. Over the years I've
been running Bayes (I forget when it appeared. I first hit SA at
2.43 I think it was - or maybe even 2.2 something.) I've trained on
less than 2000 hams and 2000 spams. Bayes 99 alone catches 85% of the

spam and hits almost no spam. Bayes 80 and 95 account for anotheralmost 6%. The rest comes from the various rule sets I have running.

I suppose I should feed the Bayes a little more. I've seen it doing
better. But at the scoring I have (Bayes 99 is 5.001) I see such good
results I am in the "if it ain't broke, don't fix it" mode. {^_-}

Another philosophy suggests to periodically retrain spamassassin on spam
that has already been identified as such by spamassassin.

There are many other spamassassin tests you can fool with.

Look at :http://lwn.net/Articles/172491/
and you will see when tested spamassassin beats spambayes in spam
identification. I don't want any flame wars on this but I stick to my
assertion that when used properly spamassassin is philosophically either
equal or slightly better than spambayes in detecting spam. That is,
spambayes's approach is no better than spamassassin so I would expect
comparable results.


Bayes alone has a dead band where you really aren't sure what you
have. Adding a rules system to the picture helps a whole lot. The
basic SpamAssassin rules are <cough> a little weak. They are designed
such that they won't badly screw up in a general purpose ISP setup
with system wide Bayes, or so it seems. (It also requires some luck
on initial Bayes training. Ideally about 1000 known spam and 1000 known
ham as initial feed stock would materially help before letting any
automation get into the act.)

Now, I do have an unfair advantage here over what an ISP might
experience. I only have to filter about a dozen accounts for two
developer level people. So we both have an idea about how to spoon
feed our Bayes filters. That helps. {^_-} Another thing that helps
is that we both write rules, although Loren is better at that than
I am. He routinely contributes to the SARE rule sets, which I have
found to make a dramatic difference in performance. (I use up to
the x2 level rules eschewing the ones that show some false positives.
I also use over 40 sets of rules.)

{^_-}