Fedora Users — Re: spamassassin doesn't seem to be using bayes

From: <akonstam@xxxxxxxxxxx>

On Fri, Oct 21, 2005 at 10:50:27AM -0700, jdow wrote:

From: "Alexander Dalloz" <ad+lists@xxxxxxxxx>

>Did you set in local.cf something like following?

>use_bayes 1
^^^ good
>auto_learn 1
^^^ IMAO that is poison unless you also change the threshold
scores for bayes to 'way out there'. These lines will do that:
---8<---
bayes_auto_learn_threshold_spam 20.0
bayes_auto_learn_threshold_nonspam 0.1

---8<---

The above is interesting since I would think that the default value 12
is too high. Your line says that auto_learning should not happen
unless the score is greater than 20. Why do your think that is good?
-------------------------------------------

It avoids false white-listing and false training. The "cost" of repair
for false training and white-listing is disproportionately high compared
to normally expected levels of spamassassin maintenance. Once you have
operated for a significant period of time you should be able to reduce
the scores to "stock" levels safely. If you watch spam scores and note
levels that are questionable you may be able to set scores even tighter
than stock. However, it is not uncommon to see hams from SOME sources
that score into the 20s. In my case that happens with LKML from time
to time. Rules that generally work on normal mail go crazy with patchs
and kernel debug reports.

I note that I am not the only person suggesting this on the spamassassin
users mailing list. The maintainers are mum on the issue for the most
part.

I also note that retraining Bayes on messages that already have high
Bayes scores seems to be pointless based on my own results. I train
only on messages that score low numbers of points and have Bayes
scores below 99. I also grab periodic bundles of ham to feed my Bayes
system when it starts getting imbalanced between ham and spam. At the
moment I have trained with about 10% of the numbers D. D.'s -D results
indicated he had. And I've never had to go find the WIKI page to learn
how to correct an auto-whitelist (don't use it at all) or bayes
screwup. This makes life easier for me. (I've never had an expire
go awry, either. Um, I've never RUN an expire. {^_-})

Let's see, the general wisdom on the user's list is that the nonspam
threshold should be at least slightly negative if you have any rules
that hit ham preferrentially. But 0.1 is probably OK. I solve the
threshold problem by using meatware. Once a day I sort the spam folder
by score and check out the lowest few scores and make a quick scan
for keywords that might indicate an interesting LKML message that was
mismarked. (Although all I usually do with that list is scan subjects
for the current "buzz", like real time precision clocks appearing to
run backwards because 2 seconds is long enough to wrap a 32 bit counter.)
The scanning is of modest interest. Sometimes I go through the email to
see what new rules might be called for or admire the new all time high
score for the machine from one of Leo's postings. (Drug spams with
base64 encoded bodies coupled with odd DNS entries are among his
signatures. He's headed for number one on ROKSO it appears. Even though
Ralsky was busted and (temporarily) shut down by the FBI Leo is still
number two on the top 10 list. He's sort of cute and deadly smart with
DNS tricks. He seems to be into drugs and kinky sex. He's managed a
message that hit 72 rules, 8 of which are zero score rules used in
meta rules, that ran to over 105 points. This was with a remarkably
short base64 encoded drug spam. I didn't feed it to bayes. My bayes
already knows all about the V drug and E Dysfunct issues.)

I also don't use tools such as amavis or milters. Plain old procmail
is remarkably transparent about what it does. If I ask spamassassin for
a specific markup I get that markup and not what some other futility
dictates I'll get. (I can also do such perverted things as playing a
tune after procmail processes emails from customers. {^_-})

{^_^}   Joanne