Fedora Users — Re: How do I teach Spam Assassin? [LONG]

From: "Jeff Vian" <jvian10@xxxxxxxxxxx>

> jdow wrote:
>
> >http://wiki.spamassassin.org/ is an astonishingly good place to learn
> >about the ins and outs of SpamAssassin. it also mentions the home pages
> >of various custom rule sets like 99_TripWire, BigEvil, and many others.
> >
> >It is well worth the visit.
> >
> >(I have "progressed" to the point that SA filters my mail. I read my
> >mail via a secure pop2 connection. I maintain spam, oldspam, ham, and
> >oldham folders on a special account via IMAP to mbox files. I have a
> >futility that filters off the message 1 the imap tool insists must be
> >there so that a cron job every night runs "salearn" for me. This is
> >all rather handy when I am running this email tool. When I get time I
> >plan to revisit tossing the special folders into my main email account
> >"safely". I understand more now than when I started. {^_-})
> >
> >
> >
> I am interested in how you do this.
> I use fetchmail to get me mail by pop3 from the isp and put it in my
> account on the linux box.  I then need to get it into the maildirs
> (instead of the default mailbox) so I can teach SA, but am unsure of the
> mechanics of making that happen.  Any pointers on getting the maildirs
> working will be greatly appreciated.

The data path is fetchmail to sendmail to procmail to spac to mbox
output. (SendMail queues the mail in /var/spool/mqueue as fetchmail
feeds it in. Then it farms the mail back through procmail for delivery
to the mbox file.) This part is easy and should be a slam dunk for most
folks here.

You must get spamd running, through. I use the spamassassin.org rpm
for 2.63. So it may be different for Fedora. But the key lines are:

#        SPAMDOPTIONS="-d -c -a -m5 -H"
        SPAMDOPTIONS="-d -c -m10 -H"

The first is commented out. The second is only partly silly. (I see I
should change the -m option back for sanity. I have a slow machine plus
massive mail chunks coming in (chiefly from LKML) that lead to imap
beocming a little disoriented. I get batches of mail delivered twice.
every once in awhile.) I removed the -a option for AWL. It's something
I do not trust. And several people's reports on the spamassassin list
reinforce that view.

I fired up the imap and later the secure pop3 tool straight out of RH9.
It picked up the mbox mail box and presents it to me for reading on a
different machine in Outlook Express.

Then I decided it was too awkward to use a :0c: rule in procmail to save
a copy in mbox format of all unprocessed incoming email so that I could
yank out spam and train with it. (mail and its "s100 spam" ability. Yeah,
REAL primitive. It HAD to be improved.)

A throwaway comment by one of the other users on the spamassassin list
led me to the dual account setup. But I think he used the second account
for global training. I wanted per user training. (I also put in the
"allow_user_rules        1" line in /etc/mail/spamassassin/local.cf.

<digression> Make no changes to /usr/share/spamassassin. I made that
mistake. I expect pain when I upgrade. Use the /etc/mail/spamassassin
folder for all your add on rule sets. They appear to run alphabedically.
</digression>

I created the second acount and fired up imap for that account. That is
where I created the spam and ham folders i used, briefly, for training.
SpamAssassin is friendly enough it can train on its own marked up spam
if you want. It can train from mbox format. It gets slower and slower
as your spam database grows.

On a slow machine getting slower and slower is a decided disadvantage.
So I created "oldspam" and "oldham" to save already processed messages
and diffidently copied all the already processed spam and ham to the
appropriate retraining folders. (If you ever blow away the bayes data-
bases, all three files, this will allow relatively pain free retraining.)


Salearn is (apparently) unfriendly enough to train on the message you
will find using "mail" as the permanent first message in the folder.
"This," decides me, "is not right!" So I figured to further automate
the whole process. Now, I speak C far better than perl. Besides for
the very limited parsing I had to perform C is far faster than perl.
The resultant futility, imapstrip, opens the folder you want, the
"old<name>" folder that goes with it (append mode). and a "<name>_temp"
file (write mode to erase the old one). It checks that the first item
in the file is indeed the imap header message. If not it proceeds to
step three directly. in step two it parses to the first real "^From "
saving the material inbetween the start of the file and the From for
later on. It completes step 2 by writing out the rest of the buffer
to both the appended file and the new temp file. It then proceeds to
read then write to both files in 64k chunks until the end of the file.
Then in step 4 if the imap header was present it closes the input file
and rewrites it with the header it captured in step 2.

Voila, I have the spam archive updated, the spam_temp for learning, and
the spam folder emptied out all without my intervention.

Since these folders exist in "<namne>_train" I had to cross connect
<name> and <name>_train accounts usefully. That means the ~/mail for
"<name>_train" is linked to "~/<name>_train" for the <name> account.
I also had to make <name> and <name_train> members of each other's
adhoc RedHat groups. A little "satrain" script and .procmailrc
editing later and I'm in business up to today.

I note that I am now convinced that the "<name>_train" accounts are
not really needed. But for the time being "it works so I am not messing
with it."

For the nonce the C coding is left as an exercise for the student. It
could be optimized, I suppose. "Done is good!" So I am leaving it alone
for awhile. The rest of it is simple minded.

--8<-- minimum .procmailrc for the <name> account
DROPPRIVS=yes
PROCMAILMATCH="X-Procmail: Matched on"
PROCMAILHEADER="X-Procmail: "

:0 fw: spamassassin.lock
* < 250000
* !^List-Id:
.*(spamassassin-talk\.lists\.sourceforge\.net|spamassassin\.apache.\org)
| /usr/bin/spamc
--8<-- added lines for rewriting spamassassin headers for easy replyto
# Tag SA-talk list mail
:0 Efw
* ^List-Id:
.*(spamassassin-talk\.lists\.sourceforge\.net|spamassassin\.apache.\org)
| formail -A "$PROCMAILHEADER SA-talk list mail not processed."

# NEW sa-talk list
:0 fw
* ^List-Id: .*spamassassin\.apache\.org
| formail -A "$PROCMAILMATCH SpamAssassin Talk list" -i "Reply-To:
spamassassin-users@xxxxxxxxxxxxxxxxxxxx"
--8<-- satrain script for the user's directory
#!/bin/bash
date
USER=$LOGNAME
USERTRAIN="$USER"_train
echo $USERTRAIN
echo "$USERTRAIN"
/usr/bin/fetchmail -q
ls ~/bin/imapstrip
if [ -f ~/bin/imapstrip ]; then
        echo "imapstrip training ham"
        ~/bin/imapstrip ~/"$USERTRAIN"/ham &&
sa-learn --ham --showdots --mbox ~/"$USERTRAIN"/ham_temp
        echo "imapstrip training spam"
        ~/bin/imapstrip ~/"$USERTRAIN"/spam &&
sa-learn --spam --showdots --mbox ~/"$USERTRAIN"/spam_temp
else
        sa-learn --ham --showdots --mbox ~/"$USERTRAIN"/ham
        sa-learn --spam --showdots --mbox ~/"$USERTRAIN"/spam
fi
/usr/bin/fetchmail -d 120 --fetchmailrc ~/.fetchmailrc
date
echo "====================================================================="
--8<--

Don't forget to setup fetchmail. But that's no big deal if you have
already done it. It looks remarkably like:
--8<--
# Configuration created sometime in 2003 by jdow
set syslog
set postmaster "jdow"
set no bouncemail
set no spambounce
set properties ""
#set daemon 60
#set logfile /var/log/fetchmail
#set syslog
# repeat these lines below for each email account fetched to the user's
# mailbox.
poll smtp.earthlink.net with proto APOP
       user 'jdow' there with password 'ZZYYZZYY' is 'jdow@mymachine' here
options pass8bits
 smtpaddress '      '
--8<--

Then I setup a user crontab entry to run the script at a unique per user
time in the late late night hours.

I hope that is enough to get you rolling. The C file is a little large
once I put in some basic bounds checking for me to include it here.

{^_^}