From: "Jeff Vian" <jvian10@xxxxxxxxxxx> > jdow wrote: > > >http://wiki.spamassassin.org/ is an astonishingly good place to learn > >about the ins and outs of SpamAssassin. it also mentions the home pages > >of various custom rule sets like 99_TripWire, BigEvil, and many others. > > > >It is well worth the visit. > > > >(I have "progressed" to the point that SA filters my mail. I read my > >mail via a secure pop2 connection. I maintain spam, oldspam, ham, and > >oldham folders on a special account via IMAP to mbox files. I have a > >futility that filters off the message 1 the imap tool insists must be > >there so that a cron job every night runs "salearn" for me. This is > >all rather handy when I am running this email tool. When I get time I > >plan to revisit tossing the special folders into my main email account > >"safely". I understand more now than when I started. {^_-}) > > > > > > > I am interested in how you do this. > I use fetchmail to get me mail by pop3 from the isp and put it in my > account on the linux box. I then need to get it into the maildirs > (instead of the default mailbox) so I can teach SA, but am unsure of the > mechanics of making that happen. Any pointers on getting the maildirs > working will be greatly appreciated. The data path is fetchmail to sendmail to procmail to spac to mbox output. (SendMail queues the mail in /var/spool/mqueue as fetchmail feeds it in. Then it farms the mail back through procmail for delivery to the mbox file.) This part is easy and should be a slam dunk for most folks here. You must get spamd running, through. I use the spamassassin.org rpm for 2.63. So it may be different for Fedora. But the key lines are: # SPAMDOPTIONS="-d -c -a -m5 -H" SPAMDOPTIONS="-d -c -m10 -H" The first is commented out. The second is only partly silly. (I see I should change the -m option back for sanity. I have a slow machine plus massive mail chunks coming in (chiefly from LKML) that lead to imap beocming a little disoriented. I get batches of mail delivered twice. every once in awhile.) I removed the -a option for AWL. It's something I do not trust. And several people's reports on the spamassassin list reinforce that view. I fired up the imap and later the secure pop3 tool straight out of RH9. It picked up the mbox mail box and presents it to me for reading on a different machine in Outlook Express. Then I decided it was too awkward to use a :0c: rule in procmail to save a copy in mbox format of all unprocessed incoming email so that I could yank out spam and train with it. (mail and its "s100 spam" ability. Yeah, REAL primitive. It HAD to be improved.) A throwaway comment by one of the other users on the spamassassin list led me to the dual account setup. But I think he used the second account for global training. I wanted per user training. (I also put in the "allow_user_rules 1" line in /etc/mail/spamassassin/local.cf. <digression> Make no changes to /usr/share/spamassassin. I made that mistake. I expect pain when I upgrade. Use the /etc/mail/spamassassin folder for all your add on rule sets. They appear to run alphabedically. </digression> I created the second acount and fired up imap for that account. That is where I created the spam and ham folders i used, briefly, for training. SpamAssassin is friendly enough it can train on its own marked up spam if you want. It can train from mbox format. It gets slower and slower as your spam database grows. On a slow machine getting slower and slower is a decided disadvantage. So I created "oldspam" and "oldham" to save already processed messages and diffidently copied all the already processed spam and ham to the appropriate retraining folders. (If you ever blow away the bayes data- bases, all three files, this will allow relatively pain free retraining.) Salearn is (apparently) unfriendly enough to train on the message you will find using "mail" as the permanent first message in the folder. "This," decides me, "is not right!" So I figured to further automate the whole process. Now, I speak C far better than perl. Besides for the very limited parsing I had to perform C is far faster than perl. The resultant futility, imapstrip, opens the folder you want, the "old<name>" folder that goes with it (append mode). and a "<name>_temp" file (write mode to erase the old one). It checks that the first item in the file is indeed the imap header message. If not it proceeds to step three directly. in step two it parses to the first real "^From " saving the material inbetween the start of the file and the From for later on. It completes step 2 by writing out the rest of the buffer to both the appended file and the new temp file. It then proceeds to read then write to both files in 64k chunks until the end of the file. Then in step 4 if the imap header was present it closes the input file and rewrites it with the header it captured in step 2. Voila, I have the spam archive updated, the spam_temp for learning, and the spam folder emptied out all without my intervention. Since these folders exist in "<namne>_train" I had to cross connect <name> and <name>_train accounts usefully. That means the ~/mail for "<name>_train" is linked to "~/<name>_train" for the <name> account. I also had to make <name> and <name_train> members of each other's adhoc RedHat groups. A little "satrain" script and .procmailrc editing later and I'm in business up to today. I note that I am now convinced that the "<name>_train" accounts are not really needed. But for the time being "it works so I am not messing with it." For the nonce the C coding is left as an exercise for the student. It could be optimized, I suppose. "Done is good!" So I am leaving it alone for awhile. The rest of it is simple minded. --8<-- minimum .procmailrc for the <name> account DROPPRIVS=yes PROCMAILMATCH="X-Procmail: Matched on" PROCMAILHEADER="X-Procmail: " :0 fw: spamassassin.lock * < 250000 * !^List-Id: .*(spamassassin-talk\.lists\.sourceforge\.net|spamassassin\.apache.\org) | /usr/bin/spamc --8<-- added lines for rewriting spamassassin headers for easy replyto # Tag SA-talk list mail :0 Efw * ^List-Id: .*(spamassassin-talk\.lists\.sourceforge\.net|spamassassin\.apache.\org) | formail -A "$PROCMAILHEADER SA-talk list mail not processed." # NEW sa-talk list :0 fw * ^List-Id: .*spamassassin\.apache\.org | formail -A "$PROCMAILMATCH SpamAssassin Talk list" -i "Reply-To: spamassassin-users@xxxxxxxxxxxxxxxxxxxx" --8<-- satrain script for the user's directory #!/bin/bash date USER=$LOGNAME USERTRAIN="$USER"_train echo $USERTRAIN echo "$USERTRAIN" /usr/bin/fetchmail -q ls ~/bin/imapstrip if [ -f ~/bin/imapstrip ]; then echo "imapstrip training ham" ~/bin/imapstrip ~/"$USERTRAIN"/ham && sa-learn --ham --showdots --mbox ~/"$USERTRAIN"/ham_temp echo "imapstrip training spam" ~/bin/imapstrip ~/"$USERTRAIN"/spam && sa-learn --spam --showdots --mbox ~/"$USERTRAIN"/spam_temp else sa-learn --ham --showdots --mbox ~/"$USERTRAIN"/ham sa-learn --spam --showdots --mbox ~/"$USERTRAIN"/spam fi /usr/bin/fetchmail -d 120 --fetchmailrc ~/.fetchmailrc date echo "=====================================================================" --8<-- Don't forget to setup fetchmail. But that's no big deal if you have already done it. It looks remarkably like: --8<-- # Configuration created sometime in 2003 by jdow set syslog set postmaster "jdow" set no bouncemail set no spambounce set properties "" #set daemon 60 #set logfile /var/log/fetchmail #set syslog # repeat these lines below for each email account fetched to the user's # mailbox. poll smtp.earthlink.net with proto APOP user 'jdow' there with password 'ZZYYZZYY' is 'jdow@mymachine' here options pass8bits smtpaddress ' ' --8<-- Then I setup a user crontab entry to run the script at a unique per user time in the late late night hours. I hope that is enough to get you rolling. The C file is a little large once I put in some basic bounds checking for me to include it here. {^_^}