RH9.0 > Postfix > SpamAssassin > Anomy Sanitizer > Maildrop > Courier IMAP - How-I-did-it

Server based spam filtering, virus filtering, malware filtering, Spam Assassin . . . all with open-source software.
(Also, LANG UTF-8 Unicode Red Hat 9 Red Hat 8 problems with Perl and mc Midnight Commander and other terminal / xterm graphic character mangling problems.)

Robin Whittle rw@firstpr.com.au      2003 June 22  except for two updates:
Updated    on 2004 January 1 for SpamAssassin 2.61

Also, there is a well-regarded web-based management system for Postfix, with instructions on making Postfix work with MySQL, at: http://high5.net


This covers how I integrated virus and spam filtering into my Postfix, Courier Maildrop and Courier IMAP setup - on a Red Hat 9.0 system.  (With a few notes on how I then did the same things on a Red Hat 7.2 machine.) That setup is described on a page you can reach from the parent directory .   Here I describe how I integrate SpamAssassin 2.55 and the Anomy Sanitizer 1.60 for automatically filtering spam and viruses (and other malware) in Maildrop, via its filtering language.  Maildrop is the local delivery program of Postfix - delivering to the Inbox and other mailboxes, all of which are in Maildir format, rather then Mbox.

This is a verbose documentation - because I find that a lot of problems are caused by overly terse doco which already assumes you know things that you don't.  Also, I think more people's problems will be solved if the description is full and search engines find this page better.

It is not a proper How-To - it is a How I Did It, simplified.  Quite a lot of the trouble I had can probably be avoided by correctly setting the LANG variable not to include ".UTF-8" - either for particular commands or for the whole system.  See this section - if you do this, your installation will probably be a lot easier than mine and most of the messy stuff below will not apply.

By the time you are reading it, the latest versions of software may differ somewhat from what I am describing, but with luck, the same principles will hold.  If you find significant differences I should mention, please let me know!   

An earlier version of this page, from 2002, is here:  ../Postfix-SA-Anomy-Maildrop-2002/  This page includes a discussion of a totally different way of integrating Anomy Sanitizer and SpamAssassin with Postfix, with separate arrangements for incoming and outgoing mail.  It is the approach used by specialists in this field, Advosys of Canada, who document it fully and whose page I link to. This may be more suitable for some people - especially those with high-volume servers and who don't want to use Maildrop.  

Back to the web-mail page for other pages on running mail servers.
Back to the First Principles page for my 21 metre (71 foot) Sliiiiiiiiiinky, corsetry ads, and all sorts of things . . .


Background

Initially I was going to cook up some simple Maildrop rules to detect a substantial proportion of spam and virus emails, without the trouble of finding and running extra programs for this purpose.  But the increasing volume of spam and viruses (and other malware, such as perhaps Javascript in HTML emails, though I have configured Mozilla / Netscape 7 not to run Javascript in emails) arriving by email, and the way spammers are playing cat-and-mouse with increasingly sophisticated methods of spam filtering, convinced me I needed to use proper detection systems. (That was in 2001 - now, June 2003, with spam increasing by 18% per month according to some estimates, it is essential to invest this effort to find, install and tweak two excellent programs to deal with spam and malware.  I am getting 35 spams a day.)

I am not currently using Postfix's ability to implement IP-address-based rejection of messages via a real time blacklist.  For an ISP or someone running a mail server which gets a lot of spam, this is a good way to reject messages which are assumed to be spam based on the IP address of the machine sending them (and/or perhaps on a few other things which can be determined from the headers ?) - but this rejection has some serious consequences in the event that it blocks a legitimate message:

  1. The intended recipient has no idea the message was sent or rejected.
  2. The sender usually knows it is has been automatically rejected, but has no human reason for why.  Usually it is the blacklist or whatever covering their ISP or IP address, based on past, present or imaginary spamming from that ISP or range of IP addresses.  In the event that there are good reasons to blacklist the IP address, this puts the sender in the position of pressuring their ISP to reduce the use of their servers by spammers, in order to get them off the blacklist, so that eventually, in days or weeks, they will be able to send the message and have it accepted.
  3. Typically, the sender does not know how to contact the administrator of the server which is doing the blocking, to report their view that this blocking was unjustified.
  4. Even if they do know how to do this, they probably can't do it by email from their usual ISP etc. because it will be rejected too!  I have been in this position as a sender, and had to resort to a web-mail service like Hotmail to write to the system administrator to alert them to their server unreasonably blocking emails originating from the IP address of my server.

I think you need to be damn sure of what you are doing implementing any such system, because it can really disrupt communications.

There are various server-based spam detection systems, but it seemed to me in October 2002 that of the open-source, Linux-compatible, approaches, SpamAssassin http://spamassassin.org was the most sophisticated, the most widely used, the most widely respected and the one under most intense development. Spam Assassin has a battery of modules and ways of detecting spam.  These include options to do real-time references to black-lists and spam signature recognition systems.   However, it doesn't specifically look for malware.   SpamAssassin is now very widely used and there are ways of running it separately from a server.  For instance it can be run on a user's Windows machine, and to pre-filter an IMAP mailbox before the user accesses it with their normal IMAP email client.  See the SpamAssassin site for all the details.  

I had thought that virus detection would probably also be a cat-and-mouse business with the need to pay to receive daily updates of virus definitions from a commercial provider.   However, I discovered an elegant alternative: Anomy Sanitizer http://mailtools.anomy.net .  Anomy deals with all executable attachments - primarily Windows executables but also shell scripts.  This should identify and defang all viruses, whatever the details of their nature.  It also defangs many nasty things which can be put into HTML emails to take advantage of weaknesses in various mail clients.  It can also defang web-bugs in HTML email.  Anomy can call a proper virus scanner, such as F-Prot Antivirus,  from Frisk Software.  Anomy is the work of Bjarni R. Einarsson and is sponsored by Frisk Software.  This is all from Iceland.  Salute!

Thanks to the SpamAssassin and Anomy Sanitizer people!!!

There are multiple MTAs (Message Transfer Agents  - AKA "mail servers") in use on Linux machines.   The most popular is old, bloated Sendmail.  To hell with that - one look at the Bat Book (a very thick O'Reilly book on how to configure it) was enough to put me off.  The next most popular, I think, is Qmail (forgive me capitalising the names of programs).  This is fast and widely respected.   However, a good friend recommended  Postfix, because it is elegant, easy to configure and very highly regarded in terms of security.  I have been entirely happy with Postfix (but don't let it try to deliver to a mailbox, or maybe write to a log file, larger than its inbuilt 50,000,000 byte limit . . . ), but it can be a headache getting clues on how to integrate other software with it because Sendmail and Qmail are more widely used.   I chose Courier IMAP because it was well respected and fast, with the use of Maildir mailboxes, and not as daunting or hard to set up as Cyrus IMAPD, which I understand is ideal for really big installations.  I chose the Courier Maildrop local delivery program because of its extensive mail filtering capabilities.  Its filtering language is tricky, but I understand it is more elegant and powerful than the more widely used, and older, Procmail.   In principle, I think, everything I document here can be done in much the same way with Procmail.   There's nothing in principle here which is Postfix-specific, so I figure it can be done with Sendmail and Qmail as well.

So I already run three programs (Postfix, Courier IMAP and Courier Maildrop), from two sources to make up my mail server - and I have modified Maildrop so it can deliver to the Inbox (or any other mailbox) with the message tagged for deletion.  This is really great for keeping an eye on dozens of mailing lists via the Inbox, and deleting all such copies of list messages with the IMAP expunge command - Alt-F F on Netscape 7.  

Here I integrate the three programs, in this order, to Postfix's local delivery command:

  1. Maildrop with my personalised filtering rules.  Like Anomy and SpamAssasin - Maildrop is server-wide too, with all users' messages being first processed according to a central config file /etc/maildroprc if it exists.  Then each user can have their own ~/mailfilter file to control the filtering of all messages not dealt with by any /etc/maildroprc  file.  (It is possible to make Anomy use a user's config file, rather than a system-wide config file - and I guess SpamAssassin can do the same.)  According to ~/mailfilter , Maildrop decides whether to deliver the message to the Inbox and/or to other mailboxes, and/or to send it to other email accounts.  With the subjadd program I wrote (see parent directory ) it can add things like "~~~[SPAM]" to the start of the subject line.  With my mods to Maildrop, it can also deliver the message to a local mailbox, tagged for deletion.   It can do lots of other things as well. Each user's .mailfilter file can include one or more other files, so it is a very flexible and extendable arrangement.  As part of the mail filtering, for all messages which I have not dealt with as mailing list messages, via the user's .mailfilter file I call:
    1. SpamAssassin - first, because Anomy Sanitizer can change the message in ways which disrupts spam detection.
    2. Anomy Sanitizer.  Since SpamAssassin can be set up not to alter the message body, but just alter its added header lines and perhaps the Subject line, we don't miss out on any ability to detect malware.

The Postfix block diagram is here:  http://www.postfix.org/big-picture.html   Please also see my diagram in the RHxx Postfix etc. page in the parent directory.


Integrating Maildrop with Postfix

This is how I did it - but please see my "RHxx-Postfix . . . " page in the parent directory for the most up-to-date details.

In addition to other standard changes to /etc/postfix/main.cf to make Postfix work properly with my domain and particular IP addresses, I change some lines as shown, with new in red and old in green, to make it use Courier Maildrop for delivering local messages:

# The mailbox_command parameter specifies the optional external
# command to use instead of mailbox delivery. The command is run as
# the recipient with proper HOME, SHELL and LOGNAME environment settings.
# Exception:  delivery for root is done as $default_user.
#
# Other environment variables of interest: USER (recipient username),
# EXTENSION (address extension), DOMAIN (domain part of address),
# and LOCAL (the address localpart).
#
# Unlike other Postfix configuration parameters, the mailbox_command
# parameter is not subjected to $parameter substitutions. This is to
# make it easier to specify shell syntax (see example below).
#
# Avoid shell meta characters because they will force Postfix to run
# an expensive shell process. Procmail alone is expensive enough.
#
# IF YOU USE THIS TO DELIVER MAIL SYSTEM-WIDE, YOU MUST SET UP AN
# ALIAS THAT FORWARDS MAIL FOR ROOT TO A REAL USER.
#
#mailbox_command = /some/where/procmail
#mailbox_command = /some/where/procmail -a "$EXTENSION"

mailbox_command = /usr/local/bin/maildrop

or, depending on where Maildrop is installed:

mailbox_command = /usr/bin/maildrop


. . .


# PARALLEL DELIVERY TO THE SAME DESTINATION
#
# How many parallel deliveries to the same user or domain? With local
# delivery, it does not make sense to do massively parallel delivery
# to the same user, because mailbox updates must happen sequentially,
# and expensive pipelines in .forward files can cause disasters when
# too many are run at the same time. With SMTP deliveries, 10
# simultaneous connections to the same domain could be sufficient to
# raise eyebrows.
#
# Each message delivery transport has its XXX_destination_concurrency_limit
# parameter.  The default is $default_destination_concurrency_limit for
# most delivery transports. For the local delivery agent the default is 2.

#local_destination_concurrency_limit = 2
#default_destination_concurrency_limit = 10

# RW ---------------------------------
#
# Set to 1 because Maildrop only delivers one message at a time.

local_destination_concurrency_limit = 1




Running SpamAssassin and Anomy Sanitizer from within Maildrop's .mailfilter file


Below, I describe how I installed these programs, and how I configured them.

Invoking SpamAssassin is easy.  Just add this to the start of .mailfilter - or for that matter, anywhere, such as within some conditional code if there are only some messages which should be scanned for spam:
xfilter "/usr/bin/spamassassin -x"
But its not good enough to so a similar thing with Anomy, since it needs and environment variable ANOMY in order to run and we need to tell it about its config file.

The Anomy doco shows how to run it from Procmail:
ANOMY=/path/to/anomy/
:0 fw
|/path/to/anomy/bin/sanitizer.pl /path/to/sanitizer.cfg
After some experiment, I found this is what I need to filter messages through both programs:
xfilter "/usr/bin/spamassassin -x"

ANOMY=/usr/local/anomy/
xfilter "/usr/local/anomy/bin/sanitizer.pl /usr/local/anomy/anomy.conf"

This is pretty painless!

But my final arrangement is more elaborate, because I use Anomy's stderr log output and append it to the log file I am creating with Maildrop.  Please see the text below, taken from my .mailfilter file,  which shows exactly how I call both these programs.

My overall filtering goals - and a section of my .mailfilter file

I put the above lines for Spam Assassin and Anomy Sanitizer after my main mail filtering lines for the various mailing lists I am on.   If spam or malware comes in on a mailing list, as it sometimes does, my filtering arrangement leaves it alone and still files the message to its ordinary mailbox, whilst also delivering it to the Inbox tagged for deletion.  

The incidence of viruses and spam on mailing lists seems to be pretty low.  If it was more of a problem I would need to be more careful - such as by sorting the mailing lists after I had weeded out the spam and defanged the virus emails.  But I suspect that spam filtering the list messages would be tricky, since:
  1. Some would be regarded as spam when they are not, or
  2. Some would be spam and I wouldn't really care, or would prefer it, if they were copied to their mailing list mailbox and probably shown in the Inbox (tagged for deletion like the other mailing list stuff) - if only because I want to see the traffic on the list just as other members see it.
I don't use crappy email clients like MS Outlook Express etc. so I am not too concerned about security problems from spam, or even virus emails.

My main aim is to identify spam in all the non-list messages, and likewise malware (virus) emails, with shortening and defanging the malware messages - and to handle them so I can see they have arrived, but that I don't have to manually select them for deletion etc.   I have Maildrop create a log file - and since Maildrop will fail if this ever gets to 50,000,000 bytes,  I have a weekly log rotation system for it, as documented in the "RHxx-Postfix . . ." page you can find in the parent directory ../ .


Here is part of my .mailfiter file, showing:
Here is a representative section of the .mailfilter file for my account on my server.  I have a few other users, but their email addresses are not as well known as mine so spam and viruses are not a problem for them . . . yet.  

Initially, I had spam and virus emails are delivered to Inbox, tagged for deletion, in addition to sending them to their own spam and virus mailboxes.  After a week of running SpamAssassin 2.55 with Bayes and RBL (Real Time Black Hole) checks, I found the system was so good that I turned off sending virus and spam messages to the Inbox.  In the first week, 259 spams were correctly detected by SpamAssassin, with no false positives - and 4 were missed.   (There were some other pesky emails I didn't count in these 4 - a report of a virus sent with my address as the sender, two highly targeted spams aimed at me and a very narrow subset of webmasters, and an out-of-office auto reply from someone on a mailing list I had written to.)  23 virus emails were correctly detected, with no false positives and none missed.  I still need to trawl through the spam pit from time to time to check for any non-spam which has been falsely detected, but with the new Bayes system, I think this is going to be rare, such as much less than the one every month or two in the past with SpamAssassin 2.42.


# Don't use lines starting with "#" in the middle of continue lines!
#
# Uncomment to create a syntax error:
# if )

# Specify the log file for Maildrop - in the user's directory.
#
# Do not use "=".
#
# I use the same file name in the Anomy section.

logfile "mailfilter-log.txt"

log "========"

# The above line is written to the log file whenever a message comes in.
#
# Then, for each successful match, a long log line is written as well, before
# what is typically two log entries:
#
# 1 - For the cc operation.
# 2 - For the to final delivery to Inbox.
#
# These long log lines serve to make the log file easier to read, and they
# also function as prominent labels in this file.
#
# Note that "." in the filter terms matches anything, from 0 to any number of
# characters, so it doesn't just match ".".
#
# The search terms are generally copied from email headers.
# Generally, for mailing lists I find that the "Return-Path: " header
# is the most distinctive and stable for all the ways that messages can
# be sent to a mailing list.
#
# Yahoo Groups lists are different - the "Return-Path: " is customised for
# every message.  So the "Reply-To: " seems to be the best header.

Actually, it may be better to look for the header "List-Unsubscribe: ".


# If external filter commands are used, via "xfilter", be sure that this
# user account has a valid shell such as /bin/bash in its /etc/passwd entry.
#
# If Maildrop exits with an error code, then a line with:
# "Unable to filter message." will appear in /var/log/maillog .
#
# If an external program run via "xfilter" fails, then there may be an
# error message returned to Maildrop which Postfix will write to
# /var/log/maillog .  To debug such things, use "log We are now doing . . . "
# lines in this .mailfilter file, because Maildrop will not be writing
# to its log file if it fails in some way, and only by scrutinising the
# log lines can I determine how far it got.
#
# Postfix will try periodically to send deferred messages, and this can have
# undesired consequences if the mailfiltering system writes the message somewhere,
# or sends it somewhere, but still does not complete without errors.
#
# To see what is deferred, use the command "mailq".
#
# The "postsuper" command can manage messages in the queue.
#
# To force Postfix to try to process all the deferred messages:
#
#     postfix flush
#


# Define where I want copies (tagged for deletion) of sorted list messages to go
# in addition to them being sent to their own mailbox.  I normally want them
# to be sent to the Inbox ("Maildir") but if I am away from home, and so
# am accessing my email infrequently and probably by the Postman web-mail
# system, or for some other reason, don't want my Inbox cluttered with
# mailing list messages, I can send those copies to a mailbox I named
# "0-Inbox-lists".  This "0" puts it up the top of the mailboxes in the left
# pane of Netscape Messenger.

LMB="Maildir/.0-Inbox-lists"         
                             # Comment out the next line when I am away.
                     
LMB="Maildir"                            
 




####################################################################### Misc mailing list stuff.




if (    ( /.m9ndfukc/ )            \
     || ( /.www.god-emil.dk/ )     \
   )
{
    log "------------------------------------------------------------- Antiorp etc. "   
    cc "Maildir/.0-SPAM-etc.Antiorp-crap"
    DELTAG=1
    xfilter "subjadd [Antiorp]"
    to "$LMB"
}

This gets rid of messages from a pesky person or bot who writes gibberish to some lists - before I search for the list messages themselves.  The "to" command to Maildrop means this is the end of the filtering process, so if the if() statement is true, then the rest of the tests are irrelevant.


if ( /^Return-Path: <bugtraq-return/ )
{
    log "------------------------------------------------------------- Linux-bugtraq "
    cc "Maildir/.Lists.Linux-bugtraq"
    DELTAG=1
    xfilter "subjadd [BugTraq]"
    to "
$LMB"
}



if ( /^X-Loop: anomy-list/ )
{
    log "------------------------------------------------------------- Mail-Anomy-Sanitizer "
    cc "Maildir/.Lists.Mail-Anomy"
    DELTAG=1
    to "
$LMB"
}




if ( /^Return-Path: <courier-users-admin/ )
{
    log "------------------------------------------------------------- Mail-Courier-users "
    cc "Maildir/.Lists.Mail-Courier-users"
    DELTAG=1
    to "Maildir"
}


                                    # EPIC
                                    # <jw@bway.net> Censorware proj. http://www.spectacle.org/freespch/slaclist.html
                                    # <privacy@vortex.com> Privacy Forum


if (     ( /^Return-Path: <epic-news/ )           \
      || ( /^Return-Path: <jw@bway/ )             \
      || ( /^Return-Path: <privacy@vortex/ )      \ 
   )   
{
    log "------------------------------------------------------------- Privacy-various-lists"
    cc "Maildir/.Lists.Privacy-various-lists"
    DELTAG=1
    to "
$LMB"
}



#------------------------------------------------------------------------------------------------------------------
#
#  Spam and Virus zone!


                                    # Copy to account on another machine, in case I
                                    # accidentally delete something or this server dies.
                                    #
                                    # The "!" means use this as an email address.
                                    #
                                    # I am not using this approach any more.  There is
                                    # another approach below.
                                                                          
# cc "! blahblah@blah.blah"

                                    # Also copy to a folder so I can see them easily.
                                    # But I don't intend to keep this folder's contents
                                    # for long.
                                   
cc "Maildir/.0-Inbox-Old.Post-Filter-Inbox"



if (    (      /info@firstpr.com.au/)\ # Watch the headers for any mention of one of the
     || ( /marketing@firstpr.com.au/)\ # generic email accounts which RFC 2142 says we
     || (     /sales@firstpr.com.au/)\ # should receive mail on, and which Postfix by
     || (   /support@firstpr.com.au/)\ # default aliases to root.
   )      
{
    log "------------------------------------------------------------- Spam generic address. "   
    cc "Maildir/.-SPAM"             #  Make this "cc" for copy or "to" to not send it to Inbox.
    DELTAG=1
    xfilter "subjadd ~~~[SPAM]"
    to "$LMB"
}

                                    # Look for messages addressed to an old account
                                    # at an ISP, where the messages are automatically
                                    # forwarded to here.  This is almost certainly
                                    # spam.  I will label it in the Inbox in a
                                    # distinctive way and toss it into its very
                                    # own spam pit.
                   
                   
if (    (/blahblah@ozemail.com.au/ ) \
   )
{
    log "------------------------------------------------------------- blahblah@ozemail.com.au probable spam "
    cc "Maildir/.0-SPAM-etc.00spam-ozemail"
    DELTAG=1
    xfilter "subjadd ~~~[SPAM-OzEm]"
    to "$LMB"
}



                                    # Look for messages from specific people which I don't want
                                    # to subject to spam filtering, because, for instance,
                                    # they have sent HTML emails or some other messages which
                                    # have been falsely identified as spam in the past.
                   
if (    (/friend-a/ )                \
     || (/
friend-b/ )                \
     || (/business-associate-1/ )    \
     || (/
business-associate-2/ )    \
   )
{
    log "------------------------------------------------------------- Saved from spam filtering "

                                    ##################################################################
                                    # This section copied from the main non-spam ordinary message
                                    # section below. 
                                    #
                                    # Copy to account on another machine, in case I accidentally delete
                                    # something from my Inbox, or mistakenly move it somewhere and then
                                    # not know were I put it.
    cc "! blahblah@blah.blah"
                                    # Also copy to a mailbox so I can see them easily.
                                    # But I don't intend to keep this mailbox's contents
                                    # for long.
                                   
    cc "Maildir/.0-Inbox-Old.Post-Filter-Inbox"

                                    # Deliver to the Inbox.
    to "Maildir"
                                    ##################################################################
}


                                    # Pipe through SpamAssassin and if it is deemed
                                    # to be spam, dump it in the spam pit and send a
                                    # copy, tagged for deletion, to the Inbox, with
                                    # [SPAM] added to the Subject line.
                                    #
                                    # If it is not deemed to be spam, then send it
                                    # through Anomy Sanitizer to defang it and
                                    # to drop any executable (ie. probably a virus)
                                    # attachments.

xfilter "/usr/bin/spamassassin -x"


if (    /^X-Spam-Flag: YES/                   \ # Watch out for header line added by Spamassassin.
   )      
{
    log "------------------------------------------------------------- Spam general. "   
                                             
# To make it not go to the Inbox, this "cc" into a "to"
                                              # and comment out the next three lines.

    cc "Maildir/.-SPAM"                       
    DELTAG=1
    xfilter "subjadd ~~~[SPAM]"
    to "$LMB"
}

                                              # Some debugging lines and a place to save things when
                                              # I am tweaking Anomy Sanitizer.
# cc "Maildir/.Debug"
# log "Send to Anomy"


                                              # Set up the environment variable ANOMY to keep Anomy happy.
                                              # 
                                              # Filter the message via stdin to Anomy, with the config
                                              # file specified, logging output being appended to the
                                              # Maildrop log file and then the output being piped to
                                              # cat so cat's stdout sends it back to Maildrop.  The use
                                              # of "2>>" for appending stderr with the log material means
                                              # we need the "| cat".
                                              #
                                              # If Anomy's conf file has:
                                              #
                                              #   feat_log_inline = 0
                                              #   feat_log_stderr = 1
                                              #
                                              # Then a report of Anomy's progress in working on the message
                                              # will be appended to the Maildrop log file.  This seems to
                                              # work fine, so presumably each "log" line for Maildrop
                                              # means it opens and closes the log file.

ANOMY=/usr/local/anomy/
xfilter "/usr/local/anomy/bin/sanitizer.pl /usr/local/anomy/anomy.conf 2>>
~/mailfilter-log.txt | cat"

# log "Anomy done."


if (   /^*** Attached file dropped ***/:b     \ # Watch out for text added to body by Anomy Sanitizer
                                              \ # when it *drops* a file, not just when it renames
                                              \ # or in some other way defangs one.  This is a part
                                              \ # of the drop message I added.
                                              \ # ":b" means look in the body, rather then the headers.
   )                  
{
    log "------------------------------------------------------------- VIRIII. " 
                                              # To make it not go to the Inbox, this "cc" into a "to"
                                              # and comment out the next three lines.
 
    cc "Maildir/.-Viriii"                    
    DELTAG=1
    xfilter "subjadd ~~~[VIRIII]"
    to "Maildir"
}







    log "- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - No match."
                           


                                    ##################################################################
                                    # This section is functionally the same as the section above
                                    # where I handle particular messages which I accept without
                                    # spam or virus filtering.
                                    #
                                    # Copy to account on another machine, in case I accidentally delete
                                    # something from my Inbox, or mistakenly move it somewhere and then
                                    # not know were I put it.
    cc "! blahblah@blah.balh"
                                    # Also copy to a mailbox so I can see them easily.
                                    # But I don't intend to keep this mailbox's contents
                                    # for long.
                                   
    cc "Maildir/.0-Inbox-Old.Post-Filter-Inbox"

                                    # Deliver to the Inbox.
    to "Maildir"
                                    ##################################################################
}
 



Problems with Red Hat 8.0 and Red Hat 9.0 Unicode UTF-8: Perl, terminal display mangling, mc Midnight Commander etc.

Here is a problem I discovered whilst installing SpamAssassin and the Perl modules it requires.  This is my rough understanding - please follow the references for more information.  See the section below, where I got rid of this UTF-8 stuff from my system, in order to test and presumably run Anomy Sanitizer.  If you do this now, perhaps you will have less trouble than I did!

There is an environment variable called LANG.  The command env lists all the environment variables, including LANG.  For me, in Australia, this returns:
UTF-8 is a way of storing 128 ordinary ASCII characters in 7 bits of one byte, with many more possible characters being stored as two bytes - with bit 7 of the first byte set and the lower order bits and the bits of the second byte making up the rest of it.  This expands further to 3 and 4 byte representations of very large numbers of characters.  Thus, every conceivable character of any language can be encoded within the one system.  This, in principle, seems like a better approach than the user or the application needing to know the character set used in a particular file, which leads to errors and the impossibility of using characters from two or more languages in one file.  See: http://www.unicode.org .

Various programs behave differently depending on this LANG environment variable.  Furthermore I understand that if you compile a Perl program (maybe others) with LANG including a "UTF-8" in it, such as the ".UTF-8" extension to the English Australia setting shown above, then the resulting program will be built to handle Unicode characters.  Maybe this is what you want - but I read somewhere (sorry, I can't remember where) that this may cause the program to run more slowly because Unicode handling is more complex. As noted below, UTF-8 breaks MakeMaker, which is for building and then installing many Perl modules.

It is possible to run a program with a different setting for LANG, such as Midnight Commander, using the bash shell in ways such as these:
LANG=C      mc

LANG=       mc

LANG=en_AU  mc

The command LANG=xyz sets the environment variable, and this holds for commands on this line, unless something else changes it.  The following seems to do the same thing:
env LANG=C mc
This is the way to set the environment variable for subsequent commands.  
export LANG=C
mc

The LANG variable for user root, in Red Hat 9.0, is controlled by  /root/.bash_profile includes /root/.bashrc which includes /etc/bashrc .  But this page:  http://www.linuxmanagers.org/pipermail/linuxmanagers/2002-December/000885.html says the setting can be changed in the file /etc/sysconfig/i18n  (i eighteen n means "internationalisation" . . .).  Here are some more links: I don't know enough about Perl or locales (which is what LANG refers to) to advise any further on the principles at work here.  But here are some observations:

Installing Perl modules and SpamAssassin 

Short version:  
SpamAssassin installation was fine, but getting the required Perl modules was a real pain. There is a specific problem with RedHat 9 (as noted in the previous section) which affects the ability to compile and install these modules, as well as SpamAssassin, and which should be corrected for with a "LANG-C" or similar command.  Maybe this can be done before running the MCPAN install program.  I found that MCPAN did not work reliably for compiling (AKA "building") and installing the modules, so I had to drop out of it and issue appropriate commands myself. Now I realize this probably was caused by MakeMaker not working, so I guess fixing this UTF-8 problem may have made things go more smoothly.

Both SpamAssassin and Anomy Sanitizer are are written in Perl and are installed as Perl modules.   What are Perl modules?   Perl has evolved to the stage where there is a highly organised library of code, generally found via http://www.cpan.org and which can be installed on any particular machine which has a suitable Perl compiler/interpreter/whatever.  The programs are generally installed in some centralised manner, both in terms of where the resulting files live, and how they are installed . . . but it seems that the actual file locations vary from system to system.  

To this end, there is a Perl program called MCPAN which can get modules from cpan.org and install them automatically.  In doing so, it can satisfy dependencies and do all sorts of other magic.   I guess it is like Red Hat's rpm program, but it knows all about the central cpan.org site, and downloads from a chosen geographically close mirror site.  It can do MD5 checksums etc.  I had never heard of it, but it was part of my Red Hat 7.2 and 9.0 installations.

SpamAssassin can be installed in three ways, I think:
  1. Using  MCPAN - supposedly the easiest - but it involves compilation.
  2. Downloading a tarball and compiling it.
  3. With an RPM, but does this involve compilation?
I chose MCPAN, but it didn't work out smoothly.  Nor would have a straight tarball or probably an RPM due to the abovementioned UTF-8 problems. I was able to fix the problems by changing the LANG environment variable to "C" and doing the build and install steps manually, rather than leaving it to MCPAN.

I had never run this MCPAN thing before, so the first time I ran it, it took me through a bunch of configuration things, including:
  1. Choosing a nearby mirror site for receiving Perl archives from.
  2. Setting up any HTTP and FTP proxies I might want to use.
  3. Various other config things - such as locations of unzip programs.
  4. Setting by default something which Advosys suggested be explicitly set every time: o conf prerequisites_policy ask This means that MPAN will ask the user about getting any new or updated Perl modules which might be needed by whatever they are now installing.   So if your MPAN is not already set up for this, give that command each time before installing something.  (The SpamAssassin instructions include this specifically.)
What is MPAN?   There's no man page on it.   Where does it live and where is its doco?   The answer was suggested by a correspondent:
man CPAN
It depends on your current Perl version, but I think the various Perl modules can be found under /usr/lib/perl5/ but exactly what is going on in there, I don't know.

I ran MPAN fr the first time, as root, like this (as instructed by man CPAN):
perl -MCPAN -e shell
(You might want to consider putting LANG=C before this and all other invocations of perl and MCPAN.)

It introduced itself and took me through some config stuff.  I hate it when a program calls itself "I".  I am an "I" - a program is not!  (Likewise dumbed-down consumer crap which displays "I blah blah . " when the intention of the program designers is to implant this thought directly into my mind, where the "I" means me.)

I hit Enter for all the questions except to tell it where I was located so it could find a nearby mirror.

What  is a "WAIT server"?  http://www.perldoc.com/perl5.6.1/lib/CPAN/WAIT.html "WAIT" is some kind of search system.  

I ended the CPAN session with "quit".

While SpamAssassin can be installed via a tarball, I chose this CPAN approach because it worked OK last time - though it was very convoluted due to side-trips into CPAN updating itself., installing modules required for that update, modules to support those modules etc.

To install SpamAssassin, I need to run CPAN and issue the following commands (as root), one at a time, as per the SpamAssasin instructions:  http://www.spamassassin.org/dist/INSTALL .  Probably the second line is redundant, so I did not give it - that option was set to "ask" as part of the initial configuration.  (For those interested in how I installed the latest SpamAssassin on my Red Hat 7.2 system, jump ahead to here.)
        perl -MCPAN -e shell                    
o conf prerequisites_policy ask
install Mail::SpamAssassin
quit
The first time I tried install anything (SpamAssassin in this case), MCPAN decided that it was an older version and that it would be a good idea to install a new version. (See below on how to update MCPAN without using MCPAN.)
 There's a new CPAN.pm version (v1.70) available!
  [Current version is v1.61]
  You might want to try
    install Bundle::CPAN
    reload cpan
  without quitting the current session. It should be a seamless upgrade
  while we are running...
I object to a computer program describing it and me as "we"!  

While I was thinking about whether to update the program, it started downloading something - I figured it was upgrading itself without my permission . . .  This Perl stuff is getting a bit uppity!  Minutes later, it turned out it had tried to install SpamAssassin 2.55, and produced the following errors:
Mail-SpamAssassin-2.55/INSTALL
Mail-SpamAssassin-2.55/config.h.in

  CPAN.pm: Going to build J/JM/JMASON/Mail-SpamAssassin-2.55.tar.gz

Checking if your kit is complete...
Looks good

Warning: I could not locate your pod2man program. Please make sure,
         your pod2man program is in your PATH before you execute 'make'

Writing Makefile for Mail::SpamAssassin
Makefile:92: *** missing separator.  Stop.
  /usr/bin/make  -- NOT OK
Running make test
  Can't test without successful make
Running make install
  make had returned bad status, install seems impossible
This is a bore.  Did this fail due to:
  1. Me not understanding and following instructions?
  2. Me not upgrading CPAN?
  3. Something funny about my current Red Hat 9.0 installation (I recently used up2date . . . ) (Later, I found, as documented above, that it was this Unicode UTF-8 problem - nothing to do with SpamAssassin.)
  4. Some problem with SpamAssassin?
I consult my good friend Google - for: "Makefile:92"  Nothing . . .   I try the SpamAssassin talk mailing list:
http://sourceforge.net/mailarchive/forum.php?forum=spamassassin-talk 
The message on 2003 May 15: includes:
Hi Thomas,
I was building a Snort sensor this afternoon on a RH 9 box and ran into the
same error while trying to build Net_SSLeay. I found this info which fixed
my problem

Try building it with the following command:

[root@mailtest1 Mail-SpamAssassin-2.54]# LANG=C perl Makefile.PL

HTH,
Matt
This is for building SpamAssassin from a tarball.  Here is what has happened:
The CPAN instigated make command couldn't find where the pod2man program was.  man reveals it is something to do with documentation format conversion.  Consequently there was a blank in the Makefile:
/root/.cpan/build/Mail-SpamAssassin-2.55/Makefile 
at line 92:
installhtml1dir=''
installhtml3dir=''
installman1
INSTALLSITEBIN = /usr
and so make failed.

What exactly does the fix:
LANG=C perl Makefile.PL
mean?  LANG is an environment variable, and I guess this will build SpamAssassin in a way that make doesn't go looking for this pod2man program, due to that form of documentation not being required, on account of the different state of the environment variable "LANG".

(I later found this problem with RH9.0 is a FAQ: http://spamassassin.taint.org/faq/index.cgi?req=show&file=faq04.014.htp .  The standard state of LANG in RH9 and apparently RH8.0 is some Unicode thang: utf8 .  For my system it is en_AU.UTF-8 .  Building perl programs with this apparently causes the resulting program to run really slowly, due to it including libraries for complex character handling.)

So I quit MPAN. With the bash shell, I can set an environment variable like this "LANG=C".  The test is:
LANG=C env
which returns, in part:
LANG=C
but this only works for the environment of the command on that line.  

I try the Medicinal Inkantation from /root/.cpan/build/Mail-SpamAssassin-2.55/ :
LANG=C perl Makefile.PL
This created a different Makefile .

Now I need to follow the compilation instructions in the INSTALL file.  This is the first section for what I assume is root installing it for all users to use:
[unzip/untar the archive]
cd Mail-SpamAssassin-*
perl Makefile.PL
[option: add -DSPAMC_SSL to $CFLAGS to build an SSL-enabled spamc]
make
make install                            [as root]
So I try:
make
This seems OK.
make install
This worked fine for me.  It output all the places it installed things but my subsequent attempts to make it do this again and capture it failed.  Documentation about installation directories is in README, but I am always wary about how that could be out of date or not quite relevant to the actual installation.

In INSTALL there is a list of Perl modules which SpamAssassin needs to run - but not, so far , for the "build" process.  Should I install these with MCPAN?  How many of them do I already have?  Here is the list:
  - File::Spec >= 0.8    (from CPAN, or included in Perl 5.6 and higher)

    The File::Spec module is required in version 0.8 (Mar 2000) or later.
    This is included in Perl versions 5.6 and later.

  - Pod::Usage           (from CPAN, or included in Perl 5.6 and higher)

    The Pod::Usage module is required.  This is included in Perl versions
    5.6 and later.

  - HTML::Parser >= 3.0  (from CPAN)

    HTML is used for an ever-increasing amount of email so this dependency
    is unavoidable.  Run "perldoc -q html" for additional information.

    If you use Debian, you can get HTML::Parser from the libhtml-parser-perl
    package.

  - Sys::Syslog    (from CPAN)

    This is a required module if you use spamd.  spamd logs information
    about scanned messages to syslog using this module.  
(On my Red Hat 7.2 system, when I used CPAN's install Sys::Syslog to see whether I had it installed the above modules, and if so whether it was up to date, the only way of getting Sys::Syslog was to install Perl 5.8.0 -  I had 5.6 on that machine. I did this with a "force install  Sys::Syslog" approach.  This took a long time . . . (10Megs over a 56k modem - I should have configured a proxy to my cable modem machine.) Then I quit CPAN, ran it again, and did "install" for these four modules and those mentioned below which SpamAssassin needs to run.  This caused some compilation to occur for Pod::Usage. When I did this for Sys::Syslog I got the same sort of message again . . . as if I had not installed Perl 5.8.0.  Indeed, there was no 5.8.0 directory . . so what was going on?  I took the second option for getting this Sys::Syslog  installed via MPAN: install J/JH/JHI/perl-5.8.0.tar.gz .  This generated an error immediately: "Running make for J/JH/JHI/perl-5.8.0.tar.gz  Use of uninitialized value in string eq at /usr/lib/perl5/5.6.1/CPAN.pm line 4424."  So  I quit MPAN, downloaded the 10 Meg tarball  ftp://ftp.cpan.org/pub/CPAN/src/perl-5.8.0.tar.gz started installing it manually and then, while this was going, did some Googling and found that Perl 5.8.0 installs its binary (the compiler/interpreter,  guess) in /usr/local/bin/ rather than /usr/bin/ of the previous versions. I found two identical files in /usr/local/bin/ : perl and perl5.8.0 there.  So it looks like MCPAN did install 5.8.0 . . . but I am not sure.  Now, with too many questions in my mind to list . . . I completed the manual installation of Perl 5.8.0, which went without visible errors. Then I rebooted the machine!  I later noticed that the 5.80 directory was not in /usr/lib/perl5/ but in /usr/local/lib/perl5/ .  So I don't know whether MCPAN installed 5.80 properly at all. There were three identical length binaries from this new installation: /usr/local/bin/perl5.8.0/usr/local/bin/perl and /usr/bin/perl.  After this manual installation and reboot, MCPAN behaved differently.

I then ran MCPAN and tried to install the 6 modules I think SpamAssassin needs:
perl -MCPAN -e shell
install
File::Spec
install Pod::Usage
install HTML::Parser
install Sys::Syslog
install
DB_File
install Net::DNS
The first outcome was that it was clear that Perl 5.80 was running this: "/usr/local/lib/perl5/5.8.0/CPAN/Config.pm initialized." (This is how I found the different location!).  The second was that I had to go through a series of set-up questions before I could do anything. Then, here is what happened for the 6 modules listed above:
File::Spec    Up to date.

Pod::Usage    It got a new version from the Net and installed it.

HTML::Parser  It installed G/GA/GAAS/HTML-Parser-3.28.tar.gz.  I
              said no to the Unicode option.  It wanted to install
              HTML::Tagset, which I let it do.

Sys::Syslog   Up to date!!

DB_File       It got and installed a new version: DB_File-1.806
.
 
Net::DNS      It got and installed a new version: Net-DNS-0.37,
              which involved also installing Digest::HMAC_MD5 but I
              accidentally stopped this.  Trying to install Net:DNS
              caused trouble due to this missing Digest::HMAC_MD5
              module so I stopped it, quit MCPAN, started MCPAN
              again, installed Digest::HMAC_MD5 - which involved
              getting and installing Digest::SHA1
- and then installed
              Net::DNS.

So I believe I am ready to install SpamAssasin 2.55 on RH7.2!
perl -MCPAN -e shell                    
install Mail::SpamAssassin
quit
This worked perfectly - so its time to configure it.)
I note that from /usr/lib/perl5/5.8.0/ I must have version 5.8 of Perl.  Scrutinising the sub-directories there, I determine whether I have it already:
File::Spec     Yes.
Pod::Usage     Yes.
HTML::Parser   No - there's no HTML directory. 
Sys::Syslog    No - there's no Sys directory.
 Following the SpamAssassin example, I try this:
perl -MCPAN -e shell
install 
HTML::Parser
(You might have a straightforward installation if you put LANG=C at the start of the first line above.)

I said no to an option about unicode encoding.

There was a series of tests, three of which failed:
Failed Test    Stat Wstat Total Fail  Failed  List of Failed
-------------------------------------------------------------------------------
t/entities.t                 11    2  18.18%  2 9
t/headparser.t                4    1  25.00%  3
1 test skipped.
Failed 2/40 test scripts, 95.00% okay. 3/226 subtests failed, 98.67% okay.
make: *** [test_dynamic] Error 29
  /usr/bin/make test -- NOT OK
Running make install
  make test had returned bad status, won't install without force
What is this crap?  I just want to install SpamAssassin.

I said "bye" to MCPAN and made this my current directory:
/root/.cpan/build/HTML-Parser-3.28
What do I do?  Force installation of something I don't understand which tells me it is partially broken?  Go back and install it with the "experimental unicode options".  This exemplifies how one IT task multiplies into an indeterminate number of sub-tasks which cannot be foreseen, each of which has the same form as the original task: unpredictable fanning out into further tasks and unpredictable times involves in completing each of them, if indeed they can be completed successfully.  Or is it because I don't have the latest MCPAN????

I delete the directory and try again, this time saying Yes to unicode . . . but this fails 8 tests!!

So I go back and do it again, deleting the directory, no unicode . . .

After some mucking around, in /root/.cpan/build/HTML-Parser-3.28 , I find that I can just give the commands:
make
install
and it seems to be happy.  But how do I know I have done the right thing.  I am installing some software I have never heard of.

Now for the second module I think I need:
perl -MCPAN -e shell
install 
Sys::Syslog
Hmm, it says: Sys::Syslog is up to date.

I also want two more modules mentioned in http://www.spamassassin.org/dist/INSTALL :

  - DB_File	(from CPAN)   

Used to store data on-disk, for the Bayes-style logic and
auto-whitelist. *Much* more efficient than the other standard Perl
database packages. Strongly recommended.


- Net::DNS (from CPAN)

Used to check the RBL, RSS, DUL etc. and perform MX checks.
Recommended.

So:
perl -MCPAN -e shell
install
Net::DNS
This needed some more modules which I allowed it to get and install:
Digest::HMAC_MD5
Digest::SHA1
This chews away for several minutes, including probing remote nameservers for test purposes and eventually winds up with:
make[1]: Leaving directory `/root/.cpan/build/Net-DNS-0.35'
You have a working compiler.

You appear to be directly connected to the Internet.  I have some tests
that try to query live nameservers.

  /usr/bin/make  -- OK
Running make test
cc   test.o   -o test
  /usr/bin/make test -- OK
Running make install
make: *** No rule to make target `install'.  Stop.
  /usr/bin/make install  -- NOT OK
This is just shit.  The process just fails.  I exit MCPAN and find that there is no Makefile at all in /root/.cpan/build/Net-DNS-0.35/ .

I try following the instructions there, with the Majick Inkantation which I used earlier to get me out of strife:
LANG=C perl Makefile.PL
I typed "y" to its request to do some tests to the Net.  This resulted in a makefile being created.  Maybe the problem would be solved if I got a new MCPAN . . .


Upgrading to the latest MCPAN

Maybe I should get rid of this old MCPAN and start again . . .  This page: 
http://www.cynistar.net/~apthorpe/code/configuring_cpan.html
says to download a tarball from cpan.org and install it (perl Makefile.PL; make; make test; make install.).  I find it at:
http://www.cpan.org/modules/01modules.index.html   
CPAN ANDK CPAN-1.70.tar.gz 129k  04 Mar 2003
This leads to problems pod2man not being found, so I give these commands:
LANG=C perl Makefile.PL
make
make test
make install
This works fine.  So I have the lastest MCPAN.


I delete the three directories in /root/.cpan/build/  ( Net-DNS-0.35 , Digest-HMAC-1.01 and Digest-SHA1-2.02) and try again:
perl -MCPAN -e shell
install
Net::DNS
This makes absolutely no difference.  make fails because there is no Makefile.  (Now we know why: UTF-8!)

So we are back to this, in  /root/.cpan/build/Net-DNS-0.35 :
LANG=C perl Makefile.PL
OK - there is a Makefile.
make
make test
make install
This seems to be OK.  So even the latest MCPAN cannot be trusted to do its job.  (Because of Red Hat's LANG="xxx.UTF-8".)

Now for the last module I think I need:
perl -MCPAN -e shell
install
DB_File
This was already installed.

I feel like a fish out of water here, but I guess its time to configure SpamAssassin.

Configuring SpamAssassin

In README it says:
These are the configuration files installed by SpamAssassin.  The commands
that can be used therein are listed in the POD documentation for the
Mail::SpamAssassin::Conf class (run the following command to read it:
"perldoc Mail::SpamAssassin::Conf"). 
Urgg ... documentation in a form I can't easily turn into text.  (man/info perldoc is no use.)  But all the doco is in HTML at: http://spamassassin.org/doc.html .  The page with all the details of settings is http://spamassassin.org/doc/Mail_SpamAssassin_Conf.html.)

Be sure to read over the list of tests too:  http://spamassassin.org/tests.html .

Here are some parts of README:
Note: The following directories are
the standard defaults that people use.  There is an explanation of all the
default locations that SpamAssassin will look at the end.
  - /usr/share/spamassassin/*.cf:

    Distributed configuration files, with all defaults.  Do not modify
    these, as they are overwritten when you upgrade.

  - /etc/mail/spamassassin/*.cf:

      Site config files, for system admins to create, modify, and
    add local rules and scores to.  Modifications here will be
    appended to the config loaded from the above directory.
To start with, there's only one file there:
# This is the right place to customize your installation of SpamAssassin.
# See 'perldoc Mail::SpamAssassin::Conf' for details of what can be
# tweaked.
#
###########################################################################
#
#rewrite_subject 0
#report_safe 1


  - /usr/share/spamassassin/user_prefs.template:

    Distributed default user preferences. Do not modify this, as it is
    overwritten when you upgrade.

  - /etc/mail/spamassassin/user_prefs.template:

    Default user preferences, for system admins to create, modify, and
    set defaults for users' preferences files.  Takes precedence over
    the above prefs file, if it exists.

    Do not put system-wide settings in here; put them in the
    /etc/mail/spamassassin directory.  This file is just a template,
    which will be copied to a user's home directory for them to
    change.

  - $USER_HOME/.spamassassin:

      User state directory.  Used to hold spamassassin state, such
    as a per-user automatic whitelist, and the user's preferences
    file.

  - $USER_HOME/.spamassassin/user_prefs:

      User preferences file.  If it does not exist, one of the
    default prefs file from above will be copied here for the
    user to edit later, if they wish.

    Unless you're using spamd, there is no difference in
    interpretation between the rules file and the preferences file, so
    users can add new rules for their own use in the
    "~/.spamassassin/user_prefs" file, if they like.  (spamd disables
    this for security and increased speed.)

OK. It looks like we should create a directory .spamassassin in every user directory, and in /etc/skel/ too.
  - $USER_HOME/.spamassassin/bayes*

    Statistics databases used for Bayesian filtering.  If they do
    not exist, they will be created by SpamAssassin.

    Spamd users may wish to create a shared set of bayes databases;
    the "bayes_path" and "bayes_file_mode" configuration settings
    can be used to do this.

    See "perldoc sa-learn" for more documentation on how
    to train this.

File Locations:
SpamAssassin will look in a number of areas to find the default
configuration files that are used.  The "__*__" text are variables which
get set during installation.  You can see their values by viewing the
first several lines of the "spamassassin" or "spamd" scripts.

  - Distributed Configuration Files
        '__def_rules_dir__'
        '__prefix__/share/spamassassin'
        '/usr/local/share/spamassassin'
        '/usr/share/spamassassin'

  - Site Configuration Files
        '__local_rules_dir__'
        '__prefix__/etc/mail/spamassassin'
        '__prefix__/etc/spamassassin'
        '/usr/local/etc/spamassassin'
        '/usr/pkg/etc/spamassassin'
        '/usr/etc/spamassassin'
        '/etc/mail/spamassassin'
        '/etc/spamassassin'

  - Default User Preferences File
        '__local_rules_dir__/user_prefs.template'
        '__prefix__/etc/mail/spamassassin/user_prefs.template'
        '__prefix__/share/spamassassin/user_prefs.template'
        '/etc/spamassassin/user_prefs.template'
        '/etc/mail/spamassassin/user_prefs.template'
        '/usr/local/share/spamassassin/user_prefs.template'
        '/usr/share/spamassassin/user_prefs.template'


After installation, try "perldoc Mail::SpamAssassin::Conf" to see what
can be set. Common first-time tweaks include:

  - required_hits

    Set this higher to make SpamAssassin less sensitive.
        If you are installing SpamAssassin system-wide, this is
        **strongly** recommended!

        Statistics on how many false positives to expect at various
        different thresholds are available in the "STATISTICS.txt" file in
        the "rules" directory.

  - subject_tag

        When rewrite_subject is on, the subject stamp is *****SPAM*****.
        This can be used to change it.

  - ok_locales

    If you expect to receive mail in non-ISO-8859 character sets (ie.
    Chinese, Cyrillic, Japanese, Korean, or Thai) then set this.


Learning
--------

Since SpamAssassin now includes a Bayesian learning filter (in version
2.50 on), it is worthwhile training SpamAssassin with your collection of
non-spam and spam, if possible.  This will make it more accurate for your
incoming mail.  Do this using the "sa-learn" tools, like so:

    sa-learn --spam ~/Mail/saved-spam-folder
    sa-learn --nonspam ~/Mail/inbox
    sa-learn --nonspam ~/Mail/other-nonspam-folder

Use as many mailboxes as you like.  Note that SpamAssassin will remember
what mails it's learnt from, so you can re-run this as often as you like.
Bayesian filtering means I can tell it to look at two mailboxes, one full of spam and the other full of non-spam.  I investigate later how this works with Maildir mailboxes.

For now I want to do some basic configuration, make it run from my Maildrop .mailfilter file and check that I can detect with Maildrop which messages SpamAssassin deems to be spam.

Be sure to read this stuff - there's lots there on how it can be configured to work with remote services for real-time black-hole approaches (identifying spam by the IP address of the machine which sent it) and by comparing an analysis of the spam with constantly updated databases of spam, some of which can be fed with manually or automatically detected spams to keep the database up to date.  

There is also a config file generator: http://www.yrex.com/spam/spamconfig.php .  I gave it a spin.  I think it is a good idea!  I turned off the subject tag and Bayes auto-learning.  I selected English as the language I am expecting non-spam messages in.  Here is the file this site generated:
# SpamAssassin config file for version 2.5x
# generated by http://www.yrex.com/spam/spamconfig.php (version 1.01)

# How many hits before a message is considered spam.
required_hits           5.0

# Whether to change the subject of suspected spam
rewrite_subject         0

# Text to prepend to subject if rewrite_subject is used
subject_tag             *****SPAM*****

# Encapsulate spam in an attachment
report_safe             1

# Use terse version of the spam report
use_terse_report        0

# Enable the Bayes system
use_bayes               1

# Enable Bayes auto-learning
auto_learn              0

# Enable or disable network checks
skip_rbl_checks         0
use_razor2              0   }  Note, these should probably be 1!
use_dcc                 0   }  See below  xxxx.
use_pyzor               0   }

# Mail using languages used in these country codes will not be marked
# as being possibly spam in a foreign language.
# - english
ok_languages            en

# Mail using locales used in these country codes will not be marked
# as being possibly spam in a foreign language.
ok_locales              en
I copy this to /etc/mail/spamassassin/local.cf . (Actually, the next day I found I hadn't, so the results reported below were for the default config file, which had no active lines.)  Later I added some lines to alter the scores for some of the Bayes outcomes.

SA returns a new message, with the spam as an attachment

When the message is not deemed to be spam, some headers are added to it, indicating briefly the tests it matched.  For instance, below.  This includes an indication of what percentage, rounded to 10%, the Bayes system gave as a probability that it was spam.  This is the original message, with extra headers.

There is a change from SA 2.42 to 2.55.  By default (or maybe it had no other way of working) SA 2.42, for those messages it decided were spam, would produce as its output the original message, with all its headers, but with new headers about spam status and optionally a detailed report of the tests etc. in the headers or the body.  This meant that just by looking at the message, you could see all the original headers where we expect them in any message.  But this has some disadvantages:
  1. This alters the headers of the spam, potentially confusing any person or program which scrutinises them.
  2. Depending on what changes SA made to the body of the message, the message may still be active in terms of Javascript etc. and instantly viewed and activated by looking at the message in an email client.
  3. If the message is passed through another spam detection system it may be recognised as spam again.
SA 2.55 defaults to a setting report_safe { 0 | 1 | 2 } (default: 1) (See:  http://spamassassin.org/doc/Mail_SpamAssassin_Conf.html ).  In this mode, the message it produces is a fresh one, but with the original Subject: and Sender:  Apart from this, there is nothing in the headers of the new message which have anything to do with the original message.  But the entire original message is reproduced as an attachment in the new message.  Settings 1 and 2 are two different ways of making the attachment.  Setting 0 is not to do this - to revert to what I described above.  This report_safe 1 arrangement means that the resulting message is "safe to handle" and that all the headers of the original mail are preserved within it, unchanged. I am using SpamAssassin as a standalone program, to be run, under a Perl interpreter, every time it is needed, which means once for every message.  It is also possible to run it as a daemon, and launch only a small client for each message.  This is much more efficient and would be good for busy mail servers.  The daemon approach involves some restrictions on per-user options.

I think I am ready to try SpamAssassin.

I copy my subjadd program (see page from parent directory on my mods to Maildrop for the source of subjadd ) to /usr/bin .

I create a mailbox "-SPAM".  (This can be done with an email client, or Courier IMAP's maildirmake program.  This mailbox should be added to /etc/skel/ too.)

I create the following test file  ~/.mailfilter file for a user account - to test Maildrop calling SpamAssassin and subjadd.
logfile "mailfilter-log.txt"

  LMB="Maildir"

xfilter "/usr/bin/spamassassin -x"

                                # Watch out for header line added by Spamassassin.
                                # Don't allow any blank lines after the
                                # if statement!
if (  /^
X-Spam-Flag: YES
/  )
{
  log "------------------------------------------------------------- Spam general. "

  cc "Maildir/.-SPAM"               #  Make this "cc" for copy or "to" to not send it to Inbox.
  DELTAG=1
  xfilter "subjadd ~~~[SPAM]"
  to "$LMB"
}

                                # The Anomy Sanitizer stuff goes in here when we are
                                # ready to test it too.

log "- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - No match."

                                    # Deliver to the Inbox.
to "Maildir"


It must have its owner and group set to the ID of the account, and it must have its permissions set to: rw- --- --- .

I send some ordinary mail to the account and check it gets through.  If nothing arrives, I look at /var/log/maillog to see what was wrong.  It is very easy to make syntax errors in .mailfilter .    My  most common mistake is to have a blank line after an if(   ) statement.   The second most common mistake is to forget to use \ to continue a statement over multiple lines.

A simple text message comes through with SpamAssassin adding header lines:
X-Spam-Status: No, hits=-0.1 required=5.0
tests=USER_AGENT_MOZILLA_UA,X_ACCEPT_LANG
version=2.55
X-Spam-Level:
X-Spam-Checker-Version: SpamAssassin 2.55 (1.174.2.19-2003-05-19-exp)
The .mailfilter if( ) statement does not match this, so this message passes straight through to the Inbox.  The header line "X-Spam-Flag: YES" only appears for messages which cross whatever threshold SpamAssassin is configured with.

I needed to make a spammy test message . . .   I looked at the tests SpamAssassin does by default:
http://au.spamassassin.org/tests.html
and came up with a message:
Subject: BUY TEST SPAM NIGERIA VIAGRA SEX FREE

NIGERIA VIAGRA SEX ENLAGEMENT FREE REMOVE SALE PENIS $$$ BUY GUARANTEE
MASS EMAIL OPPORTUNITY HERBAL DIPLOMA WORK AT HOME STOCK AUCTION

This scores 10.0 - exactly twice the threshold!

The headers SpamAssassin added were:
X-Spam-Flag: YES
X-Spam-Status: Yes, hits=10.0 required=5.0
tests=CASHCASHCASH,GUARANTEE,OPPORTUNITY,SUBJ_ALL_CAPS,SUBJ_BUY,
SUBJ_FREE_CAP,SUBJ_VIAGRA,USER_AGENT_MOZILLA_UA,
WORK_AT_HOME,X_ACCEPT_LANG
version=2.55
X-Spam-Level: **********
X-Spam-Checker-Version: SpamAssassin 2.55 (1.174.2.19-2003-05-19-exp)

SpamAssassin put the original message in an attachment which appears OK on Netscape 7.  Before that attachment is a text report by SpamAssassin listing all the message's crimes and misdemeanours:
This mail is probably spam.  The original message has been attached
along with this report, so you can recognize or block similar unwanted
mail in future. See http://spamassassin.org/tag/ for more details.

Content preview: NIGERIA VIAGRA SEX ENLAGEMENT FREE REMOVE SALE PENIS
$$$ BUY GUARANTEE MASS EMAIL OPPORTUNITY HERBAL DIPLOMA WORK AT HOME
STOCK AUCTION [...]

Content analysis details: (10.00 points, 5 required)
USER_AGENT_MOZILLA_UA (0.0 points) User-Agent header indicates a non-spam MUA (Mozilla)
X_ACCEPT_LANG (-0.1 points) Has a X-Accept-Language header
SUBJ_VIAGRA (2.6 points) Subject includes "viagra"
SUBJ_BUY (1.9 points) 'Subject' starts with Buy, Buying
SUBJ_FREE_CAP (0.7 points) Subject contains "FREE" in CAPS
GUARANTEE (1.8 points) BODY: Contains word 'guarantee' in all-caps
WORK_AT_HOME (0.5 points) BODY: Information on how to work at home (1)
OPPORTUNITY (1.5 points) BODY: Gives information about an opportunity
SUBJ_ALL_CAPS (1.1 points) Subject is all capitals
CASHCASHCASH (0.0 points) Contains at least 3 dollar signs in a row
I was already forwarding my real incoming spam-infested email to this test account, and before long, an "Unlock your career potential" spam  arrived, was awarded 9.50 points by SpamAssassin and promptly turfed into the spam pit!

So on this basic level SpamAssassin is working!

There are no-doubt some real performance issues with SpamAssassin, especially running RBL checks and doing what I imagine is the really complex work of Bayesian analysis of the spam according to a large database.

Now to how to train its Bayesian heuristics . . .

Bayesian Training of SpamAssassin from Maildir mailboxes

There are various finer points of choosing spam and non-spam messages to train SpamAssassin on. I won't try to discuss them here.  This page:
http://au.spamassassin.org/doc/sa-learn.html
suggests several thousand emails are needed of each type.  1000 of each is good - there's no point going beyond 5000.  I have turned off auto-learning. While I can imagine a system learning to better recognise whatever it is which is scored for in a simple numeric way, it sounds like a dodgy, error-prone, approach.

Be careful that you train from the same source -- for example, if you train on old spam, but new ham mail, then the classifier will think that a mail with an old date stamp is likely to be spam.

Indeed, "supervised learning" is the way to go!

This page discusses using messages from an old and currently unused (by real messages) account as a handy source of spam - a "spam trap".  I have such an account and I already filter it into a mailbox of its own.  15 spams a day for two months - no proper messages.  Maybe I should close it.  

To make sure a mailbox of suspected spam contains no legitimate emails, I looked at the mailbox with Netscape 7, sorting on Subject and looking at the Subject and Sender.  I viewed the bodies of just a few messages which were not obviously spam from the Subject and Sender.  

I found 1832 messages from my Inbox over 6 months, after removing those over 200 kbytes, and carefully removing a few spams which still remained there.  These are largely personal and business emails, but also a number of less regular list things, which I haven't bothered to sort automatically.  There are also invoices - all sorts of stuff.  Whatever it is, this stuff is not spam to me.   I did include mailing list messages, since SpamAssassin is filtering my mail after the mailing list messages have been dealt with. I put them in mailbox SA-OK.

My 100% pure recent spam collection at the end of May 2003 is as follows.  The times in bold indicate the time period I selected to combine them into a single corpus (I like this medical term for a stinking mass such as this) of spam of about the same number of about 1800 messages:
  1. 943 - 2 months worth of the old email address, which is all over an old website. I used 3 weeks
  2. 5664 - 8 months worth - detected by SpamAssassin 2.42 with a threshold of 5.  I used 3 weeks.
  3. 1065 - 7 months worth - not detected by SpamAssassin 2.42 and manually turfed into a separate pit.  These are the ones which SA scored too low for the threshold of 5.  These messages are the ones I want the Bayesian filtering to take special note of, because that will reduce the number of messages I need to manually turf.  They also include a recent (in May 2003) rash of short (<1k) messages which say very little and are not selling anything - I guess it they are the product of some virus.  I used 5 months.
I combined these into a mailbox SA-Spam .

So now I want a similar number of spam messages over about the same time, or maybe somewhat shorter (to make it more focused on current spam trends).  
This gave me:
I found that in order to get sa-learn to work I had to create this directory and empty file in the user account:
~/.spamassassin/user_prefs
This needs to go in /etc/skel/ too.

As the user, rather than root, I gave the command from the home directory:
sa-learn --ham  --showdots --dir ~/Maildir/.SA-OK/cur
sa-learn doesn't specifically know about a Maildir - all we need to do is tell it which subdirectory in the maildir the message files can be found in.

The /cur/ directory in a Maildir is where all the messages are after an IMAP client has been told they exist.  Those not yet seen by a client are in /new/ .  So in order to make any user-driven command more robust, we might want to do the command for both of these directories.  

Activate Bayesian filtering, resetting of the database etc. via email commands?

One interesting use of Maildrop filtering would be to have a special, secret, set of words to put in the Subject or body of a message to cause .mailfilter to activate the sa-learn program on the spam and non-spam mailboxes.  This would require that some process be launched outside Maildrop, so as not to cause Maildrop to wait for this to complete.  Maildrop can call shell scripts, so this should be pretty easy.  The last time I wanted to this I used a line in .mailfilter of the form:
xfilter "/home/blah/xyz.sh& cat"
to call a shell script and return immediately.  I can't remember why I used the "cat".

sa-learn took a long time to run.  The test machine is a Pentium Pro 233 (RH9.0 - and later with RH7.2 on a Pentium-III "Celeron" 824 MHz machine) and sa-learn ran with about 80% CPU on this 15 Meg mailbox . . . it slowly used more RAM, up to about 16 Megs.  It took about 20 minutes on a PPro 223 MHz and 9 min 30 secs on a Celeron 824 MHz.  Showing dots is helpful!  It seems that the file ~/.spamassassin/bayes_msgcount has a length equal to the number of messages read.

I think that running sa-learn would be a headache on any shared mail server in a busy office.  Maybe prefixing it with a nice command would help.
sa-learn --spam --showdots --dir ~/Maildir/.SA-Spam/cur
The memory usage quickly got to about 16 Megs and after maybe 150 messages shot up to about 29 Megs or so.  It took about 22 minutes on a PPro 223 MHz and ~12 min 40 secs on a Celeron 824 MHz

After this, the main database file ~/.spamassassin/bayes_toks was 5.2 Megabytes.


It seems we are now in an arms race.  Spammers' attacks force us to devote some of our precious cogitations to write, install, configure and train SpamAssassin.  This involves the application of significant CPU power - in order that we may retain our own personal space which the spammers are constantly trying to invade.  Still, I would much rather have software and silicon to keep the barbarians at bay then for me have to deal them myself physically.  In our ancestral past the major problem was the Hun appearing at sparrows-fart with clubs, spears and burning torches, bent on murder, pillage and rape.  So I guess this is progress to only have to worry about spammers!  I think we can truly say that computer power is cheap now.  In May 2003, I see 2 GHz CPUs for AUD$150 and 512 Meg DDR RAM for $AUD90.

It will take some experience to see how well this works.  Without Bayesian heuristics, the older SpamAssassin was correctly labelling about 84% (with a threshold 5) and missing about 16% of spams, 5 a day or so.   It falsely labelled maybe 4 emails in 7 months.  (A week later, with Bayes and the standard scoring for Bayes - which I later changed - SA 2.55 found 259 spams and missed 4.  This is a 98.5% success rate!!!

. . . The next morning . . .   I am running my incoming mail through the older SpamAssassin 2.42 on one machine and the new 2.55 version on another.  Four spams which were undetected by 2.42 were successfully detected by 2.55.  Another two undetected by 2.42 spams arrived while I was writing up the first four - and one was not detected by 2.55. Here are some details:

Subject: Over 1 Billion Movies, Music, and Videos to Download Now (5k)
From: We Share
SA 2.42
X-Spam-Status: No, hits=4.4 required=5.0
tests=CARRIAGE_RETURNS, CLICK_BELOW, LARGE_HEX, MSG_ID_ADDED_BY_MTA_3, REMOVE_PAGE, SPAM_PHRASE_03_05, USER_AGENT_OE, WEB_BUGS

SA 2.55

Content analysis details:   (9.00 points, 5 required)
LARGE_HEX (1.4 points) BODY: Contains a large block of hexadecimal code
HTML_LINK_CLICK_HERE (0.1 points) BODY: HTML link text says "click here"
HTML_60_70 (0.1 poin