Procmail recipe condition lines and regular expressions

There're a few things about conditions in procmail recipes that are not quite trivial from the procmailrc manpage, but you should know if you want to understand how things work.

By default the regular expressions of recipe condition lines are run on the header part of emails. If you take a look on the raw source of an email, you'll see something like this:
From bob@example.com Tue Sep 04 10:58:53 2007
Return-path: <bob@example.com>
Envelope-to: alice@example2.com
Delivery-date: Tue, 04 Sep 2007 10:58:53 +0200
Received: from mail.example3.com ([192.168.0.1])
        by mail.example2.com with esmtp (Exim 3.36 #1 (Debian))
        id 1ISoS0-0001Ex-00
        for <alice@example2.com>; Wed, 05 Sep 2007 10:58:53 +0200
Received: from mail.example.com ([192.168.0.2]) by mail.example3.com with Microsoft SMTPSVC(6.0.3790.3959);
         Tue, 4 Sep 2007 10:58:52 +0200
Received: from [127.0.0.1] ([192.168.0.3]) by mail.example.com over TLS secured channel with Microsoft SMTPSVC(6.0.3790.3959);
         Tue, 4 Sep 2007 10:58:52 +0200
Message-ID: <46DD1E79.90004@example.com>
Date: Tue, 04 Sep 2007 10:58:49 +0200
From: joe@example.com
User-Agent: Thunderbird 1.5.0.13 (X11/20070824)
MIME-Version: 1.0
To: john@example2.com
Subject: teszt
Content-Type: text/plain; charset=ISO-8859-2; format=flowed
Content-Transfer-Encoding: quoted-printable

This is the body of the email.

You might have noticed that there're some headers spanning over multiple lines (eg. the "Received" headers in the above example). Procmail does merge these split lines together before it runs any recipes on it, so the first Received: header will look like this as the input of a recipe (the following is a single line, it's just wrapped by the website's content rendering engine):
Received: from mail.example3.com ([192.168.0.1])         by mail.example2.com with esmtp (Exim 3.36 #1 (Debian))         id 1ISoS0-0001Ex-00         for <alice@example2.com>; Wed, 05 Sep 2007 10:58:53 +0200

There's another trick that you should be aware of: the newlines are not simply removed at the end of each line during the merge, but they're replaced by a space character. Plus each additional line of the Received: header in the raw email started with a tabulator character and this is not changed during the merge!

So if you're going to do precise matching in regexps (where I mean not just using .* between various fixed strings of the regexp, but some more exact whitespace pattern), then do not forget to think of the tabulator character too. Unfortunately in procmail regular expressions you cannot use character classes (eg. [:space:]) and cannot use some common escape sequences either (eg. \t as a tabulator). You've to use the literal characters, so a tab must be a character with 0x09 ASCII code.

The procmail-lib Debian package has a huge number of very advanced recipes that you can use as learning material. It contains a pm-javar.rc file with a lot of predefined variables that you can use in place of character classes.

Eg. whitespace (both space and tab characters) can be matched in recipes like this:
WSPC = "         "        # space and tab
SPC = "[$WSPC]"
s = "$SPC"

:0
*$ ^Received:$s*from$s+mail\.example3\.com$s*\(\[192\.168\.0\.1\]\)$s*by$s+mail\.example2\.com
some_mailbox

The above example recipe matches successfully, because we used everywhere the proper whitespace regexp. If we used " *" instead of "$s*" after the IP address part, then the regexp would not match due to the tab in the merged header line.

Of course a condition on Received: headers is not really common, but Subject:, To: and Cc: headers can grow easily large enough to be split into multiple lines.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

SpamAssassin whitelisting by "Received" headers

The point of my experiments on procmail recipes with conditions on "Received" lines was to ensure that incoming mail is not filtered through SpamAssassin at all if it came from my address and was sent from the localhost. I wanted to do this, because the AWL (AutoWhitelist) feature of SpamAssassin re-scores my own mail (sent by me to myself) to a very high score and thus identifies it as spam. This is because I receive a lot of real (forged) spam with my own address as the sender.

Actually SpamAssassin already has a config option that just does what I did through the procmail recipe. It's called whitelist_from_rcvd. You've to put it into your user config ($HOME/.spamassassin/user_pres) and specify the mail address and the pattern that has to match the relay server's reverse DNS. Thus if your own mail arrives always through localhost, you could whitelist it with this config:
whitelist_from_rcvd your_address@example.com localhost

In some cases localhost may be identified as localhost.localdomain so you might want to add that too. This works pretty well if you use eg. a webmail on the same server, where your email is delivered to.

Syndicate content