2023-12-01; Version 11

How I Filter Spam

Tom Van Vleck

I wrote one of the earliest e-mail programs in 1965. As the story says, even back then we worried that someone might spam, though we didn't call it that (the term wasn't applied to messages until 1993, based on the the 1970 Monty Python Spam skit).

Spam is not just an annoyance: it is often used to send "phishing" messages that try to take over computers. Hundreds of spam messages are sent to me every day, but only one or two a week get through. The tools I have evolved eliminate almost all spam, but I have to pay attention and update my rules when spammers' techniques change.

My Spam Filters

Mail to several domains is sent to my mail account at an ISP. I process each message with custom-made filters. The result of this process is one of:

The logic for deciding which action to take is complex and uses several tools.

Processing Incoming Mail

The first prerequisite for spam filtering is to hook my filters into the mail processing pipeline. My mail is hosted at Pair Networks Inc, a large, programmer-friendly, efficient hosting provider. Pair provides several spam filtering features that operate in its mail handling pipeline: virus rejection, black hole list rejection, and bad address rejection.

Next, Pair runs SpamAssassin on the remaining mail to mark suspected spam. This is a free tool that looks for spam characteristics.

Pair adds a X-Delivered-To header to each mail message describing the envelope recipient, and then delivers mail to addresses on my account, or bounces mail to invalid addresses.

In my Pair email account configuration, I specify that my mail should be processed by the procmail utility.

Quick Checks in Procmail

procmail is a fine but quirky free program for Unix and Linux for processing mail. It interprets its own language that specifies what to do when mail arrives. My mail handling is embodied in a .procmailrc program I wrote myself starting in 1997. It calls various other little helper programs I wrote in Perl.

First, I discard mail from senders on a "quick discard" list. For efficiency, I filter obvious repeat spammers early. It contains a few hundred sender names whose mail I discard right away. I add senders to the list when I notice big bursts of traffic from the same spammer not caught by other filters.

Forwarding with Procmail

With the ISP's permission, I run a low-volume forwarding service for a few people listed on the Multicians list. The forwarding is done by a simple procmail include file that checks the X-Delivered-To header (the envelope address, which may be different from the To field) and forwards it if there is a match. This include file is generated automatically from the Multicians database when it is updated.

Multicians therefore get the worst of the spam filtered out, but I avoid the chance of a false positive. I don't want to do any filtering that would involve me in hanging on to mail for others, or reading their mail to see if it is spam, or guaranteeing that I will check things every day. Their mail does get the SpamAssassin markings, so they can filter spam when they receive it.

More Quick Checks in Procmail

Next, I block mail from senders whose top-level domain is not on my list. Hundreds of new TLDs have recently been approved, such as .CLICK and .TIRES. My experience so far is that these domains are used for nothing but spam. Similarly, most mail from addresses in certain countries has never been anything but spam for me. I have a whitelist of friends who use uncommon addresses: all other mail is saved in a mailbox I can glance at before I empty it. Blocking mail from unwanted TLDs eliminates 40% of my incoming mail.

The next step I take (still in procmail) is to examine the mail headers and look for the delivery header added by Pair's server, so I can find the sending mail server that delivered to it, called the "last hop." A large amount of spam is sent from mail servers whose name either can't be looked up, or looks like a dialup or cable modem address. I check the sending server against a list of patterns that catch many unacceptable servers, by name or IP range. I also check against a list of valid mail servers and senders, and block mail with forged "From" addresses. This check is similar to what the "Sender Permitted From" (SPF) proposal would do, if it is implemented: instead of waiting for senders to publish their records, I built my own small database of the results this facility would provide.

The geographic location of the sending mail server is also an indicator of whether to trust a message. I use the free geolocation database from MaxMind to determine the country code associated with sending servers. Mail from some countries is likely spam, so I can mark it "possible spam." Same for messages in character encodings for languages I can't read, and messages with certain phrases in the subject.

Classifying With SpamBouncer

The next filtering step I perform is to run The SpamBouncer to classify each message. SpamBouncer is a big set of procmail files that I include from my main procmail script. SpamBouncer is a free tool created by Catherine Jefferson about 2001 that recognizes spam by using a variety of patterns. (Catherine has begun updating SpamBouncer again as of 2015. Most recent update was version 3.0 in 2017. Version 2.3 still works OK for me with only a few patches, but I will look into the new one. It's slow though, which is one reason to do quick checks first.)

SpamBouncer does the following:

The result of running SpamBouncer is some headers added to the mail message, and some flag values I can test in the procmail program.

Deliver or Hold in Procmail

If none of the classification steps above have tagged a message as virus, spam, possible spam, bad sender, etc, procmail delivers it to the appropriate mailbox. If it was tagged, the message is delivered to "hold" mailboxes configured in my procmail script.

Watching the Logs

The system works without much oversight. I can check the contents of the "hold" mailboxes to see that everything is running well, usually every day. The "hold" mailboxes sometimes contain false positives, which I rescue, and add the correspondent to the whitelist. I also look at the logs to spot obvious spammers, and add them to the quick discard list, in order to speed up mail processing and avoid sending it to my colleagues.

Results

In August 2004, I got over 542,000 mail messages, 99% spam or virus. Much of this mail was sent to nonexistent addresses by "dictionary attacks." In September 2004, I changed processing to reject mail to invalid addresses before all other filters (SMTP reject), and got only 23,240 messages, 62% spam or virus. (See "Bouncing Spam" below.) Pair Networks instituted further improvements in August 2005, cutting the remaining spam by a third. Here is a graph of about 20 years' traffic.

Stacked bar chart of mail classification for the past year

Stacked bar chart of mail classification by month by type

Understand that the bars aren't exactly comparable, since both the spammers' behavior and the spam filters evolved during this time. The filters got better, and the spam tactics got more aggressive. Much of the forwarded mail was spam, as I mentioned above. The sharp jump in June, July, and August 2004 is due to spammers' adoption of dictionary attacks, and the drop in September came from adopting SMTP reject. Here is a graph of last month's spam intercepts:

Pie chart of spam by type

Every day, I get some spam that is caught by only one filter. I get a few false negatives a month, i.e. spam not caught by any filter. Usually there is a way I can teach my system to reject future mails from such a source. I see a few pieces of mail a month that get false positives, i.e. good mail classified as spam. Personal mail in this category is usually the result of the reverse DNS trap; commercial mail that gets falsely identified as spam is usually caught by SpamBouncer's content heuristics.

Pair used to employ "greylisting," in its mail processing. Their mail servers would keep a list of senders accepted in the past. If a new sender was discovered, the mail server reject message and add the sender to its list. Normally functioning sending servers would get the reject message and retry sending the mail. At one time, many spammers wouldn't bother to retry, so this was a cheap filter for spam. (The cost was that mail sent from a new sender would be delayed for some time, usually 15 minutes or so.) Over time, this tactic stopped being effective: spammers caught on. So Pair removed the greylisting logic.

I get a lot of lame mail messages, for example mail with a subject in Spanish sent from a .de address (Germany), via a mail server in Russia.

Adapting My Approach

Suppose you want to do something similar, but you don't have the ability to install filters at your mail provider, or alter the mail server's behavior? You may be able to set up adequate filtering that runs all on your own machine, at the cost of downloading a lot of garbage bytes that are then discarded by your filters.

The simplest thing to do is to use the filtering available in your mail client. The Mail.app mail client on Macintosh OS X incorporates some spam filtering. (I find that it indicates false positives on mail not directly addressed to me, such as newsletters.) Other mail clients are also available for OS X. If you

you'll probably get spam down to a tolerable level.

On a Mac OS X, Linux, or FreeBSD machine, you should be able to use fetchmail to pull your mail from your ISP(s) and pass it to procmail using the -m argument. procmail could then filter using SpamAssassin, procmail rules, and SpamBouncer just as I do, and and deliver to local mailboxes. I haven't tried this though.

Using Other Services

Some users prefer to use spam handling services provided by others.

Approaches That Won't Work

There is no central solution to the spam problem. Only you know what kind of mail you don't like, or whether you know a certain mail sender. The Internet uses "end to end" protocols, meaning that the network just moves the bits, and the endpoints apply the interpretation. That's why it works so well and grew so quickly. The downside is that the endpoints have to have the intelligence.

Laws

Laws have been passed against spamming. Most are unenforceable: spammers send mail from places where it's legal, or they forge their headers to hide their tracks. Furthermore, the lawmakers have to struggle with how to define spam, and privacy and freedom advocates worry that some government agency may gain the power to censor messages it doesn't like.

Some big companies profit from spam and try to prevent or cripple laws against spam. In 2005, Ryan Hamlin, head of Microsoft's Technology Care and Safety Group spoke out against New Zealand's proposed anti-spam legislation, warning that it could impinge on 'the amazing vehicle of e-mail marketing'.

Death Threats

Some people are so frustrated by spam that they advocate the murder of spammers. In Senate hearings in May 2003, FTC Commissioner Orson Swindle said, "What we need are a couple of good hangings."

I don't think this scales right. After the killers finish with the spammers, they'll go after the people with car alarms, and then ugly shoes, bad haircuts, wrong kind of ammunition. Sooner or later they'll find something they object to about everybody.

(If I had known, back in 1965, that writing the MAIL command for CTSS would lead to death threats and DDoS attacks, I think I would have written some other program.)

Bouncing Spam

Some folks send a bounce message, even if they think a message is spam. There are several reasons why discarding such messages is better.

Far better than bouncing or discarding is to do an SMTP reject. To do this, your mail server has to know which envelope addresses are accepted, and reject incoming mail to invalid addresses rather than accepting it. Usually, this means that your ISP has to provide this feature. The sender's mail server never even gets a chance to send the contents of the mail, so this method saves lots of bandwidth. Regular users who accidentally mistype an address get a failure message, but it's generated by their sending server. Spammers get the reject and go on to the next victim. As of September 1, 2004, I switched over to this method for mail not addressed to any user at my site, and spam flow and processing load have decreased dramatically.

References