Date: 01/16/10; Version 8
I wrote one of the earliest e-mail programs in 1965. As the story says, even back then we worried that someone might spam, though we didn't call it that.
Spam has become an increasing annoyance; I get over a thousand spam messages a day. The approach I have evolved eliminates almost all of them, but requires some attention on my part, and needs to be updated as the spammers' techniques change.
Mail to several domains is sent to my mail account. The mail provider uses qmail to deliver mail. I use a feature of qmail to process each message with procmail filters including SpamBouncer. The result of this process is one of:
The logic for deciding which action to take is complex and uses several tools.
The first prerequisite for spam filtering is to introduce the filter into the mail processing pipeline. My mail is hosted at Pair Networks Inc, a large, programmer-friendly, efficient hosting provider. Pair provides several spam filtering features that operate in its mail server: greylisting, virus rejection, black hole list rejection, and bad address rejection.
Next, Pair runs SpamAssassin on the remaining mail to mark suspected spam. This is a free tool that looks for spam characteristics.
Pair then uses qmail to deliver mail to me or bounce mail to invalid addresses. The .qmail files that control delivery contain the equivalent of:
(Preline is a qmail-ism that adds a Delivered-To header describing the envelope recipient.)
For an ISP that uses sendmail instead of qmail, the initial hook would use a .forward file with similar contents. If I were processing mail from an account where I could not insert the hook, I would investigate using fetchmail to pull the mail down to my machine, and then processing it locally. (fetchmail is open source software available for Linux, Unix, and Mac OS X.)
The rest of my mail processing is done with procmail, a fine but quirky free program for Unix and Linux for processing mail. It interprets its own language that specifies what to do when mail arrives. My mail handling is embodied in a .procmailrc program I wrote myself.
First, I discard mail from senders on a "quick discard" list. For efficiency, I filter obvious repeat spammers early. It contains a few hundred sender names whose mail I discard right away. I add senders to the list when I notice big bursts of traffic from the same spammer not caught by other filters.
With the ISP's permission, I run a low-volume forwarding service for a few people listed on the Multicians list. The forwarding is done by a simple procmail include file that checks the X-Delivered-To header (the envelope address, which may be different from the To field) and forwards it if there is a match. This include file is generated automatically from the Multicians database when it is updated.
Multicians therefore get the worst of the spam filtered out, but I avoid the chance of a false positive. I don't want to do any filtering that would involve me in hanging on to mail for others, or reading their mail to see if it is spam, or guaranteeing that I will check things every day. Their mail does get the SpamAssassin markings, so they can filter spam when they receive it.
The next step I take (still in procmail) is to examine the mail headers and look for the delivery header added by Pair's server, so I can find the mail server that delivered to it, called the "last hop." Pair's server has already looked up the DNS name of the sending server based on the IP address of the mail server that contacted it. A large amount of spam these days is sent from mail servers whose name either can't be looked up, or looks like a dialup or cable modem address. I use a list of patterns that catch most unacceptable sending server names. Recently, I have seen a lot of spam sent from the same server IP, with different names, so I have added the ability to blacklist a spammer's mail server by IP range. I also check against a list of valid mail servers and senders, and block mail with forged "From" addresses. This check is similar to what the "Sender Permitted From" (SPF) proposal will do, if it is implemented: instead of waiting for senders to publish their records, I have built my own small database of the results this facility would provide.
The next filtering step I perform is to run The SpamBouncer to classify each message. SpamBouncer is a free tool distributed by Catherine Hampton that recognizes spam by using a variety of patterns. SpamBouncer is a big set of procmail files, so I include it from my main procmail script. (Catherine has not updated SpamBouncer in years, but it still works OK for me with only a few patches.)
SpamBouncer does the following:
The result of running SpamBouncer is some headers added to the mail message, and some flag values I can test in the procmail program.
If none of the classification steps above have tagged a message as virus, spam, possible spam, bad sender, etc, procmail delivers it to the appropriate mailbox. If it was tagged, the message is delivered to "hold" mailboxes that I check occasionally for false positives.
The system, as described, works without much oversight. I can check the contents of the "hold" mailboxes to see that everything is running well. The "hold" mailbox sometimes contains false positives, which I rescue, and add the correspondent to the whitelist. I also look at the logs to spot obvious spammers, and add them to the quick discard list, in order to speed up mail processing and avoid sending it to my colleagues.
In August 2004, I got over 542,000 mail messages, 99% spam or virus. Much of this mail was sent to nonexistent addresses by "dictionary attacks." In September 2004, I changed processing to reject mail to invalid addresses before all other filters, and got only 23,240 messages, 62% spam or virus. (See "Bouncing Spam" below.) Pair Networks instituted further improvements in August 2005, cutting the remaining spam by a third. Here is a graph of about 10 years' traffic.
Understand that the bars aren't exactly comparable, since both the spammers' behavior and the spam filters evolved during this time. The filters got better, and the spam tactics got more aggressive. Much of the forwarded mail was spam, as I mentioned above. The sharp jump in June, July, and August 2004 is due to spammers' adoption of dictionary attacks, and the drop in September came from adopting SMTP reject. Here is a graph of last month's spam intercepts:
Every day, I get some spam that is caught by only one filter. I get a few false negatives a month, i.e. spam not caught by any filter. Usually there is a way I can teach my system to reject future mails from such a source. I see a few pieces of mail a month that get false positives, i.e. good mail classified as spam. Personal mail in this category is usually the result of the reverse DNS trap; commercial mail that gets falsely identified as spam is usually caught by SpamBouncer's content heuristics.
Suppose you want to do something similar, but you don't have the ability to install filters at your mail provider, or alter the mail server's behavior? You may be able to set up adequate filtering that runs all on your own machine, at the cost of downloading a lot of garbage bytes that are then discarded by your filters.
The simplest thing to do is to use the filtering available in your mail client. Macintosh OS X comes with the Mail.app mail client and incorporates some kind of spam filtering. (I find that it indicates false positives on mail not directly addressed to me, such as newsletters.) Eudora provides Bayes spam filtering; Sylpheed, which I used on Linux boxes, has an option to check messages with SpamAssassin. If you
you'll probably get spam down to a tolerable level.
On a Mac OS X, Linux, or FreeBSD machine, you should be able to use fetchmail to pull your mail from your ISP(s) and pass it to procmail using the -m argument. procmail could then filter using SpamAssassin, procmail rules, and SpamBouncer just as I do, and and deliver to local mailboxes. I haven't tried this though.
There is no central solution to the spam problem. Only you know what kind of mail you don't like, or whether you know a certain mail sender. The Internet uses "end to end" protocols, meaning that the network just moves the bits, and the endpoints apply the interpretation. That's why it works so well and grew so quickly. The downside is that the endpoints have to have the intelligence.
Laws have been passed against spamming. Most are unenforceable: spammers send mail from places where it's legal, or they forge their headers to hide their tracks. Furthermore, the lawmakers have to struggle with how to define spam, and privacy and freedom advocates worry that some government agency may gain the power to censor messages it doesn't like.
Some big companies profit from spam and try to prevent or cripple laws against spam. In 2005, Ryan Hamlin, head of Microsoft's Technology Care and Safety Group spoke out against New Zealand's proposed anti-spam legislation, warning that it could impinge on 'the amazing vehicle of e-mail marketing'.
Some people are so frustrated by spam that they advocate the murder of spammers. Robert Bruce Thompson, in his weblog at http://www.ttgnet.com/daynotes/2003/2003-35.html#Wednesday says:
The only effective way I can see to kill the spam is to kill the spammers, or at least enough of them that the others take notice and stop their behavior.
Similarly, in Senate hearings in May, 2003 FTC Commissioner Orson Swindle said, "What we need are a couple of good hangings."
I don't think this scales right. After the killers finish with the spammers, they'll go after the people with car alarms, and then ugly shoes, bad haircuts, wrong kind of ammunition. Sooner or later they'll find something they object to about everybody.
(If I had known, back in 1965, that writing the MAIL command for CTSS would lead to death threats and DDoS attacks, I think I would have written some other program.)
Some folks send a bounce message, even if they think a message is spam. There are several reasons why discarding such messages is better.
Far better than bouncing or discarding is to do an SMTP reject. To do this, your mail server has to know which envelope addresses are accepted, and reject incoming mail to invalid addresses rather than accepting it. Usually, this means that your ISP has to provide this feature. The sender's mail server never even gets a chance to send the contents of the mail, so this method saves lots of bandwidth. Regular users who accidentally mistype an address get a failure message, but it's generated by their sending server. Spammers get the reject and go on to the next victim. As of September 1, 2004, I switched over to this method for mail not addressed to any user at my site, and spam flow and processing load have decreased dramatically.
Copyright (c) 2003-2013 by Tom Van Vleck