From Adam Katz

Jump to: navigation, search


Ending spam

There are two categories of topics when it comes to "anti-spam" or "spam fighting" -- the traditional filtering on the receiving side (the "gatekeeper") to which this page is devoted, and then actively pursuing spammers to stop spam at its source. I know how to stop spammers and end the war on spam. Blue Security almost had it. See my proposal for Ending spam on its own page.

Most of my work has been in filtering it as a gatekeeper, which itself is broken into two categories: client-side and server-side. I think filtering is best handled on the MX record (server-side), and this page is mostly devoted to server-side filtering. If you are interested in client-side filtering, I recommend the latest learning algorithms and a few weeks' worth of patience as you train them.


Spamassassin is a best-of-breed spam detector seeking known patterns against a very strong database, as supplemented by probabilistic Baysian spam filtering *, and online indices (e.g. DNSBLs like Spamhaus and hash filters like DCC) populated by spam reports and spam traps. With these tests, Spamassassin assigns a score representing how much the message looks like spam.

Spamassassin is king, especially with bayes trained and turned up. My other primary line of defense within SA is the RBLs. I report to SpamCop and KnujOn, which makes SpamCop's realtime blacklist (RBL) far more accurate for my data while both groups chase spammers at the source so as to stop it from being sent in the first place. RBL scores in SA are increased as well. I deploy SA at two thresholds: mark and delete. mail scoring under 5 points is delivered normally, from 5-8 points it is marked and defanged (forced to safe character sets, disclaimer prepended to body, subject altered to reflect spamminess), and from 8+ points it is rejected at SMTP-time (so the sending relay will send a bounce to the true sender, rather than my MX record sending a bounce /email/ to a forged sender). A third threshold at 12 points exists for reporting spam automatically.

Custom SA hacks

My SA hacks are now mostly contained by my sa-update channels (easy installation, and updates are built in!)

sa-update channels

My sa-update channels are signed with a dedicated channel key, E8B493D6. I also suggest using a few SARE channels and SpamAssassin lead designer Justin Mason's sought channel. Run the following on the SpamAssassin system:

wget -qO - \ |sudo sa-update --import -

You can get a list of imported keys with the command sudo gpg --keyring /etc/spamassassin/sa-update-keys/pubring.gpg --list-public-keys (the sudo may not be needed e.g. if you are already root).

Put the following in your /etc/spamassassin/sa-update-keys.txt file (the directory may differ):


Now put the following in your /etc/spamassassin/sa-update-channels.txt file:

As a side note, the following UNSUPPORTED channels are mildly useful (while also mildly dangerous due to their age), especially if your SpamAssassin is out of date. It is mostly useless in 3.2.0 or higher. The only supported SARE channel is 2tld (above). Any channel not listed here ranges (in my opinion) from somewhat dangerous to tremendously dangerous.

Do NOT use SARE's 70_sc_top200 or the openprotect syndication of SARE's rules. They are dangerously out of date (by several years!) and will unfairly block legitimate senders. khop-sc-neighbors provides the same functionality (and then some) of 70_sc_top200 but is up to date.

This goes in your nightly (or hourly) cronjob (I intend to patch Debian's /etc/cron.daily/spamassassin to better facilitate this):

cd /etc/spamassassin
sa-update --channelfile sa-update-channels.txt --gpgkeyfile sa-update-keys.txt

My channels follow (note that khop-lists is not included above):

(a syndication of all khop-* channels, including the risky khop-lists)

a collection of extremely effective blocklists that don't come with SpamAssassin (yet), plus a few rules to ensure that the total score from blocklists doesn't single-handedly take out a message (many blocklists share data or allow user submissions, so this overlap can be unjust). This is easily the most effective (and safest!) addition you can have to your SpamAssassin configuration. Some of the lists added to SpamAssassin 3.2.5 have since been published with 3.3.0 proper. This channel still offers them for older versions.

negative rules for encryption and signed messages. Note that this does nothing to actually verify anything. If spammers start abusing this, we can consider wasting CPU cycles on verification (perhaps for the first message per sender/IP/key triplet, or a random 10%, etc.). Until then, this is good enough. khop-blessed rule performance is published with regular SpamAssassin Rule QA reports, with accidentally blessed spam kept to a reasonable minimum.

a collection of dynamic host detectors in the spirit of the old BOTNET plugin et al. khop-dynamic rule performance is published with the regular SpamAssassin Rule QA reports, with false positives well below 0.100% (very good). It will eventually be folded into an official release.

my general rules, mostly a collection of back-ports from newer versions and from trunk and minor rules.

tests that fire on list/newsletter/subscription-based mail, which could be molded into the bulk flag for my proposed #Solicited Bulk Realtime List. NOT FOR GENERAL DEPLOYMENT -- this channel will assign nontrivial points to ANY automated message, solicited or not!

rules that penalize mail originating in network blocks which issue large volumes of spam. This groups blocks by CIDRs /8, /16, /24, and /32 (a.k.a. Class A, B, C, and the famed SpamCop Top 200 list), examining each block by reported spam volume and number of reported spamming IP addresses. While initially named "sc" for SpamCop, it now uses several data sources to compile its rules. khop-sc-neighbors rule performance is published with the regular SpamAssassin Rule QA reports, with false positives all well below 0.100% (very good). I have a similar mechanism for increased greylisting times in milter-greylist that is a bit more dated.

(coming soon) a syndication of Malware Patrol's Block List, reformatted for a degree of efficiency (and more importantly, the ability to change the score in one line rather than in 1300 lines). I also downgrade the default score of 3.5 to 2.0. Caveat: the offending data is no longer available (though I don't think it's terribly important). My fully operational fetching script is called sa-malware, and it can go straight into /etc/cron.hourly on any Debian system.

SA utility scripts

Scripts and the like that I've written:

Other SpamAssassin tools

I also use the iXhash plugin with the following extra rule (which you can add to your file) to bump the scores up from 0.1 each. One further note; the code is a bit buggy and error-prone on smaller messages, and it is in need of major updates but still quite worthwhile.

# Requires third-party plugin iXhash, see
rawbody __SIZE_UNDER_200 /^(?=.{0,200}$)/s # small bodies create false positives
# Use the union rather than tweaking each one and possibly going overboard.
describe IXHASH_CHECK BODY: MD5 checksum matches known spam
score IXHASH_CHECK    0 2 0 2

Other extremely useful plugins (which come with SpamAssassin but aren't enabled by default) include DCC (another hash-based checking system) and TextCat (a language detector).

Since the lastexternal check is limited to DNSBL lookups, this next rule requires knowledge of your MX records (which should be in internal_networks, by the way) and you should verify that you don't have any internal servers (or clients with outgoing servers) that are stuck in the stone age. Be sure to create a regular expression with your MX records' internal names (as placed in Received: headers). This came from a sampling of incoming spam and nonspam -- 49/3811 nonspam messages (1.3%) evaluated used SMTP instead of ESMTP, the highest scoring 3.4. I chose 0.75 as a score because it is half way between 3.4 and 5.0. A sampling of 204 spams (139 of which were marked as 5.0+ points) consisted of 20% non-ESMTP transfers. Adding 0.75 to them would have marked an additional 4 (2%) and rejected (8.0+ points) another 8 (4%). 20% spam hits and 6% new flags are very large percentages for a single rule.

header KHOP_NO_ESMTP     Received =~ /\sby (?:regex for your MX records) ?\(.* with SMTP id/
describe KHOP_NO_ESMTP   Sending relay used SMTP instead of ESMTP
score KHOP_NO_ESMTP      0.75

Spam reporting

I report spam to authorities that log, track, and shut down spammers. Their reporting goes as far as contacting network-level providers, other abuse teams, governments, and the ICANN (who can revoke internet access) in addition to taking independent legal action on occasion. This includes SpamCop, KnujOn,, and token indcies Razor and DCC, who use loose checksums/tockens to identify spam in a manner somewhat similar to the learning algorithms discussed below. All trained spam (save that with confidential information, which should be all but non-existent) should be reported.

This has two primary effects, the less direct of which is that sometimes the upstream network administrator will shut off the spammer (this works wonderfully for zombie networks, misconfigured lists, and spammers that don't cover their tracks well). For example, KnujOn nails the top ten spammed registrars. KnujOn is now an official ICANN advisor, so lists like that one will result in revoking of spammer-friendly registrar's IP blocks. SpamCop, DCC, and Razor all publish blacklists of relays they receive reports for. These blacklists are significantly better for those who report spam than others.


I highly value the DNS-based black and white lists for relays and URIs. Spamassassin undervalues them. Specifically Spamcop, which I regularly report to and therefore have better accuracy with. I also use a few nondefault RBLs, and there are some DNS-based lists proposed below for bulk mailers (SBRL) and greylist compliance (DNSGL). It should be noted that no RBL is fault-proof, so no RBL should single-handedly (or even collectively) block an email, but points in Spamassassin or acting as a prior on non-naive Bayesian logic are great avenues.


greylisting is a fantastically useful first-line of defense, but it is quite annoying in its delays. lists of problematic but legit relays are used to prevent greylisting them. its two merits are: 1. mail must be re-sent, so mail from non-persistent relays cannot be delivered ... this is exclusive to spam, nipping 80+% of my company's incoming mail without any need for allocating resources to fighting it (though if greylisting becomes more common, the spam bots will merely be able to resend). 2. the delay from requiring a re-send gives time for spamtraps and manual reports to trickle in, thus bumping the score on online databases like spamcop and other RBLs and DCC and other body content indices (and local bayes databases), so when the message is finally checked for spam, real spam is more likely to be flagged as such.

I use milter-greylist.

For the two merits (required resend, delayed scanning) there are two faults. The obvious fault is the delayed delivery. The second fault is that a small number of legitimate mail relay implementations are not greylisting-compliant, including at least one major player: gmail. This is typically worked around by using the problematic relays' SPF entries to bypass greylisting altogether, as well as a well-maintained bypass list (I prefer to call it "bypass list" instead of "white list" as that latter term has other applications). The big problem here is that there is no centralized place to access it.

Discriminate against Windows

A miniscule percent of SMTP relays in the world run Windows, but an overwhelming majority of zombie botnets do, and botnets overwhelmingly represent the non-greylisting-compliant spam-sending relays. For example, the Free Software Foundation uses greylisting on their incoming servers, but they only greylist Windows relays.

Enter p0f, a very quick and efficient passive OS fingerprinting system. I would like to use p0f in three capacities: First, as a greylisting mechanism: Windows relays get a longer delay by default. Second, as a spamassassin test: Windows relays get a few extra points, e.g. 0.7, and messages from Windows relays that do not also come from a MS Exchange-compatible client (e.g. MS Outlook) get a few more, e.g. 2.3. Third, an automated and maintained database of acceptable windows servers is needed to prevent this from getting into trouble.

DNS Greylisting Bypass List

To address the issue with discriminating against legitimate mail relays that are either non-greylisting-compliant and/or running Windows, there should be an index. Following a DNSBL methodology (check data through a DNS lookup, response of notfound means no list entry and a response within means there is data, spelled out in the DNSBL's syntax and the IP details), we can create such a list and share it with others. There is no reason to have different lists for these two items, and we might as well do other things like test for open-relay status while we're here.

We'll call this a DNS Greylisting [Bypass] List (DNSGL). It would have return codes like 127.a.b.c where a is block at 0-7 (0=open-relay, 4=zombie, 6=other-offense) and bypass at 8-11 and greylisting-compatible at 12-15. b is 0=non-windows, 1=windows, 2=sometimes-windows. c and the unspecified numbers in and out of the above ranges remain to be defined (c is always 1 for now).

  • = known open relay. block. (/24 is only non-windows)
  • = known open relay running windows. block.
  • = known open relay sometimes running windows. block.
  • = known zombie. block. (/24 is only non-windows)
  • = known zombie running windows. block.
  • = known zombie sometimes running windows. block.
  • = other known blockable offense. block. (/24 is only non-windows)
  • = other known blockable offense running windows. block.
  • = other known blockable offense sometimes running windows. block.
  • = known zombie or open relay or other problems on any OS. block.
  • ---
  • = relay is good, but NOT greylisting-compliant. bypass greylisting
  • = as above but windows-only. (/24 above is only non-windows)
  • = as above but variable OS.
  • = relay is good (and greylisting is okay). greylist as needed.
  • = relay is good, greylisting okay, server is windows.
  • = relay is good, greylisting okay, server has variable OS.
  • = relay is good, greylisting okay, server is windows or variable.
  • = relay is good (no information on greylisting or OS).

Servers within get confirmed every month or two, and others get updated daily. If a server is up and running a non-Windows OS that returns for three months, it is removed from the list.

This would involve p0f, an open relay detection script, some sort of zombie detection (perhaps you can assume a zombie based on it not serving on TCP ports 25, 587, 465, 143, 110, 993, and 995, and has no reverse-dns (or an obvious DHCP rDNS)?), an SMTP server with greylisting, a way to provoke email to test greylisting handling, and (of course) DNSBL software like spamikaze.


Nolisting is a poor-man's greylisting. It requires nothing by way of serving software, and in its most basic form, it requires only an IP address that you own but will never be used (like the control address ending your IP range). The idea is very similar to greylisting (and can act in its stead or in conjunction with); non-compliant mail servers (spam bots) may give up after trying just one MX record (and they often target the lowest priority MX record because its spam filters might be less up-to-date). The ideal nolisting implementation has a responsive server that has port 25 closed (not filtered!) as the highest priority (this makes failed connections very fast) and a non-existent server as the lowest priority MX record (which makes failed connections very slow). Slashdot had an article on nolisting back in 2007.


This builds on the secondary benefit of greylisting without the downside of delayed mail.

I have determined that the longer the delay before the scan, the more accurate the scan, especially with a global bayesian database and regular reporting to authorities like SpamCop and DCC. This becomes a convenience issue if you're preventing delivery. The solution: deliver the message, but perform those delayed scans from within the mailbox.

A modest proposal: spamass-imap, a script running as an IMAP client that re-scans unread mail and replaces it with updated results within each user's inbox. (Note, there is an access/password/privacy issue here.)

This means marked mail can become un-marked, un-marked spam can become marked, or either can actually be deleted (though the re-scan deletion threshold should be larger than 8 ... maybe 9.5, or maybe merely a move to the user's Junk folder). spamass-imap would run every fifteen minutes, each time re-scanning all unread messages in the inbox. perhaps a message scoring negative points that is older than 30 minutes won't be rescanned, and messages over 24 hours old don't get rescanned at all.

Sadly, re-scanning of already-delivered mail is not as effective as it should be. This is mostly because nobody reports their spam. As more and more people report their spam, the blocklists and hash indices will become more and more well-populated, and then we can try implementing this. A large-scale operation like GMail can actually use this concept in-house using their own users' data ... in fact, I'd be surprised if they don't already to this.

Solicited Bulk Realtime List

Another proposal, requiring some heavy-duty web crawling code for the server, the Solicited Bulk [DNS] Realtime [Black/White] List (SBRL) would be able to determine the fairness of bulk messages regardless of whether they were solicited, specifically: does unsubscribe work, is it confirmed opt-in, is the relay dedicated to bulk email, and what frequency is mail sent? As a secondary (and free) feature, it would also know whether autosubscribed sister lists respect those unsubscribe requests and confirmed opt-in (as such things are assumed to be part of the primary list).

I propose a piece of software (SA plugin or not) that examines a message for a few key characteristics that indicate it is a bulk email (newsletter, announcement, automated notice, etc), specifically excluding vacation messages, over-quota messages, and their ilk. these characteristics get summed up on the client side in a hash, which is then checked as a domain name in the same manner as a typical DNSBL. The real work is done by the SBRL server's web crawling bots and solicited-spam traps. The hash would also include the smtp relay, and the hash itself would be optional, though far more informative.

The SBRL server software would be a pimped spamtrap based on user inputted entries. You submit a solicited bulk message by sending a sign-up link, and the bots determine whether the listing should be white or black.

DNS answers for known relays/hashes would come back as 127.a.b.c where a=0/1 if unsubscribe fails/succeeds, b=0/1 if confirmed-opt-in fails/succeeds, c=frequency where 1=24+/d, 2=15+/wk, 3=5+/wk ("daily"), 4=2+/wk, 5=3+/mo ("weekly"), 6=2+/mo, 7=1/mo ("monthly"), 8=1/7+wk ("bi-monthly"), 9=rarer. Add 20 to c if the relay also sends nonbulk email (anywhere!). Relay-only requests would have to abstract the information to the worst-case data held for it, and there should be a response for not-enough-data.

Upon somebody entering the registration information on the SBRL's web site, a bot then crawls to the site and registers three unique (and unguessable) email addresses (we'll call them X, Y, and Z) for the service (if it fails, a trusted human should do this step).

X: the "welcome" message that comes back should be confirmed-opt-in, meaning it should require user action to complete the registration. X ignores this message. if X *ever* gets another message from *anybody*, the confirmed opt-in test has failed (so the DNSBL returns 127.a.1.c as noted below).

Y confirms the welcome message (any hiccups should be verified by a trusted human) and starts examining the incoming mail. This feeds the f in 127.a.b.def, ranging from 1 (very frequent) to 9 (very infrequent) and starts at 0 until there is enough data. After a certain amount of time (a year?), Y should probably bounce all mail.

Z operates just like Y, but as soon as the first subscribed message comes in, it activates an autoresponder that responds to messages from the list with "please unsubscribe this account or your message will be reported as spam. see this url: $URL" and all other mail with a random human-like unsubscribe request that does not cite the SBRL in any way. Z has a bot that parses that first message for an unsubscribe link and tries its best to crawl through the unsubscription process online or via email as directed (this is in addition to the auto-reply). 48 business hours after the unsubscribe attempts, anything that remains is spam. Just one message received after 48+ business hours after the unsubscribe attempts will result in SBRL reporting an unsuccessful unsubscribe mechanism, so a=1, 127.1.b.c.

  • = unsubscribe fails. spam.
  • = unsubscribe fails but confirmed opt-in worked (huh?). spam.
  • = unsubscribe is okay and can be trusted
  • = no confirmed opt-in, but a legit unsubscribable list
  • = legit list
  • = legit list, dedicated to bulk solicited mail
  • = legit list, also sends non-bulk mail

With a properly maintained SBRL, filters like spamassassin can skip processing on anything that returns since it is a legitimate bulk message. A response of (in conjunction with indicators that the message is in fact bulk) can be treated like your standard DNSBL, as can and (without bulk indicators).

Global Online Token Database

A Global Online Token Database (GOTDB) may become essential material as we further need to address the issue of poorly-trained learning algorithms. Spam is spam, and so long as we discard headers, spam from differing sources is largely the same. This would largely look like the DCC, or perhaps offering just a base on top of which you should provide your own training data.

Learning algorithms

There are several learning algorithms out there. As originally outlined by Paul Graham's A Plan for Spam, today's spam filters (including Spamassassin) use Naive Bayesian detection on a bag-of-words model (often called "Idiot Bayes" by statisticians for its simplicity). No priors are used, and the word order is not taken into account (so e.g. "to prick a finger" can't be differentiated from "a larger prick").

Several other implementations have gained some traction, including non-naive Bayes, Bayes with different tokenizing schemes, Markov chain-style analysis via CRM114 (which has some clever tokenizing that accounts for some lexical structure), and even the heavyweight support vector machines. While all of these will produce results superior to those of naive Bayes, this improvement must be weighted against their steeper learning curve and increased computational needs.

It has been my experience that most environments cannot rely upon learning algorithms because users are not universally vigilant enough. This is partially mitigated by using a global database for entire deployments (so that a few vigilant users might benefit the entire user base), but high levels of accuracy cannot be assumed. Because of this, a 2-5% accuracy boost from moving from naive Bayes to SVM or whatever other algorithm you prefer is largely useless (even assuming a smaller learning curve with higher tolerance). The values generated by learning algorithms help, but they cannot be trusted.

If training issues aren't present, the question becomes how to merge these in the best possible manner; perhaps using online indices like DNSBLs and DCC, Razor, and Pyzor as priors on a non-naive Bayesian algorithm to determine a ham/spam probability of 0=ham, 50=unknown, 100=spam and draw points based on the logarithmic distance from 50 (this is what Spamassassin does with naive bayes), then similarly run it through a Markov method and SVM method, adding up all the points. 100% on all three should net the most points, which if using SpamAssassin, should be around 6-8 (of a needed 5 to mark and 8 to delete) if the confidence in the training is very high, or 4-6 if high, or 2-4 if moderate.

A best-of-breed implementation

My proposed best-of-breed:

  1. Nolisting (closed port 25 on highest priority MX, no server on IP for lowest priority MX)
  2. Greylisting with the following times:
    1. No greylisting for noncompliant servers by whitelist (DNSGL
    2. 30m for relays listed in spamcop's top offending /8 network blocks
    3. 45m for a broad list of DNSBLs (including some over-aggressive ones)
    4. No greylisting for compliant servers by whitelist (DNSGL
    5. 30m for any other Windows server (p0f)
    6. 10s (yes, seconds) for anybody else, ideally noting success to DNSGL if Spamassassin reports a low score
  3. Spamassassin (round 1).
    1. First check if bulk email
    2. If bulk email, check SBRL. short-circuits SA (as ham).
    3. DNSBLs and URIBLs have increased scores, custom DNSBLs (including DNSGL and SBRL are present too.
    4. DNSWLs and their kind have increased negative scores, including custom lists like SBRL (note that hits have already exited SA).
    5. Best-of-breed machine-learning algorithms (non-naive bayes + crm114 + svm)
    6. After scan, if 8+ points, reject (still at SMTP time).
  4. ClamAV virus -> reject (still at SMTP time).
  5. Delivery to mailbox (IMAP, SMTP time is over).
  6. Spamass-imap rescans (and re-marks) unread mail in mailboxes every 15m
  7. Users have three or four special folders: teach-spam, teach-notspam, teach-pickup (teach-pickup is read-only), and teach-me. messages in teach-spam and teach-notspam are learned by the machine learning algorithms as spam or ham. teach-pickup is populated by un-marked learned ham messages, which are deleted after 48 business hours of being populated. teach-me is an optional folder where spamass-imap may put mail newly classified as between 8-10 points.

The SA(initial) and ClamAV steps could be swapped if it is determined that SA uses more CPU cycles on a day's worth of viruses (only) than ClamAV would use on all spam (only). This should theoretically only happen in cases where there is a really clever email virus out en force, which is to say, too rare to matter.

Other tools

SPF, SAV (Sender Address Verification), and SMTP server tweaks can all help too.

I especially like the idea of Sendmail's 3s delay for each incoming message. No big deal for regular mail, but spammers can't just slam out millions of simultaneous emails if each one has a 3s delay. It means that if you are number one thousand, you've had up to 3000s (50 minutes!) of delay time which hits the secondary benefit of greylisting noted above (more time for the online lists to note the spam, so an increased possibility of its proper detection).