|
Content inspection is a poor way to recognise spam, and the proliferation of image spam recently drums this home. However if one must use these unreliable techniques, one should bring mathematical rigour to the procedure. Tools like SpamAssassin combine content inspection results, with other tests, in order to tune rule-sets to give acceptable rates of false positives (mistaking genuine emails for spam), and thus end up assigning suitable weights to different content rules.
If one is going to use these approaches to filtering spam, and some see it as inevitable, one better know one’s statistics, or trust the folk who write SpamAssassin to have good default rules. Most people are not good at statistics, so guess what they do?
The default rules in SpamAssassin carry a lot of weight in the world of spam. Spammers have long known this, and try to craft their emails to score as low as possible, in order to by-pass the leading spam filters.
However senders of solicited email, bulk or otherwise, rarely give these rules a second thought, most of us blithely accepting that “false positives” are something that happens. Well to an extent this is true, but on the other hand these rules must be such that it is easier for genuine email senders to pass them than for the spammers, otherwise the rules would have no skill, and the authors of SpamAssassin would not be getting us to check them.
Good email administrators typically ensure their systems score low against SpamAssassin’s rule base, not by conscious effort to minimize the risk of “false positives” within SpamAssassin, but by following best common practice, adhering to RFCs (Internet standards), plus a healthy dose of experience (also known as “trial and error”, much of which involves losing genuine emails, or trying not to).
One might expect big email providers would make an explicit effort to minimize the score that email sent from their servers might attain in common spam filters like SpamAssassin, as by doing so they ensure that more of their clients email is delivered, and make it easier for recipients to separate their good email from the unwanted junk received from elsewhere.
One of the easiest rulesets to comply with for SpamAssassin are the RFC Ignorant rules. They can be summarised as;
1) email sent to “abuse@
” must not bounce, and must give the impression a human might read it. Ideally a human would read it, but they don’t have an automated way to test that yet (maybe DRM has a purpose after all).
2) The MX records should be correctly formed.
3) Mail from “<>" should be accepted (DSNs).
4) email sent to "postmaster@
” should be handled as per (1) above.
5) The whois data for the domain should be valid, and usable.
The domain is the domain name used in the “Envelope from” field, which in a lot of cases (but not all) is the same as the sender’s domain.
The scoring for these in SpamAssassin 3.17 is below. The number we’ll focus on is the rightmost score, as that is the penalty applied when SpamAssassin is using both Bayesian and network tests.
score DNS_FROM_RFC_ABUSE 0 0.479 0 0.200
score DNS_FROM_RFC_BOGUSMX 0 2.034 0 1.945
score DNS_FROM_RFC_DSN 0 2.872 0 2.597
score DNS_FROM_RFC_POST 0 1.440 0 1.708
score DNS_FROM_RFC_WHOIS 0 0.879 0 1.447
Now SpamAssassin is a statistical scoring system, an email can score badly on some rules and still be accepted due to other rules, but clearly removing an unnecessary penalty from an email will reduce the chance it will be flagged as spam. The RFC-Ignorant rule-set applies to all emails sent from a specific domain, a conscientious email administrator can remove all these penalties, from all their outgoing emails, with minimal effort.
My understanding is the (main) envelope sender domains from various common sender domains of general interest in my personal email are below, the number is the “default score” any email from that domain will get in SpamAssassin simply by dint of the RFC Ignorant rule-set. Note any email collecting a score of 5 or more will be flagged as spam.
GNU: gnu.org: 0
ISC: isc.org: 0
AOL: aol.com: 0.2
GMAIL: gmail.com: 0.2
NTL: ntlworld.com: 1.708
HOTMAIL: hotmail.com: 1.908
YAHOO: yahoo.com: 3.355
VIRGIN: virgin.net: 3.355
Now it is clear why so much email from Yahoo is incorrectly labeled as spam by SpamAssassin. Correspondents only need trip over one of the content specific filters, and they are over the limit, and needing to score well on one of the other tests if their email is ever to be accepted/read. However any email that is borderline, would be more likely to be received if static penalties not related to its content were minimized.
This is only a table based on one rule-set, and one would expect the email admins at big email providers would address a much wider range of rule-sets, and even other spam filters, given the advantages of scale that they have. But clearly even basic steps to avoid their email servers being considered suspect haven’t been taken.
No doubt Google and AOL would argue that they can’t maintain an abuse address. But the rules don’t dictate they should, it merely requires they accept the email. Of course it would be nice to think they would have a script that at least allows them to address the top few issues that occur repeatedly in their abuse accounts, but even the vaguest promise that someone might some day look at the email is enough.
The true irony is that gnu.org is the largest source of unwanted email to my personal account (because the spam protection for my personal email doesn’t catch email forwarded from genuine email servers, like GNU’s, and their own spam filtering is very weak), but because they adhere to the norms and standards of good email practice, their email is perceived by SpamAssassin to be much better a priori than email from Yahoo. But this just emphasises my point, that adhering to these guidelines would make it easier to identify spam.
Crafting your email to get past spam filters is not just for spammers.
Sponsored byVerisign
Sponsored byCSC
Sponsored byIPv4.Global
Sponsored byDNIB.com
Sponsored byWhoisXML API
Sponsored byVerisign
Sponsored byRadix