|
One of my pet peeves is the headline “n %” of email is spam, it is inherently misleading, and conveys no useful data. I guess it makes for great newspaper headlines then!
On our servers looking at one email address for 4 hours, we saw 208 attempted connections for SMTP traffic referring to this email address.
2 resulted in delivery of genuine email.
1 resulted in delivery of a phishing email.
1 was unsolicited bulk email.
1 was (presumed) solicited bulk email.
172 were rejected due to presence of the remote client on the SBL-XBL blacklist. 6 rejected as the envelope sender domain didn’t exist. 25 by greylisting.
So clearly only 40% of our delivered email was “spam”, or was that 99% (2/208) of our email was spam? If I signed up this email address to a high volume mailing list, I could get the delivered percentage of spam down to less than 1% with almost no effort, does that solve the problem?
One can’t measure spam in relation to the amount of genuine email, because the amount of genuine email is not connected to the amount of spam in any sensible fashion (except as it might relate to how well publicized your email address is).
One could measure the number of attempts at sending spam per user:
205 in 4 hours
The number of successful attempts at delivering spam:
2 in 4 hours
One can use that to assess how effective our filtering technique was (>99% of spam was stopped).
One can even assess what proportion of rogue connections each spam filtering method stopped (given that they are applied in a given order). SBL-XBL 84%, valid envelope sending domain 18%, Greylisting 93%. Obviously if applied in a different order the results might vary.
But one can’t meaningfully relate this to the number of genuine emails (three). Because during the same period my own email address received many genuine emails, which had no impact on the amount of spam received to my email address.
The results are not statistically valid, due to lack of sample size, and the address used has been used in whois data, so is probably unduly spammed against compared to less public email addresses. However the principle is quite clear, the proportion of email that is spam depends on the amount of genuine email you get, and that “independent variable” has nothing to do with the spammers.
Lies, damned lies, and poorly chosen metrics.
Sponsored byVerisign
Sponsored byDNIB.com
Sponsored byWhoisXML API
Sponsored byVerisign
Sponsored byRadix
Sponsored byCSC
Sponsored byIPv4.Global
While I agree in large part with the sentiment of this article, the reasoning is at least a little specious. Any large spam-filtering service provider (whether specialist, or part of a general email service) is in a good position to report what percentage of email it classified as spam for a given time frame. The aggregate figure won’t be so prone to the statistical ambiguity described, and although it remains dependent on a significant number of things (no individual provider is likely to be fully representative of the network as a whole), the figure does give us an idea as to the magnitude of the whole problem at any given time.
I suppose the lesson is the same as ever with regards to such reporting: don’t lend any greater significance to the result than is warranted by the data and methods. Where the data and methods are mentioned only in passing, treat it as advertising or anecdote, not research.
Whilst I accept aggregation may make the figures more meaningful for a given supplier, in that they might be able to compare it to future, or past figures, it is too dependent on the nature of their client base, and the effectiveness of their spam filtering, to mean much to others.
Worse still our email volumes (which aren’t perhaps large enough for good statistical analysis), are still showing figments due to actions of individual spammers. For example recently we got three lots of backscatter appear within a few days of each other, which all disappeared on the same day. Whether this was some random disruption of the botnet, or just the bots being retasked, I can’t tell, but I’m pretty sure it is not just coincidence.
Indeed about the only spam consistently making it through our defensive in any noticable volume are the image based spam pump and dump scams, and those appear to come from a very limit number of original sources (or software).
On the upside it does suggest that the spam business is still small enough that individual spammers make a measurable difference. But perhaps statistics aren’t the right tool at all.