|
In this post, I discuss a new paper that will appear at PETS 2018, authored by myself, Jeffrey Han, and Arvind Narayanan.
What happens when you open an email and allow it to display embedded images and pixels? You may expect the sender to learn that you’ve read the email, and which device you used to read it. But in a new paper we find that privacy risks of email tracking extend far beyond senders knowing when emails are viewed. Opening an email can trigger requests to tens of third parties, and many of these requests contain your email address. This allows those third parties to track you across the web and connect your online activities to your email address, rather than just to a pseudonymous cookie.
Illustrative example. Consider an email from the deals website LivingSocial (see details of the example email). When the email is opened, the client will make requests to 24 third parties across 29 third-party domains. [1] A total of 10 third parties receive an MD5 hash of the user’s email address, including major data brokers Datalogix and Acxiom. Nearly all of the third parties (22 of the 24) set or receive cookies with their requests. In a webmail client the cookies are the same browser cookies used to track users on the web, and indeed many major web trackers (including domains belonging to Google, comScore, Adobe, and AOL) are loaded when the email is opened. While this example email has a large number of trackers relative to the average email in our corpus, the majority of emails (70%) embed at least one tracker.
How it works. Email tracking is possible because modern graphical email clients allow rendering a subset of HTML. JavaScript is invariably stripped, but embedded images and stylesheets are allowed. These are downloaded and rendered by the email client when the user views the email. [2] Crucially, many email clients, and almost all web browsers, in the case of webmail, send third-party cookies with these requests. The email address is leaked by being encoded as a parameter into these third-party URLs.
When the user opens the email, a tracking pixel from “tracker.com” is loaded. The user’s email address is included as a parameter within the pixel’s URL. The email client here is a web browser, so it automatically sends the tracking cookies for “tracker.com” along with the request. This allows the tracker to create a link between the user’s cookie and her email address. Later, when the user browses a news website, the browser sends the same cookie, and thus the new activity can be connected back to the email address. Email addresses are generally unique and persistent identifiers. So email-based tracking can be used for targeting online ads based on offline activity (say, to shoppers who used a loyalty card linked to an email address) and for linking different devices belonging to the same user.
When the user opens the email, a tracking pixel from “tracker.com” is loaded. The user’s email address is included as a parameter within the pixel’s URL. The email client here is a web browser, so it automatically sends the tracking cookies for “tracker.com” along with the request. This allows the tracker to create a link between the user’s cookie and her email address. Later, when the user browses a news website, the browser sends the same cookie, and thus the new activity can be connected back to the email address. Email addresses are generally unique and persistent identifiers. So email-based tracking can be used for targeting online ads based on offline activity (say, to shoppers who used a loyalty card linked to an email address) and for linking different devices belonging to the same user.
Measuring email tracking at scale. To understand the privacy implications of viewing and interacting with emails we assembled a collection of messages from mailing lists on the top sites. [3] Using OpenWPM, a web measurement platform developed at Princeton, we simulated a user opening each email and clicking links from within a webmail client that loads remote content. We found that 85% of emails in our corpus contain embedded third-party content, and 70% contain resources categorized as trackers by popular tracking-protection lists. Many of these third parties, including 7 of the top 10, also have a large web presence.
When “anonymous” web tracking isn’t. About 29% of emails leak the user’s email address to at least one-third party when the email is opened, and about 19% of senders sent at least one email that had such a leak. The majority of these leaks (62%) are intentional.[4] If the leaked email address is associated with a tracking cookie, as it would be in many webmail clients, the privacy risk to users is greatly amplified. Since a tracking cookie can be shared with traditional web trackers, email address can allow those trackers to link tracking profiles from before and after a user clears their cookies. If a user reads their email on multiple devices, trackers can use that address as an identifier to link tracking data cross-device.
Most of the top leak recipients, including LiveIntent, Acxiom, Conversant Media, and Neustar, are involved in “people-based” marketing. These third parties receive leaked email addresses from between 24 to 68 of the 902 email senders studied. People-based marketing is defined by Acxiom as “the ability to perform targeting and measurement at the level of real people, not just devices, by resolving identity across digital and offline channels.” In other words, it is a term used to describe a set of services which allow marketers to use tracking data collected across any of a user’s devices, as well as offline data, to target that user on any of their devices. As discussed above, this could include offline data such as purchases made using a loyalty card at a grocery store, if that data is available associated with the purchaser’s email address (or a hash of it).
While our data does not let us measure how the companies use leaked email addresses they receive when a user views an email, we can get some insight into potential uses by examining their product pages. The marketing materials and privacy policies of the four companies mentioned above detail their use of email addresses for cross-device targeting and/or data onboarding products. [5]
Are leaks of hashed email addresses less of a privacy concern? In many cases, the leaked email address is hashed; in fact, 68% of all leaks which occur while viewing emails are hashed, one-third of which also include the domain portion of the email address in plaintext. Hashed email is considered by some leak recipients to not be personally identifying information. [6]
From a computer science perspective, the claim that a hashed email address is not personally identifying is patently false. When user records in a database are keyed by hashed email address, looking up the record for a given email address is trivial: simply hash it first and look it up (indeed, this is the whole point of storing hashed email addresses at all). What if you have data associated with a hash of an unknown email address and want to recover the original address? It’s surprisingly easy: you can rent a multi-GPU virtual machine for $14.40 an hour [7], which gives you 73 billion MD5 hash computations per second based on published benchmarks. Modern methods have gotten really good at enumerating plausible sequences of characters and numbers in passwords, and we believe these methods will extend to email addresses. If they do, it would mean that email address hashes can be broken much more efficiently than through brute forcing (i.e., trying all possible combinations of characters). We posit that with a trillion guesses—a cost of 6 US cents—it should be possible to enumerate the majority of email address in use.
Additional leaks occur when users click on links in emails. When an email link is clicked, the URL is typically handed over to the user’s browser, or to a new tab in the user’s browser, in the case of webmail. Email addresses and other identifiers may be embedded in these links, and may ultimately cause the user’s email address to leak to third-parties on the web. We found that about 11% of links contain requests that leak the user’s email address to a third-party and about 12% of all emails contain such a link. The largest recipients of these leaks are Google, Facebook, and Twitter, and the top recipients overall are very similar to the top third-party trackers on the web.
Leaks in link clicks can also allow email trackers to work around privacy protections in emails clients that strip cookies from remote resources (like Apple Mail) or in those that proxy remote resources (like Gmail). Since the clicked link is opened in the user’s browser, the tracker can make the explicit link between the user’s cookie and the leaked email address while the resulting page is loaded.
What can users do? All of the privacy risks discussed in our paper stem from remote resources, so users can use mail clients which support blocking images by default to completely avoid the problem. However, that can often result in emails which are unreadable; this is particularly true for marketing emails.
Blocking images by default provides complete protection from tracking when emails are viewed, but can often result in unreadable emails.
In Section 6.2 of the paper we survey 16 mail clients and find that a patchwork of privacy features are employed, but that no setup offers complete protection from the threats we identify. Mail clients that block cookies by default, like Apple Mail, offer some level of protection. In these clients it’s more difficult for a tracker to track users across mailing lists, since the mail client doesn’t provide a persistent identifier. The same is true for webmail clients which proxy images, like Gmail and Yandex. Content proxying has the added benefit of preventing a tracker from being able to link the browser’s cookies to any identifiers received when an email is opened.
Even with the defenses employed by the clients we studied, trackers which receive the user’s leaked email address will continue to be able to track and target users in these clients and on the web. As an example, LiveIntent’s marketing material reassures clients that it will continue to work in Gmail since “targeting is primarily based around the e-mail address’s [sic] MD5 hash”. Regardless of the defenses deployed by the client, control of tracking is handed off to the user’s browser when email links are clicked.
We found that the tracking protection lists EasyList and EasyPrivacy reduce the number of email leaks that occur when an email is viewed by 87%. Perhaps the best option for privacy-conscious users today is to use webmail and install tracking protection tools, such as uBlock Origin or Ghostery. Users who want to use a standalone client must find one which supports privacy extensions; of the clients we studied, the only one that supports such extensions is Thunderbird. Having tracking protection tools installed in the browser will also provide protection when email links are clicked. In Section 7 of the paper, we prototyped a server-side filtering feature which uses the tracking protection lists to filter the HTML body of emails before they reach the user. We found it to be nearly as effective as a tracking blocker running in the user’s browser.
Data, code, and paper release
You can read the paper here. We are also releasing the code and data publicly, including the all of the raw and parsed email bodies and crawls of all HTML emails. We hope that this dataset will spur additional research in this area.
Interested in hearing more from me? Follow me on Twitter @s_englehardt.
Thanks to Arvind Narayanan and Gunes Acar for their helpful comments on this blog post.
[1] The full list of third parties embedded in the LivingSocial example email given above are as follows:
Parties receiving an MD5 hash of the user’s email address: American List Counsel (alcmpn.com), LiveIntent (liadm.com), Datalogix (nexac.com), Acxiom (rlcdn.com, pippio.com, acxiom-online.com), Criteo (criteo.com, emailretargeting.com), Conversant Media (dotomi.com), V12 Data (v12group.com), VideoAmp (videoamp.com), Neustar (agkn.com), and alocdn.com. With the exception of emailretargeting.com and agkn.com all of the previous domains also set or receive cookies.
Additional parties setting or receiving cookies: MediaMath (mathtag.com), TapAd (tapad.com), IPONWEB (bidswitch.net), AOL (advertising.com), Centro (sitescout.com), The Trade Desk (adsrvr.org), Adobe (demdex.net), OpenX (openx.net), comScore (scorecardresearch.com, voicefive.com), Oracle (bluekai.com), Google (doubleclick.net), Realtime Targeting Aps (mojn.com).
Third-party domains requested without cookies or email hash: LiveIntent (licasd.com), Google (2mdn.net), Akamai (akamai.net).
[2] Unless they are proxied by the user’s email server; of the providers we studied (Section 6.2 in the paper), only Gmail and Yandex do so.
[3] Our email corpus was compiled by automatically signing up for mailing lists on the top 14,700 of the Alexa top 1 million sites, in addition to the Alexa top 500 shopping and top 500 news sites. In total, we received 12,618 emails from 902 senders.
[4] We classify the intentionality of leaks using the methodology detailed in Section 4.1 of the paper.
[5] LiveIntent’s marketing material touts the benefits of email-address-based tracking over cookies. In particular they highlight that email hash allows “Communication with clients across all screens and devices: Unlike the cookie, which represents an anonymous user, the email address represents a known customer. It’s unique to that individual, and remains persistent across all devices, apps and browsers.” Similarly, LiveIntent also explains how targeting users with hashed email addresses allows them to continue to serve targeted ads in Gmail despite Gmail’s image proxy.
Neustar’s privacy policy states: “[The onboarding process] allows advertisers to use their offline information about customer preferences (CRM data) ... in the online environment. ... We use de-identified information such as a hashed email address provided by our advertising client, to create a link between that de-identified CRM data and a Cookie ID, Mobile Advertising ID, or other persistent identifier assigned to a unique but de-identified user. That information can then be used to deliver targeted advertising…”. and “We also create and store linkages between and among household or individual level identifiers such as Cookie IDs, Mobile Advertising IDs, hashed email addresses and/or other persistent IDs that have been assigned to a unique but de-identified user. This process is sometimes called ‘cross device linking’.”
Acxiom’s Data Service API supports data queries on an MD5 or SHA1 hash of an email address.
Conversant Media’s marketing material implies that they use email address, in addition to purchase data, to match user data across devices.
[6] For example, LiveIntent’s privacy policy states: “We may collect identifiers that are used by our advertising partners to identify a specific individual ... To de-identify this information, either we or our business partners perform a mathematical process (commonly known as hashing) to convert the information into a code.”
[7] A GPU is a type of processor optimized for highly parallel tasks and is typically used for graphics processing. GPUs can also very efficiently compute hashes. In this post, we provide price quotes for Amazon’s `p2.16xlarge` EC2 cloud instance.
Image assets from the Noun Project used in this post: “Browser” by Designify.me, “Database” by Aybige, “Image” by Alfa Design, “HTML File” by Burak Kucukparmaksiz, “Computer Tower” by Melvin.
Sponsored byIPv4.Global
Sponsored byCSC
Sponsored byRadix
Sponsored byVerisign
Sponsored byVerisign
Sponsored byWhoisXML API
Sponsored byDNIB.com
Nice article and I think it clearly demonstrates, once more, that email marketing is unlawful unless the receivers are actual *direct* customers of the sender.
I’ve been following carefully, since a few years ago, all the spams that we get in our network and observed that they are using illegal databases and not following laws at all.
In general, at least in the EU, registering personal data (as emails), requires the authorization of the owner, even if this data is made public in Internet (exactly the same as sending postal mail if you haven’t provided your postal address to the sender). Furthermore, sending “commercial” emails (so SPAM in most of the cases), requires a PRECIOUS and EXPLICIT authorization from the email owner (for every sender). How many times any of us provided that authorization? I never did!
How many times you provided the authorization for a specific provider (your on-line fruit provider), but then the same database is rented for sending you spam about shoes? Thousands!
Your article shows how all this is used to break privacy.
So I’ve contacted more than 1000 email marketing companies (or direct sellers) in the last 3 years, that have sent spam to me, to ask them (according to the law in EU which mandate providing access to your data including how and when it was registered), only ONE of them has actually demonstrated that I provided my data. All the others responded with fake registrations of my email (even with invalid IPs, or from countries such as Ukraine where I never have been), and NEVER provided a demonstration of my previous and explicit consent.
Yes, Data Protection Agencies in EU can fine up to 600.000 Euros to those companies, but most of the time are databases created on the fly for a campaign, and they disappear, or the authorities are so slow, that they never put together several customers claims, or even worst, customers don’t claim, even if it is possible to do on-line. And because those databases are rented by the email marketing companies, they are not liable of the damages and unlawful activities.
So we need an EU law that:
1) Mandates the use of double-opt-in to avoid fake registrations
2) Force the owner authorization specific to each database and forbids the use of that data for other senders, either by renting it or selling it
3) Force the previous and explicit consent
4) Make sure that there is dedicated prosecutor which automatically follows those cases
5) Enforces that all the parties involved in the email marketing (advertiser, provider/owner of the SMTP servers, owner of the database, marketing companies involved in the campaign) are equally liable for the fines and damages
6) Damages are automatically paid to the email owners, without the need to start a lawsuit which cost money
7) In addition to fines, involved company staff in those campaigns, if they repeat the unlawful actions for more than 3 times, need to be charged with prison
8) Fines and prison should be proportional to the size of the database being used, something like, if you caused 1 minute of time lost for n users, you get n*10 minutes of prison
So there is a need for new laws regarding SPAM. Very strict and aggressive laws. They should consider your article, because is a right of the users not need to block email contents because third parties abuse of privacy. Every citizen has the right to not use any kind of blockers and be able to see the legitimate emails received without further work.
Spam need to be considered a crime at the same level that a hijacking. It steal users time, the same as if you are kidnapped during a few minutes every week. Put all that together in a year, and you will discover how much time it means and in the case of corporate emails, how much it cost.