|
Before the Holidays, Yahoo got a flurry of good press for the announcement that it would (as the LA Times puts it) “purge user data after 90 days.” My eagle-eyed friend Julian Sanchez noticed that the “purge” was less complete than privacy advocates might have hoped. It turns out that Yahoo won’t be deleting the contents of its search logs. Rather, it will merely be zeroing out the last 8 bits of users’ IP addresses. Julian is not impressed:
...dropping the last byte of an IP address just means you’ve narrowed your search space down to (at most) 256 possibilities rather than a unique machine. By that standard, this post is anonymous, because I guarantee there are more than 255 other guys out there with the name “Julian Sanchez.”
The first three bytes, in the majority of cases, are still going to be enough to give you a service provider and a rough location. Assuming every address in the range is in use, dropping the least-significant byte just obscures which of the 256 users at that particular provider is behind each query. In practice, though, the search space is going to be smaller than that, because people are creatures of habit: You’re really working with the pool of users in that range who perform searches on Yahoo. If your not-yet-anonymized logs show, say, 45 IP addreses that match those first three bytes making routine searches on Yahoo (17.6% of the search market x 256 = 45) you can probably safely assume that an “anonymized” IP with the same three leading bytes is one of those 45. If different users tend to exhibit different usage patterns in search time, clustering of queries, expertise with Boolean operators, or preferred natural language, you can narrow it down further.
I think this isn’t quite fair to Yahoo. Dropping the last eight bits of the IP address certainly doesn’t protect privacy as much as deleting log entries entirely, but it’s far from useless. To start with, there’s often not a one-to-one correspondence between IP addresses and Internet users. Often a single user has multiple IPs. For example, when I connect to the Princeton wireless network, I’m dynamically assigned an IP address that may not be the same as the IP address I used the last time I logged on. I also access the web from my iPhone and from hotels and coffee shops when I travel. Conversely, several users on a given network may be sharing a single IP address using a technology called network address translation. So even if you know the IP address of the user who performed a particular search, that may simply tell you that the user works for a particular company or connected from a particular coffee shop. Hence, tracking a particular user’s online activities is already something of a challenge, and it becomes that much harder if several dozen users’ online activities are scrambled together in Yahoo!‘s logs.
Now, whether this is “enough” privacy depends a lot on what kind of privacy problem you’re worried about. It seems to me that there are three broad categories of privacy concerns:
So based on the small amount of information Yahoo has made available, it seems that the new policy is a real, if small, improvement in users’ privacy. However, it’s hard to draw any definite conclusions without more specific information about what information Yahoo! is saving. Because anonymizing data is a lot harder than people think. AOL learned this the hard way in 2006 when “anonymized” search results were released to researchers. People quickly noticed that you could figure out who various users were by looking at the contents of their searches. The data wasn’t so anonymous after all.
One reason AOL’s data wasn’t so anonymous is that AOL had “anonymized” the data set by assigning each user a unique ID. That meant people could look at all searches made by a single user and find searches that gave clues to the user’s identity. Had AOL instead stripped off the user information without replacing it, it would have been much harder to de-anonymize the data because there would be no way to match up different searches by the same user. If Yahoo’s logs include information linking each user’s various searches together, then even deleting the IP address entirely probably won’t be enough to safeguard user privacy. On the other hand, if the only user-identifying information is the IP address, then stripping off the low byte of the IP address is a real, if modest, privacy enhancement.
Sponsored byWhoisXML API
Sponsored byVerisign
Sponsored byDNIB.com
Sponsored byRadix
Sponsored byVerisign
Sponsored byIPv4.Global
Sponsored byCSC