A Serious Bug in the Similarity Check

Home / Blogs

A Serious Bug in the Similarity Check

	By Werner Staub
	January 21, 2013 Views: 13,114 Comments: 2

A week ago, ICANN announced the latest delay in the New gTLD Program: the so-called “contention sets” will only be published March 1, 2013. The original deadline was July 2012, postponed serially in two-month intervals. The gTLD program is lost in confusing similarity. What went wrong?

In order to determine which TLD applications are in contention, it is necessary to say which TLD strings are confusingly similar to one another. That was supposed to be done in two steps: first, a list to be compiled by ICANN, and then, the String Similarity Objection procedure.

With the new dates, the ICANN list will be published just 13 calendar days before the end of the objection period. ICANN has entrusted the job to an external provider. There is a draft list of confusingly similar strings, but ICANN does not think it fit for publication.

Impossible marching orders

A closer look shows that ICANN staff has been given a task it cannot possibly perform correctly. The list expected from ICANN lacks a revision procedure. The String Similarity Check is a secret tribunal pronouncing sentences without appeal.

Roughly 1200 gTLD strings must be compared to one another and to 625 theoretically possible ISO-3166 2-letter codes. In theory, that makes about 2,200,000 one-to-one comparisons. That looks like a high number. In practice, the problem lies elsewhere.

Consistency vs Purpose

According to ICANN staff, the problem is to achieve consistency.

That is a fallacy. The fact is that there can never be such a thing as a “consistent” list of confusingly similar TLDs strings. It is impossible because similarity depends on context, culture, language, use, and above all, policies applied by registries.

Each similarity comparison involves criteria that may appear only once. It is hard to see how consistency could be achieved across such diversity.

The purpose of the string similarity check is not to achieve any form of consistency, but to minimize the risk of harm where it can arise from confusing similarity of TLD strings.

The risk of confusion can often be eliminated by registry policies. On the other hand, in the absence of prudential policies, even moderate similarity can be used for malicious conduct.

The TLD string is meaningless in isolation

The second fallacy lies in the idea of comparing TLD strings alone. The risk of confusion can only be measured if we include the respective TLD policies and other contexts.

Let us look at the case of “.ubs” and .“ups”. Both the bank and the courier company want a TLD for their own exclusive use. The sounds “b” and “p” are difficult to tell apart in many languages. The two strings are phonetically similar and visually close. Yet there is no danger of confusion for Internet users: neither UBS nor UPS will allow third parties to register domains under .ubs and .ups. So they would very much be like the currently existing ubs.com and ups.com, neither of which is known as a trap for user confusion.

Now compare “.sport” and “.sports”. There are three applications, two for “.sport” and one for “.sports”. All are intended to allow registration by third parties. Phonetically and visually, there is some distinction, but “sports” is the plural of “sport” in English, French and other languages. What is more, when used as a label or title, the string “sport” tends to have the same meaning as “sports”. Therefore, “.sport” and “.sports” cannot coexist.

Learning from mistakes of the IDN ccTLD similarity check

The IDN ccTLD Fast Track similarity check was handled confidentially. No explanations were given. There was no revision mechanism. The objective of consistency across incomparable contexts turned out to be self-defeating and led to misguided decisions: examples are the unjustified rejections of “.??” for Greece and “.??” for Bulgaria, as well as interminable dithering over .?? (Greek) and “.??” (Cyrillic) for .eu.

The same bugs are present in the gTLD program. As a matter of fact, one of the lessons officially learned from the Fast Track is the need for a revision procedure. The design of such a revision procedure is now one of the deliverables of the IDN ccTLD Policy Development Process.

Relying on the diversity and knowledge inside the ICANN community

ICANN staff and consultants must not be asked to decide in isolation. No single committee of experts has enough cultural and linguistic knowledge. Any initial list of confusingly similar TLD strings must be published as preliminary and subject to change. There must be public debate and revision procedures.
The ICANN Board must recognize the problem and remove the pointless pressure on both ICANN staff and gTLD applicants.

I suggest the following:

1) Restore the feasibility of the String Similarity Objection. It is defined in the AGB. But 13 calendar days are insufficient to file an objection. (March 13 is the current deadline for objections.) The best solution is to allow potential objectors to request per-TLD extensions of the objections period.

2) Introduce the option of confusion avoidance policy protocols between TLD registries. On this basis, ICANN can allow the delegation of TLDs where the policy protocol reduces risks to a level below the threshold of concern. It can take the form of prudential procedures to which affected TLD operators commit. A simple mechanism, for instance, is mutual checking between protocol partner registries before creating a given second-level domain. If the second-level domain already exists in the other protocol partner registry, then the domain creation is put on hold pending analysis by staff of both registries. The procedures can and should be adapted to the specific types of TLD in question. Similarity objection and the negotiation of a confusion avoidance protocol can easily be combined.

3) Require declaration of exclusive-use TLDs. ICANN has defined the contractual mechanism a year ago: on the grounds of exclusive use, registry operators can request exemption from Specification 8 of the Registry Agreement. If the exemption is granted, the registry operator must take responsibility for all content on all domain names of the TLD. In exchange, the registry operator is no longer barred from registering domains its own name, nor does it have to commit to treat all registrars equally. The only step missing is explicit declaration. This should be done now, through TAS. This fundamental piece of information is useful in many areas, including string similarity.

4) Publish the preliminary list without delay. If there is a revision process and the option of confusion avoidance protocols, then a preliminary list can be published immediately. In essence, the ability to correct the list makes it less dangerous, hence viable for publication. March 1 should be kept as the deadline for a list with a “semi-final” status: final before objections and negotiation of confusion avoidance protocols.

5) Conduct a community public comment process for the preliminary string similarity results.

We are simply dealing with process design bugs. The existence of those bugs is no shame, so long as we correct them. The Applicant Guidebook clearly allows for changes to be made: let us use that route, for Similarity bug and others, rather than risk a collapse of the gTLD program.

The confusion in the Similarity Check is more dangerous than it seems. The purely string-based check, with its lack of solidity, was made a cornerstone. It will crumble under the pressure. No wonder ICANN staff is hesitant to publish it. It is one of several dangerously weak spots. If we fail to correct the architecture, the entire edifice can break down.

The Similarity Check should not be a cornerstone. That position should be given to well-documented policy commitments by registries.

NORDVPN DISCOUNT - CircleID x NordVPN
Get NordVPN [74% +3 extra months, from $2.99/month]

By Werner Staub

Filed Under

Comments

Werner, you answer your own question when Philip Sheppard – Jan 23, 2013 9:17 AM

Werner, you answer your own question when you say:
“The purpose of the string similarity check is ...to minimize the risk of harm where it can arise from confusing similarity”.

It is not about perfection. It is not harm elimination. It is harm reduction.
The Trademark world has done it for decades - with some success.

# 1 Reply | Link | Report Problems

Thanks for a very good summary of Benoit Fallenius – Jan 23, 2013 5:26 PM

Thanks for a very good summary of the similarity challenge. Similarity is, like you are saying, contextual. In our experience most cases of similarity that is actually confusing depends on phonetics (ubs/ups) and the understanding of generic words (sport/sports).

We at Markify process today 10M new strings a year against our database of +100M strings (trademarks and domain names). We find 99% of potential confusions.

For us to process another 1200 against themselves would not be a big job. We could do it for free, to help the community. I think this kind of list would help the process you are suggesting.

# 2 Reply | Link | Report Problems

The Weekly Wrap

More and more professionals are choosing to publish critical posts on CircleID from all corners of the Internet industry. If you find it hard to keep up daily, consider subscribing to our weekly digest. We will provide you a convenient summary report once a week sent directly to your inbox. It's a quick and easy read.