How Do You Turn a Typesetting Language Into an Identifier System? (Not Easily)

Home / Blogs

How Do You Turn a Typesetting Language Into an Identifier System? (Not Easily)

	By John Levine Author, Consultant & Speaker
	January 30, 2018 Views: 9,490 Add Comment

Unicode’s goal, which it meets quite well, is that whatever text you want to represent in whatever language, dead or alive, Unicode can represent the characters or symbols it uses. Any computer with a set of Unicode typefaces and suitable layout software can display that text. In effect, Unicode is primarily a typesetting language.

Over in the domain name system, we also use Unicode to represent non-ASCII identifiers. That turns out to be a problem because an identifier needs a unique form, something that doesn’t matter for typesetting.

For a name in the DNS, and for most other kinds of identifiers, if a user sees an identifier in use somewhere, she needs to be able to type or otherwise enter that identifier so that what she typed produces the same bits as the stored identifier. In some cases (see mailboxes, below) the rule is slightly relaxed so that given two strings, the computer can decide whether they identify the same thing.

Unicode is full of homoglyphs, characters or groups of characters that look the same but have different internal forms. We (mostly meaning the Unicode consortium, IETF, and ICANN) have come up with three ways to minimize the homoglyph problem and try to limit Unicode internationalized domain names (IDNs) so that two IDNs that look the same actually are the same.

Homoglyphs are nothing new. Those of us old enough to remember manual typewriters remember that they often had only the digits 2 through 9, and we used lowercase letter l and uppercase O for 1 and 0. It didn’t matter because the meaning was obvious from context. But when used as identifiers in the DNS, there’s no context, and a name like “operator: is not the same as “0perator: or “0perat0r”.

Normalization

In some cases Unicode offers multiple ways to write the exact same character, such as á which can be written as two glyphs, “a” followed by “combining acute accent”, or as a single precomposed glyph “a with acute accent”. Unicode defines several normalization forms, one of which consists of characters that are as composed as possible, known as Normalization Form C (NFC.) The IETF’s Internationalized Domain Names for Applications (IDNA) requires that all IDNs be in NFC, and that input Unicode be converted to NFC before being used as a domain name. This only handles composition where the two forms appear exactly the same, not forms where the forms look similar but not identical.

Scripts

A related but different issue is different scripts. Unicode defines a script as a set of characters used to write one or more languages. Familiar scripts include Latin (used to write most European languages), Cyrillic, and Greek, and Arabic. Different scripts often have characters that look the same, e.g. the Latin letter “o”, Cyrillic “o”, and Greek omicron.

Most domain registries have a list of scripts in which they will accept registrations, and each registered name usually has to be in a single script. In some cases, names are restricted to a single language in a single script (e.g., French or Portuguese which use different accents), or a mixture of compatible scripts, notably Japanese names which allow Katakana, Hiragana, Han (Kanji ideographs), and Latin. This largely deters homograph attacks at the registration level other than some arcane examples where people have constructed what looks like English names entirely from homographs in Cyrillic or Greek.

All ICANN contracted registries are supposed to file their tables of permitted characters for each language in an IANA repository, and many have.

Registry script rules are generally only enforced for the name directly registered, and not for anything below it, so you can see names like <mixture>.something.com.

Language generation rules and bundling

The last level of confusion is among characters that don’t necessarily look the same but in some sense mean the same thing. Examples include traditional and simplified Chinese characters, and in some European languages, vowels with and without accents. In script tables, one character can be listed as a variant of another, an registries have rules about them. Some forbid registration of names that differ only in characters that are variants, while others “bundle” names so that a registrant can get some or all variants of a name.

Variants have their limits; they can’t express character sequences of different length such as the German ö and ß which are usually equivalent to “oe” or “ss”, but they avoid a lot of problems particularly in Chinese and Japanese.

So who cares?

The reason I went through all this is twofold. One is related to the DNS: there are good reasons that the characters in Unicode DNS labels are limited, and you can’t use, to revisit a recent argument, emoji. If you want to use emoji in text messages or other contexts that are like typesetting, that’s fine. But they make dreadful identifiers since there are lots and lots of emoji that look almost the same, frequently deliberately so. For most emoji that look like people, you can add modifier glyphs for any of five skin tones, and male or female gender. You can make several emoji display as a super-emoji group, say man and heart and woman as 💑 which looks cute but is a challenge to type since it’s a sequence of six glyphs that have to be entered in the right order: woman, combine, heart, alternate-version, combine, man.

If the emoji for slightly frowning face 🙁 and slightly frowning face with open mouth 😦 look nearly identical, it makes no difference in a text message, but it makes them terrible identifiers. Imagine you registered one, built a website around it, and then a competitor registered the other. How can you explain to your customers which is the real one?

To avoid this problem, in principle people could create an emoji script table that groups together similar-looking emoji as variants, and otherwise limits the allowable emoji to ones that look different enough that people could reliably recognize them if they saw them in an ad on the side of a bus. But nobody will. It’s not worth anyone’s time since emoji DNS names are at most a gimmick.

The other reason is that DNS labels are not the only place on the Internet where we have text identifiers. Two other familiar ones are the path in URLs, the part after the domain name, and the mailbox in an e-mail address. Mailboxes, in particular, are a challenge, since only the system hosting the mailbox knows the meaning of an address and although every mail system does some kind of fuzzy match, the fuzz varies a lot. For ASCII mailboxes, everyone does upper/lower case folding, some ignore dots, some trim off suffixes after hyphens or plus signs, some do other things, but it entirely depends on the mail system. Systems with Unicode addresses will do similar things, but it’s a lot harder since the details of case folding are highly language specific (even among languages written in Latin characters), and there are a lot of things that might be considered to be like dots or hyphens.

While the DNS character rules can be a useful guide to designing rules for other applications, it’s unlikely they can be applied directly (e.g., DNS names never ignore dots, mailboxes sometimes do.) We still have a lot to learn about what’s a usable identifier in what contexts.

NORDVPN DISCOUNT - CircleID x NordVPN
Get NordVPN [74% +3 extra months, from $2.99/month]

By John Levine, Author, Consultant & Speaker

Filed Under

Comments

The Weekly Wrap

More and more professionals are choosing to publish critical posts on CircleID from all corners of the Internet industry. If you find it hard to keep up daily, consider subscribing to our weekly digest. We will provide you a convenient summary report once a week sent directly to your inbox. It's a quick and easy read.