Home / Blogs

Current Difficulties With Displaying Internationalized Top-Level Domains

Earlier this week, we inserted eleven new top-level domains in the DNS root zone. These represent the term “test” translated into ten languages, in ten different scripts (Chinese is represented in two different scripts, and Arabic script is used by two different languages).

This blog post is not about that. (If you’re interested about it, read our report on the delegations.)

What I would like to talk about is some of the difficulties we face today in expressing scripts in a consistent way over the Internet. The fact is, whilst we are at the best time in history for having computers represent many different languages clearly and consistently, we are still a long way from the level of support needed to give us strong confidence that people can always see what we intend them to see.

To illustrate, I will list all the eleven new top-level domains. On the left is the version your web browser wants to present to you, and on the right is how it should actually look.

??????
???????
??
??
?????????
???????
??????
???
????
???
???????

If you find some of the versions don’t match, you would be in the majority of Internet users. The fact is most people cannot see these labels properly and consistently.

The most likely problem you will face is that there will be some labels that you simply cannot see, because your computer does not have any font that can express the characters. When the correct font can not be found it will usually display something like the following:

Computers never come with the complete set of fonts that will allow it to show every possible IDN in the world. The primary concern is to supply fonts that allow the language used on the computer to work, and the rest are optional. Often this is fixed by downloading additional language packs for the missing languages, or specifically finding and installing fonts that support the wanted languages.

Finding fonts is sometimes only half the battle. English, on the scale of languages, is one of the simplest to represent by computer. It has 26 letters, and they always look the same and are presented the same no matter what order they are in. Sure, they may be stylistic variants, but in terms of composing letters it is very simple.

Take a look at this:

On the left is the correct way to present this, but those of you that do have Arabic fonts may find that you see the version on the right. This is because Arabic has more complex rules on how letterforms should be connected and formed. Some software is more accurate than others on how it does this.

The same issue may present itself in Devanagari script:

Again, on the right you can see the composing is not working correctly.

If you’re really unlucky, for the Arabic version you may be seeing this:

This comes about because Arabic is written right-to-left. English, on the other hand, is written left-to-right. However, this corrupted example of Arabic has been written left-to-right - .siht ekil etorw I if sa

Ordering problems may also arise when fully blown domain names are used. Imagine a domain like maps.google.com. Now imagine it showing up as com.google.maps. That’s confusing, but imagine the confusion of google.com.maps, or worst of all, as google.com.spam. These are some of the variants that have shown when right-to-left ordering issues appear due to software problems. (More on this issue is in this presentation from the Israel ccTLD registry.)

Apart from the visual display issue, there can also be issues simply in transmitting these domains in communications. The DNS has been carefully upgraded to support these new domains, but that doesn’t mean you will get a consistent experience in other areas. In a discussion on these new test domains on an Internet mailing list, one person found they were showing like this:

This is because the encoding in the email is incorrect. Generally speaking, to fully express all the possible IDNs you need to use an encoding like UTF-8. However, ISO 8859-1 is often the default on many mail programs for users of English and other Western European languages. The result of viewing UTF-8 encoded labels in ISO 8859-1 results in the undecipherable letter soup you see above. If you’ve ever received foreign spam that just looked like a list of random letters, this is probably why.

This is just touching on the number of problems that can express themselves when dealing with the world’s languages and scripts. With the release of the evaluative top-level domains, it will provide additional opportunity to identify these types of problems, and work with software vendors and other parties to help improve their applications so these issues will no longer occur.

NORDVPN DISCOUNT - CircleID x NordVPN
Get NordVPN  [74% +3 extra months, from $2.99/month]
By Kim Davies, Manager, Root Zone Services

Filed Under

Comments

Jaap Akkerhuis  –  Oct 16, 2007 1:38 PM

As a datapoint how subtle the differences might be, I read your article on a Macintosh OSx 10.4.10 with Safari (Version 3.0.3 (522.12.1)) and Firefox (2.0.0.7). Just limiting myself at the list of labels, Firefox couldn’t display the Dvangari.

Firebox (2.0.0.7) on my FreeBSD box (6.2-STABLE) had problems with the Greek, it other symbols and couldn;t handle the Tamil scriptat all (displays the Unicode-points in small squares). I assume this has to do with the (lack of) proper font files on the machine.

Slim Amamou  –  Oct 19, 2007 8:42 PM

even if you have a perfect setup there are still issues with right to left scripts (Arabic for instance) :

http://????.??????/????/

look at this URI. for me, it looks a bit awkward. it looks like :

http://path/example.test/

preserving the left to right directionality of the path (the slashes order), is essential. everybody seems to forget that a URI represents also a hierarchy, not just an identifier.

ahir man  –  Apr 2, 2008 4:10 PM

Thanks Kim Davies for this article but how about solutions?

I recently got this problem of disconneted arabic letters that were arranged from left to right. Probably after I installed firefox 2.0.0.12 en-us. The font problem was also present in MSIE.

Is it because of corrupted fonts? Will the problem be solved if some or all of the fonts were deleted from c:\windows\fonts then replaced with new ones?

Comment Title:

  Notify me of follow-up comments

We encourage you to post comments and engage in discussions that advance this post through relevant opinion, anecdotes, links and data. If you see a comment that you believe is irrelevant or inappropriate, you can report it using the link at the end of each comment. Views expressed in the comments do not represent those of CircleID. For more information on our comment policy, see Codes of Conduct.

CircleID Newsletter The Weekly Wrap

More and more professionals are choosing to publish critical posts on CircleID from all corners of the Internet industry. If you find it hard to keep up daily, consider subscribing to our weekly digest. We will provide you a convenient summary report once a week sent directly to your inbox. It's a quick and easy read.

Related

Topics

IPv4 Markets

Sponsored byIPv4.Global

Cybersecurity

Sponsored byVerisign

DNS

Sponsored byDNIB.com

Domain Names

Sponsored byVerisign

New TLDs

Sponsored byRadix

Threat Intelligence

Sponsored byWhoisXML API

Brand Protection

Sponsored byCSC