Home / Blogs

Choosing Internationalized Email Addresses

Recently I’ve been working on Email Address Internationalization (EAI), looking at what software is available (Gmail and Outlook/Hotmail both handle it now) and what work remains to be done. A surprisingly tricky part is assigning EAI addresses to users.

In traditional ASCII mail, the local part of the address, what goes before the @ sign, can be any printable ASCII characters. Although an address like %i()/;[email protected] is valid, and mail systems will handle it, users don’t want addresses like that. A good address is one that is easy to remember, easy to tell someone over the phone, and easy to type.

Mail systems all give senders some help when interpreting addresses. If an address is [email protected], they’ll accept [email protected] or [email protected] If the address is [email protected], they’ll accept [email protected] and often variations in punctuation like [email protected] without the dots.

The flip side of this is that you don’t assign different addresses that are too similar. While it is technically possible that [email protected] and [email protected] could deliver to different mailboxes, nobody does that. Similarly, nobody makes [email protected] and [email protected] different. (They may not both work, but if they do, they’re the same mailbox.)

The domain (the part of the address after the @ sign) has to follow the DNS rules, which don’t allow any fuzzy matching other than ASCII upper and lower case.

How does all this extend into EAI mail?

EAI extends ASCII addresses in a straightforward way—in addition to any printable ASCII characters the local part can contain any printable UTF-8 characters, and the domain can be UTF-8 U-labels. As before, users will have an easier time if mail systems assign addresses conservatively and interpret addresses on incoming mail liberally.

The PRECIS working group at the IETF defined string classes for different applications. The Identifier class works well for mailbox names, codepoints that are (roughly) letters and digits in various languages.

It also provided rules to prepare UTF-8 strings for use. Unicode often provides multiple ways to represent exactly the same character, e.g., a single codepoint for an accented character é or separate e and accent codepoints. It often also has variant characters that look different but mean approximately or exactly the same thing, such as full-width and half-width versions of characters, Latin digits 12345 and Arabic digits ١٢٣٤٥, or traditional and simplified Chinese characters. To prepare a string, software maps variant codepoints into preferred ones, usually precomposed characters such as é. Mail systems should assign mailbox names in prepared form, but they can and should accept addresses in the incoming mail in any form since they can prepare them as they receive them. (This is different from the DNS where DNS servers only do exact matches, so the client has to do any preparation.)

There’s no reason that a mail system’s fuzzy matching has to stop where PRECIS and ASCII addresses did. The Latin and Arabic digits aren’t the same for PRECIS, but it’s easy enough for a mail system to map them together and to ensure that it doesn’t issue two mailboxes with digits that collide. In Latin languages with accented or multiple forms of characters (such as the Turkish dotless ı) a conservative mail system would avoid assigning addresses that differ only in the form of a letter, accept all versions of the letter, even ones that aren’t valid or equivalent in the user’s language. For example, even though Turkish speakers wouldn’t write i for ı, correspondents who don’t speak Turkish might, and it’s easier all around if the slightly misspelled address works. Similarly, in Scandinavian languages the letters O Ø Ö are different, but it’d be a good idea to accept the wrong versions in incoming addresses.

Mail systems have only recently started to assign EAI addresses, and I’m not yet aware of any of them doing fuzzy matching on incoming addresses. But for the same reason, we have found it a good idea to allow [email protected] for [email protected] in ASCII mail, EAI mail systems will have to figure out how to adapt to however their correspondents type the EAI addresses.

By John Levine, Author, Consultant & Speaker

Filed Under

CircleID Newsletter The Weekly Wrap

More and more professionals are choosing to publish critical posts on CircleID from all corners of the Internet industry. If you find it hard to keep up daily, consider subscribing to our weekly digest. We will provide you a convenient summary report once a week sent directly to your inbox. It's a quick and easy read.

I make a point of reading CircleID. There is no getting around the utility of knowing what thoughtful people are thinking and saying about our industry.

VINTON CERF
Co-designer of the TCP/IP Protocols & the Architecture of the Internet

Comments

Comment Title:

  Notify me of follow-up comments

We encourage you to post comments and engage in discussions that advance this post through relevant opinion, anecdotes, links and data. If you see a comment that you believe is irrelevant or inappropriate, you can report it using the link at the end of each comment. Views expressed in the comments do not represent those of CircleID. For more information on our comment policy, see Codes of Conduct.

Related

Topics

IPv4 Markets

Sponsored byIPXO

Threat Intelligence

Sponsored byWhoisXML API

Cybersecurity

Sponsored byVerisign

Domain Names

Sponsored byVerisign

Domain Management

Sponsored byMarkMonitor