Home | Find museums | Events & News | Register your name | Background info | Contact

Internationalized Domain Names (IDN) in .museum - Orthographic issues

(Note: This text assumes that the reader is acquainted with basic IDN concepts. Unfamiliar terms appearing here can, however, safely be glossed over. Clicking through the illustrations requires an IDN-compliant browsing environment, which is otherwise not a requirement for comprehensibility. Detailed introductory information and relevant policy documents are available at http://about.museum/idn/.)

Background

The characters and code points that may appear in an IDN are taken from the Unicode Code Chart version 3.2. The IDNA protocol places some explicit limitations on the repertoire that is actually available, but assumes that significant further restrictions will be imposed by registry policies. The default state in a gTLD such as .museum is that no IDN code points are permitted unless they are explicitly declared as available. A single label is further restricted to code points taken from the same script. This is intended to avoid confusion that can easily result from (or be deliberately caused by) the gratuitous intermingling of characters taken from different scripts. The selection of available scripts and characters is based on the orthographic requirements of the languages that the registry wishes to support. There is a reasonably standardized practice for many languages, with variance in usage in different locales also being relatively easy to codify. A registry may, nonetheless, need or wish to modify or extend this when setting its IDN policies.

The representation of loanwords may, for example, require the use of characters that do not belong to the native repertoire of the receiving language (as in the English résumé, naïve, jalapeño, garçon, fräulein, crêpe, crèche, etc.). In the .museum case, a name holder might also wish to use non-contemporary orthography or an extinct language to label a resource associated with a corresponding historical niche, or use what might otherwise be seen as a foreign language to identify material originating in a separate cultural context. As long as this doesn't violate any of the IDNA protocol requirements or the other terms in the ICANN Guidelines for the Implementation of Internationalized Domain Names, there should be little difficulty in accommodating such needs.

A converse state exists where IDNA excludes the use of some characters that are required by accepted orthographic convention, either by outright prohibition or by mapping them to idiosyncratic alternatives. Situations have been noted in which constraints are imposed on an adequate Unicode repertoire, with results that may be a source of confusion to users and name holders. Three such cases are described immediately below, with further sections discussing general issues including the deceptive exploitation of graphically similar characters. This issues list will likely grow as IDN support for additional scripts and languages is introduced.

Latin script

Of the languages using Latin script currently appearing in IDN contexts, German is probably the most widely supported. There is some discussion about policies for dealing with orthographically equivalent appearances of the umlauted <ä ö ü> and the two-letter alternatives <ae oe ue>, but all forms are available for inclusion in registered names. However, the fourth of the German letters external to the basic Latin <a-z> alphabet, the 'sharp s' <ß>, is remapped by IDNA to <ss> and is therefore not available for inclusion in a registered name. (There is no upper case sharp s in Unicode, and the normalized form is <SS ss>.) This is orthographically acceptable in all contexts and is the least likely of the cases described here to cause confusion. If the display of the <ß> is important, it can be used literally without difficulty. Although applications software may convert an entered <ß> into <ss>, the strict representation can always be used in text entry and initial display contexts.

Greek script

The Greek small letter sigma exists in two forms, each at its own Unicode code point. The initial and medial form is <σ>, and the final form is <ς>. The IDNA specification maps the latter into the former. (Here as well, this is based on case, where the upper case sigma only exists in a single form and the normalization is to <Σ σ>.) A native Greek speaker would, however, expect to see the final form at the end of a domain label and not regard the other form as an acceptable alternative. Here again, it is possible to use the correct orthography in text entry and display contexts. The transformation of a final sigma to the initial/medial form by applications software would, however, not be as readily tolerated as the preceding situation with <ß ss>. This may be illustrated with the IDN identifier provided by the International Council of Museums (ICOM) to its Cypriot National Committee, http://κυπρος.icom.museum. The label κυπρος can appear correctly in the body of an HTML document and a browser's address line, but may appear as κυπροσ in the status line, or in the way the target site identifies itself. This is the name of a country and its forced misspelling is unlikely to be taken invariably as a casual matter.

Other scripts

ICOM provides all of its national committees with native language domain names. Most of the committees use the organization's English acronym in Latin script, but there are some instances of it being represented in other scripts. The Russian National Committee, for example, uses the orthographically straightforward http://россия.иком.museum. None of the committees represent the ICOM acronym in Greek script but, were they to do so, it could easily appear as a second-level label in the corresponding domain name. Similarly, it is expected that committees in Arabic speaking countries, as well as other countries with languages using the Arabic script, will prepare resources in those languages. Regardless of preferences in the representation of the ICOM acronym, it illustrates concerns with scripts that are written right-to-left as described below, and may introduce additional orthographic issues requiring special consideration.

There is only one additional ICOM committee site currently being operated where two IDN labels are appropriate — that of the Israeli National Committee. This uses what appears to be an obvious http://איקו״ם.ישראל.museum. On closer examination, however, this illustrates several problems with IDN that have yet to be fully resolved. The fifth character (counting from the right) in the label איקו״ם is listed in the Unicode code chart as a punctuation mark, not a letter. It serves the necessary function of indicating that this sequence of letters is an acronym, rather than a word. Although the inclusion of punctuation marks in domain names is restricted, this sign is a semantic extension to the Hebrew alphabet and arguably not punctuation in the sense used in the DNS specification. Permitting its appearance in Hebrew IDN appears justified, in any case, if not absolutely necessary.

The 'punctuation gershayim' <״> appear in the penultimate position in a sequence of Hebrew letters that is not to be read as a word, and this function cannot be dispensed with without risking confusion. The problem is compounded by the fact that the standard Hebrew keyboard does not include this sign, obligating users to substitute a quotation mark <"> for it. Although the correct Unicode code point can easily be placed in a displayed IDN label, a keyboarded transcription of that label is likely to fail without the reason being apparent to the non-specialist user. The acquiescent inclusion of a quotation mark in a domain name is not a viable work-around. Despite being theoretically permissible in the DNS protocol, the applications software that supports the actual operation of the DNS will normally reject a name containing a quotation mark.

Bidirectionality

Every domain name contains a top-level label written using the left-to-right Latin alphabet. If the other labels in that name are in a script that is written right-to-left, there is likely to be some ambiguity about the order in which the individual labels should be read. In the specific case of the listing of ICOM committee names, the icom designator will be always be expected to appear next to the museum label. This forces the committee name into the position to the left of the icom label, regardless of the direction in which the language is normally read. The problem is that two adjacent labels in a script with right-to-left properties will always appear in right-to-left order. The sequence thirdlevel.secondlevel.toplevel will thus appear as secondlevel.thirdlevel.toplevel if the second- and third-level labels are both in right-to-left scripts. The two Hebrew labels in http://איקו״ם.ישראל.museum therefore need to be registered in reverse order for this name both to appear and function as expected.

This has the arcane benefit of moving the policy issues about the inclusion of what may be seen as a punctuation mark, from the second level to the third level (where constraints can be less severe). There is, nonetheless, an obvious lack of clarity that still needs to be addressed. It is difficult to envision the resolution of this problem without establishing means for the representation of TLD labels in right-to-left scripts giving, for example, http://ישראל.איקו״ם.מוזיאון (with few applications likely to be troubled if the residual left-to-right <http://> were excluded from display).

Localized top-level labels

It can safely be expected that the Russian-speaking community would be well served by the availability of http://россия.иком.музей, despite the lack of directional ambiguity in the currently available representation of this name. Obvious benefit would derive to numerous additional communities if their native languages and scripts could be used on all levels in a domain name, and need for top-level IDN has already become a matter of keen interest. It is, however, fraught with a range of daunting technical, political, and policy concerns. Nor is it by any means certain that these will all ultimately be resolved without more or less serious disruptive side effects. However varied the means for addressing the individual issues may be, those emanating from the orthographic requirements of specific languages are unlikely to move forward, much less be satisfactorily treated, without the energetic participation of the corresponding language communities.

Visually confusable characters

Many of the letters and characters that may be included in a domain name have similar, if not identical, appearances on a computer screen. Examples of this in the pre-IDN repertoire are <I 1 l> and <0 O>. Any increase in the available repertoire will also increase the number of confusable characters, with risk for the deliberate exploitation of these similarities for deceptive purposes. This was recognized early in the course of the development of IDNA. Some of the restrictions that were required to offset the risk of deceptive exploitation were prescribed directly in the protocol, but the largest control was expected to be implemented through registry policies.

The adequacy of that approach was recently called into question through a demonstration exploit based on the visual similarity between the grapheme <a> representing the first letter in the Latin alphabet and the separate grapheme <а> representing the first letter in the Cyrillic alphabet. Longer sequences of characters from the one of these scripts can appear to be similarly indistinguishable from the other. An example of this is the Latin <apsyeoxic> and the Cyrillic <арѕуеохіс>. A given sequence of Latin letters can obviously only be registered once in any parent domain. In a registry supporting both the Cyrillic and Latin alphabets it might, however, be possible for both <арѕуеохіс> and <apsyeoxic> to be registered separately. It is precisely such situations that individual registry policies are expected to address, but considering the large number of additional examples that can be found in these two scripts alone, and there being other scripts with rich potential for graphic confusion, it has become clear that more detailed basic regulatory instruments are needed, with similarly clearer enforcement mechanisms.

Action toward that end has already been initiated by the organizations responsible for the various facets of IDN. These include the IAB, ICANN, the TLD registries, and the Unicode Consortium. Pending the formal establishment of a suitably revised normative and policy framework, some software developers have begun imposing their own constraints on the use and appearance of IDN, and are also active participants in the broader discussion of the issues that need to be resolved. Two useful points of entry into the documentation of this process are provided in ICANN's new IDN information area and, in greater technical detail, at the unofficial and independent http://nameprep.org.

Terminology — "homographs"

The situation described under the previous heading is frequently referred to as the 'homograph problem' with IDN. The definition that normally appears in dictionaries and linguistic texts states that homographs are different words which are spelled identically (for example, the adjective 'brief' meaning short, the noun 'brief' meaning a document, and the verb 'brief' meaning to inform). By definition, letters in two different alphabets are not the same. This means that sequences of letters from two different scripts that appear to be identical on a computer display cannot be homographs in the accepted sense, even if they are both words in the dictionary of some language. Assuming that there is a language written with Cyrillic script in which 'сар' is a word, regardless of what it might mean, it is not a homograph of the Latin-script English word 'cap'.

When the security implications of visually confusable characters were brought to the forefront earlier this year, the term homograph was used to designate any instance of graphic similarity, even when comparing individual characters. (Nothing suggests that this was a well-reasoned neologism. A handy term seems simply to have been co-opted in the belief either that it was being used correctly, or that the extension of the definition had no potentially negative consequence.) Given the rapidity with which the new sense has become entrenched, it might not appear worth much time calling it into question. Some documents relating to IDN avoid the issue by the straightforward device of simply not using the term at all. However, an essential attribute of textbook homography is that it applies to words. It is therefore not a generally useful term for describing domain name labels.

Relaxing the definition, nonetheless, can ease communication with the segments of the technical community that have adopted it. It can, however, as easily impede communication with members of other communities who either won’t recognize the new sense or will react negatively to its having been coined at all, given that there are more precise pre-existing alternatives — and even if all were deemed unsuitable, a more focused neologism would have been possible.

Another problem with extending the definition of homograph is that it obscures the distinction between diacritical marks that combine with base characters, those that occupy their own space next to the character to which they apply, and auxiliary characters that can serve a variety of functions with varying degrees of orthographic essentiality. Differing considerations apply to each of these cases and lumping them together under the single heading of 'homographic confusion' does not make it easier to formulate differentiated IDN policies where they are truly essential.

In illustration of one of the more trivial situations, many languages use diacritical marks to distinguish between what would otherwise be homographs (as with the English words resume and résumé). It is certainly reasonable to consider whether such disambiguation is adequate for the purposes of secure IDN registration, and it is here that the term homograph can legitimately appear in the discussion. Recognizing situations where diacritics can be applied to what would otherwise be homographs, allows the undecorated shared base form to be used as a normalized reference for policies that deal with equivalent character variants.

Permitting the gratuitous use of diacritical marks in IDNs without linguistic justification is, of course, a good way to ask for trouble. It would be reasonable for an IDN policy to state, for example, that 'the diacritical embellishment of a character for decorative visual effect must not be permitted under any circumstances.' On the other hand, the Müller Corporation cannot be prevented from registering müller.tld just because the Muller Widget Works holds muller.tld, without very clearly stated justification.

This is an extremely intricate situation. From the anglophone perspective 'u' and 'ü' may appear to be unacceptably and confusingly similar, whereas from the perspectives of language communities in which they exist side by side, they may be regarded as entirely distinct. Determining the circumstances under which the one perspective may override the other without vulnerability to accusations of cultural bias, is a particularly delicate challenge.

The Cyrillic and Latin labels presented above are straightforward illustrations of pairs of character strings that appear to be identical but are not. Their visual confusability can, however, be exploited for deceptive purposes with obvious ease. If the term 'visually confusable' isn’t fully adequate to describe this, the term 'pseudo-homograph' might be. This is also suggested by the W3C in a commentary on Unicode Technical Report #36 at http://www.w3.org/International/reviews/utr36/.

Seemingly identical characters from different scripts are correctly termed 'homoglyphs' but many people reject its use because it 'sounds ugly'. Without suggesting that it needs to be adopted nonetheless, reconciling this discussion to the aesthetic deficiencies of the alternative terminologies might remove an impediment to the more rapid resolution of the core issues. As a final resort, terms may be coined that are needed but can't be found elsewhere. Appropriating a term that has an established but different meaning in relevant adjascent contexts should not be the first alternative.


Latest update: 2005-12-30