The initial instalment of my recent series of articles on domain name discovery1 considered the use of phonotactic analysis—that is, the measure of the similarity of a string to the ‘corpus’ of other words in a language—to identify available unregistered candidate domains which may be of interest for potential brandability. This filtering process is necessary because of the large universe of domains which must be assessed. Considering just 5-character alphabetics .com domains, for example, there are approximately 9 million unregistered combinations of characters (as the SLD, or second-level domain name—i.e. the part to the left of the dot), out of the ‘pool’ of around 12 million possible names (from aaaaa.com to zzzzz.com).

Phonotactic analysis is a powerful tool, but does have some shortcomings—not least, it is computationally slow to calculate the phonotactic ‘violation score’ for a string of characters, but additionally it typically still retains large numbers of candidate domains within any given score window, and furthermore the ‘mapping’ of score to brandable desirability is not always ‘clean’ (i.e. many domains which are (subjectively) attractive do not always generate low violation scores).

In this follow-up, I start to explore additional frameworks for filtering the large sets of candidate domain names, considering the inherent structure of the SLD strings themselves. This type of approach would allow would-be brand owners to specify a preference as to the ‘type’ of brand name they may be looking to use, based on analogy with other brand names, words or strings (and potentially also allows for further filtering based on factors such as preferred initial letters, etc.), but without having to specify a specific exact string or keyword which they would like the brand name to resemble (as in the methodology proposed for ‘variant string’ domain names in another recent study2).

Framework

As was also the case for the phonotactic method, the framework considered in this initial study relates to classification of domain names according to their high-level phonetic characteristics, but using a much simplified approach (and with negligible computational overhead to calculate) in which the constituent characters (consonants / vowels) are categorised into groups.

The groupings are based on the standard classifications for (English) consonant phonemes (i.e. unique sounds), in which they are classified according to the positions (in the vocal tract) and manners of articulation, within speech3, >4.

For simplicity, I consider one of the original datasets utilised in the initial study—that is, the set of (as of the time of original analysis) unregistered, 5-character .com domain names with SLDs of the form CVCVC (C = consonant, V = vowel, noting that a ‘y’ is also accepted where it appears in a ‘vowel’ position). In general, there is no one-to-one mapping between individual characters and phonemes, due to factors such as variabilities in pronunciation and the existence of character combinations (especially phonemes such as ‘ch’, ‘sh’, ‘ng’, etc.). However, the use of the CVCVC pattern means that a ‘cleaner’ mapping can be assigned (since, for example, no consonant pairs will arise) and the utilised groupings are shown in Table 1. The overall classification of any given SLD string is then based just on the consonants present within the string (which are deemed to determine the high-level ‘structure’ of the word).