Unregistered Gems: Identifying Brandable Domain Names Using Phonotactic Analysis

Home / Blogs

Unregistered Gems: Identifying Brandable Domain Names Using Phonotactic Analysis

	By David Barnett Brand Protection Strategist at Stobbs
	September 03, 2024 Views: 6,870 Add Comment

Conventional wisdom within the domain-sales industry states that the stock of unregistered domain names is ‘running out,’ with limited or no availability of short, desirable domain names across popular extensions (TLDs). This presents problems for would-be brand owners looking for a brand name (and accompanying suitable website presence) to utilize for newly-launched companies, producing a push towards the selection of longer, unusual or novel terms for new brand names and/or increased adoption of new TLDs or those which were historically less popular—with the alternative generally being a requirement to purchase a pre-existing name at a premium price.

The initial statement made above is true up to a point. For .com, for example—still the most popular and trusted domain-name extension by a significant margin—there are no available unregistered (Latin) alphabetic domain names of four characters in (second-level domain (SLD) name¹) length or less. For longer domain names, the proportion of domain names which are not currently registered increases rapidly due to the exponential rise with domain-name length in the number of possible names (equal to 26ⁿ, where n is the length (in characters) of the domain name). For 5-character .com domains, there are around 9 million unregistered strings (out of a ‘pool’ of 12 million), and for 6 characters, there are 303 million unregistered combinations (out of a possible 309 million)^{2, 3}. However, it is generally accepted that the vast majority of dictionary words are genuinely already registered.

These determinations can easily be made through analysis of domain zone files, the data files maintained by registry organistions containing comprehensive lists of all registered domains across the TLD in question. It is important to note, however, that the absence of a candidate domain from the zone file does not necessarily mean that the domain is available to register; domains can be absent for the file for other reasons (such as having being put ‘on hold’ status, or having no associated nameservers) and may be reserved or otherwise unavailable, so a round of ‘post-processing’ checks is required in order to confirm that any specific domain missing from the zone file is actually registrable.

Methodology and analysis

The large set of unregistered domain names—even on .com and at lengths as low as five characters—does mean that there are still plenty of options available for the launches of new brands; the difficulty is in finding the appropriate candidate names.

The lack of availability of dictionary words as registrable domain names does mean that—aside from cases of acquisition of pre-registered domains—recent years have seen increased use of new styles of brand names, such as word combinations, neologisms containing familiar roots and fragments (so-called ‘transmutations’), and ‘wacky’ alternative spellings of familiar words (often with one character replaced by another with a similar pronunciation, or with vowels removed)⁴⁵—in addition to a wide range of abstract terms⁶. The current popularity of names of this type can be seen through the selections of names—billed as ‘brandable’ options—available for sale by domain brokers and on domain marketplaces (Figure 1).

**Figure 1:** Examples of domain names offered for sale on the Atom.com marketplace (as of 27-Aug-2024)

Also an attractive aspect of adopting an ‘invented’ term as a brand name—particularly if the domain has never been registered before⁷—is that there is a lower likelihood of existence of pre-existing intellectual property rights or name collisions or confusion, meaning it is likely to be more straightforward to protect and defend the new name.

One viewpoint is that there is nothing inherently ‘special’ about the names typically on offer for sale through such sources, apart from the fact that they have been identified as being available and have been deemed ‘brandable’; potentially there may be (many?) other equally attractive options available within the unregistered ‘pool’, with the issue being the complexity of identifying them amongst the enormous numbers of other random character strings—i.e. how is it possible to filter down the set of candidate names into a more manageable number?

A key analysis technique involves the use of so-called phonotactics—essentially, a measure of the potential readability, or similarity to other existing words (or brand names!) present in the corpus of a language, of the candidate strings. (This analysis focuses on ‘English-like’ terms, but similar principles can also be applied to other languages.) In this study, the analysis is carried out just on the SLD string of each domain name. Of course, an objective determination of phonotactic ‘acceptability’ does not necessarily mean that a candidate domain name will be an attractive brandable option, so it is generally also necessary to conduct a subsequent manual review of the (much smaller) filtered set of names, using a more subjective assessment of branding potential (based on the intrinsic understanding and ‘feel’ of language and marketing that only a human reviewer can impart). Some of the basic ideas of brandability are well summarised by Nick Kolenda’s overview of the subject⁸.

Phonotactics – basics

An example of a phonotactic calculator is that produced by UCI⁹, though there are a number of other implementations of similar tools. The algorithm used in this analysis is the BLICK model (Hayes, 2012)¹⁰ which, for an arbitrary string of phonemes (i.e. a candidate word), outputs a score providing a measure of the extent of phonotactic ‘violation’—i.e. a lower score denoting a more credible potential name; in the words of the original study:

“[The model] predicts that ket [K EH1 T] should a completely perfect word of English (penalty score zero), that doit [D OY1 T] should be a somewhat peculiar word of English (score 3.094), and that nguhyee [NG AH H Y IY0] should be a pretty horrible word of English (score 12.295).”

The implementation utilised in this analysis requires that each string is first converted to its phonetic representation using ARPABET syntax, in which each phonetic element is represented as a series of Latin characters (and, in some cases, a trailing digit)¹¹,¹².

Previous brandable domain sales

As a case study, it is informative to consider a set of previous domains (all .com) sold (or on sale) as brandable examples, for which sale prices are available from a range of sources. The relationship between the sale price and phonotactic violation score for the 5-character SLD names is shown in Figure 2 (noting that, for some strings, the phonotactic calculation algorithm will fail, in which cases, the score is assigned a default value of -1).

**Figure 2:** Relationship between sale price and phonotactics violation score for 5-character brandable domain names sold (or on sale) through a range of sources

Within this specific dataset, there is no strong relationship between cost and phonotactic score (even if the results from Novanym.com, for which all domain sales were for a constant price, are excluded). However, it is significant that all domains in the set have relatively low scores (compared with the distribution within the ‘universe’ of unregistered names) and with the vast majority at scores of 6 or lower.

A deep-dive into the universe of unregistered 5- and 6- character .com domain names

Based on analysis of the .com zone file carried out in mid-August 2024, there are 9,284,133 of the set of possible alphabetic 5-character names, and 303,531,886 6-character names, absent from the zone file, and potentially available for registration.

It is informative to consider the distribution of phonotactic violation scores (for the SLDs) across an unfiltered set, to gain an indication of the variations across the dataset and how these reflect the ‘readability’ of the corresponding names. Because the calculation algorithm is relatively slow, this analysis considers only the 5-letter names beginning with ‘a’ and ‘b’ (one vowel and one consonant). This gives a dataset of 478,369 (candidate) domain names. Of these, 8,893 (1.9%) are assigned a score of zero, but with a wide range of scores observed, up to a maximum of just under 68 (Figure 3). The five SLD strings with the highest scores (i.e. the least credible brandable candidates) are awbzp (59.43), bctko (65.17), anwjf (65.94), apgdj (67.26) and bchji (67.92).

**Figure 3:** Distribution of phonotactics violation scores for all unregistered 5-character alphabetic domains with names beginning with ‘a’ or ‘b’

The above analysis does suggest that the calculation of phonotactic violation score does provide one reasonable basis for filtering down large datasets of candidate domain names into smaller subsets (i.e., by selecting a score threshold and then retaining domains with SLD scores below that value), which can then be reviewed for suitable brandable candidate domain names. Realistically, this criterion will need to be combined with others in order to obtain datasets of manageable size (particularly given the fact that many readable / brandable names are assigned scores of up to 15 or greater), and to drill down into subsets that meet particular conditions (e.g. contain product-related keywords or ‘fragments’, or where some rough guidelines on the preferred brand name are available). Examples of other suitable possible filtering criteria might be that it may be preferable to exclude any names containing no vowels or with no more than two consecutive repeated characters. Of course, such criteria may not always be appropriate, depending on branding preferences, but are likely to be the sort of conditions that will ‘fit well’ with the use of phonotactic analysis (i.e. where brand names resembling classical readable words are being sought). This type of analysis will also likely not be appropriate for generating non-readable names (e.g., those intended to be used as acronyms/initialisations), but this is deemed to be a separate problem (since a brand owner seeking to use an acronym will likely already have an established (multi-word) name to which that acronym will apply).

Review

So, does this approach yield meaningful results? The short answer is that it does seem to do so. An initial set of searches within the sets of 312 million unregistered 5- and 6-character .com domain names, combined with test domain purchases, has allowed the identification of at least some domains which, when submitted to the Atom.com domain marketplace, have been deemed attractive enough from a brandability point of view that they have been assigned ‘premium’ status with suggested values in excess of $2,000:

axidy.com (phonotactic violation score: 0.90)—suggested price $2,299
gyble.com (0.90)—$2,299
kyppy.com (0.90)—$2,299
ebeya.com (no score assigned)—$2,399
byskit.com (0.12)—$2,499
fybric.com (0.49)—$2,499
qaxxy.com (0.90)—$2,499
duklet.com (0.93)—$2,599
tyckl.com (0.00)—$2,699
ozogy.com (no score assigned)—$2,799

This set of 10 premium domains is from an initial test-set of 48 candidates, all deemed to be attractive on the basis of a manual review of the filtered dataset, and submitted to Atom.com for consideration for sale suitability (i.e. a ‘hit rate’ of 21%—more than one in five). Many of the other identified domains are, however, also appealing from the point of view of potential brandability, with 25 of the group of 48 being assessed by an alternative AI-based domain valuation tool as being worth $100 or more.

Conclusions

The use of phonotactic analysis, combined with other appropriate filtering criteria and a subsequent process of manual review and assessment, does appear to provide a valid basis for identifying brandable domain names—i.e. candidates of potential interest for individuals seeking an attractive name for a new business—amongst the very large dataset of unregistered potential names; a dataset which would, by virtue of its size, not be reasonably manually reviewable without the application of these criteria. These ideas are key to the identification of attractive available names—the ‘unregistered gems’ within the extensive bedrock of domain-name noise—which has obvious applications in the provision of brand-name recommendations and branding consultancy.

Acknowledgements

Thanks must go to Sten Lillieström for his introduction to this subject, and his calm and measured enthusiasm in all subsequent discussions.

NORDVPN DISCOUNT - CircleID x NordVPN
Get NordVPN [74% +3 extra months, from $2.99/month]

By David Barnett, Brand Protection Strategist at Stobbs

Filed Under

Comments

The Weekly Wrap

More and more professionals are choosing to publish critical posts on CircleID from all corners of the Internet industry. If you find it hard to keep up daily, consider subscribing to our weekly digest. We will provide you a convenient summary report once a week sent directly to your inbox. It's a quick and easy read.