Home / Blogs

Exploring the Domain of Subdomain Discovery

Domain name monitoring—that is, the detection of domains with names containing a brand-term (or other string) of interest—is a very well-established element of brand protection services. Branded domain names are of key importance to brand owners (as the basis for business-critical infrastructure (i.e. ‘core’ domain names), and as part of a ‘tactical’ portfolio of strategic and defensive registrations), but also to infringers, who can utilise domains as a means of impersonation, passing off, claimed affiliation, or traffic direction and monetisation. These types of third-party registrations are often of great concern by virtue of factors such as their explicit abuse of IP, and their high potential visibility in search engines. However, they are (up to a point) relatively straightforward to detect, through methods such as domain zone-file analysis, which most brand protection service providers will utilise as a standard methodology.

A more complex world is the ecosystem of subdomain names. A subdomain is the part of the URL prior to the dot preceding the domain name (e.g. ‘play’ in play.google.com). The owner of a domain name can create whatever hierarchy of subdomain names they wish, and can (for example) configure each distinct hostname (i.e. a subdomain plus domain-name combination) to resolve to a different IP address and webpage content. Additionally, some Internet service providers (‘private subdomain registries’) offer the sale of subdomains of one of more domains under their ownership, as a business model. Subdomains can be used legitimately for a range of different purposes, including the creation of subject- or region-specific microsites, but can be abused by infringers in many of the same ways as domain names1234. This issue is made more concerning by the fact that there is, in general, no comprehensive way of detecting potentially relevant subdomains (akin to the zone-file methods used for domain names themselves), which is one of the great unsolved issues in brand monitoring5.

Definitions

Terminologically, ‘subdomain monitoring’ as a service description is often used in two distinct ways in the context of brand protection and cybersecurity. The first—most usually carried out by the registrar responsible for the management of a brand owner’s official domain portfolio, and therefore with a full overview of their domain and subdomain infrastructure—refers to the monitoring of subdomains of the brand owner’s official domains, with a view to identifying potential cybersecurity issues. These can take the form of ‘dangling’ DNS records—i.e. subdomains which are no longer used and which are susceptible to hijacking—or the third-party creation of new subdomains through DNS compromise (i.e. domain ‘shadowing’). The second definition—i.e. the identification of relevant subdomains on an arbitrary third-party domain name, a process which may be termed subdomain ‘discovery’—is a much more complex prospect. Generally it involves the application of a combination of methods (which even together are not comprehensive), such as analysis of domain-name zone configuration information (e.g. passive DNS analysis), certificate transparency (CT) analysis, or the use of explicit queries for specific subdomain names. This issue of subdomain discovery is the focus of the remainder of this article.

A case study of subdomain discovery—top 50 popular websites

a. Introduction

As a case study, we explore an approach allowing the identification of (as many as possible) subdomains of each of the top 50 most popular website domain names (as of March 2024), according to Similarweb6, using a combination of monitoring and discovery scripts789, open-source databases, and search queries.

i. Methodology

In general, a comprehensive overview of the subdomains of a particular domain name is only possible via inspection of the full DNS zone record, which is generally only accessible by the managing registrar (as for a (true) subdomain monitoring service). However, partial coverage—from a discovery point of view—can be achieved through a combination10 of:

  • Queries to search engines
  • Queries to public databases of DNS or SSL information, data from Internet scans, or certificate transparency logs (i.e. information pertaining to the issue of digital certificates)
  • Brute-force searches (i.e. generating possible ‘candidate’ subdomains from large lists of keywords, and testing to determine which ones resolve)

ii. Terminology

In the description of the identified subdomains, the following terminology is used (in reference to test.mail.site.com as an example):

  • site.com is the domain name (’.com’ is the top-level domain (TLD); ‘site’ is the second-level domain (SLD))
  • test.mail.site.com is the full hostname
  • The full string preceding the domain name (i.e. ‘test.mail’ in this case) is the subdomain name string—the number of distinct subdomain ‘elements’ is referred to as the number of ‘levels’ (i.e. two—with the elements being ‘test’ and ‘mail’—in this case); the total length of this string (in characters) is the sum of the lengths of the individual elements, plus the separating dots (’.’)
  • The element preceding the domain name (i.e. ‘mail’ in this case) is the third-level domain
  • The first element in the subdomain name string (i.e. ‘test’ in this case) is the lowest-level name
b. Findings

Using the range of approaches discussed above, over 640,000 unique subdomains were identified, across just the 50 domain names under consideration (Figure 1).

Figure 1: Numbers of identified subdomains for each of the top twenty domains (by number identified)

The subdomain names range in length and number of levels, up to 231 characters and 28 levels (respectively), with the longest subdomain in the dataset (by both measures) found to be:

news.xinhuanet.comwww.zalando.dewww.google.
comhyperboleandahalf.blogspot.comchannel.pixnet.netwww.youtube.
comhistory.gmw.cnvk.comwww.bing.comsd.360.cnmarketplace.asos.
comstock.sohu.com2kindsofpeople.tumblr.comimgur.comgithub.
comwww.xvideos.com

The distribution of lengths (up to 50 characters) and numbers of levels (up to 10) across the whole dataset is shown in Figure 2.

Figure 2: Distribution of subdomain lengths and numbers of levels, across the dataset, by number of instances

For one-level subdomains, there is a peak in number of instances at a length of 7 characters. For two-level subdomains, there is a peak at 15 characters (i.e. a mean of 7.0 characters per element), and for three-level subdomains the peak occurs at length 21 (mean = 6.3 characters per element).

From the overall dataset, it is possible to calculate the statistics for the most frequently occurring subdomain elements, regardless of the level in the subdomain hierarchy at which they appear. This information is shown in Table 1.

Table 1: Top ten subdomain elements (at any level) by total number of instances
Subdomain elementNo. instances
mail25,111
ne118,229
gq117,753
bf116,241
ghs15,801
aa-rt15,037
qzone13,855
corp12,552
afd12,436
clump11,669

Other key terms appearing in the top 100 include ‘www’ (5,707 instances), ‘teams’ (3,897), ‘dns’ (1,879), ‘shop’ (1,547), ‘cloud’ (1,477), ‘dev’ (1,331), ‘extranet’ (1,261), ‘test’ (1,145), ‘sandbox’ (1,084), ‘search’ (960) and ‘media’ (873).

It is also possible to calculate more granular statistics for elements appearing at key locations in the subdomain strings. Tables 2 and 3 show the top third-level domain strings (i.e. the element immediately preceding the domain name) and lowest-level domain strings (i.e. the element at the start of the subdomain name string) identified across the dataset, by total numbers of instances (noting that, for subdomains with one level, the third-level string will—by definition—also be the lowest level).

Table 2: Top ten third-level domain strings by total number of instances
Third-level domainNo. instances
ne118,225
gq117,745
bf116,188
ghs15,785
aa-rt15,037
qzone13,853
ynwp7,307
corp6,910
sg36,870
spaces6,185
Table 3: Top ten lowest-level domain strings by total number of instances
Lowest-level domainNo. instances
lo011,536
www4,086
ha1993
ha2903
m776
api753
o-o698
crawl661
vl-120522
a418

Certain classes of subdomain names also tend to have special use-cases—two-character names, for example, are often used to denote country codes (e.g. for regional subsites) or may have other special meanings (e.g. ‘go’, ‘my’ or ‘ai’). The top 20 two-character subdomain elements across the whole dataset are shown in Table 4.

Table 4: Top 20 two-character subdomain elements (at any level) by total number of instances
Subdomain elementNo. instances
983,512
a12,352
ke1,916
101,817
qa1,750
bb1,517
dv1,206
sc1,112
ny1,020
tc841
a0690
a2685
in652
hk568
mp561
cp524
fp523
qq522
my502
db491

Various other common abbreviations also appear highly in the dataset, including the (potential) country codes de (430 instances), ru (343), fr (324), us (252), cn (246), es (246), it (243), kr (243), uk (223), au (200), jp (191), and other terms such as go (393).

It is worth noting, however, that the above statistics may be dominated by the naming style used across just a small number of sites. For example, all of the ‘ne1’ third-level domains were identified on the yahoo.com site. Potentially a more meaningful insight into the style of names used across the subdomain landscape generally can be gained by determining the numbers of unique sites (within the dataset of 50) across which a specific name string was identified. These statistics—for the features shown in Tables 2 and 3—are shown in Tables 5 and 6.

Table 5: Top ten third-level domain strings by number of unique sites (in the set of 50) on which the name was identified
Third-level domainNo. sites (/ 50)
www49
m43
api40
support36
blog35
mail34
help33
dev31
news31
email30
Table 6: Top ten lowest-level domain strings by number of unique sites (in the set of 50) on which the name was identified
Lowest-level domainNo. sites (/ 50)
www49
m44
api43
dev40
mail39
support39
blog37
help36
test35
app34

Several of these terms have clear use-cases, and appear to be used consistently across multiple popular sites (e.g. the use of ‘m.’ for the mobile-compatible version of a website).

Many of these trends mirror those from previous studies. For example, a 2021 analysis11 of the most popular subdomain (‘element’) strings overall found that the top three were ‘www’, ‘mail’ and ‘forum’. Whilst ‘www’ does not appear in the list of top ten most frequently occurring subdomain elements across the 50 sites considered in this analysis (Table 1), it does appear more than 5,700 times across the dataset. Furthermore, the dataset contains almost 1,000 distinct variants of ‘www’ being used as subdomain elements, with the list topped by ‘www’ itself (5,707 instances), followed by ‘comwww’ (146), ‘www2’ (42), ‘www1’ (31) and ‘ww’ (24).

An additional study12, looking at the (analogous) use of second-level domain names in conjunction with dot-brand extensions, also found extensive use of many of the strings featured in this study, including ‘mail’ and ‘api’.

It was noted previously that subdomain-related brand abuse can be a popular way of creating infringements or deceptive content. As a proxy for the infringement landscape, we can consider just those examples from the 50-site dataset in which the name of the Apple brand (the most valuable brand in 202413, but whose website does not appear in the list of 50 considered) appears anywhere in the subdomain name string. This will represent just a tiny proportion of the potential subdomain infringement landscape, since we are focusing just on a single brand, are considering only those instances where a textual mention appears in the subdomain name, and are focusing only on subdomains on the top 50 sites (where—one might hope—being controlled by large corporations and, in some cases, with IP protection programmes in place, the infringement landscape may be much less pronounced than across the Internet generally). In addition, the searches carried out for this study did not include any explicit brand-related searches; in a formal landscape sweep for (say) Apple, it would be advantageous to include additional search queries of the form: site:[site.com]+apple.

Nevertheless, the study dataset includes 139 examples in which ‘apple’ is referenced somewhere in the subdomain name, including a small number of live examples of potential infringements (Figure 3).

Figure 3: Examples of subdomain-based Apple potential infringements from within the dataset

Conclusions

Aside from the specific trends observed in the set of subdomains of the top 50 most popular websites, a significant take-away from this analysis is the effectiveness of the use of a range of discovery techniques to identify relevant content. Using a combination of search-engine queries, information from DNS, SSL and certificate transparency databases, and brute-force keyword-based searches, it has proven possible to identify almost two-thirds of a million subdomains of the 50 websites in question.

Given the risks associated with subdomain-based infringements, monitoring of this space as part of a comprehensive brand protection solution is of key importance, but has always proven difficult to achieve. This initial analysis shows that the range of available approaches can, when used together, provide a successful means of detecting potential threats. Whilst completely comprehensive subdomain detection is unlikely to be possible, these methods certainly provide a significant step in the right direction.

By David Barnett, Brand Protection Strategist at Stobbs

Filed Under

Comments

Comment Title:

  Notify me of follow-up comments

We encourage you to post comments and engage in discussions that advance this post through relevant opinion, anecdotes, links and data. If you see a comment that you believe is irrelevant or inappropriate, you can report it using the link at the end of each comment. Views expressed in the comments do not represent those of CircleID. For more information on our comment policy, see Codes of Conduct.

CircleID Newsletter The Weekly Wrap

More and more professionals are choosing to publish critical posts on CircleID from all corners of the Internet industry. If you find it hard to keep up daily, consider subscribing to our weekly digest. We will provide you a convenient summary report once a week sent directly to your inbox. It's a quick and easy read.

I make a point of reading CircleID. There is no getting around the utility of knowing what thoughtful people are thinking and saying about our industry.

VINTON CERF
Co-designer of the TCP/IP Protocols & the Architecture of the Internet

Related

Topics

New TLDs

Sponsored byRadix

IPv4 Markets

Sponsored byIPv4.Global

Cybersecurity

Sponsored byVerisign

DNS

Sponsored byDNIB.com

Brand Protection

Sponsored byCSC

Domain Names

Sponsored byVerisign

Threat Intelligence

Sponsored byWhoisXML API