|
What are the most popularly used top-level domains (TLDs), or at least, which are the ones that show up on pages indexed in Google?
I wondered this yesterday after seeing a news article stating that the registration of .cn (china) top-level domain names topped 1 million for the first time ever by the end of 2005. The seed for my wonderment was probably planted when EGOL, at Cre8asite Forums, asked about using a ‘.info’ top-level domain earlier that day.
So I decided to check to see which were the most popular in Google, since that was the easiest place to get some statistics.
I found a couple of lists of top-level domains (generic TLDs and country code TLDs), and searched for the number of results that appeared in Google, using the advanced “site” search operator and my TLD lists. For example, a search for “site:.com” without the quotation marks might show me approximately how many pages appear in Google’s index that are on sites using a “.com” top level domain.
I have listed the 20 most popular, then the 20 least popular, and finally the whole list.
Sponsored byDNIB.com
Sponsored byIPv4.Global
Sponsored byWhoisXML API
Sponsored byCSC
Sponsored byVerisign
Sponsored byRadix
Sponsored byVerisign
William, this is some good research.
In the bottom 20 were some extensions that aren’t delegated to someone to manage, like .ax,.eh,.kp,.cs, and most interesting is that there was anything in .ap. .ap is one of ten ISO3166 list additions that were set up for WIPO that no TLDs exist for.
Interesting—.uk has the most pages indexed of any ccTLD, nearly three times as many as the next-highest ranked ccTLD (.ca). However, .de has nearly double the number of domain names registered as .uk. I wonder if this reflects a Google bias (perhaps inadvertent, or at least benign) towards indexing English-language web pages?
I am very wondering why there are about 120 search results for .kp (Korea, Democratic People’s Republic). As Jothan mentioned, it was never delegated.
One more interesting thing is that there are only about 13 millions search results for .kr (Korea, Republic of). This shows that Google’s search results for non-English web pages are still not very reliable.
Nonexistent addresses seem to be able to get into the Google index somehow; maybe they add any site that anybody links to, whether it really exists or not. The search results for .iq include some obvious “joke” entries like phrases ending in “low.iq”. None of them actually work when you try to follow the links from the Google results.
Thanks for the comments. I wasn’t sure what I would find when I originally started collecting this information, but have received a few comments here, and on the blog that have made me think some more about what I’m seeing.
Looking at the sites that are actually listed in some of those cctlds with very small numbers of results, this does seem to tell us more about Google than it does about top level domains.
Google will index URLs that it finds on pages even though the site isn’t available. That I knew. Sometimes pages aren’t available for one reason or another.
But, it also looks like Google will index URLs that shouldn’t even exist. That surprised me a little. I just conducted a search for “site:.xxx” (without the quotation marks) and received 852 results. A search for “site:.xyz” comes back with another 98 results.
The URLs returned all shared some of characteristics in common, which I see on some of the other cctlds that don’t have many results because they may not have been delegated, or have expired. These are:
Use of the URL as the snippet title.
Lack of any snippets of text from the sites themselves.
Lack of a link to a cache of the page.
Those characteristics are usually good indicators that Google has found a link to a page, but had problems visiting the page. I’m surprised that they would include URLs with tlds that are nonexistent, like the “xyz” one I mentioned above.
While the numbers of results are small, I would have expected the search engine to filter out some of these results, or even more likely, ignore them in an earlier stage when it is crawling pages and collecting URLs for indexing. It’s possible that the effort involved in doing that isn’t worth the processing power. It’s probable that having these results in an index likely doesn’t affect the relevancy of results of too many searches.
Geoffrey’s comparison of indexed pages as opposed to registered sites is interesting. I don’t know if we can conclude search engine bias in the indexing of sites based upon language from the data that I collected.
There are a number of sites that generate pages dynamically, calling them forth from a database, and depending upon how the pages are set up, could be said to have an almost infinite amount of pages to index. For instance, if a site can serve pages that display multiple data variables in the URLs to those pages, a page could be included multiple times based upon different orderings of those data variables. Or, if a site uses session IDs, and the search engine has access to those session IDs in the URLs of pages, it could index many pages more than once.
But, when a search engine crawler collects pages to index, it will follow a number of different importance metrics that tells it which pages are important and which to visit next. Those should keep it from trying to grab too many pages from a site that might have one of the problems I mentioned above.
I’m really not sure how useful this data is, but I’ll probably look at these numbers again in a few months to see if they have changed dramatically.
I tried site:.dk using google.com which gave the 19m you got too. Then I tried the same query using google.dk and got 42m. Then I just pressed reload a few times in the google.dk-window and it turns out it went back and forth between 19m and 42m. I guess Google has a little problem there.
Unfortunately this could mean that the data you collected isn’t totally reliable.
Hi Robert,
It sounds like you were receiving result amounts from different data centers, which Google will do while load balancing.
One of the interesting things about collecting data like this, and sharing it with others is that you get a good number of views, conclusions, opinions, and sometimes even conflicting information, like yours.
A couple of other folks have pointed out to me that they are seeing very different numbers from other data centers. Not every data center has the same information within it, and not every Google data center uses exactly the same algorithm to serve results to searchers. I knew that going into this.
What I find interesting is that some of the numbers vary drastically from one data center to another. Someone reported in a comment on the blog post that they were seeing 7 billion results for a “site:.com” search. So, I tried using a number of different data centers to see how much difference I could spot, and my results varied by more than a billion results from one data center to another.
So yes, my collection of data might tell us a little more about the reflections of the web that Google holds in its index, and not as much about the actual web. We know that there are large parts of the web that aren’t indexed at all, and that there are inaccuracies in what the search engines have indexed.
I’d love to see some accurate numbers about the actual page amounts used in the differet tlds. The closest method I could come up with to getting an approximation was to look at one of the tools that we use to search the web. It’s been helpful to define some of the limitations of that approach, like this issue with different data centers that you raise. Thanks.