The Missing Data: Measuring ISP User Populations

Home / Blogs

The Missing Data: Measuring ISP User Populations

	By Geoff Huston Author & Chief Scientist at APNIC
	November 11, 2024 Views: 7,295 Add Comment

In our physical world, census information is used to inform the planning processes behind the provision of infrastructure, such as schools, hospitals, housing, and similar. It can be used to assess the impact of natural disasters or to understand a society’s needs in terms of food and energy security. Demographic data is also used to inform investment and business decisions.

You’d think that the Internet itself would be awash with similar information. After all, much of the Internet’s economy is based on the aggregation of user profile data, which is then repackaged and sold to advertisers in the form of ad placement capabilities. So, it’s likely that similar census-related data will be continually gathered on the Internet. However, this data is a key commercial asset owned by the corporate entities that gather the data. There is very little public data of a similar nature that relates to the market positioning of Internet Service Providers (ISPs) in terms of the number of users of their services.

In our measurement work at APNIC Labs, we tried to relate our measurement data, based on a sampled subset of users, to the larger picture of user populations. If you had the information on the number of users of each Internet Service Provider (ISP), then it would be possible to derive data that can infer the level of adoption of a particular technology, such as IPv6 or DNS Security mechanisms.

This data would also be extremely useful in a number of areas. When a major ISP experiences a service failure, what is the impact of service disruption where the ISP service has failed? (e.g. There was an 8-hour service outage experienced by a major ISP in Australia, Optus, on the 8th of November, 2023. This provider is the second largest provider in the Australian ISP market, with an estimated 4 million users, so the outage was a major incident.)

The data would also be extremely useful in the area of public policy. How open is the market for the provision of Internet services within each country? How many users are served by each ISP? What’s their respective market share?

Such information can also inform policy issues related to national security and resilience: How many local users are reliant on the services provided via a foreign platform?

Our response to this missing data set was to generate, on a daily ongoing basis, our estimate of the number of users per ISP for every ISP that we see on the Internet through the ad-based measurement platform. This report is published at the URL: https://stats.labs.apnic.net/aspop. As far as we are aware, this is the only public data set that encompasses the entirety of the public Internet.

Here, I would like to explain how we calculate this data and provide some responses to a recent presentation at the RIPE 89 meeting on this data set.

Data Generation

The process starts with the estimated current population in each country. The data we use is sourced from the United Nations Population Division. We use the mid-year population estimate from 2023 and apply the 2022-2023 growth rate to the period from mid-2023 to the present day to get an estimate of the current population of each country for this day.

The second data set we use is the proportion of the population of each country that is classed as Internet users. There are three possible sources for this data: the World Bank, the International Telecommunications Union (ITU) and the CIA World Factbook. We use the ITU data by preference, but the three data sets are well correlated in any case.

The combination of this data gives us an estimate of the current Internet user population per country. It should be noted that this is not the number of “subscriptions” to a service, as it attempts to include the number of users behind each subscription. It also is supposed to avoid “double counting,” so where a user is part of a broadband service and also has a mobile service, then the user is still only counted once as an “Internet user.”

The third component of the data is the ad presentation data of the APNIC measurement program. We use Google Ads to deliver some 25M individual ad impressions per day. We use the Maxmind geolocation database to map each user who received an ad impression to a country and use a local default-free BGP routing table to also map each user to their “home” network. At this point, we have assembled a set of “home” networks (origin AS numbers) and the geo-located country for each presented ad.

Assumptions

Here, we make two major assumptions. Both assumptions are somewhat questionable, but we’ve been forced to make them in the absence of generally available data.

The first assumption is that Google’s ad placement algorithms apply to all users within a given country uniformly. In defining the ad campaigns, we attempt to make the placement definitions as generic as possible so that within each country, the ad placements are roughly equivalent to a random sampling drawn from all users in that country. The implication of this assumption is that if an ISP has twice the number of users as another ISP in the same country, then its users will receive twice the number of ad impressions. This could be stated as: The distribution of ad placement, and the distribution of users across ISPs are assumed to correlate.

The second assumption is that each user uses a single ISP for Internet access. This is not necessarily the case. For example, a user may use a local mobile service provider for their mobile Internet access and Starlink for their broadband access. We also have a user in their workplace using their workplace’s ISP and using a consumer ISP when they are at home. We are not able to account for such situations, and in uniquely assigning each user to a single ISP in a country, we tend to underestimate the user count for each ISP in consequence.

Due to the uncertainties that follow from these assumptions, the results we generate will have an inevitable level of uncertainty. Some isolated comparisons of this data against other sources where we have access to ISP market share data in individual countries point to an overall level of uncertainty of around 20% or so in our estimates of users per ISP. Large consumer ISPs are still reported as having a large user population in the generated data, but the data for small networks is very uncertain.

The assumption of uniform distribution of ad placements across all ISPs within each country tends to fail where the number of placed ads in relation to the per-country user population is low. The best current example of this can be seen with the Russian Federation, where ad placement in this country has plummeted since February 2023 (a consequence of the hostilities between the Russian Federation and the Ukraine and associated western sanctions being placed on Russia).

The data for Norway highlights another assumption, namely that browsers do not use proxies. In the case of Opera, this is not the case, and Opera performs many of the fetches from its own servers on behalf of Opera users. The result is that the system assumes that AS39832, the Opera AS, is the largest ISP in Norway, some four times the size of the next largest ISP, Telenor. (This Opera result is, of course, completely wrong, and I should remove Opera’s AS from this data set!)

There is another assumption around the day of the week and for holidays, where the analysis assumes that every day is much the same, whereas, on business days, the ad presentation into work-related ISPs is far higher than the presentation rate for the same ISPs on weekends and holidays.

As this is a measurement based on the placement of ads, the use of so-called “ad-blockers” can disrupt this measurement. Our assumption here is that, like the ads themselves, the use of ad-blockers is also relatively uniformly distributed across all users in the country.

Conclusions

It’s frustrating that this information is not generally collected in annual filings for national regulatory agencies and not collated internationally by the ITU-T, and this frustration has motivated us to use our measurement data to push out our estimates as a public data set. The conclusion from the recent RIPE presentation is that this method of estimation of the number of users for each ISP works well in countries with sufficient Google Ads presentations, a conclusion that correlates with our own experience in running this measurement for many years.

On the other hand, the generation of this data is based on a number of sweeping assumptions, which I’ve noted here, and numbers should be treated with some level of caution.

NORDVPN DISCOUNT - CircleID x NordVPN
Get NordVPN [74% +3 extra months, from $2.99/month]

By Geoff Huston, Author & Chief Scientist at APNIC — (The above views do not necessarily represent the views of the Asia Pacific Network Information Centre.)
Visit Page

Filed Under

Comments

The Weekly Wrap

More and more professionals are choosing to publish critical posts on CircleID from all corners of the Internet industry. If you find it hard to keep up daily, consider subscribing to our weekly digest. We will provide you a convenient summary report once a week sent directly to your inbox. It's a quick and easy read.