Benford’s Law in China

0. INTRODUCTION
About a week ago, Dan Silver, a management consultant in Taipei who has been studying China for 25 years, emailed me to say that he enjoyed my piece on the application of Benford’s Law to Australian population data.  Dan asked if I’d be willing to run the same tests on population data from China if he supplied the numbers.  It sounded like an interesting exercise, so I agreed.

What I found intrigued me.  The Chinese population data did not conform to Benford’s Law.  I stress that this departure from what was expected should not be taken as any kind of proof of fraudulent data entry (Benford’s Law often applies, but not always).  However, it was a surprising result.

1. BENFORD’S LAW
Tally the number of times the first digit 1 through to 9 occurs in any “real world” data source such as lengths of rivers or heights of buildings (unit of measurement doesn’t matter).  Intuitively you would expect the digits should be uniformly distributed with each one being observed 1/9th (about 11%) of the time.  Under Benford’s Law, however, a first digit of 1 tends to appear with a frequency of about 30%, a leading digit of 2 tends to occur about 18% of the time, a 3 about 13%, and so on in a logarithmically decreasing pattern, with a leading digit of 9 often being observed less than 5% of the time.

2. DATA
Dan provided land area (in square kilometres), along with two series of population data: “actual” (常住人口); and official/registered (戶籍人口) for each of the more than 350 administrative areas in China.  These data can be viewed here.  Dan notes that at the bottom of the spreadsheet are four cities which the Chinese government counts as provinces.  These are included to bring the population to the official total for China.  If you scan through the columns of figures, you’ll see that some administrative areas don’t have a land area recorded next to them, or an “actual” population.  However, they all have a registered population.  These gaps mean that the totals presented in the tables below won’t match.  The data are as at the end of 2007.

3.1 BENFORD ANALYSIS – CHINESE LAND AREA
So on to the analysis.  Let’s start by tallying the occurrences of first digits in the column of figures holding the land area data.  Table 1 summarises these frequencies (number and percentage) that were observed and expected under Benford’s Law.

Table 1: Chinese administrative land areas (sq kms)

Leading digit
No. times leading digit
was observed
No. times leading digit
was expected
1 133 (39.8%) 101 (30.1%)
2 57 (17.1%) 59 (17.6%)
3 28 (8.4%) 42 (12.5%)
4 22 (6.6%) 32 (9.7%)
5 21 (6.3%) 26 (7.9%)
6 12 (3.6%) 22 (6.7%)
7 22 (6.6%) 19 (5.8%)
8 20 (6.0%) 17 (5.1%)
9 19 (5.7%) 15 (4.6%)
TOTAL
334 (100%) 334 (100%)

I think the data are easier to digest when presented graphically.

Figure 1: Chinese administrative land areas (sq kms)


The leading digits of Chinese land area data by administrative area do decrease roughly logarithmically, although don’t technically follow a Benford distribution when subject to a Chi-square goodness of fit test (X2=26.05 on 8 d.f.; p=0.00103).  Data are overweight in figures starting with a “1″ and underweight in areas leading with digits “3″ to “6″.  But as Benford Law predicts, “1″ is the most common leading digit, followed by “2″ with the remaining leading digits bringing up the rear.

3.2 BENFORD ANALYSIS – CHINESE ACTUAL POPULATION
Similarly, let’s look at the leading digits of “actual” populations by administrative areas.

Table 2: Chinese administrative area “actual” populations

Leading digit
No. times leading digit
was observed
No. times leading digit
was expected
1 39 (14.8%) 79 (30.1%)
2 53 (20.2%) 46 (17.6%)
3 44 (16.7%) 33 (12.5%)
4 39 (14.8%) 25 (9.7%)
5 28 (10.6%) 21 (7.9%)
6 25 (9.5%) 18 (6.7%)
7 15 (5.7%) 15 (5.8%)
8 13 (4.9%) 13 (5.1%)
9 7 (2.7%) 12 (4.6%)
TOTAL
263 (100%) 263 (100%)

When plotted, the difference between frequencies observed and what was expected under Benford’s Law is, I think, striking.

Figure 2: Chinese administrative area “actual” populations


Not even close to a Benford distribution.  The leading digit of “1″ actually occurs less often than a leading digit of “2″, although after that things do decrease logarithmically.

3.3 BENFORD ANALYSIS – CHINESE REGISTERED POPULATION
Dan asked that I focussed on the registered/official population figures (the highlighted column of raw data in the spreadsheet linked above).  Leading digits from the registered, official population figures by administrative areas are presented in Table 3 and Figure 3 below.

Table 3: Chinese administrative area “registered” populations

Leading digit
No. times leading digit
was observed
No. times leading digit
was expected
1 71 (19.6%) 109 (30.1%)
2 67 (18.5%) 64 (17.6%)
3 64 (17.6%) 45 (12.5%)
4 46 (12.7%) 35 (9.7%)
5 40 (11.0%) 29 (7.9%)
6 25 (6.9%) 24 (6.7%)
7 24 (6.6%) 21 (5.8%)
8 14 (3.9%) 19 (5.1%)
9 12 (3.3%) 17 (4.6%)
TOTAL
363 (100%) 363 (100%)


Figure 3: Chinese administrative area “registered” populations

Things look a bit better than the “actual” population data, but registered populations certainly don’t follow Benford’s Law either.  The leading digit of “1″ is severely under-weight compared to its predicted frequency and, in fact, digits “1″ through to “3″ from registered population data occur with almost equal frequency.

4. SUMMARY
Chinese land area data by Chinese government administrative area follow a logarithmically decreasing pattern, although technically not a Benford distribution.  Chinese population data (“actual” and registered) by the same areas don’t follow a logarithmically decreasing pattern at all, Benford or otherwise.

That these Chinese government data don’t conform to Benford’s Law should not be taken as any kind of proof or insinuation that something untoward is going on.  It’s something that I would say warrants further investigation, but I expect there’ll be a rational explanation for it.  It’s probably more indicative of my own gaps in understanding than anything else.  Having said that, it was a very interesting exercise and I’d like to thank Dan Silver for the opportunity to write about it.

——

Benford’s Law and Census data

Would you like to conduct a simple statistical experiment?

Grab a list of measurements from a real-life source of data, such as heights of buildings, or lengths of rivers.  Now look at just the first digit of each of the measurements and record the frequency of the numbers 1 through to 9.  Intuitively, you’d expect that each leading digit occurs roughly 11% of the time (i.e. 1 chance in 9).  However, it is an interesting observation that the leading digit from such sources is often not uniformly distributed.  Surprisingly, a first digit of 1 tends to appear with a probability of about 30%, a leading digit of 2 tends to occur about 18% of the time, a 3 about 13%, and so on in a logarithmically decreasing pattern, with a leading digit of 9 often being observed less than 5% of the time.

Welcome to the bizarre world* of Benford’s Law [*not actually a bizarre world].

This phenomenon was first noted by the astronomer and mathematician Simon Newcomb in 1881.  The physicist Frank Benford re-stated the observation in 1938 and, in an odd twist of fate, it is after him that the law is named.  It is referred to as a “Law” but of course it won’t apply to all kinds of real-world lists of numbers.  Lottery results (they’re entirely random) or counts of fingers (since we’re talking about digits), by way of example, should be, I hope, somewhat more uniformly distributed.

I thought it would be a fun exercise to test Benford’s Law using population data available from the Australian Bureau of Statistics’ website.  I downloaded the latest Census population counts across the 129 Statistical Local Areas (SLAs) in South Australia, and tallied the frequency of the leading digits.  The results of observed versus expected frequencies are summarised in Table 1 below.  For example, there were 3 SLAs that recorded a population with a leading digit of 9 (they were 934, 9205, and 9015).  If Benford’s Law holds there should be 6 such SLAs.  Note that one SLA had a population of zero, so was not included.

Table 1: Frequency of leading digit – observed vs. expected under Benford’s Law

Leading digit
No. times leading digit
was observed
No. times leading digit
was expected*
1 47 (36.7%) 39 (30.1%)
2 36 (28.1%) 23 (17.6%)
3 11 (8.6%) 16 (12.5%)
4 10 (7.8%) 12 (9.7%)
5 1 (0.8%) 10 (7.9%)
6 6 (4.7%) 9 (6.7%)
7 7 (5.5%) 7 (5.8%)
8 7 (5.5%) 7 (5.1%)
9 3 (2.3%) 6 (4.6%)
Total SLAs
128 (100%) 128 (100%)

*Expected numbers are rounded to nearest integer based on exact percentages predicted under Benford’s Law.

First, the similarities.  The observed leading digits of the population counts do decrease in a logarithmic pattern, as Benford’s Law predicts.  However, on closer inspection, discrepancies start to emerge.  The leading digits of 1 and 2, for example, are over-represented in the dataset (47 observed vs. 39 expected and 36 vs. 23, respectively).  Conversely, the leading digit of 5 is severely under-represented, with just a single occurrence compared to ten expected.

The actual versus expected distributions are plotted below in Figure 1.

Figure 1: Frequency of leading digit – observed vs. expected under Benford’s Law

So does the population data conform to Benford’s Law?  I decided to dig deeper with an appropriate statistical test.  Under the null hypothesis, both the observed and expected counts come from the same distribution (i.e. Benford-type logarithmic).  Under the alternative hypothesis (two-sided) they are different distributions.

Firing up the trusty statistical analysis packager “R”, I entered the above matrix and ran a Chi-squared Goodness of Fit test:

> census <- c(47,36,11,10,1,6,7,7,3)
> benford <- c(39,23,16,12,10,9,7,7,6)

> chisq.test(census,p=benford,rescale.p=T)

yielding the following output:

Chi-squared test for given probabilities
data:  census
X-squared = 21.6447, df = 8, p-value = 0.005618

p=0.005618 is highly statistically significant, so we can reject the null hypothesis in favour of the alternative, concluding that the observed frequencies do not follow a Benford distribution.

Benford’s Law analysis is often used for fraud detection (for example, insurance claims, forensic accounting, and even the recent Iranian Presidential election).  That’s not to suggest that there’s anything fraudulent going on here with the Australian Bureau of Statistics’ Census data.  I’d be interested to know why there’s such an undercount in SLA populations beginning with the digit “5″ for example, but I’m totally confident that there’s a legitimate, rational explanation for it.  Two likely explanations are that just 128 data points are not enough to draw any kind of sensible conclusion; or perhaps something to do with the way the Australian Bureau of Statistics defines what a Statistical Local Area actually is.  Or maybe I’ve just made a hash of the whole thing.

Still, an interesting analysis.

——

Follow

Get every new post delivered to your Inbox.