Benford’s Law in China

About a week ago, Dan Silver, a management consultant in Taipei who has been studying China for 25 years, emailed me to say that he enjoyed my piece on the application of Benford’s Law to Australian population data.  Dan asked if I’d be willing to run the same tests on population data from China if he supplied the numbers.  It sounded like an interesting exercise, so I agreed.

What I found intrigued me.  The Chinese population data did not conform to Benford’s Law.  I stress that this departure from what was expected should not be taken as any kind of proof of fraudulent data entry (Benford’s Law often applies, but not always).  However, it was a surprising result.

Tally the number of times the first digit 1 through to 9 occurs in any “real world” data source such as lengths of rivers or heights of buildings (unit of measurement doesn’t matter).  Intuitively you would expect the digits should be uniformly distributed with each one being observed 1/9th (about 11%) of the time.  Under Benford’s Law, however, a first digit of 1 tends to appear with a frequency of about 30%, a leading digit of 2 tends to occur about 18% of the time, a 3 about 13%, and so on in a logarithmically decreasing pattern, with a leading digit of 9 often being observed less than 5% of the time.

Dan provided land area (in square kilometres), along with two series of population data: “actual” (常住人口); and official/registered (戶籍人口) for each of the more than 350 administrative areas in China.  These data can be viewed here.  Dan notes that at the bottom of the spreadsheet are four cities which the Chinese government counts as provinces.  These are included to bring the population to the official total for China.  If you scan through the columns of figures, you’ll see that some administrative areas don’t have a land area recorded next to them, or an “actual” population.  However, they all have a registered population.  These gaps mean that the totals presented in the tables below won’t match.  The data are as at the end of 2007.

So on to the analysis.  Let’s start by tallying the occurrences of first digits in the column of figures holding the land area data.  Table 1 summarises these frequencies (number and percentage) that were observed and expected under Benford’s Law.

Table 1: Chinese administrative land areas (sq kms)

Leading digit
No. times leading digit
was observed
No. times leading digit
was expected
1 133 (39.8%) 101 (30.1%)
2 57 (17.1%) 59 (17.6%)
3 28 (8.4%) 42 (12.5%)
4 22 (6.6%) 32 (9.7%)
5 21 (6.3%) 26 (7.9%)
6 12 (3.6%) 22 (6.7%)
7 22 (6.6%) 19 (5.8%)
8 20 (6.0%) 17 (5.1%)
9 19 (5.7%) 15 (4.6%)
334 (100%) 334 (100%)

I think the data are easier to digest when presented graphically.

Figure 1: Chinese administrative land areas (sq kms)

The leading digits of Chinese land area data by administrative area do decrease roughly logarithmically, although don’t technically follow a Benford distribution when subject to a Chi-square goodness of fit test (X2=26.05 on 8 d.f.; p=0.00103).  Data are overweight in figures starting with a “1” and underweight in areas leading with digits “3” to “6”.  But as Benford Law predicts, “1” is the most common leading digit, followed by “2” with the remaining leading digits bringing up the rear.

Similarly, let’s look at the leading digits of “actual” populations by administrative areas.

Table 2: Chinese administrative area “actual” populations

Leading digit
No. times leading digit
was observed
No. times leading digit
was expected
1 39 (14.8%) 79 (30.1%)
2 53 (20.2%) 46 (17.6%)
3 44 (16.7%) 33 (12.5%)
4 39 (14.8%) 25 (9.7%)
5 28 (10.6%) 21 (7.9%)
6 25 (9.5%) 18 (6.7%)
7 15 (5.7%) 15 (5.8%)
8 13 (4.9%) 13 (5.1%)
9 7 (2.7%) 12 (4.6%)
263 (100%) 263 (100%)

When plotted, the difference between frequencies observed and what was expected under Benford’s Law is, I think, striking.

Figure 2: Chinese administrative area “actual” populations

Not even close to a Benford distribution.  The leading digit of “1” actually occurs less often than a leading digit of “2”, although after that things do decrease logarithmically.

Dan asked that I focussed on the registered/official population figures (the highlighted column of raw data in the spreadsheet linked above).  Leading digits from the registered, official population figures by administrative areas are presented in Table 3 and Figure 3 below.

Table 3: Chinese administrative area “registered” populations

Leading digit
No. times leading digit
was observed
No. times leading digit
was expected
1 71 (19.6%) 109 (30.1%)
2 67 (18.5%) 64 (17.6%)
3 64 (17.6%) 45 (12.5%)
4 46 (12.7%) 35 (9.7%)
5 40 (11.0%) 29 (7.9%)
6 25 (6.9%) 24 (6.7%)
7 24 (6.6%) 21 (5.8%)
8 14 (3.9%) 19 (5.1%)
9 12 (3.3%) 17 (4.6%)
363 (100%) 363 (100%)

Figure 3: Chinese administrative area “registered” populations

Things look a bit better than the “actual” population data, but registered populations certainly don’t follow Benford’s Law either.  The leading digit of “1” is severely under-weight compared to its predicted frequency and, in fact, digits “1” through to “3” from registered population data occur with almost equal frequency.

Chinese land area data by Chinese government administrative area follow a logarithmically decreasing pattern, although technically not a Benford distribution.  Chinese population data (“actual” and registered) by the same areas don’t follow a logarithmically decreasing pattern at all, Benford or otherwise.

That these Chinese government data don’t conform to Benford’s Law should not be taken as any kind of proof or insinuation that something untoward is going on.  It’s something that I would say warrants further investigation, but I expect there’ll be a rational explanation for it.  It’s probably more indicative of my own gaps in understanding than anything else.  Having said that, it was a very interesting exercise and I’d like to thank Dan Silver for the opportunity to write about it.