## Benford’s Law in China

0. INTRODUCTION
About a week ago, Dan Silver, a management consultant in Taipei who has been studying China for 25 years, emailed me to say that he enjoyed my piece on the application of Benford’s Law to Australian population data.  Dan asked if I’d be willing to run the same tests on population data from China if he supplied the numbers.  It sounded like an interesting exercise, so I agreed.

What I found intrigued me.  The Chinese population data did not conform to Benford’s Law.  I stress that this departure from what was expected should not be taken as any kind of proof of fraudulent data entry (Benford’s Law often applies, but not always).  However, it was a surprising result.

1. BENFORD’S LAW
Tally the number of times the first digit 1 through to 9 occurs in any “real world” data source such as lengths of rivers or heights of buildings (unit of measurement doesn’t matter).  Intuitively you would expect the digits should be uniformly distributed with each one being observed 1/9th (about 11%) of the time.  Under Benford’s Law, however, a first digit of 1 tends to appear with a frequency of about 30%, a leading digit of 2 tends to occur about 18% of the time, a 3 about 13%, and so on in a logarithmically decreasing pattern, with a leading digit of 9 often being observed less than 5% of the time.

2. DATA
Dan provided land area (in square kilometres), along with two series of population data: “actual” (常住人口); and official/registered (戶籍人口) for each of the more than 350 administrative areas in China.  These data can be viewed here.  Dan notes that at the bottom of the spreadsheet are four cities which the Chinese government counts as provinces.  These are included to bring the population to the official total for China.  If you scan through the columns of figures, you’ll see that some administrative areas don’t have a land area recorded next to them, or an “actual” population.  However, they all have a registered population.  These gaps mean that the totals presented in the tables below won’t match.  The data are as at the end of 2007.

3.1 BENFORD ANALYSIS – CHINESE LAND AREA
So on to the analysis.  Let’s start by tallying the occurrences of first digits in the column of figures holding the land area data.  Table 1 summarises these frequencies (number and percentage) that were observed and expected under Benford’s Law.

Table 1: Chinese administrative land areas (sq kms)

 Leading digit No. times leading digit was observed No. times leading digit was expected 1 133 (39.8%) 101 (30.1%) 2 57 (17.1%) 59 (17.6%) 3 28 (8.4%) 42 (12.5%) 4 22 (6.6%) 32 (9.7%) 5 21 (6.3%) 26 (7.9%) 6 12 (3.6%) 22 (6.7%) 7 22 (6.6%) 19 (5.8%) 8 20 (6.0%) 17 (5.1%) 9 19 (5.7%) 15 (4.6%) TOTAL 334 (100%) 334 (100%)

I think the data are easier to digest when presented graphically.

Figure 1: Chinese administrative land areas (sq kms)

The leading digits of Chinese land area data by administrative area do decrease roughly logarithmically, although don’t technically follow a Benford distribution when subject to a Chi-square goodness of fit test (X2=26.05 on 8 d.f.; p=0.00103).  Data are overweight in figures starting with a “1” and underweight in areas leading with digits “3” to “6”.  But as Benford Law predicts, “1” is the most common leading digit, followed by “2” with the remaining leading digits bringing up the rear.

3.2 BENFORD ANALYSIS – CHINESE ACTUAL POPULATION
Similarly, let’s look at the leading digits of “actual” populations by administrative areas.

Table 2: Chinese administrative area “actual” populations

 Leading digit No. times leading digit was observed No. times leading digit was expected 1 39 (14.8%) 79 (30.1%) 2 53 (20.2%) 46 (17.6%) 3 44 (16.7%) 33 (12.5%) 4 39 (14.8%) 25 (9.7%) 5 28 (10.6%) 21 (7.9%) 6 25 (9.5%) 18 (6.7%) 7 15 (5.7%) 15 (5.8%) 8 13 (4.9%) 13 (5.1%) 9 7 (2.7%) 12 (4.6%) TOTAL 263 (100%) 263 (100%)

When plotted, the difference between frequencies observed and what was expected under Benford’s Law is, I think, striking.

Figure 2: Chinese administrative area “actual” populations

Not even close to a Benford distribution.  The leading digit of “1” actually occurs less often than a leading digit of “2”, although after that things do decrease logarithmically.

3.3 BENFORD ANALYSIS – CHINESE REGISTERED POPULATION
Dan asked that I focussed on the registered/official population figures (the highlighted column of raw data in the spreadsheet linked above).  Leading digits from the registered, official population figures by administrative areas are presented in Table 3 and Figure 3 below.

Table 3: Chinese administrative area “registered” populations

 Leading digit No. times leading digit was observed No. times leading digit was expected 1 71 (19.6%) 109 (30.1%) 2 67 (18.5%) 64 (17.6%) 3 64 (17.6%) 45 (12.5%) 4 46 (12.7%) 35 (9.7%) 5 40 (11.0%) 29 (7.9%) 6 25 (6.9%) 24 (6.7%) 7 24 (6.6%) 21 (5.8%) 8 14 (3.9%) 19 (5.1%) 9 12 (3.3%) 17 (4.6%) TOTAL 363 (100%) 363 (100%)

Figure 3: Chinese administrative area “registered” populations

Things look a bit better than the “actual” population data, but registered populations certainly don’t follow Benford’s Law either.  The leading digit of “1” is severely under-weight compared to its predicted frequency and, in fact, digits “1” through to “3” from registered population data occur with almost equal frequency.

4. SUMMARY
Chinese land area data by Chinese government administrative area follow a logarithmically decreasing pattern, although technically not a Benford distribution.  Chinese population data (“actual” and registered) by the same areas don’t follow a logarithmically decreasing pattern at all, Benford or otherwise.

That these Chinese government data don’t conform to Benford’s Law should not be taken as any kind of proof or insinuation that something untoward is going on.  It’s something that I would say warrants further investigation, but I expect there’ll be a rational explanation for it.  It’s probably more indicative of my own gaps in understanding than anything else.  Having said that, it was a very interesting exercise and I’d like to thank Dan Silver for the opportunity to write about it.

——

### 2 Responses

1. Hi,
1. For area, the law closely holds true.
2. For population – it needs a deeper thought. I think there is a hidden bias that your analysis does not account for. This bias could be referred to curtailed tails of the distribution. Average population size of the province, clustering effects (for example majority sizes clustered just below a 1000,000 mark will have a different picture of leading digits.) …. a simple decile distribution of the province sizes can throw more light on this point…. etc.
3. I don’t think there is anything wrong with the data. Just a case of curtailed distribution.
4. Interesting analysis though.
Shashi

2. It’s probably something to do with planning of Chinese administrative divisions, as well as the definition of administrative areas.
E.g. There may be a deliberate grouping of districts of population 1-2 million, 2-3 million, 3-4 million (there are between 54 and 60 instances of each of these groups).
There is certainly a great deal of control and planning of the administrative divisions in China. E.g. see: