Lucky 8?

My state government’s lottery administration, SA Lotteries, makes the results from its various games available online, including tables of how frequently the various lottery numbers were drawn.

For example, you can see here how frequently the numbers 1 through 45 have been drawn in the Saturday X Lotto. At the time of writing this, Number 8 was the most frequently drawn, recorded as occurring a total of 289 times between Draw Number 351 to 3265. Note that the Saturday X Lotto Draws are odd-numbered, so Draw 351 to 3265 actually consists of 1458 (i.e. (3265-351)/2+1) weekly games.

Of course there will be random variation in how frequently balls are drawn over time, just as there’s random variation in heads and tails in the toss of a coin. But is it particularly unusual for the Number 8 Ball being drawn 289 times in 1458 games of Saturday X Lotto?

Is South Australia’s Saturday X Lotto biased towards the number 8?

Now before we can determine if the Number 8 being drawn 289 times in 1458 games of X Lotto is an extraordinary event, it helps if we first work out how many times we expected it to happen. In X Lotto, a total of eight balls (6 balls for the main prize and then 2 supplementary balls) are selected without replacement from a spinning barrel of 45 balls. The probability of any single number of interest being selected in eight attempts without replacement from a pool of 45 can be calculated using a Hypergeometric Calculator as P(X=1)=0.17778 (i.e. just under 18%). Therefore we expect the Number 8 (or any other number for that matter) to be drawn 0.17778 x 1458 = 259 times in 1458 games.

So observing 289 occurrences when we were only expecting 259 certainly seems unusual, but is it extraordinary?

To answer this, I’ll employ the Binomial test to evaluate the null hypothesis, H0:

H0: Observed frequency of the Number 8 being drawn in SA Lotteries’ Saturday X Lotto is within the expected limits attributable to chance (i.e. the lotto is a fair draw)

vs. the alternative hypothesis, H1:

H1: Observed frequency is higher than would be expected from chance alone (i.e. the lotto is not a fair draw)

The statistics package, R, can be used to run the Binomial test:

> binom.test(289,1458,0.17778,alternative=”greater”)

> Exact binomial test
> data: 289 and 1458
> number of successes = 289, number of trials = 1458, p-value = 0.02353
> alternative hypothesis: true probability of success is greater than 0.17778

So we can reject the null hypothesis of a fair draw at the alpha=0.05 level of significance.  The p-value is small enough to conclude that the true probability of Number 8 being drawn is higher than expected based on chance alone.

However, please note that I am definitely not suggesting that anything untoward is going on at SA Lotteries, or that you’ll improve the odds of winning the lottery by including the number 8 in your selection. For a start, rejection of the null hypothesis of a fair system occurs at the standard, but fairly conservative, alpha=0.05 level. What if I had decided to use alpha=0.01 instead? The null hypothesis of a fair system would be retained. Things are all rather arbitrary in the world of statistics.

Still, a curious result that utilised several statistical concepts that I thought would be interesting to blog about.

——

Which Australian State Is the Most Charitable?

The Australian Taxation Office (ATO) recently released Taxation Statistics 2009-10, a broad collection of data compiled from income tax returns for that financial year. If taxation statistics are the kind of thing that gets your juices flowing then check it out – the report and associated tables contain a veritable wealth of information.

Amongst other things, it shows deductions claimed by taxpayers for gifts or donations to charities — including welfare agencies, hospitals, research institutes, environmental groups, and arts organisations. From these data I thought it might be interesting to see which of the Australian States and Territories is the most relatively charitable — using tax deductions claimed as a proxy for actual donations made.

I used the Selected items, by sex and State/Territory of residence, 2009-10 income year data to compare taxpayers’ deductions claimed for gifts or donations to charities, relative to their total incomes. The average income per taxable individual in Australia was $66,502 per annum in 2009-10, of which an average $216 (proportionally, 0.32% of income) was claimed for charity. The States and Territories are summarised in Table 1 below.

Table 1: Charity as a proportion of income, States & Territories, 2009-10

The data in the table are graphed below. The red horizontal line is the Australian average against which we can compare the individual States and Territories.

Figure 1: Charity as a proportion of income, States & Territories, 2009-10

New South Welshpersons and Australian Capital Territorians really pull their weight when it comes to donating to charity. They have some of the highest average individual incomes in the country ($69,431 p.a. and $72,007 p.a., respectively), but then really come to the party with the highest proportion of that income going to charitable organisations (0.40% or $277, and 0.39% or $279, respectively).

Western Australians on the other hand really need to lift their game. Despite an average annual individual income of $71,690 (second highest in the country), they were only half as charitable as their eastern cousins mentioned above — only 0.22% ($161) of their income went to charity in the 2009-10 financial year. The Northern Territory recorded the lowest proportion of income to charity with 0.19% of $64,527 ($124) claimed as gifts and donations.

Or perhaps they actually gave a lot to charity, but then didn’t claim it as a deduction come tax-time. Statistics can be a bit dodgy like that.

Interesting.

——

Queuing Theory and iiNet, Part II

It’s an interesting read, but the author makes a lot of basic errors. Unfortunately, customers refuse to line up and call at regular intervals and spend the average amount of time on the call. The reality is obviously more bursty than that and needs non-linear modelling.

- Michael Malone, CEO of iiNet

The feedback from Michael Malone above was in response to my previous blog post on Applying Queuing Theory to iiNet Call Centre Data. I don’t accept that I made “a lot of basic errors”, but I did make a lot of assumptions. Or perhaps the statistician George E. P. Box said it better, “Essentially, all models are wrong, but some are useful.”

But Michael is correct – customers don’t line up and call at regular intervals, and the reality is more “bursty” (i.e. Poisson). My model is inadequate because it doesn’t take into account all the natural variation in the system.

One way of dealing with, or incorporating, this random variation into the model is by applying Monte Carlo methods.

Take the iiNet Support Phone Call Waiting Statistics for 6 February 2012, specifically for the hour 11am to noon. I chose this time block because the values are relatively easy to read off the graph’s scale – (a bit over) 664 calls and an average time in the queue of 24 minutes.

Now if we assume Average Handling Time (AHT), including time on the call itself followed by off-phone wrap-up  time, was 12 minutes, then my model says there were 664*(12/60) / (24/60 +1) = 95 iiNet Customer Service Officers (CSOs) actually taking support calls between 11am and noon on 6 February 2012. That’s an estimate of average number of CSOs actually on the phones and taking calls during that hour, excluding those on a break, performing other tasks, and so on. Just those handling calls.

But there will be a lot of variation in conditions amongst those 664 calls. I constructed a little Monte Carlo simulation and ran 20,000 iterations of the model with random variation in call arrival rates, AHT, and queue wait times.

Assumptions:

Little’s Law applies
664 calls were received that hour (at a steady pace)
Average time in the queue of 24 minutes
AHT (time on the actual call itself plus off-call wrap-up) of 12 minutes

then the result of the 20,000 monte carlo runs is a new estimate of 135 iiNet CSOs taking support calls between 11am and noon on 6 February 2012.

I ran a few more simulations, plugging in different values for number of CSOs handling calls (all else remaining equal – i.e. 664 calls an hour; AHT=12 minutes) to see what it did for average time in the queue. The results are summarised in the table below:

Modelling suggests that if iiNet wanted to bring the average time in the phone call support queue down to a sub-5 minute level during that particular hour of interest, an additional 85% in active phone support resourcing would need to be applied.

The table of results is graphically presented below (y-axis is time in queue, x-axis is CSOs)

Looks nice and non-linear to me :) You can see a law of diminishing returns thing start to take place around about the point of the graph corresponding to 160 CSOs / 16.5 minute average queue wait time.

——

Applying Queuing Theory to iiNet Call Centre Data

In previous posts I’ve talked about queuing theory, and the application of Little’s Law in particular, to Internet Service Provider (ISP) customer support call centre wait times. We can define Little’s Law, as it applies to a call centre, as:

The long-term average number of support staff (N) in a stable system is equal to the total number of customers in the queue (Q) multiplied by the average time support staff spend resolving a customers’ technical problems (T), divided by the total time waited in the queue (W); or expressed algebraically: N=QT/W.

Thinking things through a bit more, the total number of customers in the queue (Q) at a point in time in a stable system should be equal to the rate at which people joined the queue (λ), minus the rate at which the support desk dealt with technical problems (i.e. N/T) over the period of observation. Obviously Q>=0.

So N=QT/W and Q=λ-N/T which all comes out in the wash as:

N=λT/(W+1)

I thought might be a bit of fun to see if this could be applied to the customer support call centre waiting statistics published by one of Australia’s largest ISPs, iiNet.

iiNet make some support centre data available via their customer toolbox page. Below is a screenshot of call activity and wait times graphed each hour by iiNet on 10 January 2012. The green line (in conjunction with the scale on the left hand side of the graph) represents the average time (in minutes) it took to speak to a customer service representative (CSR), including call-backs. The grey bars (in conjunction with the right hand scale) represents the total number of incoming phone calls to iiNet’s support desk.

It may be possible to use the formula derived above to estimate how many CSRs iiNet had on the support desk handling calls that day. For example, during the observed peak period of 8am to 1pm on Tuesday, 10 January 2012, the iiNet support desk was getting around 732 calls per hour on average. The expected wait time in the queue over the same period was around 11 minutes.

If we assume that the average time taken for a CSR to resolve a technical problem is, let’s say, 12.5 minutes, then we can estimate that the number of CSRs answering calls in a typical peak-hour between 8am to 1pm on 10 January 2012 as:

732*(12.5/60) / (11/60 + 1)

= 129 CSRs actively handling calls.

Sounds sort of a reasonable for a customer service-focussed ISP the size of iiNet. But if iiNet wanted to bring the average time in the queue down even more – to a more reasonable 3 minutes, for example – they’d need 145 CSRs (all else remaining equal) during a typical peak-hour answering calls.

————

Has the Australian Stock Market…

… ever seen anything quite like this?

I updated my “Stock Market Seismometer” (click on the separate tab above for details) for the first time in many months. I have to say, the results shocked me. Over the course of 2011 the Australian stock market slid down even further into “oversold” territory. As we head into 2012 things have never looked so bleak. Or maybe 2012 will be the year of the rebound? I expect at some point growth will move back to its long term trend, but when that will start to happen is anyone’s guess.

——

How poker machines vacuum up your money

Poker machines are unique in the gambling world. They are the only form of gambling that has been designed and crafted for the purpose of making money, and where there is absolutely no chance of influencing the outcome.

- Tom Cummings, Poker machine maths, 27 May 2011

The article, linked above, on poker machines is an excellent insight into the mathematics of poker machine gaming and how, despite a guaranteed return-to-player percentage which is high, a punter that plays long enough ends up with nothing. The vital point that Tom Cummings makes is this:

It’s common knowledge that poker machines have a return-to-player percentage of anywhere between 85 per cent and 90 per cent, depending on where you live and what kind of establishment you’re playing in. For the sake of the story, let’s assume that Gladys’ poker machine was set to 90 per cent. That means that over a long period of time, the machine will return 90 per cent of money gambled to players, and keep 10 per cent as profit.

But wait, I hear you cry. Gladys didn’t lose 10 per cent; she lost it all! Well, no… not according to poker machine mathematics. The rule is that the poker machine has to return 90 per cent of money gambled… not money inserted. And there’s a huge difference.

The way it works, mathematically, is, I think, very interesting. Take the example used in the article of a 2-cent gaming machine, at $1 per play, with guaranteed return of 90%. When you load that dollar in the slot, sometimes you win, and sometimes you lose, but the long-term average dictates that at each “turn” the dollar is losing 10% of its value. $1 turns into 90c, turns into 81c, turns into 72.9c, … , turns into 2c at which point the game ends.

More generally, at “turn” n, an amount A1 initially fed into a slot machine with return R is worth:

A_{n}=A_{1}R^{n-1}

or solving for n:

n=\frac{\displaystyle ln(A_{n}) - ln(A_{1})}{\displaystyle ln(R)} + 1

What does this mean? It means that it takes, on average

n = [ ln(0.02) - ln(1.00) / ln(0.90) ] + 1 = 38

iterations to turn $1.00 into 2 cents on a 90% return poker machine. More importantly, the total amount gambled isn’t $1.00, it’s actually (because you’re re-investing your winnings and following your losses until the game ends): $1.00 + 90c + 81c + 72.9c + … + 2c. Or more generally,

A_{1}\sum_{1}^{n} R^{i-1}

which you’ll remember converges to:

A_{1} \frac{\displaystyle 1-R^{n-1}}{\displaystyle 1-R}

So a single dollar coin will generate, on average (on a 2-cent, 90% return machine):

[ 1-0.9(38-1)] / [ 1-0.90 ] = $9.80 in total bets.

At $1.00 per game, at 10 games per minute, that’s about 1 minute of play for every dollar put in, or 5 hours to burn through Gladys’ $300.

Those who defend poker machines often point to the high rate of return as one of the reasons that pokies are just “good, clean fun” for most people. The reality is that every poker machine can meet this “rate of return” requirement while still leaving the gambler broke. That’s poker machine mathematics.

——

Further reading:

ABC Hungry Beast, The Beast File: Pokies

The NBN, CVC and burst capacity

Late last year, NBN Co (the body responsible for rolling out Australia’s National Broadband Network) released more detail on its wholesale products and pricing. You can download their Product and Pricing Overview here. The pricing component that I wanted to analyse in this post is NBN Co’s additional charge for “Connectivity Virtual Circuit” (CVC) capacity.

CVC is bandwidth that ISPs will need to purchase from NBN Co, charged at the rate of $20 (ex-GST) per Mbps per month. Note that this CVC is on top of the backhaul and international transit required to pipe all those interwebs into your home. And just like backhaul and international transit, if an ISP doesn’t buy enough CVC from NBN Co to cover peak utilisation, its customers will experience a congested service.

The problem with the CVC tax, priced as it is by NBN Co, is that it punishes small players. By my calculations, an ISP of (say) 1000 subscribers will need to spend proportionally a lot more on CVC than an ISP of 1,000,000 subscribers if they want to provide a service that delivers the speeds it promises.

Here comes the statistics.

Consider NBN Co’s theoretical 12 megabit service with 10GB of monthly quota example that they use in the document I linked to above. 10GB per month, at 12Mbps gives you 6,827 seconds (a bit under 2 hours) at full speed before you’re throttled off. There’s 2,592,000 seconds in a 30-day month, so if I choose a random moment in time there is a 6827/2592000 = 0.263% chance that I’ll find you downloading at full speed.

That’s on average. The probability would naturally be higher during peak times. But let’s assume in this example that our theoretical ISP has a perfectly balanced network profile (no peak or off-peak periods). It doesn’t affect the point I’ll labour to make.

A mid-sized ISP with (let’s say) 100,000 subscribers can expect, on average, to have 100,000*0.263% = only 263 of those customers downloading at full speed simultaneously at any particular second. However, the Binomial distribution tells us that there’s a relatively small, but still statistically significant (at the alpha=0.05 level) probability that there could be 290 or more customers downloading at the same time.

So a quality ISP of 100,000 subscribers will plan to buy enough CVC bandwidth to service 263 customers at any one time. But a statistician would advise the ISP to buy enough CVC bandwidth to service 290 subscribers, an additional (290-263)/263 = 10% , or find itself with a congested service about one day in every 20.

This additional “burst headroom”, as a percentage, increases as the size of the ISP decreases. From above, an ISP of 10,000 can expect to have 26 customers downloading simultaneously at any random moment in time. But there’s a statistically significant chance this could be 35+. This requires them to buy an additional (35-26)/26=33% in CVC over and above what was expected to cover peak bursts.

The table below summarises, for ISPs of various sizes, how much additional CVC would need to be purchased over and above the expected amount, to provide an uncontended service 95%+ of the time.



Graphically it looks a bit like this…

As you can see, things only really start to settle down for ISPs larger than 100,000 subscribers. Any smaller than that and your relative cost of CVC per subscriber per month is disproportionally large.

….

Further reading:

Rebalancing NBNCo Pricing Model

NBN Pricing – Background & Examples, Part 1

——

Australian ISP Market Share, 2009-2010

The market research firm, Roy Morgan, has released its latest ISP satisfaction data, with an overwhelmingly positive result recorded for Internode and iiNet.

According to the latest Roy Morgan Internet Satisfaction data, Internode (93.4%) is still the top performer for customer satisfaction while iiNet (89.9%) appears to be closing the gap from 5.6% points in the 6 months to April 2010 to 3.5% points in the 6 months to May 2010.

Scrolling further down the Roy Morgan press release page, you’ll find individual ISP customer profiles available for purchase.  At the bottom of each report’s synopsis, you’ll see that a sample size has been included [for example, the Internode customer profile is based on a sample of 305 customers].  Now, combining these sample sizes from the ISPs’ profiles I think could potentially provide a good basis for estimating market share.

My results/estimates are presented in the table below.  Market share is ALL business, government and home subscribers with ANY kind of internet access including dialup, DSL, cable, fibre, satellite and wireless [fixed & mobile].  The Roy Morgan samples were taken between April 2009 and May 2010.

Table 1: Estimated Australian ISP market share, 2009-2010

Internet Service Provider Roy Morgan sample
(no.)
Est. market share
2009-2010 (%)
3 Internet 322 2.6%
AAPT 342 2.8%
Adam 134 1.1%
Chariot 134 1.1%
Dodo 321 2.6%
Exetel 206 1.7%
iiNet 509 4.2%
Internode 305 2.5%
iPrimus 284 2.3%
Netspace 166 1.4%
Optus 2,099 17.3%
Primus-AOL 153 1.3%
TADAust 119 1.0%
Telstra 5,710 46.9%
TPG 539 4.4%
Unwired 175 1.4%
Virgin 158 1.3%
Vodafone 103 0.8%
Westnet 384 3.2%
TOTAL 12,163* 100.0%
  • AAPT, Netspace and Westnet are owned by iiNet
  • Chariot is owned by TPG
  • Adam offers residental internet access in South Australia & Northern Territory only

——

ISP Customer Service in 2009

A few weeks ago Whirlpool’s Australian Broadband Survey 2009 Report was released.  Last year I used the 2008 report to analyse the survey results specifically as they pertained to ISP customer service; so I thought it would be good idea to update my analysis, and see just how much the ISP customer service landscape has changed over the 12 month period.

My objective, as it was last year, was to take results from these three survey questions related to customer service

  1. When calling customer support, how long did you have to wait on the phone (or talk to an operator) before you spoke to the right person?
  2. How quickly have technical support issues typically taken to resolve?
  3. How would you rate their customer service?

and distil them down to a single score that can be used to rank providers.  I arbitrarily set the benchmark score across the whole industry to be 1000, with each individual ISP’s customer service ranked relative to that benchmark.  So an ISP score higher than 1000 is above the industry average.  Lower than 1000 is below average.

The methodology employed was exactly the same as last year, so no need to go into the details.  Without further ado, here are the updated results:

Stan’s Top Five ISPs for Customer Service in 2009 [2008 rank in brackets]

  1. Adam Internet [3]
  2. Westnet [1]
  3. Amnet [2]
  4. Internode [4]
  5. iiNet [6]

Congratulations go to local Adelaide-based outfit, Adam Internet.  Number 1 with a bullet in 2009.  Westnet (purchased by iiNet in 2008) has always prided itself on providing subscribers with a premium customer service experience, so it was very surprising to see them knocked off their coveted number 1 spot.  Also surprising to see aaNet slip out of the Top Five altogether, replaced by iiNet.

The overall results from the three customer service questions (equally weighted) are as follows:

Table 1: Australian ISP customer service scores
<1000:below average1000:average>1000:above average

ISP 1. Time in queue 2. Speed of resolution 3. Rating of service OVERALL CUSTOMER SERVICE SCORE
Telstra Cable 513 556 731 586
Telstra DSL 488 523 724 562
Optus Cable 605 886 812 748
Optus DSL 669 668 809 710
iiNet 1389 1410 1101 1284
Internode 1636 1803 1176 1488
TPG 757 665 812 739
Westnet 2249 2601 1220 1820
Adam 2796 2544 1139 1842
Exetel 1125 954 919 991
Netspace 559 958 973 777
aaNet 903 863 904 889
iPrimus 950 1273 1025 1066
Amnet 2792 2242 1128 1774
Telstra NextG 389 365 645 437
AAPT 879 604 846 755
Other ISPs 844 725 938 826
TOTAL 1000 1000 1000 1000

It’s important to keep in mind that Whirlpool’s Australian Broadband Survey isn’t scientific.  Although it gets tens of thousands of responses, it only reflects the opinions of those who are aware of the Whirlpool site and motivated to express an opinion.  It is a self-select survey and, as such, the respondents’ attitudes may not be statistically representative of the ISP’s customer base.   In other words, take with a grain of salt.

Ranked from highest to lowest the results are as follows:
ISP (2009 score) (2008 score):

  • Adam Internet (1842) (1727)
  • Westnet (1820) (2132)
  • Amnet (1774) (1735)
  • Internode (1488) (1348)
  • iiNet (1284) (1081)
    ——
  • iPrimus (1066) (903)
  • Exetel (991) (992)
    ——
  • aaNet (889) (1204)
  • Netspace (777) (912)
  • AAPT (755) (808)
  • Optus Cable (748) (736)
  • TPG (739) (963)
  • Optus DSL (710) (672)
  • Telstra Cable (586) (711)
  • Telstra DSL (562) (676)
  • Telstra NextG (437) (562)
    ——
  • Other (826) (959)
    ——
  • AVERAGE (1000)

——

Benford’s Law in China

0. INTRODUCTION
About a week ago, Dan Silver, a management consultant in Taipei who has been studying China for 25 years, emailed me to say that he enjoyed my piece on the application of Benford’s Law to Australian population data.  Dan asked if I’d be willing to run the same tests on population data from China if he supplied the numbers.  It sounded like an interesting exercise, so I agreed.

What I found intrigued me.  The Chinese population data did not conform to Benford’s Law.  I stress that this departure from what was expected should not be taken as any kind of proof of fraudulent data entry (Benford’s Law often applies, but not always).  However, it was a surprising result.

1. BENFORD’S LAW
Tally the number of times the first digit 1 through to 9 occurs in any “real world” data source such as lengths of rivers or heights of buildings (unit of measurement doesn’t matter).  Intuitively you would expect the digits should be uniformly distributed with each one being observed 1/9th (about 11%) of the time.  Under Benford’s Law, however, a first digit of 1 tends to appear with a frequency of about 30%, a leading digit of 2 tends to occur about 18% of the time, a 3 about 13%, and so on in a logarithmically decreasing pattern, with a leading digit of 9 often being observed less than 5% of the time.

2. DATA
Dan provided land area (in square kilometres), along with two series of population data: “actual” (常住人口); and official/registered (戶籍人口) for each of the more than 350 administrative areas in China.  These data can be viewed here.  Dan notes that at the bottom of the spreadsheet are four cities which the Chinese government counts as provinces.  These are included to bring the population to the official total for China.  If you scan through the columns of figures, you’ll see that some administrative areas don’t have a land area recorded next to them, or an “actual” population.  However, they all have a registered population.  These gaps mean that the totals presented in the tables below won’t match.  The data are as at the end of 2007.

3.1 BENFORD ANALYSIS – CHINESE LAND AREA
So on to the analysis.  Let’s start by tallying the occurrences of first digits in the column of figures holding the land area data.  Table 1 summarises these frequencies (number and percentage) that were observed and expected under Benford’s Law.

Table 1: Chinese administrative land areas (sq kms)

Leading digit
No. times leading digit
was observed
No. times leading digit
was expected
1 133 (39.8%) 101 (30.1%)
2 57 (17.1%) 59 (17.6%)
3 28 (8.4%) 42 (12.5%)
4 22 (6.6%) 32 (9.7%)
5 21 (6.3%) 26 (7.9%)
6 12 (3.6%) 22 (6.7%)
7 22 (6.6%) 19 (5.8%)
8 20 (6.0%) 17 (5.1%)
9 19 (5.7%) 15 (4.6%)
TOTAL
334 (100%) 334 (100%)

I think the data are easier to digest when presented graphically.

Figure 1: Chinese administrative land areas (sq kms)


The leading digits of Chinese land area data by administrative area do decrease roughly logarithmically, although don’t technically follow a Benford distribution when subject to a Chi-square goodness of fit test (X2=26.05 on 8 d.f.; p=0.00103).  Data are overweight in figures starting with a “1” and underweight in areas leading with digits “3” to “6”.  But as Benford Law predicts, “1” is the most common leading digit, followed by “2” with the remaining leading digits bringing up the rear.

3.2 BENFORD ANALYSIS – CHINESE ACTUAL POPULATION
Similarly, let’s look at the leading digits of “actual” populations by administrative areas.

Table 2: Chinese administrative area “actual” populations

Leading digit
No. times leading digit
was observed
No. times leading digit
was expected
1 39 (14.8%) 79 (30.1%)
2 53 (20.2%) 46 (17.6%)
3 44 (16.7%) 33 (12.5%)
4 39 (14.8%) 25 (9.7%)
5 28 (10.6%) 21 (7.9%)
6 25 (9.5%) 18 (6.7%)
7 15 (5.7%) 15 (5.8%)
8 13 (4.9%) 13 (5.1%)
9 7 (2.7%) 12 (4.6%)
TOTAL
263 (100%) 263 (100%)

When plotted, the difference between frequencies observed and what was expected under Benford’s Law is, I think, striking.

Figure 2: Chinese administrative area “actual” populations


Not even close to a Benford distribution.  The leading digit of “1” actually occurs less often than a leading digit of “2”, although after that things do decrease logarithmically.

3.3 BENFORD ANALYSIS – CHINESE REGISTERED POPULATION
Dan asked that I focussed on the registered/official population figures (the highlighted column of raw data in the spreadsheet linked above).  Leading digits from the registered, official population figures by administrative areas are presented in Table 3 and Figure 3 below.

Table 3: Chinese administrative area “registered” populations

Leading digit
No. times leading digit
was observed
No. times leading digit
was expected
1 71 (19.6%) 109 (30.1%)
2 67 (18.5%) 64 (17.6%)
3 64 (17.6%) 45 (12.5%)
4 46 (12.7%) 35 (9.7%)
5 40 (11.0%) 29 (7.9%)
6 25 (6.9%) 24 (6.7%)
7 24 (6.6%) 21 (5.8%)
8 14 (3.9%) 19 (5.1%)
9 12 (3.3%) 17 (4.6%)
TOTAL
363 (100%) 363 (100%)


Figure 3: Chinese administrative area “registered” populations

Things look a bit better than the “actual” population data, but registered populations certainly don’t follow Benford’s Law either.  The leading digit of “1” is severely under-weight compared to its predicted frequency and, in fact, digits “1” through to “3” from registered population data occur with almost equal frequency.

4. SUMMARY
Chinese land area data by Chinese government administrative area follow a logarithmically decreasing pattern, although technically not a Benford distribution.  Chinese population data (“actual” and registered) by the same areas don’t follow a logarithmically decreasing pattern at all, Benford or otherwise.

That these Chinese government data don’t conform to Benford’s Law should not be taken as any kind of proof or insinuation that something untoward is going on.  It’s something that I would say warrants further investigation, but I expect there’ll be a rational explanation for it.  It’s probably more indicative of my own gaps in understanding than anything else.  Having said that, it was a very interesting exercise and I’d like to thank Dan Silver for the opportunity to write about it.

——

Follow

Get every new post delivered to your Inbox.