Would you like to conduct a simple statistical experiment?
Grab a list of measurements from a real-life source of data, such as heights of buildings, or lengths of rivers. Now look at just the first digit of each of the measurements and record the frequency of the numbers 1 through to 9. Intuitively, you’d expect that each leading digit occurs roughly 11% of the time (i.e. 1 chance in 9). However, it is an interesting observation that the leading digit from such sources is often not uniformly distributed. Surprisingly, a first digit of 1 tends to appear with a probability of about 30%, a leading digit of 2 tends to occur about 18% of the time, a 3 about 13%, and so on in a logarithmically decreasing pattern, with a leading digit of 9 often being observed less than 5% of the time.
Welcome to the bizarre world* of Benford’s Law [*not actually a bizarre world].
This phenomenon was first noted by the astronomer and mathematician Simon Newcomb in 1881. The physicist Frank Benford re-stated the observation in 1938 and, in an odd twist of fate, it is after him that the law is named. It is referred to as a “Law” but of course it won’t apply to all kinds of real-world lists of numbers. Lottery results (they’re entirely random) or counts of fingers (since we’re talking about digits), by way of example, should be, I hope, somewhat more uniformly distributed.
I thought it would be a fun exercise to test Benford’s Law using population data available from the Australian Bureau of Statistics’ website. I downloaded the latest Census population counts across the 129 Statistical Local Areas (SLAs) in South Australia, and tallied the frequency of the leading digits. The results of observed versus expected frequencies are summarised in Table 1 below. For example, there were 3 SLAs that recorded a population with a leading digit of 9 (they were 934, 9205, and 9015). If Benford’s Law holds there should be 6 such SLAs. Note that one SLA had a population of zero, so was not included.
Table 1: Frequency of leading digit – observed vs. expected under Benford’s Law
||No. times leading digit
|No. times leading digit
|1||47 (36.7%)||39 (30.1%)|
|2||36 (28.1%)||23 (17.6%)|
|3||11 (8.6%)||16 (12.5%)|
|4||10 (7.8%)||12 (9.7%)|
|5||1 (0.8%)||10 (7.9%)|
|6||6 (4.7%)||9 (6.7%)|
|7||7 (5.5%)||7 (5.8%)|
|8||7 (5.5%)||7 (5.1%)|
|9||3 (2.3%)||6 (4.6%)|
||128 (100%)||128 (100%)|
*Expected numbers are rounded to nearest integer based on exact percentages predicted under Benford’s Law.
First, the similarities. The observed leading digits of the population counts do decrease in a logarithmic pattern, as Benford’s Law predicts. However, on closer inspection, discrepancies start to emerge. The leading digits of 1 and 2, for example, are over-represented in the dataset (47 observed vs. 39 expected and 36 vs. 23, respectively). Conversely, the leading digit of 5 is severely under-represented, with just a single occurrence compared to ten expected.
The actual versus expected distributions are plotted below in Figure 1.
Figure 1: Frequency of leading digit – observed vs. expected under Benford’s Law
So does the population data conform to Benford’s Law? I decided to dig deeper with an appropriate statistical test. Under the null hypothesis, both the observed and expected counts come from the same distribution (i.e. Benford-type logarithmic). Under the alternative hypothesis (two-sided) they are different distributions.
Firing up the trusty statistical analysis packager “R”, I entered the above matrix and ran a Chi-squared Goodness of Fit test:
> census <- c(47,36,11,10,1,6,7,7,3)
> benford <- c(39,23,16,12,10,9,7,7,6)
yielding the following output:
Chi-squared test for given probabilities
X-squared = 21.6447, df = 8, p-value = 0.005618
p=0.005618 is highly statistically significant, so we can reject the null hypothesis in favour of the alternative, concluding that the observed frequencies do not follow a Benford distribution.
Benford’s Law analysis is often used for fraud detection (for example, insurance claims, forensic accounting, and even the recent Iranian Presidential election). That’s not to suggest that there’s anything fraudulent going on here with the Australian Bureau of Statistics’ Census data. I’d be interested to know why there’s such an undercount in SLA populations beginning with the digit “5″ for example, but I’m totally confident that there’s a legitimate, rational explanation for it. Two likely explanations are that just 128 data points are not enough to draw any kind of sensible conclusion; or perhaps something to do with the way the Australian Bureau of Statistics defines what a Statistical Local Area actually is. Or maybe I’ve just made a hash of the whole thing.
Still, an interesting analysis.