The other day I came across this article over on *Wired*: Data-Mining for Terrorists Not ‘Feasible,’ DHS-Funded Study Finds.

The (United States) government should not be building predictive data-mining programs systems that attempt to figure out who among millions is a terrorist, a privacy and terrorism commission funded by Homeland Security reported Tuesday. The commission found that the technology would not work and the inevitable mistakes would be un-American.

…

“Even in well-managed programs, such tools are likely to return significant rates of false positives, especially if the tools are highly automated.”

I haven’t read the full report, but Wired’s article piqued my interest. Of course the issue of privacy is a very important one. But I was particularly interested from a probability/ statistical point of view in the reference to “significant rates of false positives”. It is a seemingly paradoxical outcome of a particular law of probability known as *Bayes’ Law* that the *rarer* the event for which we are testing, despite the accuracy of the test, the *greater* the proportion of false positives.

Let me explain by way of example.

Let us assume that the United States’ Department of Homeland Security (DHS) has developed an uber “predictive data-mining program” that is 99.9% sensitive. That is, if you are a terrorist, the algorithm will correctly identify you as one 99.9% of the time. The bad guys have only a 1 in 1000 shot at slipping under the radar. However, the chance that any person picked at random in the population *is* actually a terrorist would have to be fantastically slim. Say for argument’s sake that the rate of terrorists in the population is 1 in 15million. That’s a probability of just 0.00000667%. As I’ll show below it is this tiny probability that, conversely, brings the DHS testing unstuck.

Let “T” be the event of being a terrorist and “N” be the complementary event of not being a terrorist. Similarly, let “+” be the event of the DHS algorithm flagging you as a terrorist and “-” be the complementary event of not being identified. Expressed as probabilities we have the following:

P(you are a terrorist) = P(T) = 0.0000000667

P(you are *not* a terrorist) = P(N) = 1-P(T) = 0.9999999333

P(identified as a terrorist, given that you are a terrorist) = P(+|T) = 0.999

P(identified as a terrorist, given that you are *not* a terrorist) = P(+|N) = 1-P(+|T) = 0.001

So far everything looks good for the DHS. The chances of a terrorist living in the population is very, very small and the chances of their algorithm working on a terrorist is very, very high. However, Bayes had other ideas when he began toying with conditional probabilities. Using Bayes’ Law the probability of a false positive can be calculated as:

P(you are *not* a terrorist, *given that you were identified as one*)

= P(N|+)

= P(+|N) x P(N) / P(+)

= [ P(+|N) x P(N) ] / [ P(+|T)xP(T) + P(+|N)xP(N) ]

= [ 0.001 x 0.9999999333 ] / [ 0.999 x 0.0000000667 + 0.001 x 0.9999999333 ]

= 0.9999333711 (i.e. more than 99.9%)

In other words, even if the accuracy of the DHS program was extremely high, all it would really do is *falsely* accuse everyone except your grandma and the neighbour’s cat as a terrorist.

And there would be some suspicions about the cat.

Filed under: statistical concepts | Tagged: probability, statistics, terrorism | Leave a comment »