The other day I came across this article over on Wired: Data-Mining for Terrorists Not ‘Feasible,’ DHS-Funded Study Finds.
The (United States) government should not be building predictive data-mining programs systems that attempt to figure out who among millions is a terrorist, a privacy and terrorism commission funded by Homeland Security reported Tuesday. The commission found that the technology would not work and the inevitable mistakes would be un-American.
“Even in well-managed programs, such tools are likely to return significant rates of false positives, especially if the tools are highly automated.”
I haven’t read the full report, but Wired’s article piqued my interest. Of course the issue of privacy is a very important one. But I was particularly interested from a probability/ statistical point of view in the reference to “significant rates of false positives”. It is a seemingly paradoxical outcome of a particular law of probability known as Bayes’ Law that the rarer the event for which we are testing, despite the accuracy of the test, the greater the proportion of false positives.
Let me explain by way of example.
Let us assume that the United States’ Department of Homeland Security (DHS) has developed an uber “predictive data-mining program” that is 99.9% sensitive. That is, if you are a terrorist, the algorithm will correctly identify you as one 99.9% of the time. The bad guys have only a 1 in 1000 shot at slipping under the radar. However, the chance that any person picked at random in the population is actually a terrorist would have to be fantastically slim. Say for argument’s sake that the rate of terrorists in the population is 1 in 15million. That’s a probability of just 0.00000667%. As I’ll show below it is this tiny probability that, conversely, brings the DHS testing unstuck.
Let “T” be the event of being a terrorist and “N” be the complementary event of not being a terrorist. Similarly, let “+” be the event of the DHS algorithm flagging you as a terrorist and “-” be the complementary event of not being identified. Expressed as probabilities we have the following:
P(you are a terrorist) = P(T) = 0.0000000667
P(you are not a terrorist) = P(N) = 1-P(T) = 0.9999999333
P(identified as a terrorist, given that you are a terrorist) = P(+|T) = 0.999
P(identified as a terrorist, given that you are not a terrorist) = P(+|N) = 1-P(+|T) = 0.001
So far everything looks good for the DHS. The chances of a terrorist living in the population is very, very small and the chances of their algorithm working on a terrorist is very, very high. However, Bayes had other ideas when he began toying with conditional probabilities. Using Bayes’ Law the probability of a false positive can be calculated as:
P(you are not a terrorist, given that you were identified as one)
= P(+|N) x P(N) / P(+)
= [ P(+|N) x P(N) ] / [ P(+|T)xP(T) + P(+|N)xP(N) ]
= [ 0.001 x 0.9999999333 ] / [ 0.999 x 0.0000000667 + 0.001 x 0.9999999333 ]
= 0.9999333711 (i.e. more than 99.9%)
In other words, even if the accuracy of the DHS program was extremely high, all it would really do is falsely accuse everyone except your grandma and the neighbour’s cat as a terrorist.
And there would be some suspicions about the cat.