I’m number one, baby, so why try harder?

The other day Whirlpool Discussion Forums member “billabong” posted that ISP aaNet is running a promotion, and are using the flyer pictured below.  Readers will see that aaNet have sourced two customer satisfaction surveys (Whirlpool Australian Broadband Survey 2008 and Australian PC Authority Best ISP Award 2008), making it clear that aaNet finished number “1″ in those polls.

Or did they?

As billabong points out, on the question of “Would you recommend your ISP to other people?” in the Whirlpool survey, aaNet actually finished 7th overall with 88.4% “Yes”.  In the “Value for Money” category of the PC Authority survey, aaNet also finished 7th with 4 out of 6 stars.  And they dropped to only 3 out of 6 stars in the “Overall” class.

So not quite “Number 1″.

I don’t actually have a real problem with aaNet cherry picking survey results, putting themselves in the best possible light for marketing purposes.  It’s just par for the course in advertising.  I expect all companies will do it to some degree.  Consumers should always be wary of the various shenanigans that go on when it comes to marketing departments and data.  That said, there can be still be a certain elegance about it.  The way aaNet have crudely slapped a blue “Number 1″ ribbon over results that they actually finished 7th in leaves a sour taste in my mouth.

Poor form, aaNet.  Poor form.

aaNetFlyer

——

The poll as a political weapon

When applied correctly, statistics is an elegant tool that can help us put a random and uncertain world into context.  When abused, it can help dark and mysterious powers further their own nefarious agendas.  In its most brutal form, statistics can be used as a weapon to club the thick-witted over the head.

No game is quite so brutal as politics.

It’s been an interesting few weeks of local politics here in South Australia.  South Australia has a fixed 4-year electoral cycle and our next election is due in March 2010.  The Liberal opposition party isn’t making any real headway against the Labor incumbents who dominate the political landscape.  To make matters worse, the State Liberal leader, Martin Hamilton-Smith (MHS), recently entangled himself in a “dodgy documents” scandal.  He tried to embarrass the government with some “leaked” emails that turned out to be forgeries.  If the Liberal party are to present any kind of alternative government to the people they need to quickly put this controversy behind them; present a united front; and build positive momentum over the remaining nine months leading into the formal campaign.  Leadership rumblings in the face of a looming election would be too much of a distraction.  In short, MHS’ position became untenable.  However, a trigger was needed to effect his removal.  That trigger was, of all things, a little statistic.

Last week, Mike Rann, the current Labor Premier of South Australia fired a warning shot of what was to come on the social networking site, twitter:

Some of the polling to be dribbled out over next few weeks will be of dubious provenance but Lib plotters hope it will spook/stampede MPs.

-Mike Rann, Premier of South Australia, on twitter, 5:43 PM, 27 June 2009

Sure enough, the very next day, polling of dubious provenance dribbled out of our local propaganda rag, the Sunday Mail.

A Sunday Mail poll of 483 Adelaide metropolitan voters put Labor on 64 per cent to the Liberals’ 36 percent on a two party-preferred basis.

-AdelaideNow, 28 June 2009

Bear in mind this “poll” had less than 500 respondents.  Even if everything had been above board, that would put the margin of error at about 4-5 percentage points or so.  Reputable polling companies typically canvass a greater sample size (usually about 1200) to reduce the margin of error.  However the poll contained multiple flaws.  For a start, it covered metropolitan Adelaide only, not the whole State.  Further, the Sunday Mail didn’t bother to explain which electorates were included, how the polling was conducted, or by whom.  They couldn’t even be bothered to present the results in a tabled summary for detailed scrutiny.  It’s what Darrell Huff would have described as a phoney statistic.  It’s what I would describe as horse shit.  The poll was biased, politically motivated and compromised from the outset.

But it had its desired result.  There was a leadership spill.  Just as the “Lib plotters” had hoped, the MPs were spooked by the dodgy poll.  Support for MHS melted away.  Although he scraped back in, by the narrowest of margins (11-10, with one abstaining), it was a Pyrrhic victory.  Realising that the margin was too slender to lead effectively, MHS promptly quit.  There’ll be another leadership ballot tomorrow which MHS won’t be contesting.  So that’s it, he’s gone.

All thanks to a little statistic.

Funnily enough, the powerbrokers have apparently decided that this whole dodgy polling strategy is a winner.  Whereas it can be used to tear one leader down, it can be used to build up the next.  Mike Rann went on to post on twitter:

After Lib leader settled we’ll see campaign by one media to promote whoever wins. We might see odd dodgy poll plus Press Club ‘vision’ etc

So not the last we’ll see of the poll as a political weapon.  Interesting times ahead.

——

Disclaimer: I am not associated with any political party.  I simply have a keen interest in statistics and its application in the world.

Pornography, NewScientist and the Ecological Fallacy

Some time ago I came across an article in NewScientist: Porn in the USA: Conservatives are biggest consumers.

Americans may paint themselves in increasingly bright shades of red and blue, but new research finds one thing that varies little across the nation: the liking for online pornography.

A new nationwide study of anonymised credit-card receipts from a major online adult entertainment provider finds little variation in consumption between states.

“When it comes to adult entertainment, it seems people are more the same than different,” says Benjamin Edelman at Harvard Business School.

However, there are some trends to be seen in the data. Those states that do consume the most porn tend to be more conservative and religious than states with lower levels of consumption, the study finds.

“Some of the people who are most outraged turn out to be consumers of the very things they claimed to be outraged by,” Edelman says.

As I waded into the paper, things began to get murky.  Firstly, the study relied on only a single source of data, a “top-10 seller of adult entertainment”.  In fairness, Edelman does admit that “… it is difficult to confirm rigorously that this seller is representative” but explains it away with “… the seller runs literally hundreds of sites offering a broad range of adult entertainment.” Crikey.  That old chestnut.  If your sample is large enough it cancels out any bias?  This is the statistical equivalent of “if lots people tell you something it must be true”.  I believe it’s worth raising a proverbial sceptical eyebrow here at the source of the data.

Secondly, a little sleight of hand.  “Little variation in consumption between states” segues into “there are some trends to be seen in the data.  Those states that do consume the most porn tend to be more conservative and religious than states with lower levels of consumption.” To me, the statements look mutually exclusive.  Either there was little variation or there was some trends to be seen.  Which is it?  I do feel for the researcher who slaves away for many hours on a project, only to yield a null result.  It’s never as exciting as finding that diamond in the mud.  But a null result is still a null result, and it’s important to stay true to your findings.  Hell, just ask Michelson and Morley.

But what concerned me the most was the decoupling of the conclusion from the results.  Endelman concludes that “some of the people who are most outraged turn out to be consumers of the very things they claimed to be outraged by.” In other words conservatives are hypocrites because they actually buy more porn than those Godless, unbathed liberals.

This is an example of an Ecological Fallacy.

An ecological fallacy, often called an ecological inference fallacy, is an error in the interpretation of statistical data in an ecological study, whereby inferences about the nature of specific individuals are based solely upon aggregate statistics collected for the group to which those individuals belong.

Put simply, let’s say I went to a statistics professor with these results:

State X, 1000 people, 90% conservative, 6 porn subscribers.

State Y, 1000 people, 90% liberal, 3 porn subscribers.

If my conclusion from those data was that conservative people buy more porn than liberal people, the statistics professor would simply laugh at me and walk away.  Or maybe if the professor was kind s/he would mention in passing that I don’t know what the political leanings of the nine porn subscribers are because I haven’t directly analysed them.  I fell for the Ecological Fallacy.  I incorrectly inferred that the subscribers are defined by the area in which they live.

An important distinction that researchers need to be mindful of.

How to talk back to a statistic

In my previous blog entry I briefly reviewed Darrell Huff’s excellent book, How to Lie with Statistics.  In the closing chapter Huff summarises the lessons by explaining How to Talk Back to a Statistic.  Or, in Huff’s own words, “how to look a phoney statistic in the eye and face it down”.

Not all the statistical information that you may come upon can be tested with the sureness of chemical analysis or of what goes on in an assayer’s laboratory.  But you can prod the stuff with five simple questions, and by finding the answers avoid learning a remarkable lot that isn’t so.
- How to Lie with Statistics, Chapter 10, p110

The five simple questions are:

  1. Who Says So?
  2. How Does He Know?
  3. What’s Missing?
  4. Did Somebody Change the Subject?
  5. Does It Make Sense?

Armed with these five questions I thought it might be interesting to examine a real world example.

You might have seen recently the news report that “75% of ex-Bush officials are still unemployed”.  The source of the story was the Wall Street Journal article of 21 Feb 2009: Jobs Still Elude Some Bush Ex-Officials

The jobless rate is hanging high for many of the roughly 3,000 political appointees who served President George W. Bush.  Finding work has proved a far tougher task than those appointees expected …

Only 25% to 30% of ex-Bush officials seeking full-time jobs have succeeded … much, much worse than when Ronald Reagan, George H.W. Bush and Bill Clinton left the White House …

Let’s put this 75% unemployment rate of ex-Bush officials to the Duff Test.

Who Says So?

The first sleight of hand you notice about the statistic is that it hides behind what Huff calls an “OK Name”.  In this case the “OK Name” is the Wall Street Journal, a well known and reputable news source.  But it’s not the WSJ who actually “says so”.  It’s not a piece of their own independent investigative journalism.  In this case the WSJ is merely reporting on a statistic prepared by somebody else.  It’s therefore worthwhile considering if this third party actually has any expertise in the areas of data collection and statistical analysis.  Are they impartial?  Could they be biased?  Do they have a hidden agenda or ulterior motive behind presenting these figures?

I don’t want to drink from a poisoned well.  So I’m going to approach this source with a healthy dose of scepticism.

How Does He Know?

How did the researchers arrive at their “estimate”?  Via robust statistical sampling?  Rumour mill?  Reading tea leaves?  On the face of it, the data look anecdotal at best.  The WSJ article doesn’t go into any details.  This is enough to raise a second doubt about the statistic.

What’s Missing?

What kind of error margin is there in the estimates?  If the estimate was based on a sample, how big was it?  How was it selected?  Is it representative?  When comparing ex-Bush officials with previous administrations are they comparing apples with apples in terms of such things as ages and career ambitions?  Were ex-Bush officials more likely to be heading into retirement or satisfied with a bit of part time work?  Not to mention the vastly different employment situation that exists right now as the U.S., and indeed the world, enters the Second Great Depression.

Did Somebody Change the Subject?

Huff warns that “when assaying a statistic, watch out for a switch somewhere between the raw figure and the conclusion.  One thing is all too often reported as another.”  Although it doesn’t explicitly say so, the implication behind the WSJ article is that ex-Bush officials are having a hard time finding employment because they’re ex-Bush officials.  It’s fair to say that George W. Bush is regarded as being one of the worst U.S. presidents in history.  Certainly he left office with some of the lowest approval ratings of all time.  So it’s only natural that nobody from his administration could ever find gainful employment again.  They’re hopeless and everybody hates them, right?  Well, maybe.  But the truth is probably far more complex.  There are so many external variables in play that such a conclusion represents a leap of faith.

Does It Make Sense?

For any statistic to “make sense” it needs context.  A comparison of yourself to a group, or your suburb to the country, or a trend over time are examples of data context.  In my opinion any kind of real, meaningful context is missing from the statistic reported by the WSJ.

All in all I believe the figure of “75% of ex-Bush officials are unemployed”, as reported by the WSJ, fails Duff’s basic how to talk back to a statistic five-point criteria.  This particular statistic is like a bad smell in an elevator.  Source and purpose unknown, it hangs in the air requiring our attention.  But any kind of meaningful comment is impossible.  The only sensible course of action is to ignore it.

How to Lie with Statistics

Statistics is hard.  Let’s all go to the beach.
-Barbie

You know, I really enjoy being an information analyst.  Statistics has been a very rewarding career choice.  Over time I’ve learnt to swim through data like a fish dives through water.  In fact, remove me from statistics and I’d probably flap around gasping to breathe just like a landed fish.  But after many years I’ve come to accept that the vast majority of the population simply don’t “trust” statistics.  I admit not without good reason.  On the one hand we’re bombarded with statistics every day, mostly from the media (both “as reported” and by advertising).  On the other hand statistics are too often twisted, corrupted, misrepresented, biased, misused, falsified, misreported or sometimes simply ignored (not by me of course, heh).  No wonder some people throw up their hands and declare it’s all too hard.  Why bother paying attention anyway when 83.7% of all statistics are simply made up on the spot?

With that in mind I’ve just finished reading How to Lie with Statistics by Darrell Huff.

I understand that How to Lie with Statistics is one of the, if not the highest selling books on statistics ever written.  An extraordinary achievement, especially considering Huff had no formal training in statistics.  The concepts are all too familiar to me, but of course How to Lie with Statistics is not aimed at the professional.  It’s very much an introductory text aimed squarely at a non-technical audience.  My copy was a mere 124 pages long, making How to Lie with Statistics something that can be digested in just a couple of hours.  First published in 1954 it’s striking how, even though some of the language has dated terribly (“Negro”? “Mongolism”?), the basic ideas expressed inside are timeless.  Warning people to beware of such things as hidden bias, inappropriate sampling, “conveniently” omitted details, and inappropriate measures (e.g. using mean when median is more appropriate) remain as relevant in 2009 as in 1954.  They’ll still be relevant in 2059.

Duff certainly writes entertainingly and with good humour throughout, making How to Lie with Statistics a very accessible and enjoyable read.  More than 50 years after first being published, many of the statistical “sins” highlighted by Huff in his book are still being committed today.  By way of example – correlation being used to imply causation, graph scales used to exaggerate minor differences and “OK names” being used to mask dodgy sources.  In conclusion, How to Lie with Statistics will help the average reader identify the various statistics “sharks” that can lurk in these waters.

Safe swimming.

In future blog entries I’d like to expand further on some of the concepts that Huff wrote about in How to Lie with Statistics, hopefully using some real world examples.

Further reading:

Follow

Get every new post delivered to your Inbox.