## 2010-06-08

### This Is The Dumbest Goddamn Thing You Can Say About Statistics

"A large population size must require a larger sample size."
This -- or any iteration thereof -- is the dumbest goddamn thing you can say about statistics. While it's a clear demonstration that someone's missed the whole point of inferential statistics, it's also one of the most common things you'll hear about them. (Often in the form of "That sample is only a small proportion of the population.") Here's some of the varieties of this statement that I've encountered over time:
How do they project statistics like that? I'm trying to imagine what kind of sample size you'd need to represent, well, everything in the universe. [In regard to matter/anti-matter ratio in the universe as researched at Fermilab; comment posted at Slashdot]
Adobe claims that its Flash platform reaches '99% of internet viewers,' but a closer look at those statistics suggests it's not exactly all-encompassing... the number of Flash users is based on a questionable internet survey of just 4,600 people — around 0.0005% of the suggested 956,000,000 total. [News summary at Slashdot]
That poll doesn't convince me of 4e's success or lack thereof. Also, there's only 904 total votes while ENWorld has over 74,000 members, so that's only a small fraction of forum members (addmittedly many of those 74,000 are probably inactive). [In regard to the popularity of the D&D game's 4th Edition; comment posted at ENWorld]
You get the idea. To save some writing time here, I'll use n to indicate the sample size and N to indicate the population size. For any statistical inference, if n=50 is an acceptable sample for N=1,000, then it's also acceptable for N=10,000, N=1 billion, or N=infinity. In particular, one thing that never really matters is the ratio of sample to population.

Brief illustration: Let's say that you're using a sample mean to estimate a population mean (much like in a scientific opinion survey, etc.). As long as you have a sample size of at least 30 or so, you automatically know what the shape of all possible sample mean results is: a normal curve, as per the (mathematically proven) Central Limit Theorem. And then you can use that curve (via some integral calculus, or a resulting table or spreadsheet formula) to calculate the probability that your observed sample mean is any given distance from the population mean. Does the size of the population have any bearing on this sampling distribution shape? No. Does the CLT make any reference to the size of the population? No, not whatsoever. You have a moderate-sized sample (30+), you know the shape of all possible sample means, you calculate your probability from that (or some equivalent process), done.

Exception: In calculating sampling distribution probabilities, you'll use something like the fact that its standard deviation is σ/√n. (Here the σ indicates the standard deviation of the whole population.) Now, if the population size happens to be exceptionally small (like, N≤20n), and you're sampling without replacement, then you can improve the estimate a bit by instead using the correction formula √((N-n)/(N-1)) * σ/√n. But why bother? (a) You're almost never in that situation, (b) it rarely makes that much difference, and (c) you're just making extra number-crunching work for yourself. So you're actually better off assuming that the population is really huge or even infinite (as is actually done), thereby saving yourself calculation effort by way of the simpler formula. For any N>20n, the difference is negligible anyway (which is to say: lim N→∞ √((N-n)/(N-1)) * σ/√n = σ/√n). Run some numerical examples (pick any σ you like) and you'll see how little difference it makes.

Even more absurd exception: One requirement that the Central Limit Theorem does have is that the population standard deviation must be nonzero, i.e., σ>0, which does rule out having a population size of just one. But, c'mon, if that were the case then what you're doing isn't really sampling or inferential statistics in the first place, now is it?

In summary: If anything, a larger population size makes the statistics easier, and the math is simplest when you assume an infinite population size in the first place. Other than that, population size has no bearing on the math behind your estimation or surveying procedure.

One final, really simple observation: If an opinion poll is performed at the standard 95% confidence level, then its margin of error can be basically calculated by: E = 1/√n. (Compare to the formula for standard deviation above; the σ disappears due to a particular very convenient substitution and cancellation.) Does the population size N appear anywhere in this formula? Nope -- it's fundamentally irrelevant to the process.

(I've written about this before, but I wanted a version that was a bit more -- ahem -- direct, for posterity's sake.)

1. You're right that they're wrong, but I don't think it's dumb. It's such a natural mistake. (You're also right that technical writers should get it. I expect commenters to get it wrong; I want to expect the technical writers to get it right.)

You don't explain in this post why the ratio doesn't matter. For me, math is all about why. I think this is a hard concept for many people, and I'd love more really clear illustrations of why it must be true.

Here's what I tell my students:
Imagine a silo full of corn, very well mixed (so a random sample will be easy to get), now pull out one big scoop. How may kernels does it have? Hundreds, maybe a thousand.

Suppose you want to measure the moisture content of all the corn in the silo. If you find that it doesn't vary too much among these kernels, then this one scoop seems like enough to judge the silo by.

Does it matter if the silo doubles in height? Not as long as we mix the corn up really well.

They like that example, but I think plenty of them would still get your final exam question on this wrong.

2. Of course, I'm trying to bit intentionally attention-grabbing here. :) Re: The title, I actually said that out loud in my college class a week ago, and it got an enormously positive-charged and memorable response (not advised for most instructors), so I figured I'd iterate on it here a bit.

I've described a similar illustration in my class in the past (being game-oriented, I usually got to a deck of cards; if you sampled for the average rank, does it matter if there are 1 or 4 or 100 suits?)

However, I actually feel that it distracts a bit (not a good use of time for me) from the central key insight to inferential statistics, which is to specify the sampling distribution of the estimating statistic. That's why if I can bring my explanation back to "The shape of all possible sample means is what allows you to calculate probabilities -- and that's normal by the CLT, regardless of population size", then I'd rather do that, and have it feed back again into the one critical "hard insight" about the subject.

3. I think the confusion comes about because in real life, samples are not "random".
So for example, in a company of 1000, i may sample 50 people and conclude that 80% of them have iPads. I then may make some statistical inference about the proportion of the company population having iPads.

The problem is, i only surveyed co-workers, who all work in IT. In this case I may have a good sample for my department but not for the company as a whole. And in this case it may be easy to identify why my sample may be biased with respect to the population of the company.

So that leads people to conclude that a bigger sample will give more accurate results, because it will likely reduce bias in the sample. Because we have this uneasy feeling that maybe our sample is not representative, and many times we don't know why.

If I ask everyone in my company if they own an iPad, I know the mean of the population itself. If I ask 90% of the population, i am unlikely to be off the population mean by a lot, even if my sample is biased.

However, if I only ask 20%, I can be off by quite a bit because i have a biased sample. If I have a sample, how do I know it is biased? I can't tell just from the sample itself.

1. Well obviously biased samples are bad, but that's a distinct issue from sample size. We can, for example, commit to being on the lookout for instances where you've asked people from only one department.

I would argue that if someone were in a situation where they were asking 90% (or even 20%, really) of a population some question, then they've effectively lost all the cost-savings of taking a sample in the first place. So: we need to get people on board with accepting miniscule fractions of a percent as expected and desirable behavior for sampling.