2010-06-08

This Is The Dumbest Goddamn Thing You Can Say About Statistics

"A large population size must require a larger sample size."
This -- or any iteration thereof -- is the dumbest goddamn thing you can say about statistics. While it's a clear demonstration that someone's missed the whole point of inferential statistics, it's also one of the most common things you'll hear about them. (Often in the form of "That sample is only a small proportion of the population.") Here's some of the varieties of this statement that I've encountered over time:
How do they project statistics like that? I'm trying to imagine what kind of sample size you'd need to represent, well, everything in the universe. [In regard to matter/anti-matter ratio in the universe as researched at Fermilab; comment posted at Slashdot]
Adobe claims that its Flash platform reaches '99% of internet viewers,' but a closer look at those statistics suggests it's not exactly all-encompassing... the number of Flash users is based on a questionable internet survey of just 4,600 people — around 0.0005% of the suggested 956,000,000 total. [News summary at Slashdot]
That poll doesn't convince me of 4e's success or lack thereof. Also, there's only 904 total votes while ENWorld has over 74,000 members, so that's only a small fraction of forum members (addmittedly many of those 74,000 are probably inactive). [In regard to the popularity of the D&D game's 4th Edition; comment posted at ENWorld]
You get the idea. To save some writing time here, I'll use n to indicate the sample size and N to indicate the population size. For any statistical inference, if n=50 is an acceptable sample for N=1,000, then it's also acceptable for N=10,000, N=1 billion, or N=infinity. In particular, one thing that never really matters is the ratio of sample to population.

Brief illustration: Let's say that you're using a sample mean to estimate a population mean (much like in a scientific opinion survey, etc.). As long as you have a sample size of at least 30 or so, you automatically know what the shape of all possible sample mean results is: a normal curve, as per the (mathematically proven) Central Limit Theorem. And then you can use that curve (via some integral calculus, or a resulting table or spreadsheet formula) to calculate the probability that your observed sample mean is any given distance from the population mean. Does the size of the population have any bearing on this sampling distribution shape? No. Does the CLT make any reference to the size of the population? No, not whatsoever. You have a moderate-sized sample (30+), you know the shape of all possible sample means, you calculate your probability from that (or some equivalent process), done.

Exception: In calculating sampling distribution probabilities, you'll use something like the fact that its standard deviation is σ/√n. (Here the σ indicates the standard deviation of the whole population.) Now, if the population size happens to be exceptionally small (like, N≤20n), and you're sampling without replacement, then you can improve the estimate a bit by instead using the correction formula √((N-n)/(N-1)) * σ/√n. But why bother? (a) You're almost never in that situation, (b) it rarely makes that much difference, and (c) you're just making extra number-crunching work for yourself. So you're actually better off assuming that the population is really huge or even infinite (as is actually done), thereby saving yourself calculation effort by way of the simpler formula. For any N>20n, the difference is negligible anyway (which is to say: lim N→∞ √((N-n)/(N-1)) * σ/√n = σ/√n). Run some numerical examples (pick any σ you like) and you'll see how little difference it makes.

Even more absurd exception: One requirement that the Central Limit Theorem does have is that the population standard deviation must be nonzero, i.e., σ>0, which does rule out having a population size of just one. But, c'mon, if that were the case then what you're doing isn't really sampling or inferential statistics in the first place, now is it?

In summary: If anything, a larger population size makes the statistics easier, and the math is simplest when you assume an infinite population size in the first place. Other than that, population size has no bearing on the math behind your estimation or surveying procedure.

One final, really simple observation: If an opinion poll is performed at the standard 95% confidence level, then its margin of error can be basically calculated by: E = 1/√n. (Compare to the formula for standard deviation above; the σ disappears due to a particular very convenient substitution and cancellation.) Does the population size N appear anywhere in this formula? Nope -- it's fundamentally irrelevant to the process.

(I've written about this before, but I wanted a version that was a bit more -- ahem -- direct, for posterity's sake.)

2010-06-03

Stuff that Shouldn't Work

Making practice or test exercises is harder than you might first think before becoming a teacher. If you ever make a problem up on the fly while lecturing, it's highly probable that you'll create something with hideous fractions, irrational or imaginary numbers, extraneous solutions, etc., that you didn't want, which winds up sidetracking you from the point you were trying to make.

Another pitfall is creating problems that are singularities, i.e., the correct answer can be produced by some completely incorrect process, one that won't work for any other problem of the same nature. For remedial math students, this is almost a nightmare scenario, since their capacity to correctly generalize from the specific to the abstract is already shaky and confused as it is.

As just one example, here's one of my favorites from the algebra workbook we use at my school (custom edition produced by other teachers in same department):
If 1.05x = 22.05, then x = ?
Now, the correct process is to divide both sides by 1.05, and see that x = 22.05/1.05 = 21. But horribly, if a student mistakenly subtracts 1.05, then they also get the same answer! Say x = 22.05 - 1.05 = 21. Thus, this exercise allows a student to "submarine" a totally broken process (answers are multiple-choice in the book), giving them apparent confirmation that they're doing the right thing when they're absolutely not. (Note that this particular exercise was changed in a newer edition after I pointed it out.)

Enough prelude. The thing I'm trying to get around to is that last night I saw the "crown jewel" for this kind of problem, as part of a set of practice problems for the ACT Compass Test in Algebra. (You can actually see it here: "Sample Math Test Questions: Numerical Skills/Pre-Algebra and Algebra", Algebra item #14). Something like this:
For x ≠ 3, reduce (x2 - 9)/(x-3).
Now, the point of this exercise is to practice factoring (in this case, the top is a "difference of squares") and then cancel like factors on the top & bottom. Write: (x2 - 9)/(x-3) = (x+3)(x-3)/(x-3) = x+3.

But last night my students got all weirded out when I was writing that much (there's an additional wrinkle in the Compass problem, but it's not germane to my point) and said they got the right answer with a lot less work. They explained: "Divide x2 on top by x on bottom and get x. Divide -9 on top by -3 on bottom and get +3. There's the answer, x+3."

Now obviously this is a horribly mutilated process (and not uncommon!), thinking that you can divide individual terms in a rational expression piecemeal. (My best explanation, not that it gets fantastic traction, is always "Division distributes across addition, so if you divide by x, you have to divide every term by x." ) But the really crazy unique thing about this problem is that the broken process actually works for every possible problem of this format!

Consider all possible ways of constructing a "difference of squares" on top, and one of its canceling factors on the bottom. Case 1: Say you're reducing (a2-b2)/(a+b). Correct process: (a2-b2)/(a+b) = (a+b)(a-b)/(a+b) = a-b. Incorrect process: (a2-b2)/(a+b) = a2/a - b2/b = a-b (same answer). Case 2: Say you're reducing (a2-b2)/(a-b). Correct process: (a2-b2)/(a-b) = (a+b)(a-b)/(a-b) = a+b. Incorrect process: (a2-b2)/(a-b) = a2/a - b2/(-b) = a+b (also the same answer).

So not only does the "broken" process work for all permutations of this kind of problem, it even manages to get all the signs correct regardless of how those have been set up. Arrrghh!!!

(Silver lining: At least two of my students had the courage to tell me that's what they'd done, and I had the presence of mind to listen to it last night. I've used this practice test for about 5 years without anyone pointing out how they were doing it like that.)