Thursday, June 7, 2012

Paucity of P-Values

How Effect Size is Immensely More Important than P-Value; Small Effect Sizes for Educational Techniques Explainable by the Pygmalion Effect; Reporting Bias; Positive Outlier in Hostos Study

One of the very nice things about the recent conference papers was getting a chance to dig into statistical research in a field that I'm actually knowledgeable about. Of course, I've taught college statistics for approximately 7 years at this point. In the last year I started assigning a research project to report and interpret a medical article of the student's choice from JAMA -- which is a fair bit of work for me, since I'm not previously aware of the medical issues or terminology involved (and as an accidental side-effect of this assignment, I've gotten a rapid-immersion pickup of lots more medical information than I ever expected). So here's an opportunity to see how our statistics apply to actual math education issues.

In my statistics course, P-value statements are among the "crown jewels" of the course (assessment of reasonableness of a hypothesis such as "do average test scores increase with this technique?"), and frequently the last thing we do in the course. It takes the whole semester as prep-work, and then about a week of lectures on the subject. It's an interesting and clever piece of math which can frequently establish whether results in the population are, on average, improved. For example: A JAMA medical journal might say, "Infection was... increased with number of sexual partners (P < .001 for trend) and cigarettes smoked per day (P < .001 for trend)." [Gillision, 2012, "Prevalence of Oral HPV Infection in the United States, 2009-2010"] As our textbook would say, this indicates extremely strong evidence that the claim was true for the population in general (lower P-values being better; sort of a probability of being wrong in some sense).

The somewhat thunderbolt-realization I got from the math-education articles I've been reading is that suddenly, I kind of don't give a crap about the P-values. What I really care about is the effect size; how much did scores go up (if any)? We want to make a cost-benefit analysis on completely overhauling our classes; is it worthwhile? Granted that some increase exists -- is it useful, or negligibly small?

A textbook will usually discuss this briefly; "Statistical significance is not the same as practical significance", but until now I didn't realize how immensely critical that was. Several of the papers I'm looking at take some delight in spending several paragraphs explaining what a P-value is, and how it can establish overwhelming likelihood for (some, possibly negligible?) increase in average test scores. Others get sloppy about saying "these findings are very significant!", not specifying statistical significance -- which is to say, possibly not actually significant at all. P-values for some change are somewhat interesting, and in the JAMA article I think they're worth the 3-or-so words expended on them ("P < .001 for trend"), but not any more than that. Noting that most of our math instructional papers gloss over this without highlighting effect size, Wikipedia says this:
Reporting effect sizes is considered good practice when presenting empirical research findings in many fields. The reporting of effect sizes facilitates the interpretation of the substantive, as opposed to the statistical, significance of a research result. Effect sizes are particularly prominent in social and medical research. (Wikipedia, "Effect Size")

This perhaps brings to mind a joke:
"There are many kinds of intelligence: practical, emotional. And then there’s actual intelligence, which is what I’m talking about." (Jack Donaghy, 30 Rock)

So the truth is, most of the recent conference papers are demonstrating fairly small effect sizes from any of the various techniques tried; for those that showed any significance at all, it's something like a 2, 5, or 7% increase to final exam scores or overall passing rates (as one said, "about two-thirds of a letter grade"). Is that worth the effort entirely overhauling an educational program for? It seems to me like effect sizes of this amount are quite likely to be accounted for by one or more of the following well-known testing phenomena:
  • Novelty Effect -- Subjects tend to perform better for a limited time after any kind of change in the environment.
  • Hawthorne Effect -- Subjects in an experiment may increase performance because they know they're being studied.
  • Pygmalion Effect -- If a teacher simply expects students to perform better, then students do in fact perform better.
Let's think about that last one a bit more. It seems pretty well-known that if researchers are invested in the outcome of their research in an educational setting, results tend to track their expectations/incentives (perhaps they will put more energy than usual into the new technique, etc.; something that won't scale to other instructors in general). In medicine, for example, that's analogous to the reason why you want double-blinded trials. Robert Rosenthal presented findings (in a meta-analysis of hundreds of experiments) that interpersonal expectancy by the teacher has a mean effect size on learning and ability of r = 0.26 -- that is, r^2 = 0.07, such that it alone explains 7% of the variation seen in outcomes (Rosenthal, Dec-1994, "Interpersonal Expectations: A 30-Year Perspective", in Current Directions in Psychological Science).

So now let's momentarily limit ourselves to considering the studies from this conference that that did not have the primary investigators specially teaching the experimental groups (my hypothesis being: those studies will show reduced effect sizes as regards to overall passing rates). There are 3 of the 10 studies where this can be established. First: BMCC (in Algebra) which had randomized instructor assignments -- control pass rate was 32.8% vs. treatment pass rate 36.4% (sample effect +3.6%; P = 0.3013, not statistically significant). Second: Laguardia (in Algebra) which only made extra outside tutors available (no change to in-class teaching) -- control pass rate 56.6% vs. treatment pass rate 58.9% (sample effect +2.3%; P = 0.471, not statistically significant). Third: Brooklyn College (in Precalculus), which scrupulously avoided having investigators teaching -- They report fail/withdraw rates and a somewhat questionable statistical procedure (assumes that control group counts as population, generating P = 0.0869), which I'll re-do myself here. Control pass (non-F/W) rate was 67.46%, treatment pass rate 77.55% (sample effect +10.09%; running a two-proportions z-test [per Weiss Sec 12.3] gives me P = 0.0749, only moderately-strong evidence -- not strong or very strong -- for the trend according to my book). A little more info on the last one: A two-proportion z-interval calculation for the improvement in population passing rate gives (95% C.I.: -2.42% to +22.60%), which is to say, the demonstrated effect size is less than the margin of error.

The other interesting lesson here is that uniformly, there's always "reasons why more research is called for" (or some-such). If a technique did not show improvement, then there's a paragraph explaining extenuating or confounding reasons for that, that could be fixed in a future round of research (but there's never a parallel explanation for why a significant result was accidental or one-time-only). Out of 10 research papers I'm looking at, no one ever said, "The results of this study show no evidence for this technique improving scores, and therefore we recommend not pursuing it in the future." (I guess that can be called "Reporting Bias".) Likewise, the introduction from the university goes on about how many "positive effects" of the various studies, but again, I'm not seeing effect sizes that are tremendously useful.

There is one outlier in all this. The study from Hostos claims a near-doubling of remedial class passing rates (from 24% to 43%) when online software is used for homework assignments (specifically, MathXL). It's a short report, and a bit unclear on the study setup, and whether these results are just for Arithmetic or also Algebra classes. (Note that the following report from City College on using Maple TA software for Precalculus classes showed showed no statistical difference in performance). I trail-blazed using MathXL myself at another college about 10 years ago, but didn't find as much improvement as I expected at the time (plus lots of technical complaints, can't-get-online excuses, not accessible to vision-impaired students, etc.) I'd like to see more clear information on exactly how this was achieved.

(Consider how this relates to recent troubling revelations that most published medical results cannot be reproduced.)


  1. Follow-up: The tremendous positive results shown at Hostos were in fact only for Arithmetic. There was no statistically significant effect shown for Algebra.

  2. Another follow-up: For the Hostos Algebra classes in the study above, in fact, the experimental groups did less well than the old control-group way of presenting homework. Pass rate of Experiment group (MathXL online homework and multiple tutors) was 31%; Control-1 group (MathXL homework and single tutor) 38%; Control-2 group (paper homework and single tutor) 43%. Thanks to Alice Cunningham for providing me with the full paper.