Monday, March 28, 2011

Lindley's Paradox

The Wikipedia description of Lindley's Paradox asserts an example of opposite hypothesis-testing results between the Frequentist approach and the Bayesian approach.

The example is one of testing a certain town for the ratio of boy-to-girl births. The thing that violently strikes me here is the choice of the Bayesian prior: P(theta = 0.5) = 0.5, i.e., the advance assumption that it's 50% likely for the ratio to be equal to 0.5 (the other 50% chance spread uniformly between all points from 0 to 1).

I mean: What? Why would I conceivably assume that? If I broadly picture real numbers as being continuous, then my instinct would be to assume that it's almost impossible for any given number to be exactly the parameter value, i.e., I'd assume P(theta = 0.5) = 0. Even if I didn't reason that way, I otherwise have copious evidence that human births aren't really 50/50, there's very clearly more boys born than girls -- so if anything I'd choose that as the most likely prior value.

Is that really how Bayesians are supposed to choose their prior? (It seems atrocious!) Or is this just a fantastically mangled example at Wikipedia?

6 comments:

  1. It's just a contrived example to illustrate the paradox. The paradox still works if you choose to have a prior that's a narrow Gaussian at 0.5 on top of a much broader distribution (flat, or anything wide).

    The point is that the paradox most often rears its head when the prior is broad with a high narrow region in addition, and a flat prior with a delta function is just the simplest in many respects.

    ReplyDelete
  2. That makes a little more sense, but I'm having trouble wrapping my head around the article's general-description statements "a prior distribution that favors H0 weakly" (either example seems like favoring it strongly) and "It is a result of the prior having a sharp feature at H0" (where I'd call your example seems to have a non-sharp feature).

    Thanks for addressing this -- do you have a link or citation to better presentation/example?

    ReplyDelete
  3. There's been some recent discussion of this http://www.science20.com/quantum_diaries_survivor/jeffreyslindley_paradox-87184
    http://andrewgelman.com/2012/02/untangling-the-jeffreys-lindley-paradox/

    Also, I edited the Wikipedia article to remove "weakly", since that's obviously not the case, and to add a more rational comparison of the Bayesian and Frequentist approaches, in which they both give the same conclusion.

    And no, speaking as a Bayesian, this is not how one would usually choose a prior (at least, without a great deal of previous experience/evidence).

    ReplyDelete
  4. ^ Very informative, thanks for posting that! Glad to know I'm not totally alone in my intuition that seems like a bungled example/prior.

    ReplyDelete
  5. I am rather surprised that this is being discussed among the statisticians.

    As I see it, the Bayesian case is a natural consequence of extremely flawed prior probabilities. Bayesian logic is a tool, and like with any other tool, you should understand what it does.

    By setting a nonzero prior probability for a point value, the probability density becomes infinite. Compare that to the finite density in an infinitesimal interval I=<0.5, 0.5+epsilon>, infinitesimally close to the theta=0.5 point value.

    It is like saying "In this murder case, I already have a strong evidence that Mr. X could be the perpetrator. There are millions of other potential perpetrators, and the total probability of these others is not insignificant, but now that we have this new evidence about who where close to the site at the time, the few that were near, add up to next to nothing, and the rest of the potential suspects were far from the site, so their probability, weighted by their distance, remains very low even when added over the millions of individuals.

    This is sound logic - but only if the premises are true. Do you really have prior evidence that implicates Mr. X millions of times more strongly than his nearby neighbors?

    Even if the single-point null hypothesis is replaced with a very narrow range, the prior probability density becomes huge inside that range, while the density becomes comparatively abysmally small just outside. Does that reflect the prior knowledge about the question being studied? If so were the case, it would have to have an effect on your reasoning, and Bayesian logic takes that into account. But in the example, you just don't have any such knowledge. Garbage in - garbage out.

    ReplyDelete
    Replies
    1. Everything you said here makes sense to me. Not being trained in Bayesian statistics, I had to throw up my hands at that example that seemed simply ludicrous to me. Thanks for writing your observations and helping convince me I'm not crazy, I appreciate it!

      Delete