When Dice Fail

Some of the more popular posts on my gaming blog have been about how to check for balanced dice, using Pearson's chi-square test (testing a balanced die, testing balanced dice, testing balanced dice power). One of the observations in the last blog was that "chi-square is a test of rather lower power" (quoting Richard Lowry of Vassar College); to the extent that I've never had any dice that I've checked actually fail the test.

Until now. Here's the situation: A while back my partner Isabelle, preparing entertainment for a long trip, picked up a box of cheap dice at the dollar store around the corner from us. These dice are in the Asian-style arrangement, with the "1" and "4" pip sides colored red (I believe because the lucky color red is meant to offset those unlucky numbers):

A few weeks ago, it occurred to me that these dice are just the right size for an experiment I run early in the semester with my statistics students: namely, rolling a moderately large number of dice in handful batches and comparing convergence to the theoretically-predicted proportion of successes. In particular, the plan is customarily to roll 80 dice and see how many times we get a 5 or 6 (mentally, I'm thinking in my Book of War game, how many times can we score hits against opponents in medium armor -- but I don't say that in class).

So when we did that in class last week, it seemed like the number of 5's and 6's was significantly lower than predicted, to the extent that it actually threw the whole lesson under a shadow of suspicion and confusion. I decided that when I got a chance I'd better test these dice before using them in class again. Following the findings of the prior blog on the "low power" issue, I knew that I had to get on the order of about 500 individual die-rolls in order to get a halfway decent test; in this case with a boxful of 15 dice, it seemed was convenient to make 30 batched rolls for 15 × 30 = 450 total die rolls... although somewhere along the way I lost count of the batches and wound up actually making 480 die rolls. Here are the results of my hand-tally sheet:

As you can see at the bottom of that sheet, this box of dice actually does fail the chi-square test, as the \(SSE = 1112\) is in fact greater than the critical value of \(X \cdot E = 11.070 \cdot 80 = 885.6\).Or in other words, with a chi-square value of \(X^2 = SSE/E = 1112/80 = 13.9\) and degrees of freedom \(df = 5\), we get a P-value of \(P = 0.016\) for this hypothesis test of the dice being unbalanced; that is, if the dice really were balanced, there would be less than a 2% chance of getting an SSE value this high by natural variation alone.

In retrospect, it's easy to see what the manufacturing problem is here: note in the frequency table that it's specifically the "1"'s and the "4"'s, the specially red-colored faces, that are appearing in a preponderance of the rolls. In particular, the "1" face on each die is drilled like an enormous crater compared to the other pips; it's about 3 mm wide and about 2 mm deep (whereas other pips are only about 1 mm in both dimensions). So the "6" on the other side from the "1" would be top heavy, and tends to roll down to the bottom, leaving the "1" on top more than anything else. Also, the corners of the die are very rounded, making it easier for them to turn over freely or even get spinning by accident.

Perhaps if the experiment in class had been to count 4's, 5's, and 6's (that is: hits against light armor in my wargame), I never would have noticed the dice being unbalanced (because together those faces have about the same weight as the 1's, 2's, and 3's together)? On the one hand my inclination is to throw these dice out so they never get used again in our house by accident; but on the other hand maybe I should keep them around as the only example that the chi-square test has managed to succeed at rejecting to date.