Underpowered: why we won’t claim skill yet — and when we would.
This page exists because the honest answer to “is your forecasting any good?” is, for now: nobody can know — including us. That isn’t modesty. It’s arithmetic. This is the arithmetic, applied to our own record first.
1 · The coin-flip problem
Suppose a forecaster has called two resolvable questions and both resolved their way. Impressive? A coin does that 25% of the time. Five for five? Still ~3% — about the odds of rolling a five on two dice. With a handful of resolutions, perfect performance is statistically indistinguishable from luck, and any “hit rate” carries a confidence interval so wide it covers both “oracle” and “coin.” Our own hit-rate CI at n=7 decisive calls spans roughly 25% to 84% — which is to say: it tells you almost nothing yet. We print it anyway, with the interval, because that’s what the interval is for.
2 · The honest yardstick is the market, and beating it needs ~200
Accuracy alone is the wrong test — most resolvable questions are easy, and a thermometer
“predicts” most of them. The test that matters is Brier skill versus the
market price at the moment of the call: did the forecaster add information the crowd
didn’t have? Per-question Brier differences are noisy (typical spreads ~0.1–0.2), so
detecting a realistic skill margin (a few points of Brier) at 95% confidence needs on the order
of ~200 independent resolved questions; ~30 is roughly where a trend
first separates from noise. Below ~30, the only honest label is the one we stamp on every
surface: underpowered.
| resolved questions | what it can support |
|---|---|
| < 30 | nothing about skill — record-keeping only (where we are) |
| ~30 | a visible trend; wide error bars; no claims |
| ~200 | a defensible market-relative skill estimate with a CI |
| any n, no pre-registration | nothing — selection decides the record, not skill |
3 · Why pre-registration is half the battle
A track record is only as honest as its denominator. Grade yourself after the fact and you will — humanly, inevitably — count the wins and narrativize the misses. That’s why every number we publish is hash-anchored the night it’s made, before the outcome is knowable, and why the receipts grade everything that resolves: wins, losses, and the embarrassing ones. No n is meaningful without this.
4 · How we get from n=2 to powered — without waiting years
Slow geopolitics questions resolve in quarters. So the ledger fills on two clocks: marquee geopolitical theses (the draw), and fast-clock macro and markets calls — Fed, CPI, oil — that resolve in days (see Resolution Week). Meanwhile two interim instruments run that don’t need resolutions at all: the pre-registered convergence study (does the market move toward our number after we diverge? — primary result so far: null, published anyway), and external benchmarks that grade hundreds of questions per season on someone else’s scoreboard.
underpowered, machine-readably:
"claims_demonstrated_edge": false in
/api/receipt.json.
5 · How to use this on anyone (please do)
Ask three questions of any forecaster, pundit, or AI tool claiming accuracy: (1) Where is the pre-registered denominator — every call, timestamped before resolution? (2) What is n, and is the claim labeled underpowered below ~30? (3) Is the comparison against the market price at call time, or against a strawman? Anyone selling “98% accuracy” without those three answers is selling the absence of a denominator. This page is the standard we accept being held to.
The proof layer: calibration · receipts · registry · research · THE LEDGER