Reference · how to read a track record

Underpowered: why we won’t claim skill yet — and when we would.

This page exists because the honest answer to “is your forecasting any good?” is, for now: nobody can know — including us. That isn’t modesty. It’s arithmetic. This is the arithmetic, applied to our own record first.

Our record, live
loading…

1 · The coin-flip problem

Suppose a forecaster has called two resolvable questions and both resolved their way. Impressive? A coin does that 25% of the time. Five for five? Still ~3% — about the odds of rolling a five on two dice. With a handful of resolutions, perfect performance is statistically indistinguishable from luck, and any “hit rate” carries a confidence interval so wide it covers both “oracle” and “coin.” Our own hit-rate CI at n=7 decisive calls spans roughly 25% to 84% — which is to say: it tells you almost nothing yet. We print it anyway, with the interval, because that’s what the interval is for.

2 · The honest yardstick is the market, and beating it needs ~200

Accuracy alone is the wrong test — most resolvable questions are easy, and a thermometer “predicts” most of them. The test that matters is Brier skill versus the market price at the moment of the call: did the forecaster add information the crowd didn’t have? Per-question Brier differences are noisy (typical spreads ~0.1–0.2), so detecting a realistic skill margin (a few points of Brier) at 95% confidence needs on the order of ~200 independent resolved questions; ~30 is roughly where a trend first separates from noise. Below ~30, the only honest label is the one we stamp on every surface: underpowered.

resolved questionswhat it can support
< 30nothing about skill — record-keeping only (where we are)
~30a visible trend; wide error bars; no claims
~200a defensible market-relative skill estimate with a CI
any n, no pre-registrationnothing — selection decides the record, not skill

3 · Why pre-registration is half the battle

A track record is only as honest as its denominator. Grade yourself after the fact and you will — humanly, inevitably — count the wins and narrativize the misses. That’s why every number we publish is hash-anchored the night it’s made, before the outcome is knowable, and why the receipts grade everything that resolves: wins, losses, and the embarrassing ones. No n is meaningful without this.

4 · How we get from n=2 to powered — without waiting years

Slow geopolitics questions resolve in quarters. So the ledger fills on two clocks: marquee geopolitical theses (the draw), and fast-clock macro and markets calls — Fed, CPI, oil — that resolve in days (see Resolution Week). Meanwhile two interim instruments run that don’t need resolutions at all: the pre-registered convergence study (does the market move toward our number after we diverge? — primary result so far: null, published anyway), and external benchmarks that grade hundreds of questions per season on someone else’s scoreboard.

The standing rule we’ve pre-committed to: no skill claim on any surface — headline, badge, API field — until the distinct-resolved count clears the power floor and the market-relative Brier CI excludes zero. Until then every surface says underpowered, machine-readably: "claims_demonstrated_edge": false in /api/receipt.json.

5 · How to use this on anyone (please do)

Ask three questions of any forecaster, pundit, or AI tool claiming accuracy: (1) Where is the pre-registered denominator — every call, timestamped before resolution? (2) What is n, and is the claim labeled underpowered below ~30? (3) Is the comparison against the market price at call time, or against a strawman? Anyone selling “98% accuracy” without those three answers is selling the absence of a denominator. This page is the standard we accept being held to.

The proof layer: calibration · receipts · registry · research · THE LEDGER