Paired records of (intel context at time T, model reasoning trace, probability, market price, eventual outcome). Built from a live AI agent trading geopolitical prediction markets in public — every signal registered to a tamper-evident chain before resolution. The only forecasting dataset where you can cryptographically prove the model did not see the answer.
Loading live counts…
—
| ForecastBench / GJP / IARPA | Raw Polymarket / Kalshi | Trench Eval Dataset | |
|---|---|---|---|
| Per-sample reasoning trace | — | — | Full Claude text, ~500–2000 chars |
| Model probability + market price paired | aggregate only | market only | both, timestamped |
| Pre-event hash anchor (no hindsight) | — | — | SHA-256 chain + Wayback Machine |
| Input intel snapshot (RSS, native-lang, indicators) | — | — | replay bundle per record |
| Verified outcome from a real market | — | yes | yes (Polymarket / Kalshi settlement) |
Existing forecasting benchmarks publish aggregate forecaster scores. Trench publishes the full reasoning trace per prediction, paired with the exact market price the model saw, with a cryptographic anchor that lets a third party prove the prediction predated the resolution. That last property is what makes the dataset rare — every other forecasting-eval dataset has known hindsight-leakage problems.
Live view of the public sample. Click any row to expand the full
reasoning trace and anchor block. The JSON returned by
/api/dataset/records is byte-identical to the sample JSONL.
The example below is rendered from the most recent record in the public sample — not a fabricated illustration. The full schema (every field, every type) lives at /dataset/schema.
Loading example record…
Every record is independently verifiable in four steps. No trust in TrenchSignals required — the proof relies only on SHA-256 and the Internet Archive.
1. curl https://trenchsignals.io/api/replay-bundle/{replay_bundle_id} 2. sha256sum -- compare to anchor.bundle_sha256 3. curl {anchor.registry_url} # confirm the hash is in that day's chain 4. curl {anchor.wayback_url} # confirm IA capture predates resolution
Full schema and verification protocol: trenchsignals.io/dataset/schema (also at docs/DATASET_EVAL.md). Scoring library that produces the Brier values: trench-core on PyPI (MIT, anyone can recompute byte-identical).
import json from urllib.request import urlopen url = "https://trenchsignals.io/dataset-files/sample/records.jsonl" records = [json.loads(line) for line in urlopen(url) if line.strip()] print(len(records), "records") # Score your model against Claude's reasoning + outcome. for r in records[:3]: print(r["question"]["text"]) print(" Claude p(YES) =", r["prediction"]["probability_yes"]) print(" Outcome =", r["resolution"]["outcome"]) print(" Brier =", r["resolution"]["brier_score"])
import pandas as pd df = pd.read_csv("https://trenchsignals.io/api/dataset/sample.csv") print(df[["prediction_probability_yes", "resolution_outcome", "resolution_brier_score"]].describe()) # Mean Brier by theater: print(df.groupby("question_theater")["resolution_brier_score"].mean())
from datasets import load_dataset ds = load_dataset( "json", data_files="https://trenchsignals.io/dataset-files/sample/records.jsonl", split="train", ) print(ds.features) print(ds[0]["prediction"]["reasoning"][:300]) # Pipe to your own scorer: def score(ex): p = ex["prediction"]["probability_yes"] o = 1 if ex["resolution"]["outcome"] == "YES" else 0 return {"brier": (p - o) ** 2} ds = ds.map(score)
# Sample JSONL (CC-BY-4.0, all 11 sample records): curl -sSL https://trenchsignals.io/dataset-files/sample/records.jsonl \ | head -1 | jq . # Live stats: curl -sSL https://trenchsignals.io/api/dataset/stats | jq . # One anchored record (returns a 404 in the v0.1.0 sample — # anchors begin at v0.2 / records made 2026-05-07 onward): BUNDLE_ID=$(curl -sSL https://trenchsignals.io/api/dataset/records?anchored=true \ | jq -r '.records[0].context.replay_bundle_id') curl -sSL "https://trenchsignals.io/api/replay-bundle/${BUNDLE_ID}" | jq .
v0.1.0 (current public sample) — Iran-theater
binary-resolution markets with Claude's full reasoning trace. The
first 56 records predate the replay-bundle pipeline and carry
anchor: null; from 2026-05-07 onward
every record is fully hash-anchored.
v0.2 (next) — anchored records for Russia/Ukraine, Taiwan/China, Lebanon, and broader Middle East. Multi-model reasoning traces (Claude + GPT-4o + Gemini in parallel via the shadow ensemble that's already wired into methodology §03) activate the moment a second-provider key is added.
Same audit pipeline backs /receipts and the ablation data at /source-quality.
The public sample is CC-BY-4.0. The full dataset (all resolved predictions to date + continuous updates + replay-bundle URLs for the input intel) is a commercial license.
Best fit: AI evals teams scoring forecasting/reasoning models; forecasting-research labs building post-hindsight benchmarks; geopolitical-risk desks wanting a calibrated upstream signal.
dataset@trenchsignals.io · pricing depends on scope and update cadence.