For AI labs & forecasting-eval teams

A reasoning-eval dataset where
predictions are hash-anchored before the market resolved. v0.1.0

Paired records of (intel context at time T, model reasoning trace, probability, market price, eventual outcome). Built from a live AI agent trading geopolitical prediction markets in public — every signal registered to a tamper-evident chain before resolution. The only forecasting dataset where you can cryptographically prove the model did not see the answer.

Request the full dataset → Download JSONL Download CSV Field-by-field schema →

—

Resolved predictions

—

Hash-anchored

—

Theaters tagged

119+

Source feeds, 5 native languages

About the anchor coverage above. The replay-bundle pipeline (which makes hash-anchoring possible) went live 2026-05-07. Records made before that date carry the full reasoning + market + outcome triple, but show anchor: null transparently. From v0.2 onward every record is anchored — why this matters.

What's in the sample, right now

Loading live counts…

—

What you get that nothing else has

	ForecastBench / GJP / IARPA	Raw Polymarket / Kalshi	Trench Eval Dataset
Per-sample reasoning trace	—	—	Full Claude text, ~500–2000 chars
Model probability + market price paired	aggregate only	market only	both, timestamped
Pre-event hash anchor (no hindsight)	—	—	SHA-256 chain + Wayback Machine
Input intel snapshot (RSS, native-lang, indicators)	—	—	replay bundle per record
Verified outcome from a real market	—	yes	yes (Polymarket / Kalshi settlement)

Existing forecasting benchmarks publish aggregate forecaster scores. Trench publishes the full reasoning trace per prediction, paired with the exact market price the model saw, with a cryptographic anchor that lets a third party prove the prediction predated the resolution. That last property is what makes the dataset rare — every other forecasting-eval dataset has known hindsight-leakage problems.

Browse the records

Live view of the public sample. Click any row to expand the full reasoning trace and anchor block. The JSON returned by /api/dataset/records is byte-identical to the sample JSONL.

One record, fully expanded

The example below is rendered from the most recent record in the public sample — not a fabricated illustration. The full schema (every field, every type) lives at /dataset/schema.

Loading example record…

Verifying a record

Every record is independently verifiable in four steps. No trust in Trench Signals required — the proof relies only on SHA-256 and the Internet Archive.

1. curl https://trenchsignals.io/api/replay-bundle/{replay_bundle_id}
2. sha256sum -- compare to anchor.bundle_sha256
3. curl {anchor.registry_url}   # confirm the hash is in that day's chain
4. curl {anchor.wayback_url}    # confirm IA capture predates resolution

Full schema and verification protocol: trenchsignals.io/dataset/schema (also at docs/DATASET_EVAL.md). Scoring library that produces the Brier values: trench-core on PyPI (MIT, anyone can recompute byte-identical).

Load it into your eval pipeline

import json
from urllib.request import urlopen

url = "https://trenchsignals.io/dataset-files/sample/records.jsonl"
records = [json.loads(line) for line in urlopen(url) if line.strip()]
print(len(records), "records")

# Score your model against Claude's reasoning + outcome.
for r in records[:3]:
    print(r["question"]["text"])
    print("  Claude p(YES) =", r["prediction"]["probability_yes"])
    print("  Outcome      =", r["resolution"]["outcome"])
    print("  Brier         =", r["resolution"]["brier_score"])

import pandas as pd
df = pd.read_csv("https://trenchsignals.io/api/dataset/sample.csv")
print(df[["prediction_probability_yes",
          "resolution_outcome",
          "resolution_brier_score"]].describe())

# Mean Brier by theater:
print(df.groupby("question_theater")["resolution_brier_score"].mean())

from datasets import load_dataset

ds = load_dataset(
    "json",
    data_files="https://trenchsignals.io/dataset-files/sample/records.jsonl",
    split="train",
)
print(ds.features)
print(ds[0]["prediction"]["reasoning"][:300])

# Pipe to your own scorer:
def score(ex):
    p = ex["prediction"]["probability_yes"]
    o = 1 if ex["resolution"]["outcome"] == "YES" else 0
    return {"brier": (p - o) ** 2}
ds = ds.map(score)

# Sample JSONL (CC-BY-4.0, all 11 sample records):
curl -sSL https://trenchsignals.io/dataset-files/sample/records.jsonl \
  | head -1 | jq .

# Live stats:
curl -sSL https://trenchsignals.io/api/dataset/stats | jq .

# One anchored record (returns a 404 in the v0.1.0 sample —
# anchors begin at v0.2 / records made 2026-05-07 onward):
BUNDLE_ID=$(curl -sSL https://trenchsignals.io/api/dataset/records?anchored=true \
  | jq -r '.records[0].context.replay_bundle_id')
curl -sSL "https://trenchsignals.io/api/replay-bundle/${BUNDLE_ID}" | jq .

What's in v0.1.0 — and what's coming

v0.1.0 (current public sample) — Iran-theater binary-resolution markets with Claude's full reasoning trace. The first 56 records predate the replay-bundle pipeline and carry anchor: null; from 2026-05-07 onward every record is fully hash-anchored.

v0.2 (next) — anchored records for Russia/Ukraine, Taiwan/China, Lebanon, and broader Middle East. Multi-model reasoning traces (Claude + GPT-4o + Gemini in parallel via the shadow ensemble that's already wired into methodology §03) activate the moment a second-provider key is added.

Same audit pipeline backs /receipts and the ablation data at /source-quality.

License the full dataset

The public sample is CC-BY-4.0. The full dataset (all resolved predictions to date + continuous updates + replay-bundle URLs for the input intel) is a commercial license.

Best fit: AI evals teams scoring forecasting/reasoning models; forecasting-research labs building post-hindsight benchmarks; geopolitical-risk desks wanting a calibrated upstream signal.

dataset@trenchsignals.io · pricing depends on scope and update cadence.