feat(aether-arena): benchmark-first scorer + witness chain + repeatability (M2/M5/M7)

Per direction "remove the initial number, optimize for benchmark first" + "include witness chain capabilities for proof and repeatability analysis": - Empty board, no seeded numbers: ledger seeds to genesis only. Every result is a real scoring-pipeline witness; RuView gets no hand-entered baseline. - Real model scoring: aa_score_runner now loads predictions + an eval split (--split/--pred) and scores them through the real ruview_metrics pose harness — not just a synthetic fixture. Committed public smoke split (fixtures/smoke_*.json). - Witness chain: each score emits a witness = inputs_sha256 (binds it to the exact inputs) + proof_sha256 (cross-platform-stable score hash) + harness_version. - Repeatability analysis: --repeat N runs the harness N× and fails if it ever yields >=2 distinct proof hashes (16/16 identical locally). - Witness ledger: ledger/ledger_tools.py — append-only, hash-chained, tamper- evident (seed/append/verify); editing any past row breaks the chain. - CI gate extended: determinism + repeatability(16) + real-scoring smoke + ledger chain verify on every PR. Co-Authored-By: claude-flow <ruv@ruv.net>
2026-07-27 18:11:43 +00:00 · 2026-05-30 16:59:11 -04:00
parent a6808568a2
commit 483bfa4660
10 changed files with 373 additions and 87 deletions
@@ -146,8 +146,9 @@ The leaderboard is only credible if its failure modes cannot be hidden. Explicit
 | Model exfiltrates / phones home the eval data | Scorer container runs with **no network, read-only eval FS, resource caps** (sandboxed) |
 | Submitter overfits the public split | **Private held-out split** — never published; scoring runs on data the submitter has never seen |
 | Model fingerprints / detects the eval set | **Seasonal rotation** of a fraction of the held-out split (mirrors ADR-120 hash rotation) |
-| Maintainer silently edits a score / rank | **Signed, append-only** Parquet results ledger — rows are immutable and verifiable |
-| Scorer version drift changes ranks invisibly | **`harness_version` pinned per row**; a scorer change forces a re-eval, not a silent re-rank |
+| Maintainer silently edits a score / rank | **Witness chain**: append-only, hash-chained ledger (`ledger/ledger_tools.py`) — each row references the prior row's hash, so any edit breaks every subsequent link and `verify` fails |
+| A score can't be reproduced / hides nondeterminism | **Witness + repeatability analysis**: each score is a witness (`inputs_sha256` binding it to the exact inputs + `proof_sha256` of the quantised result + `harness_version`); `aa_score_runner --repeat N` runs the harness N× and fails if it ever produces ≥2 distinct proof hashes |
+| Scorer version drift changes ranks invisibly | **`harness_version` pinned per witness**; a scorer change moves the proof hash and fails the CI determinism gate until regenerated + reviewed |
 | Slow model brute-forces accuracy | **Latency is a ranked axis** (p50/p95/p99) with hard caps + the `latency_factor` in `arena_score` |
 | "Gold accuracy, leaks identity" win | **Privacy is a (gated) axis**; once active, `privacy_factor` penalizes leakage in `arena_score` |
 | Malicious model artifact (RCE in the scorer) | Untrusted artifact loaded in the sandboxed container only; pinned, minimal runtime; no host mounts |