feat(aether-arena): benchmark-first scorer + witness chain + repeatability (M2/M5/M7)

Per direction "remove the initial number, optimize for benchmark first" + "include
witness chain capabilities for proof and repeatability analysis":

- Empty board, no seeded numbers: ledger seeds to genesis only. Every result is a
  real scoring-pipeline witness; RuView gets no hand-entered baseline.
- Real model scoring: aa_score_runner now loads predictions + an eval split
  (--split/--pred) and scores them through the real ruview_metrics pose harness —
  not just a synthetic fixture. Committed public smoke split (fixtures/smoke_*.json).
- Witness chain: each score emits a witness = inputs_sha256 (binds it to the exact
  inputs) + proof_sha256 (cross-platform-stable score hash) + harness_version.
- Repeatability analysis: --repeat N runs the harness N× and fails if it ever
  yields >=2 distinct proof hashes (16/16 identical locally).
- Witness ledger: ledger/ledger_tools.py — append-only, hash-chained, tamper-
  evident (seed/append/verify); editing any past row breaks the chain.
- CI gate extended: determinism + repeatability(16) + real-scoring smoke + ledger
  chain verify on every PR.

Co-Authored-By: claude-flow <ruv@ruv.net>
This commit is contained in:
ruv
2026-05-30 16:59:11 -04:00
parent a6808568a2
commit 483bfa4660
10 changed files with 373 additions and 87 deletions
@@ -1,74 +1,95 @@
//! AetherArena ("AA") Deterministic Score Runner (ADR-149).
//! AetherArena ("AA") Score Runner + Witness Chain (ADR-149).
//!
//! The CI-runnable entry point behind the AA harness gate: it runs the **real**
//! `wifi-densepose-train::ruview_metrics` pose-acceptance harness against a
//! fixed, committed synthetic fixture (seed = 42) and emits:
//! 1. the pose metrics (PCK@0.2 all/torso, OKS, jitter, p95 error),
//! 2. the v0 `RuViewTier`-style pose verdict, and
//! 3. a cross-platform-stable SHA-256 **proof hash** of the quantised result.
//! Benchmark-first scorer for the official Spatial-Intelligence Benchmark. It runs
//! the **real** `wifi-densepose-train::ruview_metrics` pose-acceptance harness and
//! emits a **witness record** for proof + repeatability analysis:
//!
//! This is the `determinism_gate` substrate from ADR-149 §2.5: the same fixture
//! + same harness version must always produce the same hash. A PR that changes
//! the scoring maths moves the hash and fails the gate (the `expected_score.sha256`
//! must be regenerated and reviewed), so scorer drift can never land silently.
//! witness = { inputs_sha256, harness_version, metrics, tier, proof_sha256 }
//!
//! Cross-platform portability (lesson from `calibration_proof_runner.rs`):
//! PCK/OKS use `sqrt` (libm-sensitive: glibc/MSVC/Apple differ by ~1e-7). We
//! never hash raw f32 — we quantise each metric to coarse fixed-point (1e-3 /
//! 1e-4) so a 1e-7 libm wobble is invisible while a real algorithm change
//! (>1e-3) breaks the hash. No sort, no truncation.
//! The `proof_sha256` is a cross-platform-stable hash of the quantised score; the
//! `inputs_sha256` binds the witness to the exact inputs it scored. Together with
//! the append-only hash-chained ledger (`aether-arena/ledger`), every published
//! rank traces back to a reproducible witness — the witness chain.
//!
//! Usage:
//! # verify against the committed expected hash (CI gate default):
//! Modes:
//! # 1. Determinism self-test on the committed fixture (CI gate default):
//! cargo run -p wifi-densepose-train --bin aa_score_runner --no-default-features
//!
//! # emit the score as JSON (for the leaderboard ledger row):
//! cargo run -p wifi-densepose-train --bin aa_score_runner --no-default-features -- --json
//! # 2. Repeatability analysis — run K times, confirm identical proof hash:
//! cargo run ... --bin aa_score_runner --no-default-features -- --repeat 8
//!
//! # regenerate the expected hash (after an intentional scorer change):
//! cargo run -p wifi-densepose-train --bin aa_score_runner --no-default-features -- --generate-hash \
//! > ../aether-arena/fixtures/expected_score.sha256
//! # 3. Real model scoring — score predictions against an eval split:
//! cargo run ... --bin aa_score_runner --no-default-features -- \
//! --split eval.json --pred predictions.json --json
//!
//! # 4. Regenerate the fixture's expected hash (after an intentional change):
//! cargo run ... --bin aa_score_runner --no-default-features -- --generate-hash \
//! > ../aether-arena/fixtures/expected_score.sha256
//!
//! Input JSON (split = private ground truth; pred = the submitted model's output):
//! split.json : {"frames":[{"gt":[[x,y]*17],"vis":[v*17],"scale":1.0}, ...]}
//! pred.json : {"frames":[{"pred":[[x,y]*17]}, ...]} (index-aligned with split)
//!
//! Determinism discipline (lesson from calibration_proof_runner.rs): PCK/OKS use
//! libm `sqrt` which differs ~1e-7 across glibc/MSVC/Apple — so we hash only the
//! quantised metrics (1e-3 / 1e-4), never raw f32. No sort, no truncation.
use std::env;
use std::process::ExitCode;
use ndarray::{Array1, Array2};
use serde::Deserialize;
use sha2::{Digest, Sha256};
use wifi_densepose_train::ruview_metrics::{
evaluate_joint_error, JointErrorResult, JointErrorThresholds,
};
/// Bump when the fixture or canonical hash form changes on purpose. Pinned into
/// the proof so a `harness_version` change forces a re-score (ADR-149 §2.4).
const AA_HARNESS_VERSION: u32 = 1;
/// Bump on a purposeful fixture/canonical-form change. Pinned into every witness
/// so a `harness_version` change forces a re-score (ADR-149 §2.4).
const AA_HARNESS_VERSION: u32 = 2;
/// Fixture size — fixed so the hash is stable.
const N_FRAMES: usize = 120;
const N_KPTS: usize = 17;
/// Deterministic, libm-free LCG (Numerical Recipes constants) → u32 → f32 in [0,1).
// ── input schema ────────────────────────────────────────────────────────────
#[derive(Deserialize)]
struct SplitFile {
frames: Vec<SplitFrame>,
}
#[derive(Deserialize)]
struct SplitFrame {
gt: Vec<[f32; 2]>,
vis: Vec<f32>,
#[serde(default = "one")]
scale: f32,
}
#[derive(Deserialize)]
struct PredFile {
frames: Vec<PredFrame>,
}
#[derive(Deserialize)]
struct PredFrame {
pred: Vec<[f32; 2]>,
}
fn one() -> f32 {
1.0
}
// ── deterministic fixture (libm-free LCG) ─────────────────────────────────────
struct Lcg(u64);
impl Lcg {
fn next_u32(&mut self) -> u32 {
self.0 = self.0.wrapping_mul(6364136223846793005).wrapping_add(1442695040888963407);
(self.0 >> 32) as u32
}
/// Uniform f32 in [0,1) at 1e-6 granularity — no float math in the generator.
fn unit(&mut self) -> f32 {
(self.next_u32() % 1_000_000) as f32 / 1_000_000.0
}
}
/// Build the canonical fixture: ground-truth keypoints in [0.2,0.8] and
/// predictions = GT + a small, deterministic offset, so PCK/OKS land in a
/// stable mid-high band (not trivially 0 or 1). Identical on every platform.
fn build_fixture() -> (Vec<Array2<f32>>, Vec<Array2<f32>>, Vec<Array1<f32>>, Vec<f32>) {
let mut rng = Lcg(42);
let mut gt = Vec::with_capacity(N_FRAMES);
let mut pred = Vec::with_capacity(N_FRAMES);
let mut vis = Vec::with_capacity(N_FRAMES);
let mut scale = Vec::with_capacity(N_FRAMES);
let (mut pred, mut gt, mut vis, mut scale) = (vec![], vec![], vec![], vec![]);
for _ in 0..N_FRAMES {
let mut g = Array2::<f32>::zeros((N_KPTS, 2));
let mut p = Array2::<f32>::zeros((N_KPTS, 2));
@@ -76,15 +97,12 @@ fn build_fixture() -> (Vec<Array2<f32>>, Vec<Array2<f32>>, Vec<Array1<f32>>, Vec
for k in 0..N_KPTS {
let gx = 0.2 + 0.6 * rng.unit();
let gy = 0.2 + 0.6 * rng.unit();
// Deterministic prediction offset: small for most kpts, larger for a
// few, so PCK is a believable fraction (~0.6-0.8) rather than 1.0.
let ox = (rng.unit() - 0.5) * 0.06;
let oy = (rng.unit() - 0.5) * 0.06;
g[[k, 0]] = gx;
g[[k, 1]] = gy;
p[[k, 0]] = (gx + ox).clamp(0.0, 1.0);
p[[k, 1]] = (gy + oy).clamp(0.0, 1.0);
// Occlude ~10% deterministically.
if rng.next_u32() % 10 == 0 {
v[k] = 0.0;
}
@@ -97,13 +115,53 @@ fn build_fixture() -> (Vec<Array2<f32>>, Vec<Array2<f32>>, Vec<Array1<f32>>, Vec
(pred, gt, vis, scale)
}
/// Canonical, libm-stable byte form of the result for hashing.
/// Each metric → coarse fixed-point so ~1e-7 platform noise can't flip the hash.
/// Load (pred, gt, vis, scale) from index-aligned split + prediction files.
fn load_inputs(
split_path: &str,
pred_path: &str,
) -> Result<(Vec<Array2<f32>>, Vec<Array2<f32>>, Vec<Array1<f32>>, Vec<f32>), String> {
let split: SplitFile = serde_json::from_str(
&std::fs::read_to_string(split_path).map_err(|e| format!("read split: {e}"))?,
)
.map_err(|e| format!("parse split: {e}"))?;
let pred: PredFile = serde_json::from_str(
&std::fs::read_to_string(pred_path).map_err(|e| format!("read pred: {e}"))?,
)
.map_err(|e| format!("parse pred: {e}"))?;
if split.frames.len() != pred.frames.len() {
return Err(format!(
"frame count mismatch: split={} pred={}",
split.frames.len(),
pred.frames.len()
));
}
let (mut gt, mut pr, mut vis, mut scale) = (vec![], vec![], vec![], vec![]);
for (i, (s, p)) in split.frames.iter().zip(pred.frames.iter()).enumerate() {
let to_arr = |kps: &[[f32; 2]]| -> Result<Array2<f32>, String> {
if kps.len() != N_KPTS {
return Err(format!("frame {i}: expected {N_KPTS} keypoints, got {}", kps.len()));
}
let mut a = Array2::<f32>::zeros((N_KPTS, 2));
for (k, xy) in kps.iter().enumerate() {
a[[k, 0]] = xy[0];
a[[k, 1]] = xy[1];
}
Ok(a)
};
gt.push(to_arr(&s.gt)?);
pr.push(to_arr(&p.pred)?);
vis.push(Array1::from(s.vis.clone()));
scale.push(s.scale);
}
Ok((pr, gt, vis, scale))
}
/// Canonical, libm-stable byte form of the score for the proof hash.
fn canonical_bytes(r: &JointErrorResult) -> Vec<u8> {
let mut b = Vec::new();
b.extend_from_slice(b"AA-SCORE-v0");
b.extend_from_slice(&AA_HARNESS_VERSION.to_le_bytes());
let q = |x: f32, scale: f32| -> u32 { (x.max(0.0) * scale).round() as u32 };
let q = |x: f32, s: f32| -> u32 { (x.max(0.0) * s).round() as u32 };
b.extend_from_slice(&q(r.pck_all, 1e3).to_le_bytes());
b.extend_from_slice(&q(r.pck_torso, 1e3).to_le_bytes());
b.extend_from_slice(&q(r.oks, 1e3).to_le_bytes());
@@ -113,56 +171,132 @@ fn canonical_bytes(r: &JointErrorResult) -> Vec<u8> {
b
}
fn hash_hex(bytes: &[u8]) -> String {
fn sha256_hex(bytes: &[u8]) -> String {
let mut h = Sha256::new();
h.update(bytes);
h.finalize().iter().map(|x| format!("{x:02x}")).collect()
}
/// Bind the witness to its exact inputs: hash the quantised gt+pred+vis bytes.
fn inputs_hash(
pred: &[Array2<f32>],
gt: &[Array2<f32>],
vis: &[Array1<f32>],
) -> String {
let mut h = Sha256::new();
h.update(b"AA-INPUTS-v0");
h.update((pred.len() as u32).to_le_bytes());
let q = |x: f32| -> i32 { (x * 1e4).round() as i32 };
for f in 0..gt.len() {
for k in 0..N_KPTS {
h.update(q(gt[f][[k, 0]]).to_le_bytes());
h.update(q(gt[f][[k, 1]]).to_le_bytes());
h.update(q(pred[f][[k, 0]]).to_le_bytes());
h.update(q(pred[f][[k, 1]]).to_le_bytes());
h.update([(vis[f][k] >= 0.5) as u8]);
}
}
h.finalize().iter().map(|x| format!("{x:02x}")).collect()
}
struct Witness {
inputs_sha256: String,
proof_sha256: String,
result: JointErrorResult,
}
fn score(
pred: &[Array2<f32>],
gt: &[Array2<f32>],
vis: &[Array1<f32>],
scale: &[f32],
) -> Witness {
let result = evaluate_joint_error(pred, gt, vis, scale, &JointErrorThresholds::default());
Witness {
inputs_sha256: inputs_hash(pred, gt, vis),
proof_sha256: sha256_hex(&canonical_bytes(&result)),
result,
}
}
fn witness_json(w: &Witness) -> String {
format!(
"{{\"category\":\"pose\",\"harness_version\":{},\"inputs_sha256\":\"{}\",\"proof_sha256\":\"{}\",\"pck_all\":{:.4},\"pck_torso\":{:.4},\"oks\":{:.4},\"jitter_rms_m\":{:.5},\"max_error_p95_m\":{:.5},\"pose_passes\":{}}}",
AA_HARNESS_VERSION, w.inputs_sha256, w.proof_sha256,
w.result.pck_all, w.result.pck_torso, w.result.oks,
w.result.jitter_rms_m, w.result.max_error_p95_m, w.result.passes
)
}
fn arg_val<'a>(args: &'a [String], key: &str) -> Option<&'a str> {
args.iter().position(|a| a == key).and_then(|i| args.get(i + 1)).map(|s| s.as_str())
}
fn main() -> ExitCode {
let args: Vec<String> = env::args().collect();
let mode_json = args.iter().any(|a| a == "--json");
let mode_gen = args.iter().any(|a| a == "--generate-hash");
let repeat: usize = arg_val(&args, "--repeat").and_then(|v| v.parse().ok()).unwrap_or(0);
let (pred, gt, vis, scale) = build_fixture();
let result = evaluate_joint_error(&pred, &gt, &vis, &scale, &JointErrorThresholds::default());
let proof = hash_hex(&canonical_bytes(&result));
// Inputs: real split+pred if provided, else the deterministic fixture.
let (pred, gt, vis, scale) = match (arg_val(&args, "--split"), arg_val(&args, "--pred")) {
(Some(s), Some(p)) => match load_inputs(s, p) {
Ok(v) => v,
Err(e) => {
eprintln!("input error: {e}");
return ExitCode::FAILURE;
}
},
_ => build_fixture(),
};
let w = score(&pred, &gt, &vis, &scale);
// ── Repeatability analysis: run K times, confirm an identical proof hash ──
if repeat > 0 {
let mut hashes = std::collections::BTreeSet::new();
for _ in 0..repeat {
let wi = score(&pred, &gt, &vis, &scale);
hashes.insert(wi.proof_sha256);
}
let repeatable = hashes.len() == 1;
println!(
"{{\"repeatability\":{{\"runs\":{},\"unique_proof_hashes\":{},\"repeatable\":{},\"proof_sha256\":\"{}\"}}}}",
repeat, hashes.len(), repeatable, w.proof_sha256
);
return if repeatable { ExitCode::SUCCESS } else {
eprintln!("REPEATABILITY FAIL: {} distinct hashes across {} runs (nondeterminism)", hashes.len(), repeat);
ExitCode::FAILURE
};
}
if mode_gen {
// Emit just the hash (stdout) for redirection into expected_score.sha256.
println!("{proof}");
println!("{}", w.proof_sha256);
return ExitCode::SUCCESS;
}
if mode_json {
// One leaderboard-ledger-shaped row (ADR-149 §2.2).
println!(
"{{\"category\":\"pose\",\"harness_version\":{},\"pck_all\":{:.4},\"pck_torso\":{:.4},\"oks\":{:.4},\"jitter_rms_m\":{:.5},\"max_error_p95_m\":{:.5},\"pose_passes\":{},\"proof_sha256\":\"{}\"}}",
AA_HARNESS_VERSION,
result.pck_all, result.pck_torso, result.oks,
result.jitter_rms_m, result.max_error_p95_m, result.passes, proof
);
println!("{}", witness_json(&w));
return ExitCode::SUCCESS;
}
// Default: verify against the committed expected hash (CI gate).
// Default: determinism gate against the committed expected hash (CI).
println!(
"AA pose witness: PCK_all={:.4} PCK_torso={:.4} OKS={:.4} jitter={:.5}m p95={:.5}m passes={}",
w.result.pck_all, w.result.pck_torso, w.result.oks,
w.result.jitter_rms_m, w.result.max_error_p95_m, w.result.passes
);
println!("AA inputs_sha256: {}", w.inputs_sha256);
println!("AA proof_sha256: {}", w.proof_sha256);
let expected_path = concat!(env!("CARGO_MANIFEST_DIR"), "/../../../aether-arena/fixtures/expected_score.sha256");
let expected = std::fs::read_to_string(expected_path)
.ok()
.map(|s| s.trim().to_string());
println!("AA pose score: PCK_all={:.4} PCK_torso={:.4} OKS={:.4} jitter={:.5}m p95={:.5}m passes={}",
result.pck_all, result.pck_torso, result.oks, result.jitter_rms_m, result.max_error_p95_m, result.passes);
println!("AA proof sha256: {proof}");
match expected {
Some(exp) if exp == proof => {
match std::fs::read_to_string(expected_path).ok().map(|s| s.trim().to_string()) {
Some(exp) if exp == w.proof_sha256 => {
println!("VERDICT: PASS (determinism hash matches expected)");
ExitCode::SUCCESS
}
Some(exp) => {
eprintln!("VERDICT: FAIL — scorer drift detected.\n expected: {exp}\n actual: {proof}");
eprintln!("If this change to the scoring maths is intentional, regenerate with --generate-hash and review the diff.");
eprintln!("VERDICT: FAIL — scorer drift.\n expected: {exp}\n actual: {}", w.proof_sha256);
eprintln!("If intentional, regenerate with --generate-hash and review the diff.");
ExitCode::FAILURE
}
None => {