Files
ruvnet--RuView/aether-arena/VERIFY.md
T
ruv a6808568a2 feat(aether-arena): ADR-149 spatial-intelligence benchmark — scorer + CI harness gate (M1-M4)
AetherArena ("AA") — the official, project-agnostic Spatial-Intelligence Benchmark
(ADR-149, Accepted). Iteration 1 of the long-horizon build:

- ADR-149 accepted: name locked (ruvnet/aether-arena), v0 metrics locked
  (pose/presence/latency/determinism), dataset legality resolved (MM-Fi CC BY-NC
  only; Wi-Pose excluded). Adds four-part framing, threat model, arena_score
  formula, submission state machine, neutrality/governance, and the §7 acceptance test.
- aa_score_runner: deterministic scorer bin reusing the real ruview_metrics pose
  harness on a fixed seed=42 fixture → RuViewTier-style verdict + cross-platform
  SHA-256 proof hash. Builds --no-default-features (no torch/GPU). VERDICT: PASS.
- CI harness gate: .github/workflows/aether-arena-harness.yml runs the scorer on
  every PR — the "PR that runs the harness as part of the build" requirement.
- Scaffold: aether-arena/{README,VERIFY,STATUS}.md + schema/aa-submission.toml.
- Horizon record persisted (.claude-flow/horizons/aether-arena-aa.json).

Infra = the deliverable; model SOTA (MM-Fi PCK@20) is a separate effort blocked on
ADR-079 data collection, tracked as a stretch goal, not an infra exit.

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-30 16:47:22 -04:00

2.6 KiB
Raw Blame History

Verifying AetherArena (you don't have to trust us)

AA's credibility rests on a stranger being able to reproduce a score and see that the rules are fair. This is the launch gate (ADR-149 §7): v0 does not ship until all five checks below pass for someone with no insider access.

The open scorer

The scoring engine is a pure-Rust, GPU-free binary: aa_score_runner in wifi-densepose-train. It runs the real ruview_metrics pose-acceptance harness on a fixed fixture and emits a cross-platform-stable SHA-256 determinism proof.

Reproduce the determinism hash locally

cd v2
# Verify the committed expected hash still matches (this is the CI gate):
cargo run -q -p wifi-densepose-train --bin aa_score_runner --no-default-features
# → prints the score, the proof sha256, and "VERDICT: PASS"

# See the leaderboard-ledger row as JSON:
cargo run -q -p wifi-densepose-train --bin aa_score_runner --no-default-features -- --json

The expected hash is committed at fixtures/expected_score.sha256. Same harness version + same fixture → same hash on glibc / MSVC / Apple. If your local run prints VERDICT: PASS, you have reproduced the scorer.

What happens if the scoring maths changes

Any edit to ruview_metrics.rs, ablation.rs, or aa_score_runner.rs moves the hash and fails the CI gate (.github/workflows/aether-arena-harness.yml) until the maintainer regenerates and reviews:

cargo run -p wifi-densepose-train --bin aa_score_runner --no-default-features -- --generate-hash \
  > aether-arena/fixtures/expected_score.sha256

So a scorer change is always a reviewed, public diff — never silent. That's harness_version pinning + determinism_gate in action (ADR-149 §2.4–§2.5).

The five-step acceptance test (v0 launch gate)

A stranger must be able to:

  1. Submit a model (artifact + schema/aa-submission.toml) with no insider help.
  2. Get a deterministic score — same model + same harness_version → same numbers.
  3. See the signed row appended to the public results ledger.
  4. Rerun the scorer locally on the public smoke split and reproduce the logic (the command above).
  5. Understand why the rank is fair — private split, open scorer, pinned version, proof hash — from these docs alone.

If any step fails, v0 is not ready.

Current status

  • Step 4 (rerun the open scorer locally, reproduce the hash) — works today via aa_score_runner.
  • CI harness gate runs the scorer on every PR.
  • Steps 13, 5 (HF Space submission flow + signed ledger) — in progress; require the HF Space deploy (needs an HF token / maintainer authorization).