mirror of
https://github.com/ruvnet/RuView
synced 2026-06-09 10:13:17 +00:00
0fbdd15955
- README: replace retracted "100% presence" claim with honest 82.3% held-out temporal-triplet; correct stale "pose model not in this release" (now live at ruvnet/wifi-densepose-mmfi-pose, 82.69% torso-PCK@20 SOTA); add a Results & proof table (HF models, AetherArena, benchmark study, deterministic verify.py proof, witness). - user-guide: same 100%->82.3% correction in two places; add Results & proof pointers and the SOTA pose model + AetherArena links. - docs/proof-of-capabilities.md (new): evidence-first rebuttal to the "fake / misleading" claims. Concedes what was fair (over-stated early metrics, AI-doc tone), refutes the category errors (simulate-mode mistaken for fraud; missing weights mistaken for missing pipeline), and gives copy-paste "prove it yourself" steps (verify.py VERDICT: PASS + published SHA-256, cargo test, HF model pull, ESP32 CSI). Emphasizes built-in-public history (git, 96 ADRs, CHANGELOG, issues incl. #803/#872 bug->fix arcs) as the anti-facade evidence. - aether-arena/VERIFY.md: cross-link the whole-platform proof doc. Verified: python archive/v1/data/proof/verify.py -> VERDICT: PASS (hash ca58956c...9199 matches published expected_features.sha256). Co-Authored-By: claude-flow <ruv@ruv.net>
79 lines
4.1 KiB
Markdown
79 lines
4.1 KiB
Markdown
# Verifying AetherArena (you don't have to trust us)
|
||
|
||
AA's credibility rests on a stranger being able to reproduce a score and see that the rules are fair. This is the **launch gate** (ADR-149 §7): v0 does not ship until all five checks below pass for someone with no insider access.
|
||
|
||
> **Wider context:** this page covers the *leaderboard scorer*. For the whole-platform answer to
|
||
> "is this real / does it actually work?" — including the deterministic pipeline proof, the
|
||
> published models + public-benchmark numbers, and the built-in-public development trail — see
|
||
> [`docs/proof-of-capabilities.md`](../docs/proof-of-capabilities.md).
|
||
|
||
## The open scorer
|
||
|
||
The scoring engine is a pure-Rust, GPU-free binary: `aa_score_runner` in `wifi-densepose-train`. It runs the real `ruview_metrics` pose-acceptance harness on a fixed fixture and emits a cross-platform-stable SHA-256 **determinism proof**.
|
||
|
||
### Reproduce the determinism hash locally
|
||
|
||
```bash
|
||
cd v2
|
||
# Verify the committed expected hash still matches (this is the CI gate):
|
||
cargo run -q -p wifi-densepose-train --bin aa_score_runner --no-default-features
|
||
# → prints the witness (inputs_sha256 + proof_sha256) and "VERDICT: PASS"
|
||
|
||
# See the witness row as JSON:
|
||
cargo run -q -p wifi-densepose-train --bin aa_score_runner --no-default-features -- --json
|
||
```
|
||
|
||
### Witness chain — proof + repeatability analysis
|
||
|
||
Every score is a **witness**: `inputs_sha256` (binds it to the exact inputs scored)
|
||
+ `proof_sha256` (cross-platform-stable hash of the quantised score) + `harness_version`.
|
||
Witnesses are recorded in an **append-only, hash-chained ledger** (each row references
|
||
the previous row's hash), so a silent edit to any past row breaks the chain.
|
||
|
||
```bash
|
||
# Repeatability: run the scorer K times, confirm ONE identical proof hash:
|
||
cd v2
|
||
cargo run -q -p wifi-densepose-train --bin aa_score_runner --no-default-features -- --repeat 16
|
||
# → {"repeatability":{"runs":16,"unique_proof_hashes":1,"repeatable":true,...}}
|
||
|
||
# Real model scoring (score predictions against an eval split):
|
||
cargo run -q -p wifi-densepose-train --bin aa_score_runner --no-default-features -- \
|
||
--split ../aether-arena/fixtures/smoke_split.json \
|
||
--pred ../aether-arena/fixtures/smoke_pred.json --json
|
||
|
||
# Verify the witness ledger chain is intact (tamper-evident):
|
||
cd ../aether-arena/ledger && python3 ledger_tools.py verify
|
||
# → "OK: N rows, chain intact" (edit any row and it reports the broken link)
|
||
```
|
||
|
||
The expected hash is committed at [`fixtures/expected_score.sha256`](fixtures/expected_score.sha256). Same harness version + same fixture → same hash on glibc / MSVC / Apple. If your local run prints `VERDICT: PASS`, you have reproduced the scorer.
|
||
|
||
### What happens if the scoring maths changes
|
||
|
||
Any edit to `ruview_metrics.rs`, `ablation.rs`, or `aa_score_runner.rs` moves the hash and **fails the CI gate** (`.github/workflows/aether-arena-harness.yml`) until the maintainer regenerates and reviews:
|
||
|
||
```bash
|
||
cargo run -p wifi-densepose-train --bin aa_score_runner --no-default-features -- --generate-hash \
|
||
> aether-arena/fixtures/expected_score.sha256
|
||
```
|
||
|
||
So a scorer change is always a reviewed, public diff — never silent. That's `harness_version` pinning + `determinism_gate` in action (ADR-149 §2.4–§2.5).
|
||
|
||
## The five-step acceptance test (v0 launch gate)
|
||
|
||
A stranger must be able to:
|
||
|
||
1. **Submit** a model (artifact + `schema/aa-submission.toml`) with no insider help.
|
||
2. **Get a deterministic score** — same model + same `harness_version` → same numbers.
|
||
3. **See the signed row** appended to the public results ledger.
|
||
4. **Rerun the scorer locally** on the public smoke split and reproduce the logic (the command above).
|
||
5. **Understand why the rank is fair** — private split, open scorer, pinned version, proof hash — from these docs alone.
|
||
|
||
If any step fails, v0 is not ready.
|
||
|
||
## Current status
|
||
|
||
- ✅ Step 4 (rerun the open scorer locally, reproduce the hash) — **works today** via `aa_score_runner`.
|
||
- ✅ CI harness gate runs the scorer on every PR.
|
||
- ⏳ Steps 1–3, 5 (HF Space submission flow + signed ledger) — in progress; require the HF Space deploy (needs an HF token / maintainer authorization).
|