ci(bench): wire v2 criterion benches into CI as a compile-verify regression gate

Sub-deliverable 8.3 of the benchmark/optimization milestone (needs ADR slot 174).

The v2/ workspace ships 26 criterion benches across 18 crates, but benches are
not part of `cargo test`, so nothing in CI compiled them and they silently rot
when a public API they call changes.

Add `.github/workflows/bench-regression.yml`:
  - bench-compile (HARD GATE): `cargo bench --workspace --no-default-features
    --no-run` compiles + links every default-feature bench (no measurement) plus
    the cir-gated cir_bench — a real, deterministic regression guard against
    bench bit-rot.
  - bench-fast-run (INFORMATIONAL, continue-on-error, never gates): runs a
    curated pure-CPU subset (nvsim, ruvector sketch/fusion) in criterion
    quick-mode and uploads logs as an artifact.

No timing-regression gate, by design: wall-clock on shared GitHub runners varies
2-3x run-to-run, so a hard threshold or cross-runner `criterion --baseline`
compare would manufacture false failures. The honest scope is compile-verify +
informational-run; the workflow header documents the self-hosted-runner
condition under which true timing-gating becomes honest. The crv-gated crv_bench
is excluded because its crates.io dep ruvector-crv 0.1.1 fails to build upstream.

Running the gate immediately caught one already-bit-rotted bench:
wifi-densepose-mat/detection_bench failed to compile (E0063: missing field
last_rssi in SensorPosition). Fixed (last_rssi: None) and re-verified.

Validation (MEASURED): mat detection_bench + cir_bench + nvsim + ruvector +
vitals + swarm benches compile under --no-default-features; fast subset runs;
`cargo test -p wifi-densepose-mat --no-default-features` 174 passed / 0 failed;
Python proof PASS, hash f8e76f21...46f7a unchanged.

Co-Authored-By: claude-flow <ruv@ruv.net>
This commit is contained in:
ruv
2026-06-15 07:38:26 -04:00
parent 90a88ada9a
commit c4c59e0859
3 changed files with 203 additions and 0 deletions
+199
View File
@@ -0,0 +1,199 @@
name: Bench Regression Guard
# Sub-deliverable 8.3 of the benchmark/optimization milestone.
#
# HONEST SCOPE (read this before assuming this gates on timing):
# * The `bench-compile` job is a REAL, HARD-FAILING regression gate. It runs
# `cargo bench --no-default-features --no-run`, which type-checks and links
# EVERY criterion bench in the v2/ workspace without running a single
# measurement. Benches are not part of `cargo test`, so they silently
# bit-rot when a public API they call changes — this job catches that the
# moment it happens. This is the part of this workflow that can fail a PR.
#
# * The `bench-fast-run` job runs a small, curated subset of pure-CPU benches
# in criterion "quick mode" (short warm-up / measurement / 10 samples) and
# is INFORMATIONAL ONLY (`continue-on-error: true`). It does NOT gate on
# timing. Wall-clock timings on shared GitHub-hosted runners vary by
# 2-3x run-to-run (noisy neighbours, CPU throttling, no pinned frequency),
# so a hard ">X ms" threshold here would flake constantly and teach
# everyone to ignore it. We deliberately do not pretend to do timing
# regression-gating we cannot deliver reliably. The numbers are surfaced in
# the job log + uploaded as an artifact for humans to eyeball trends.
#
# WHY NO criterion --baseline COMPARE GATE:
# criterion's `--save-baseline` / `--baseline` compare is the textbook
# regression mechanism, but it only produces a trustworthy verdict when the
# baseline and the candidate were measured on the SAME hardware under the SAME
# conditions. GitHub-hosted runners give neither (the baseline commit and the
# PR commit land on different physical machines). Committing a baseline JSON
# measured on one runner and comparing a different runner against it would
# manufacture false regressions. If/when these benches run on a dedicated,
# frequency-pinned self-hosted runner, a `--baseline` compare with a generous
# (>2x) noise floor becomes honest and can be added then. Until then,
# compile-verify + informational-run is the honest gate.
on:
push:
branches: [ main, develop, 'feat/*' ]
paths:
- 'v2/crates/**/benches/**'
- 'v2/crates/**/Cargo.toml'
- 'v2/crates/**/src/**'
- 'v2/Cargo.toml'
- 'v2/Cargo.lock'
- '.github/workflows/bench-regression.yml'
pull_request:
paths:
- 'v2/crates/**/benches/**'
- 'v2/crates/**/Cargo.toml'
- 'v2/crates/**/src/**'
- 'v2/Cargo.toml'
- 'v2/Cargo.lock'
- '.github/workflows/bench-regression.yml'
workflow_dispatch:
permissions:
contents: read
env:
CARGO_TERM_COLOR: always
# Debuginfo is useless in CI and the 38-crate workspace target dir otherwise
# exhausts the runner disk (mirrors ci.yml's rust-tests job). The bench
# profile inherits release + debug = true (v2/Cargo.toml [profile.bench]);
# force it off so the link step does not run out of space.
CARGO_PROFILE_BENCH_DEBUG: "0"
CARGO_PROFILE_RELEASE_DEBUG: "0"
jobs:
# ── HARD GATE: every bench must still compile + link ─────────────────────
bench-compile:
name: bench compile-verify (--no-run)
runs-on: ubuntu-latest
steps:
- name: Checkout (recursive — wifi-densepose-rufield path-deps vendor/rufield)
uses: actions/checkout@v4
with:
# The workspace includes `wifi-densepose-rufield`, which path-deps the
# `vendor/rufield` submodule crates. Without a recursive checkout the
# whole workspace fails to resolve before any bench is built.
submodules: recursive
# The workspace pulls in `wifi-densepose-desktop` (Tauri v2) whose -sys
# crates need the GTK/WebKit/serial dev libraries via pkg-config, exactly
# as ci.yml's rust-tests job documents. A `--workspace` bench build links
# the whole graph, so these are required here too.
- name: Install Tauri / GTK / serial system dev libraries
run: |
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
libglib2.0-dev \
libgtk-3-dev \
libsoup-3.0-dev \
libjavascriptcoregtk-4.1-dev \
libwebkit2gtk-4.1-dev \
libayatana-appindicator3-dev \
librsvg2-dev \
libxdo-dev \
libudev-dev \
libdbus-1-dev \
libssl-dev \
pkg-config
- name: Install Rust toolchain
uses: dtolnay/rust-toolchain@stable
- name: Cache cargo (Swatinem/rust-cache)
uses: Swatinem/rust-cache@v2
with:
workspaces: v2
# Distinct cache scope from ci.yml's rust-tests so the bench profile
# artifacts (release+opt) do not evict the test profile cache.
key: bench-regression
# The core regression guard. `--no-run` compiles + links every bench
# target in the workspace's DEFAULT feature set but runs no measurement,
# so it is deterministic and fast-ish (build only). A bench that no longer
# compiles — because a type/signature it calls changed and nobody updated
# the bench — fails the build here. `--no-default-features` is the
# workspace's standard gate flag (openblas/tch/ort/onnx stay opt-out).
- name: Compile all workspace benches (default features)
working-directory: v2
run: cargo bench --workspace --no-default-features --no-run
# Feature-gated benches are skipped by the default build above because
# their `[[bench]]` entries carry `required-features`. Compile the ones we
# can guard so they are also covered against bit-rot.
# * cir → wifi-densepose-signal/benches/cir_bench.rs (ADR-134). The
# `cir` feature is pure-Rust (`cir = []`), so it builds on the stock
# runner and is a real, hard-failing guard like the step above.
#
# NOT guarded here (honest scope):
# * crv → wifi-densepose-ruvector/benches/crv_bench.rs. The `crv` feature
# pulls the crates.io dependency `ruvector-crv 0.1.1`, which currently
# FAILS to compile on stable (E0308 type mismatch in its own
# `stage_iii.rs` — an UPSTREAM bug, unrelated to bench bit-rot).
# Adding a hard `--features crv` compile step would make this workflow
# red for a reason this gate is not meant to police. Re-add this step
# once `ruvector-crv` ships a fixed release. (mqtt/onnx benches are
# likewise left to their own crate workflows.)
- name: Compile feature-gated benches (cir)
working-directory: v2
run: cargo bench -p wifi-densepose-signal --no-default-features --features cir --bench cir_bench --no-run
# ── INFORMATIONAL: run a curated fast subset (never gates) ───────────────
bench-fast-run:
name: bench fast-run (informational, non-gating)
runs-on: ubuntu-latest
# NEVER fail the workflow on this job — timings are noise-prone on shared
# runners (see header). It exists to surface trends for humans, not to gate.
continue-on-error: true
needs: [bench-compile]
steps:
- name: Checkout (recursive)
uses: actions/checkout@v4
with:
submodules: recursive
- name: Install Rust toolchain
uses: dtolnay/rust-toolchain@stable
- name: Cache cargo (Swatinem/rust-cache)
uses: Swatinem/rust-cache@v2
with:
workspaces: v2
key: bench-regression
# Curated subset = pure-CPU, fast, dependency-light criterion benches that
# finish in seconds under quick-mode flags. Each is targeted by `--bench`
# (NOT a bare `cargo bench -p`) because the crates' lib targets use the
# libtest harness, which rejects criterion's CLI flags (--warm-up-time
# etc.) and aborts the run. Quick-mode: 1s warm-up, 2s measure, 10 samples.
- name: nvsim pipeline_throughput (quick)
working-directory: v2
run: |
mkdir -p ../bench-out
cargo bench -p nvsim --no-default-features --bench pipeline_throughput -- \
--warm-up-time 1 --measurement-time 2 --sample-size 10 \
| tee ../bench-out/nvsim_pipeline_throughput.txt
- name: ruvector sketch_bench (quick)
working-directory: v2
run: |
cargo bench -p wifi-densepose-ruvector --no-default-features --bench sketch_bench -- \
--warm-up-time 1 --measurement-time 2 --sample-size 10 \
| tee ../bench-out/ruvector_sketch_bench.txt
- name: ruvector fusion_bench (quick)
working-directory: v2
run: |
cargo bench -p wifi-densepose-ruvector --no-default-features --bench fusion_bench -- \
--warm-up-time 1 --measurement-time 2 --sample-size 10 \
| tee ../bench-out/ruvector_fusion_bench.txt
- name: Upload informational bench logs
if: always()
uses: actions/upload-artifact@v4
with:
name: bench-fast-run-logs
path: bench-out/
if-no-files-found: warn
+1
View File
@@ -15,6 +15,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added
- **Metric-locked PCK/MPJPE accuracy harness — resolves the PCK-definition ambiguity (`wifi-densepose-train`, needs ADR slot 173).** The SOTA brief (`docs/research/sota-nn-train-benchmark-brief.md` §1, §3.1, §4) found the single biggest threat to any "beyond-SOTA" claim is **metric ambiguity**: three PCK@20 figures (96.09% WiFlow-STD image-normalized, 81.63% AetherArena torso-PCK, 61.1% GraphPose-Fi standard PCK) cannot be lined up because each silently uses a different normalization — the project was retracted twice over this (a withdrawn "92.9%" used *absolute* pixels, not torso). New `src/accuracy.rs` makes the normalizer **explicit, selectable, and carried with every reported number**: a `PckNormalization` enum (`TorsoDiameter` = standard MM-Fi/GraphPose-Fi hip↔hip; `BoundingBoxDiagonal` = looser WiFlow-STD image-normalized; `AbsolutePixels(threshold)` = the retracted convention, included so historical numbers are reproducible and clearly labeled non-comparable); one canonical `pck_at(pred, gt, vis, k, normalization)` reusing the `metrics_core` geometric primitives (hip distance, bbox diagonal — no duplicate kernel); `mpjpe(pred, gt, vis)` (2D/3D, mm); and a self-describing `PoseAccuracy { pck_at: BTreeMap<u8,f32>, mpjpe, normalization, n_keypoints, n_frames }` returned by `accuracy_report(frames, ks, normalization)` so an **unlabeled PCK number is structurally impossible**. **17 hand-computed deterministic tests** (no GPU, no datasets) prove the harness arithmetic: perfect→PCK=1.0/MPJPE=0; all-just-outside→0.0; half-in-half-out→0.5; the **key proof** that identical predictions score 0.50 (torso) / 1.00 (bbox) / 0.75 (abs) under the three normalizations (the ambiguity is real and the definitions are distinct); MPJPE 2D/3D fixtures; and graceful degenerate handling (zero torso, empty frames, NaN coords — no panic, never a false-perfect). **This is measurement infrastructure, not an accuracy claim** — the tests prove the harness is correct, not that any model is good. `wifi-densepose-train` lib 191→206, `test_metrics` 12→14, 0 failed. Python deterministic proof unchanged (off the signal proof path).
- **CI bench-regression guard (`.github/workflows/bench-regression.yml`) — wires the v2/ criterion benches into CI as a real, hard-failing COMPILE-VERIFY gate + an informational fast-run; caught + fixed one already-bit-rotted bench (benchmark/optimization milestone sub-deliverable 8.3; needs ADR slot 174).** The v2/ workspace ships **26 criterion benches across 18 crates** (e.g. `nvsim/pipeline_throughput`, `wifi-densepose-ruvector/{ann,sketch,fusion}_bench`, `wifi-densepose-signal/{signal,dsp_perf,features,calibration,aether_prefilter,cir}_bench`, `wifi-densepose-mat/detection_bench`, `wifi-densepose-nn/{inference,native_conv,onnx}_bench`, `wifi-densepose-engine/engine_cycle`, …) but, because benches are **not** part of `cargo test`, nothing in CI compiled them — so they silently rot when a public API they call changes. **Proof this matters (MEASURED):** running the new gate on the current tree immediately caught `wifi-densepose-mat/detection_bench` failing to compile (`E0063: missing field last_rssi in initializer of SensorPosition` — the struct gained a field, the bench was never updated); fixed in this change (`last_rssi: None`, the simulated-zone convention) and re-verified (`cargo bench -p wifi-densepose-mat --no-default-features --bench detection_bench --no-run``Finished`, Executable produced). **HONEST SCOPE — what gates vs what is informational:** (1) `bench-compile` (HARD GATE) runs `cargo bench --workspace --no-default-features --no-run` (compile + link every default-feature bench, no measurement) plus a `--features cir` compile of the gated `cir_bench` — a deterministic, real regression guard against bench bit-rot; (2) `bench-fast-run` (INFORMATIONAL, `continue-on-error: true`, NEVER gates) runs a curated pure-CPU subset (`nvsim/pipeline_throughput`, `ruvector/{sketch,fusion}_bench`) in criterion quick-mode (1s warm-up / 2s measure / 10 samples), targeted per-`--bench` (the crates' libtest lib targets reject criterion flags), and uploads the logs as an artifact. **No timing-regression gate, by design and stated in the workflow header:** wall-clock on shared GitHub runners varies 2-3x run-to-run, so a hard threshold or a cross-runner `criterion --baseline` compare would manufacture false failures; that becomes honest only on a frequency-pinned self-hosted runner (documented as the re-add condition). The `crv`-gated `ruvector/crv_bench` is deliberately NOT compiled by the gate because its crates.io dep `ruvector-crv 0.1.1` currently fails to build on stable (upstream E0308 in its own `stage_iii.rs`) — noted in-workflow with the re-add condition. Checkout is `submodules: recursive` (the workspace path-deps `vendor/rufield`) and installs the Tauri/GTK dev libs like `ci.yml`'s rust-tests job (a `--workspace` bench link pulls the whole graph). **MEASURED locally (Windows, `--no-default-features`):** `nvsim`, `wifi-densepose-ruvector` (sketch/fusion/ann), `wifi-densepose-signal/cir_bench`, `wifi-densepose-mat/detection_bench` (post-fix), `wifi-densepose-vitals/vitals_bench`, and `ruview-swarm/swarm_bench` all compile + the fast subset runs (sample baseline: `nvsim pipeline_run/d1/256` ≈ 55 µs, `d16/1024` ≈ 315 µs; `ruvector sketch_hamming` ≈ 3-7 ns vs `float_l2` ≈ 63-371 ns). The full `--workspace` `--no-run` could **not** be fully validated on Windows (Tauri-`desktop` needs GTK, `candle-core` fails on MSVC, `swarm_bench` LTO-links OOM under parallel pressure) — those are Windows-env artifacts that build in the Linux CI runner (each affected bench was confirmed to compile standalone here). No baseline JSON is committed (a cross-runner baseline would be dishonest). Python deterministic proof unchanged (`f8e76f21…46f7a`, bit-exact — off the signal proof path).
- **RuField `rufield-viewer` live-ingest mode — closes the RuView↔RuField visual loop (ADR-262 surfaces).** The dashboard gains `--source live --upstream <RuView-URL>`: it consumes RuView's `/ws/field` SSE (falling back to polling `/api/field`), **verifies every event's ed25519 provenance receipt on ingest** (`is_fusable`) — forged/tampered events are flagged ✗ and **never fused** into trusted inferences — and renders real RuView `FieldEvent`s through the same room-state/privacy-badge/fusion-graph/receipt path the synthetic mode uses (wire-compatible by construction: both sides use `rufield_core::FieldEvent` serde). **Strict banner honesty:** a single `BannerState` shows `SYNTHETIC` / `LIVE — <upstream>` / `DISCONNECTED — <upstream> unreachable`, mutually exclusive — never SYNTHETIC while showing live data or vice versa; live mode returns **409** on `/api/run` rather than fabricate a synthetic run, and starts DISCONNECTED until first verified contact. Default stays synthetic. 26 tests / 0 failed. `ruvnet/rufield` `crates/rufield-viewer`; `vendor/rufield` submodule bumped.
- **ADR-262 P3 — live RuField surface: RuView's running sensing-server now speaks RuField on `/api/field` + `/ws/field`.** Wires the P1 `wifi-densepose-rufield` bridge into the live `wifi-densepose-sensing-server` (the bridge is the only added coupling, ADR-262 §5.4). A new `src/rufield_surface.rs` module (kept out of the 8k-line `main.rs`) holds a `FieldSurface` with a **dedicated ed25519 `Signer`**, a bounded ring buffer of recent signed events (`FIELD_RING_CAPACITY = 64`), and the `/ws/field` broadcast topic; it exposes `GET /api/field` (latest signed `FieldEvent`s + signer pubkey + a `dev_signing_key` flag) and `GET /ws/field` (per-cycle stream, mirroring `/ws/sensing`), plus a standalone `router()` for isolated testing. **Tap:** at the ESP32 governed-trust cycle (`main.rs` `observe_cycle` ~`:5886` / `SensingUpdate` build ~`:5938`), `emit_rufield_event` joins the cycle's real `SensingUpdate` (features/classification/signal_field) with the engine's recorded `effective_class`/`demoted` trust state into a `SensingSnapshot` and surfaces a signed `FieldEvent`**existing endpoints (`/ws/sensing` etc.) are unchanged; this is purely additive.** **Signer (defers the P2 key decision, §8 Q1):** a **standalone dev/sensing key** from `WDP_RUFIELD_SIGNING_SEED` (64-hex or ≥32-byte value), else a deterministic dev default with a logged `WARN` — reusing the `cog-ha-matter` Ed25519 key is the deferred P2 call, so P3 does not pre-empt it. **Egress privacy (fail-closed):** `network_egress_allowed` is *stricter* than `DefaultPrivacyGuard` for an unattended live surface — only **P1/P2** leave the box; P0 (raw) and P3/P4/P5 are held edge-local, so a `Derived → P4/P5` cycle **never** surfaces; no-presence cycles emit **no phantom event**. **P3 acceptance gates (`tests/rufield_surface_test.rs`, 4 integration via `tower::oneshot` + 4 module unit, 0 failed):** a well-formed **signed** event (`Modality::WifiCsi`, P2 not P1, `is_fusable` ed25519-verified, real timestamp); empty cycle → no phantom; **privacy-safety** — an injected `Derived` trust never surfaces; a mixed stream surfaces only egress-safe events. **Honest scope (ADR-262 §0/§6):** real plumbing on a **live endpoint**, **NOT accuracy** — single-link CSI with its existing caveats (no validated room-coordinate accuracy — `field_localize`), a dedicated dev signing key pending the P2 ownership decision, no accuracy claim. The win is narrowly: "RuView's live sensing now speaks RuField on `/ws/field`."
- **ADR-262 P1 — `wifi-densepose-rufield` anti-corruption bridge: RuView WiFi-CSI sensing → signed RuField `FieldEvent`s.** A new v2 workspace member (the *single coupling point* between RuView and the standalone RuField MFS spec, ADR-262 §5.4) that **path-deps** the `vendor/rufield` submodule crates (`rufield-core`/`-provenance`/`-privacy`/`-fusion` — pure-Rust, `--no-default-features`-buildable: serde/sha2/ed25519/toml only, no tch/openblas/ndarray/candle) and **no** RuView internal crate. The bridge takes owned primitives — `SensingSnapshot` mirrors the `/ws/sensing` `SensingUpdate` (features + classification + signal_field) joined with the `TrustedOutput` trust state (`trust_class`/`demoted`/`identity_bound`) — and `snapshot_to_field_event()` emits one **signed** `FieldEvent` (`Modality::WifiCsi`, axis `[Frequency]`): a real `FieldTensor` from the feature scalars with the real `timestamp_ns`; an `Observation` whose `range_m`/`motion_vector`/`space_cell` are derived from the strongest **signal-field peak** when present (else `None` — coordinates are **never fabricated**, per the `field_localize` caveat) and `confidence` from the classification; a real `ProvenanceRef` (sha256 over the tensor bytes, `synthetic=false`) **ed25519-signed** so `rufield_provenance::is_fusable` passes. **The §3.3 privacy mapping is the critical correctness item**, implemented as `map_privacy()` mapping RuView's class onto RuField P0P5 **by information content, NEVER by byte value** and **fail-closed**: RuView `Derived` (byte `1`, which sorts *below* `Anonymous` byte `2`) carries an identity embedding → maps to **P4** (or **P5** if identity-bound), **never P1** (the single most dangerous mapping mistake); `Raw → P0`, `Anonymous → P2`, `Restricted → P2`; a governed-engine `demoted` cycle floors the egress class to ≥ P2 with raw suppressed. **P1 acceptance gates (15 tests / 0 failed — 5 unit + 9 integration + 1 doc):** round-trip (`SensingSnapshot → FieldEvent →` serde `→` equal), `is_fusable` (verified ed25519 receipt), `RuFieldFusion::ingest` accept + `infer()` runs, **privacy-safety** (`gate_privacy_safety_derived_never_maps_to_low_privacy``Derived → P4/P5`, never P1; a table test over every RuView class; fail-closed demotion), and determinism (same snapshot + same signer seed → byte-identical event). **Honest scope:** this is **P1 plumbing** — a tested conversion + a safe privacy mapping. It is **not** wired into the live server (that is P3) and makes **no accuracy claim** (RuField v0.1 is synthetic; RuView's single-link CSI carries its own caveats). CI: the `rust-tests` workflow checkout gains `submodules: recursive` so the path-deps resolve. Python deterministic proof unchanged (off the signal proof path).
@@ -220,6 +220,9 @@ fn create_test_sensors(count: usize) -> Vec<SensorPosition> {
z: 1.5,
sensor_type: SensorType::Transceiver,
is_operational: true,
// No live RSSI plumbed for synthetic bench sensors (simulated
// zone) — localization must not fabricate one.
last_rssi: None,
}
})
.collect()