ci(bench): wire v2 criterion benches into CI as a compile-verify regression gate

Sub-deliverable 8.3 of the benchmark/optimization milestone (needs ADR slot 174). The v2/ workspace ships 26 criterion benches across 18 crates, but benches are not part of `cargo test`, so nothing in CI compiled them and they silently rot when a public API they call changes. Add `.github/workflows/bench-regression.yml`: - bench-compile (HARD GATE): `cargo bench --workspace --no-default-features --no-run` compiles + links every default-feature bench (no measurement) plus the cir-gated cir_bench — a real, deterministic regression guard against bench bit-rot. - bench-fast-run (INFORMATIONAL, continue-on-error, never gates): runs a curated pure-CPU subset (nvsim, ruvector sketch/fusion) in criterion quick-mode and uploads logs as an artifact. No timing-regression gate, by design: wall-clock on shared GitHub runners varies 2-3x run-to-run, so a hard threshold or cross-runner `criterion --baseline` compare would manufacture false failures. The honest scope is compile-verify + informational-run; the workflow header documents the self-hosted-runner condition under which true timing-gating becomes honest. The crv-gated crv_bench is excluded because its crates.io dep ruvector-crv 0.1.1 fails to build upstream. Running the gate immediately caught one already-bit-rotted bench: wifi-densepose-mat/detection_bench failed to compile (E0063: missing field last_rssi in SensorPosition). Fixed (last_rssi: None) and re-verified. Validation (MEASURED): mat detection_bench + cir_bench + nvsim + ruvector + vitals + swarm benches compile under --no-default-features; fast subset runs; `cargo test -p wifi-densepose-mat --no-default-features` 174 passed / 0 failed; Python proof PASS, hash f8e76f21...46f7a unchanged. Co-Authored-By: claude-flow <ruv@ruv.net>
2026-06-16 11:23:19 +00:00 · 2026-06-15 07:38:26 -04:00
parent 90a88ada9a
commit c4c59e0859
3 changed files with 203 additions and 0 deletions
@@ -0,0 +1,199 @@
+name: Bench Regression Guard
+
+# Sub-deliverable 8.3 of the benchmark/optimization milestone.
+#
+# HONEST SCOPE (read this before assuming this gates on timing):
+#   * The `bench-compile` job is a REAL, HARD-FAILING regression gate. It runs
+#     `cargo bench --no-default-features --no-run`, which type-checks and links
+#     EVERY criterion bench in the v2/ workspace without running a single
+#     measurement. Benches are not part of `cargo test`, so they silently
+#     bit-rot when a public API they call changes — this job catches that the
+#     moment it happens. This is the part of this workflow that can fail a PR.
+#
+#   * The `bench-fast-run` job runs a small, curated subset of pure-CPU benches
+#     in criterion "quick mode" (short warm-up / measurement / 10 samples) and
+#     is INFORMATIONAL ONLY (`continue-on-error: true`). It does NOT gate on
+#     timing. Wall-clock timings on shared GitHub-hosted runners vary by
+#     2-3x run-to-run (noisy neighbours, CPU throttling, no pinned frequency),
+#     so a hard ">X ms" threshold here would flake constantly and teach
+#     everyone to ignore it. We deliberately do not pretend to do timing
+#     regression-gating we cannot deliver reliably. The numbers are surfaced in
+#     the job log + uploaded as an artifact for humans to eyeball trends.
+#
+# WHY NO criterion --baseline COMPARE GATE:
+#   criterion's `--save-baseline` / `--baseline` compare is the textbook
+#   regression mechanism, but it only produces a trustworthy verdict when the
+#   baseline and the candidate were measured on the SAME hardware under the SAME
+#   conditions. GitHub-hosted runners give neither (the baseline commit and the
+#   PR commit land on different physical machines). Committing a baseline JSON
+#   measured on one runner and comparing a different runner against it would
+#   manufacture false regressions. If/when these benches run on a dedicated,
+#   frequency-pinned self-hosted runner, a `--baseline` compare with a generous
+#   (>2x) noise floor becomes honest and can be added then. Until then,
+#   compile-verify + informational-run is the honest gate.
+
+on:
+  push:
+    branches: [ main, develop, 'feat/*' ]
+    paths:
+      - 'v2/crates/**/benches/**'
+      - 'v2/crates/**/Cargo.toml'
+      - 'v2/crates/**/src/**'
+      - 'v2/Cargo.toml'
+      - 'v2/Cargo.lock'
+      - '.github/workflows/bench-regression.yml'
+  pull_request:
+    paths:
+      - 'v2/crates/**/benches/**'
+      - 'v2/crates/**/Cargo.toml'
+      - 'v2/crates/**/src/**'
+      - 'v2/Cargo.toml'
+      - 'v2/Cargo.lock'
+      - '.github/workflows/bench-regression.yml'
+  workflow_dispatch:
+
+permissions:
+  contents: read
+
+env:
+  CARGO_TERM_COLOR: always
+  # Debuginfo is useless in CI and the 38-crate workspace target dir otherwise
+  # exhausts the runner disk (mirrors ci.yml's rust-tests job). The bench
+  # profile inherits release + debug = true (v2/Cargo.toml [profile.bench]);
+  # force it off so the link step does not run out of space.
+  CARGO_PROFILE_BENCH_DEBUG: "0"
+  CARGO_PROFILE_RELEASE_DEBUG: "0"
+
+jobs:
+  # ── HARD GATE: every bench must still compile + link ─────────────────────
+  bench-compile:
+    name: bench compile-verify (--no-run)
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout (recursive — wifi-densepose-rufield path-deps vendor/rufield)
+        uses: actions/checkout@v4
+        with:
+          # The workspace includes `wifi-densepose-rufield`, which path-deps the
+          # `vendor/rufield` submodule crates. Without a recursive checkout the
+          # whole workspace fails to resolve before any bench is built.
+          submodules: recursive
+
+      # The workspace pulls in `wifi-densepose-desktop` (Tauri v2) whose -sys
+      # crates need the GTK/WebKit/serial dev libraries via pkg-config, exactly
+      # as ci.yml's rust-tests job documents. A `--workspace` bench build links
+      # the whole graph, so these are required here too.
+      - name: Install Tauri / GTK / serial system dev libraries
+        run: |
+          sudo apt-get update
+          sudo apt-get install -y --no-install-recommends \
+            libglib2.0-dev \
+            libgtk-3-dev \
+            libsoup-3.0-dev \
+            libjavascriptcoregtk-4.1-dev \
+            libwebkit2gtk-4.1-dev \
+            libayatana-appindicator3-dev \
+            librsvg2-dev \
+            libxdo-dev \
+            libudev-dev \
+            libdbus-1-dev \
+            libssl-dev \
+            pkg-config
+
+      - name: Install Rust toolchain
+        uses: dtolnay/rust-toolchain@stable
+
+      - name: Cache cargo (Swatinem/rust-cache)
+        uses: Swatinem/rust-cache@v2
+        with:
+          workspaces: v2
+          # Distinct cache scope from ci.yml's rust-tests so the bench profile
+          # artifacts (release+opt) do not evict the test profile cache.
+          key: bench-regression
+
+      # The core regression guard. `--no-run` compiles + links every bench
+      # target in the workspace's DEFAULT feature set but runs no measurement,
+      # so it is deterministic and fast-ish (build only). A bench that no longer
+      # compiles — because a type/signature it calls changed and nobody updated
+      # the bench — fails the build here. `--no-default-features` is the
+      # workspace's standard gate flag (openblas/tch/ort/onnx stay opt-out).
+      - name: Compile all workspace benches (default features)
+        working-directory: v2
+        run: cargo bench --workspace --no-default-features --no-run
+
+      # Feature-gated benches are skipped by the default build above because
+      # their `[[bench]]` entries carry `required-features`. Compile the ones we
+      # can guard so they are also covered against bit-rot.
+      #   * cir → wifi-densepose-signal/benches/cir_bench.rs (ADR-134). The
+      #     `cir` feature is pure-Rust (`cir = []`), so it builds on the stock
+      #     runner and is a real, hard-failing guard like the step above.
+      #
+      # NOT guarded here (honest scope):
+      #   * crv → wifi-densepose-ruvector/benches/crv_bench.rs. The `crv` feature
+      #     pulls the crates.io dependency `ruvector-crv 0.1.1`, which currently
+      #     FAILS to compile on stable (E0308 type mismatch in its own
+      #     `stage_iii.rs` — an UPSTREAM bug, unrelated to bench bit-rot).
+      #     Adding a hard `--features crv` compile step would make this workflow
+      #     red for a reason this gate is not meant to police. Re-add this step
+      #     once `ruvector-crv` ships a fixed release. (mqtt/onnx benches are
+      #     likewise left to their own crate workflows.)
+      - name: Compile feature-gated benches (cir)
+        working-directory: v2
+        run: cargo bench -p wifi-densepose-signal --no-default-features --features cir --bench cir_bench --no-run
+
+  # ── INFORMATIONAL: run a curated fast subset (never gates) ───────────────
+  bench-fast-run:
+    name: bench fast-run (informational, non-gating)
+    runs-on: ubuntu-latest
+    # NEVER fail the workflow on this job — timings are noise-prone on shared
+    # runners (see header). It exists to surface trends for humans, not to gate.
+    continue-on-error: true
+    needs: [bench-compile]
+    steps:
+      - name: Checkout (recursive)
+        uses: actions/checkout@v4
+        with:
+          submodules: recursive
+
+      - name: Install Rust toolchain
+        uses: dtolnay/rust-toolchain@stable
+
+      - name: Cache cargo (Swatinem/rust-cache)
+        uses: Swatinem/rust-cache@v2
+        with:
+          workspaces: v2
+          key: bench-regression
+
+      # Curated subset = pure-CPU, fast, dependency-light criterion benches that
+      # finish in seconds under quick-mode flags. Each is targeted by `--bench`
+      # (NOT a bare `cargo bench -p`) because the crates' lib targets use the
+      # libtest harness, which rejects criterion's CLI flags (--warm-up-time
+      # etc.) and aborts the run. Quick-mode: 1s warm-up, 2s measure, 10 samples.
+      - name: nvsim pipeline_throughput (quick)
+        working-directory: v2
+        run: |
+          mkdir -p ../bench-out
+          cargo bench -p nvsim --no-default-features --bench pipeline_throughput -- \
+            --warm-up-time 1 --measurement-time 2 --sample-size 10 \
+            | tee ../bench-out/nvsim_pipeline_throughput.txt
+
+      - name: ruvector sketch_bench (quick)
+        working-directory: v2
+        run: |
+          cargo bench -p wifi-densepose-ruvector --no-default-features --bench sketch_bench -- \
+            --warm-up-time 1 --measurement-time 2 --sample-size 10 \
+            | tee ../bench-out/ruvector_sketch_bench.txt
+
+      - name: ruvector fusion_bench (quick)
+        working-directory: v2
+        run: |
+          cargo bench -p wifi-densepose-ruvector --no-default-features --bench fusion_bench -- \
+            --warm-up-time 1 --measurement-time 2 --sample-size 10 \
+            | tee ../bench-out/ruvector_fusion_bench.txt
+
+      - name: Upload informational bench logs
+        if: always()
+        uses: actions/upload-artifact@v4
+        with:
+          name: bench-fast-run-logs
+          path: bench-out/
+          if-no-files-found: warn
@@ -15,6 +15,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

 ### Added
 - **Metric-locked PCK/MPJPE accuracy harness — resolves the PCK-definition ambiguity (`wifi-densepose-train`, needs ADR slot 173).** The SOTA brief (`docs/research/sota-nn-train-benchmark-brief.md` §1, §3.1, §4) found the single biggest threat to any "beyond-SOTA" claim is **metric ambiguity**: three PCK@20 figures (96.09% WiFlow-STD image-normalized, 81.63% AetherArena torso-PCK, 61.1% GraphPose-Fi standard PCK) cannot be lined up because each silently uses a different normalization — the project was retracted twice over this (a withdrawn "92.9%" used *absolute* pixels, not torso). New `src/accuracy.rs` makes the normalizer **explicit, selectable, and carried with every reported number**: a `PckNormalization` enum (`TorsoDiameter` = standard MM-Fi/GraphPose-Fi hip↔hip; `BoundingBoxDiagonal` = looser WiFlow-STD image-normalized; `AbsolutePixels(threshold)` = the retracted convention, included so historical numbers are reproducible and clearly labeled non-comparable); one canonical `pck_at(pred, gt, vis, k, normalization)` reusing the `metrics_core` geometric primitives (hip distance, bbox diagonal — no duplicate kernel); `mpjpe(pred, gt, vis)` (2D/3D, mm); and a self-describing `PoseAccuracy { pck_at: BTreeMap<u8,f32>, mpjpe, normalization, n_keypoints, n_frames }` returned by `accuracy_report(frames, ks, normalization)` so an **unlabeled PCK number is structurally impossible**. **17 hand-computed deterministic tests** (no GPU, no datasets) prove the harness arithmetic: perfect→PCK=1.0/MPJPE=0; all-just-outside→0.0; half-in-half-out→0.5; the **key proof** that identical predictions score 0.50 (torso) / 1.00 (bbox) / 0.75 (abs) under the three normalizations (the ambiguity is real and the definitions are distinct); MPJPE 2D/3D fixtures; and graceful degenerate handling (zero torso, empty frames, NaN coords — no panic, never a false-perfect). **This is measurement infrastructure, not an accuracy claim** — the tests prove the harness is correct, not that any model is good. `wifi-densepose-train` lib 191→206, `test_metrics` 12→14, 0 failed. Python deterministic proof unchanged (off the signal proof path).
+- **CI bench-regression guard (`.github/workflows/bench-regression.yml`) — wires the v2/ criterion benches into CI as a real, hard-failing COMPILE-VERIFY gate + an informational fast-run; caught + fixed one already-bit-rotted bench (benchmark/optimization milestone sub-deliverable 8.3; needs ADR slot 174).** The v2/ workspace ships **26 criterion benches across 18 crates** (e.g. `nvsim/pipeline_throughput`, `wifi-densepose-ruvector/{ann,sketch,fusion}_bench`, `wifi-densepose-signal/{signal,dsp_perf,features,calibration,aether_prefilter,cir}_bench`, `wifi-densepose-mat/detection_bench`, `wifi-densepose-nn/{inference,native_conv,onnx}_bench`, `wifi-densepose-engine/engine_cycle`, …) but, because benches are **not** part of `cargo test`, nothing in CI compiled them — so they silently rot when a public API they call changes. **Proof this matters (MEASURED):** running the new gate on the current tree immediately caught `wifi-densepose-mat/detection_bench` failing to compile (`E0063: missing field last_rssi in initializer of SensorPosition` — the struct gained a field, the bench was never updated); fixed in this change (`last_rssi: None`, the simulated-zone convention) and re-verified (`cargo bench -p wifi-densepose-mat --no-default-features --bench detection_bench --no-run` → `Finished`, Executable produced). **HONEST SCOPE — what gates vs what is informational:** (1) `bench-compile` (HARD GATE) runs `cargo bench --workspace --no-default-features --no-run` (compile + link every default-feature bench, no measurement) plus a `--features cir` compile of the gated `cir_bench` — a deterministic, real regression guard against bench bit-rot; (2) `bench-fast-run` (INFORMATIONAL, `continue-on-error: true`, NEVER gates) runs a curated pure-CPU subset (`nvsim/pipeline_throughput`, `ruvector/{sketch,fusion}_bench`) in criterion quick-mode (1s warm-up / 2s measure / 10 samples), targeted per-`--bench` (the crates' libtest lib targets reject criterion flags), and uploads the logs as an artifact. **No timing-regression gate, by design and stated in the workflow header:** wall-clock on shared GitHub runners varies 2-3x run-to-run, so a hard threshold or a cross-runner `criterion --baseline` compare would manufacture false failures; that becomes honest only on a frequency-pinned self-hosted runner (documented as the re-add condition). The `crv`-gated `ruvector/crv_bench` is deliberately NOT compiled by the gate because its crates.io dep `ruvector-crv 0.1.1` currently fails to build on stable (upstream E0308 in its own `stage_iii.rs`) — noted in-workflow with the re-add condition. Checkout is `submodules: recursive` (the workspace path-deps `vendor/rufield`) and installs the Tauri/GTK dev libs like `ci.yml`'s rust-tests job (a `--workspace` bench link pulls the whole graph). **MEASURED locally (Windows, `--no-default-features`):** `nvsim`, `wifi-densepose-ruvector` (sketch/fusion/ann), `wifi-densepose-signal/cir_bench`, `wifi-densepose-mat/detection_bench` (post-fix), `wifi-densepose-vitals/vitals_bench`, and `ruview-swarm/swarm_bench` all compile + the fast subset runs (sample baseline: `nvsim pipeline_run/d1/256` ≈ 55 µs, `d16/1024` ≈ 315 µs; `ruvector sketch_hamming` ≈ 3-7 ns vs `float_l2` ≈ 63-371 ns). The full `--workspace` `--no-run` could **not** be fully validated on Windows (Tauri-`desktop` needs GTK, `candle-core` fails on MSVC, `swarm_bench` LTO-links OOM under parallel pressure) — those are Windows-env artifacts that build in the Linux CI runner (each affected bench was confirmed to compile standalone here). No baseline JSON is committed (a cross-runner baseline would be dishonest). Python deterministic proof unchanged (`f8e76f21…46f7a`, bit-exact — off the signal proof path).
 - **RuField `rufield-viewer` live-ingest mode — closes the RuView↔RuField visual loop (ADR-262 surfaces).** The dashboard gains `--source live --upstream <RuView-URL>`: it consumes RuView's `/ws/field` SSE (falling back to polling `/api/field`), **verifies every event's ed25519 provenance receipt on ingest** (`is_fusable`) — forged/tampered events are flagged ✗ and **never fused** into trusted inferences — and renders real RuView `FieldEvent`s through the same room-state/privacy-badge/fusion-graph/receipt path the synthetic mode uses (wire-compatible by construction: both sides use `rufield_core::FieldEvent` serde). **Strict banner honesty:** a single `BannerState` shows `SYNTHETIC` / `LIVE — <upstream>` / `DISCONNECTED — <upstream> unreachable`, mutually exclusive — never SYNTHETIC while showing live data or vice versa; live mode returns **409** on `/api/run` rather than fabricate a synthetic run, and starts DISCONNECTED until first verified contact. Default stays synthetic. 26 tests / 0 failed. `ruvnet/rufield` `crates/rufield-viewer`; `vendor/rufield` submodule bumped.
 - **ADR-262 P3 — live RuField surface: RuView's running sensing-server now speaks RuField on `/api/field` + `/ws/field`.** Wires the P1 `wifi-densepose-rufield` bridge into the live `wifi-densepose-sensing-server` (the bridge is the only added coupling, ADR-262 §5.4). A new `src/rufield_surface.rs` module (kept out of the 8k-line `main.rs`) holds a `FieldSurface` with a **dedicated ed25519 `Signer`**, a bounded ring buffer of recent signed events (`FIELD_RING_CAPACITY = 64`), and the `/ws/field` broadcast topic; it exposes `GET /api/field` (latest signed `FieldEvent`s + signer pubkey + a `dev_signing_key` flag) and `GET /ws/field` (per-cycle stream, mirroring `/ws/sensing`), plus a standalone `router()` for isolated testing. **Tap:** at the ESP32 governed-trust cycle (`main.rs` `observe_cycle` ~`:5886` / `SensingUpdate` build ~`:5938`), `emit_rufield_event` joins the cycle's real `SensingUpdate` (features/classification/signal_field) with the engine's recorded `effective_class`/`demoted` trust state into a `SensingSnapshot` and surfaces a signed `FieldEvent` — **existing endpoints (`/ws/sensing` etc.) are unchanged; this is purely additive.** **Signer (defers the P2 key decision, §8 Q1):** a **standalone dev/sensing key** from `WDP_RUFIELD_SIGNING_SEED` (64-hex or ≥32-byte value), else a deterministic dev default with a logged `WARN` — reusing the `cog-ha-matter` Ed25519 key is the deferred P2 call, so P3 does not pre-empt it. **Egress privacy (fail-closed):** `network_egress_allowed` is *stricter* than `DefaultPrivacyGuard` for an unattended live surface — only **P1/P2** leave the box; P0 (raw) and P3/P4/P5 are held edge-local, so a `Derived → P4/P5` cycle **never** surfaces; no-presence cycles emit **no phantom event**. **P3 acceptance gates (`tests/rufield_surface_test.rs`, 4 integration via `tower::oneshot` + 4 module unit, 0 failed):** a well-formed **signed** event (`Modality::WifiCsi`, P2 not P1, `is_fusable` ed25519-verified, real timestamp); empty cycle → no phantom; **privacy-safety** — an injected `Derived` trust never surfaces; a mixed stream surfaces only egress-safe events. **Honest scope (ADR-262 §0/§6):** real plumbing on a **live endpoint**, **NOT accuracy** — single-link CSI with its existing caveats (no validated room-coordinate accuracy — `field_localize`), a dedicated dev signing key pending the P2 ownership decision, no accuracy claim. The win is narrowly: "RuView's live sensing now speaks RuField on `/ws/field`."
 - **ADR-262 P1 — `wifi-densepose-rufield` anti-corruption bridge: RuView WiFi-CSI sensing → signed RuField `FieldEvent`s.** A new v2 workspace member (the *single coupling point* between RuView and the standalone RuField MFS spec, ADR-262 §5.4) that **path-deps** the `vendor/rufield` submodule crates (`rufield-core`/`-provenance`/`-privacy`/`-fusion` — pure-Rust, `--no-default-features`-buildable: serde/sha2/ed25519/toml only, no tch/openblas/ndarray/candle) and **no** RuView internal crate. The bridge takes owned primitives — `SensingSnapshot` mirrors the `/ws/sensing` `SensingUpdate` (features + classification + signal_field) joined with the `TrustedOutput` trust state (`trust_class`/`demoted`/`identity_bound`) — and `snapshot_to_field_event()` emits one **signed** `FieldEvent` (`Modality::WifiCsi`, axis `[Frequency]`): a real `FieldTensor` from the feature scalars with the real `timestamp_ns`; an `Observation` whose `range_m`/`motion_vector`/`space_cell` are derived from the strongest **signal-field peak** when present (else `None` — coordinates are **never fabricated**, per the `field_localize` caveat) and `confidence` from the classification; a real `ProvenanceRef` (sha256 over the tensor bytes, `synthetic=false`) **ed25519-signed** so `rufield_provenance::is_fusable` passes. **The §3.3 privacy mapping is the critical correctness item**, implemented as `map_privacy()` mapping RuView's class onto RuField P0–P5 **by information content, NEVER by byte value** and **fail-closed**: RuView `Derived` (byte `1`, which sorts *below* `Anonymous` byte `2`) carries an identity embedding → maps to **P4** (or **P5** if identity-bound), **never P1** (the single most dangerous mapping mistake); `Raw → P0`, `Anonymous → P2`, `Restricted → P2`; a governed-engine `demoted` cycle floors the egress class to ≥ P2 with raw suppressed. **P1 acceptance gates (15 tests / 0 failed — 5 unit + 9 integration + 1 doc):** round-trip (`SensingSnapshot → FieldEvent →` serde `→` equal), `is_fusable` (verified ed25519 receipt), `RuFieldFusion::ingest` accept + `infer()` runs, **privacy-safety** (`gate_privacy_safety_derived_never_maps_to_low_privacy` — `Derived → P4/P5`, never P1; a table test over every RuView class; fail-closed demotion), and determinism (same snapshot + same signer seed → byte-identical event). **Honest scope:** this is **P1 plumbing** — a tested conversion + a safe privacy mapping. It is **not** wired into the live server (that is P3) and makes **no accuracy claim** (RuField v0.1 is synthetic; RuView's single-link CSI carries its own caveats). CI: the `rust-tests` workflow checkout gains `submodules: recursive` so the path-deps resolve. Python deterministic proof unchanged (off the signal proof path).
@@ -220,6 +220,9 @@ fn create_test_sensors(count: usize) -> Vec<SensorPosition> {
                z: 1.5,
                sensor_type: SensorType::Transceiver,
                is_operational: true,
+                // No live RSSI plumbed for synthetic bench sensors (simulated
+                // zone) — localization must not fabricate one.
+                last_rssi: None,
            }
        })
        .collect()