mirror of https://github.com/ruvnet/RuView synced 2026-06-30 13:43:18 +00:00

Files

T

rUv 29de574e63 Beyond-SOTA engine/signal/train improvements: mesh partition guard, FFT CIR solver, canonical frame decoder, falsifiable occupancy benchmark, governed streaming, adapter provenance (#1018 )

* docs(research): add RuView beyond-SOTA system review (00)

First document of the beyond-SOTA research series: capability audit of
the current RuView engine with role-to-crate maturity matrix, ruvsense
module inventory, gap analysis, and risk register.

https://claude.ai/code/session_01MjBucx95K4BuUxZi8NWwRH

* docs(research): add beyond-SOTA architecture design (02, in progress)

https://claude.ai/code/session_01MjBucx95K4BuUxZi8NWwRH

* docs(research): finalize beyond-SOTA architecture (02)

https://claude.ai/code/session_01MjBucx95K4BuUxZi8NWwRH

* docs(research): add benchmark/validation methodology snapshot (03)

https://claude.ai/code/session_01MjBucx95K4BuUxZi8NWwRH

* docs(research): add beyond-SOTA series index with validation results; changelog

README index ties the 5 research docs together with the session's
measured validation evidence: 2,797 workspace tests / 0 failed, Python
proof PASS (bit-exact), and paired pre/post criterion CIR benchmarks.

https://claude.ai/code/session_01MjBucx95K4BuUxZi8NWwRH

* perf(signal): precompute CIR warm-start system; hoist tomography solver allocs

Exact, determinism-safe optimizations (bit-identical float results):

- cir.rs: diag(PhiH Phi)+lambda*I and its CSR matrix depend only on Phi
  and lambda (fixed at CirEstimator::new) but were rebuilt every frame
  (O(K*G) pass + CSR allocation). Now built once in new() via
  build_warm_start_system; summation order unchanged.
- tomography.rs: ISTA gradient buffer hoisted out of the 100-iteration
  loop (fill(0.0) reset) and the Frobenius Lipschitz bound moved from
  per-reconstruct to construction.

Verified: signal 456 tests green; engine 11/11 green including
cycle_is_deterministic and witness-stability tests. Criterion paired
pre/post: cir_estimate/he40 -3.9% (p<0.01), multiband -1.2/-1.4%.

https://claude.ai/code/session_01MjBucx95K4BuUxZi8NWwRH

* fix(worldgraph): bound SemanticState growth with deterministic retention

StreamingEngine::process_cycle appended one SemanticState belief per cycle
with no eviction — ~1.7M nodes/day at 20 Hz (beyond-SOTA roadmap finding #6).

Add WorldGraph::prune_semantic_states(max): deterministic eviction of the
oldest beliefs by (valid_from_unix_ms, id); structural nodes (rooms, zones,
sensors, anchors, tracks, events) are never eligible. Wire it into the
engine after each belief append (DEFAULT_SEMANTIC_RETENTION = 7,200, ~6 min
at 20 Hz; set_semantic_retention to tune). The WorldGraph holds current
beliefs; durable history is the recorder's job, so no audit data is lost.

3 new tests: end-to-end bounded growth, oldest-only eviction, deterministic
equal-timestamp tie-break. Workspace gate: 2,865 passed, 0 failed.

https://claude.ai/code/session_01MjBucx95K4BuUxZi8NWwRH

* feat(sensing-server): route live frames through the governed StreamingEngine

Closes the live-trust-path gap (ADR-136 section 8, beyond-SOTA system review):
the running server fused live CSI with the bare MultistaticFuser, while the
privacy/provenance/witness control plane (ADR-135..146) only ever ran on
synthetic in-test frames. The privacy control plane was therefore bypassable
on the real path.

New engine_bridge module drives StreamingEngine::process_cycle from the
server's live NodeState map, reusing the existing NodeState -> MultiBandCsiFrame
conversion. It lazily wires each contributing node as a WorldGraph sensor
(idempotent), bounds belief growth via the retention cap, and forwards explicit
timestamps/calibration ids so the path stays deterministic and replayable.

Wired additively into both live ESP32/WiFi fusion sites in main.rs via a
split-borrow off the write guard, so person-count behavior is unchanged; the
latest BLAKE3 witness is stored on AppState. Every published belief now carries
evidence + model + calibration + privacy decision and a deterministic witness.

Adds wifi-densepose-engine/-worldgraph/-bfld/-geo deps. 6 new bridge tests
(witnessed belief with full provenance, cross-run determinism, idempotent node
registration, retention bound, privacy-mode propagation). sensing-server suite
430+128 green; workspace gate 2,904 passed / 0 failed.

https://claude.ai/code/session_01MjBucx95K4BuUxZi8NWwRH

* feat(train): falsifiable occupancy benchmark with anti-overfitting gate

Makes the presence/person-count "beyond SOTA" claim falsifiable in code
instead of aspirational (the unfalsifiability gap from the beyond-SOTA system
review). occupancy_bench grades predictions vs ground truth and gates a SOTA
claim behind one claim_allowed invariant requiring ALL of:

- DataProvenance::Measured — synthetic/mock data is scorable for regression
  but never claimable (anti-mock-contamination; the CLAUDE.md Kconfig-bug
  lesson made structural).
- A leak-free EvalSplit — validate() refuses any split where a subject OR
  environment id appears in both train and test (subject leakage /
  per-environment overfitting).
- n_test >= min_test_samples (small-N guard).
- Presence F1 whose bootstrap-CI lower bound (deterministic seeded splitmix64)
  clears the threshold — not the point estimate.
- Count MAE within threshold.

The claim string is unreadable except through the gate (NO_CLAIM otherwise),
same discipline as the ruview-gamma acceptance gate. What remains is data, not
method: a frozen, SHA-pinned, subject/environment-disjoint measured replay set
turns the claim into a passing/failing test.

Lives in wifi-densepose-train (the eval bounded context, alongside ablation/
eval/metrics). 10 tests cover each refusal path; warning-clean under the
crate's missing_docs lint. Workspace gate 2,914 passed / 0 failed. Doc 03
updated.

https://claude.ai/code/session_01MjBucx95K4BuUxZi8NWwRH

* feat(engine): per-room adapter provenance + drift-to-recalibration advisor

Closes the trust-chain gap where an ~11 KB per-room LoRA adapter (ADR-150
section 3.4) could silently change inference without the witness noticing:
provenance carried only "rfenc-v<N>" with no notion of adapter identity.

- StreamingEngine::set_room_adapter(AdapterInfo): pins the adapter's
  content-derived id into provenance model_version
  ("rfenc-v1+adapter:<id>") — and therefore into the BLAKE3 witness — so
  swapping or clearing adapter weights always shifts the witness. Engine test
  proves base -> adapter -> other-adapter -> cleared all witness differently
  and cleared == base.
- RecalibrationAdvisor: recommends re-running the ADR-135 empty-room baseline
  / refitting the room adapter on sustained low fusion coherence (streak
  threshold, default 60 cycles ~ 3 s at 20 Hz) or an ADR-142 change-point.
  Surfaced as TrustedOutput::recalibration_recommended, stored on the
  sensing-server AppState alongside the witness at both live fusion sites.
- Bridge plumbing: EngineBridge::{set_room_adapter, clear_room_adapter} +
  live-path test that the adapter id flows into the live witness.

Scope note (honest): this is the deployable provenance/trigger half of the
"retrained model" roadmap item. Fitting the adapter itself runs in the
existing external calibration service (aether-arena/calibration/); a trained
RF-encoder checkpoint still does not exist in-tree.

Engine 15 tests, bridge 7 tests. Workspace gate: 2,918 passed / 0 failed.

https://claude.ai/code/session_01MjBucx95K4BuUxZi8NWwRH

* fix(mat): gate api module behind its feature — standalone no-default-features builds

pub mod api was unconditional while its only dependency, serde, is optional
behind the 'api' feature, so any build without default features failed with
101 unresolved-serde errors (masked in --workspace runs by feature
unification). The api module and its create_router/AppState re-export are now
cfg(feature = "api")-gated with docsrs annotations.

All combos compile: bare --no-default-features (was 101 errors, now 0),
--no-default-features --features api, and full default (177 tests pass).
Workspace gate: 2,918 passed / 0 failed.

https://claude.ai/code/session_01MjBucx95K4BuUxZi8NWwRH

* perf(signal): opt-in FFT operator for the CIR ISTA solver (8-14x measured)

Phi is a sub-DFT, so each ISTA mat-vec can run as one length-G FFT
(O(G log G)) instead of a dense O(K*G) product — the dominant-latency-hazard
finding from the beyond-SOTA optimization roadmap.

New CirConfig::fft_operator, default FALSE: the dense path stays the
bit-exact witness default. The FFT evaluates the same sums in a different
order, so enabling it shifts float results in the last bits and requires
regenerating any pinned witness — strictly opt-in per deployment.

FftOperator (rustfft, planned once at CirEstimator::new, scratch buffers
reused across the ISTA loop) dispatches inside ista_solve:
  Phi x   = scale * forward-FFT(x) sampled at bins (k_idx mod G)
  Phi^H v = scale * unnormalised inverse-FFT of v scattered into those bins
Warm-start and Lipschitz estimation stay dense at construction.

Measured (criterion, same run, same machine):
  ht20: 2.22 ms -> 265 us  (8.4x)
  ht40: 10.26 ms -> 717 us (14.3x)
The real HE40 grid (K=484, G=1452) scales further per the O(K*G)/O(G log G)
ratio.

3 new tests: FFT<->dense matvec equivalence to float tolerance on ht20 and
he40 grids; end-to-end dominant-tap agreement on a single-path frame; all
default configs keep FFT off. New cir_estimate_fft bench group.

Workspace gate: 2,921 passed / 0 failed (default path bit-exact, witnesses
unchanged).

https://claude.ai/code/session_01MjBucx95K4BuUxZi8NWwRH

* feat(core): canonical frame decoder — capture-to-claim replay (ADR-136)

The encode half of the ADR-136 frame contract existed (ComplexSample,
to_canonical_bytes, witness_hash) but there was no decoder: a captured
canonical frame could be witnessed but never reconstructed, blocking
replay-from-capture.

CsiFrame::from_canonical_bytes is the exact inverse: same id, metadata,
complex payload, and witness hash (tested as the round-trip law AC7 — the
replayed frame re-encodes byte-identically). Amplitude/phase are recomputed
from the payload (projections, not independent state). Every malformed-input
class fails closed (AC8): header truncation -> Truncated, payload truncation
-> PayloadMismatch, unknown discriminants, non-UTF-8 device id, trailing
bytes. Nil calibration uuid decodes as None per the documented encoding.

Core: 36 tests pass. Workspace gate: 2,937 passed / 0 failed.

https://claude.ai/code/session_01MjBucx95K4BuUxZi8NWwRH

* feat(engine): dynamic min-cut mesh partition guard (ruvector-mincut)

Maintains an exact min-cut over the live mesh coupling graph — nodes are
sensing nodes, coupling is the product of fusion attention weights — and
surfaces per cycle, as TrustedOutput::mesh:

- cut value: the global "how close is the array to partitioning" number,
  a structural measure per-node heuristics miss;
- weak side: which specific nodes would split off (failure/jamming triage,
  feeds ADR-032 posture);
- at-risk flag: counts as a structural event for the drift->recalibration
  advisor (alongside ADR-142 change-points).

Degenerate cases fail toward risk: a node with zero coupling is reported as
already partitioned (cut 0, that node as the weak side).

Measured cost policy (criterion, 12-node mesh — the honest part):
- weights quantized (1/64) + change-gated: steady-state cycles do ZERO graph
  work and reuse the cached cut (~7.3 us, ~23x cheaper than building);
- on any real change a full exact rebuild (~171 us) is used, because ONE
  DynamicMinCut delete+insert measured ~240 us — the subpolynomial machinery
  amortizes on much larger graphs, so rebuild-on-change is the measured
  optimum at mesh scale (one-edge case -28% after switching policy);
- full process_cycle with the guard: ~33 us for 4 nodes vs the 50 ms budget.

9 mesh_guard tests (weak-node detection, steady-state zero updates,
sub-quantum gating, join/drop rebuild, determinism, disconnection) + an
engine-level wiring test (down-weighted node -> weak side -> recalibration).
Engine 24 tests; workspace gate 2,946 passed / 0 failed.

https://claude.ai/code/session_01MjBucx95K4BuUxZi8NWwRH

* feat(engine): mesh partition risk demotes privacy + enters the witness (ADR-032)

Completes the mesh-guard integration: its at_risk signal was advisory-only
(fed the recalibration advisor). It now also contributes to the ADR-141
privacy demotion alongside fusion- and array-level contradictions — a mesh
close to partitioning makes the fused belief less trustworthy, so the cycle
emits at a more restricted class (monotonic; information only removed).

Because effective_class feeds the BLAKE3 witness, a fragmenting array now
shifts the witness: partition risk is auditable, not just logged. The mesh
computation moved ahead of the demotion step in process_cycle; mesh_guard_mut
exposes risk-threshold tuning.

Test: a forced-risk 3-node cycle demotes PrivateHome Anonymous->Restricted
and shifts the witness vs a clean baseline. Engine 25 tests; workspace gate
2,947 passed / 0 failed.

https://claude.ai/code/session_01MjBucx95K4BuUxZi8NWwRH

* fix: public-PR review findings — privacy-path honesty, gate holes, mesh-guard cliff

- sensing-server: engine errors logged+counted (no silent swallow), trust
  state exposed via status surface, privacy-demotion claims aligned with
  the actual parallel-audit-path behavior
- occupancy_bench: vacuous-F1 hole closed (degenerate test sets fail with
  their own criterion); CI-lower-bound test made probative
- mesh_guard: quantization scaled to observed coupling range — >=65-node
  balanced meshes no longer permanently at_risk (regression test)
- engine: both wiring tests made probative (same-topology witness compare,
  deterministic risk-crossing fixture)
- mat: axum/tokio optional behind api; real serde feature (api enables it)
- core: canonical decoder strict (non-zero reserved bytes and nil UUID
  rejected — injective on accepted domain, forged-bytes tests)
- CHANGELOG: un-spliced the FFT/adapter bullet mangle

Co-Authored-By: claude-flow <ruv@ruv.net>

* chore: strip private-track references for public PR

Reword the occupancy-benchmark changelog bullet to drop a cross-reference
to the private research track, and restore the WorldGraph retention bullet
header that was glued onto the preceding MAT bullet.

Co-Authored-By: claude-flow <ruv@ruv.net>

* chore: lockfile refresh for cherry-picked feature set

Co-Authored-By: claude-flow <ruv@ruv.net>

---------

Co-authored-by: Claude <noreply@anthropic.com>

2026-06-11 16:08:54 -04:00

20 KiB

Raw Blame History

RuView Beyond-SOTA — 04: Performance Review & Optimization Roadmap

Scope: the streaming sensing pipeline (CSI ingest → multistatic fusion → CIR gate → pose publish) in v2/, hot-path crates wifi-densepose-signal (ruvsense), wifi-densepose-engine, wifi-densepose-ruvector, plus build-profile and edge-target (Pi 5-class, WASM) considerations.

Hard constraint (non-negotiable): the witness chain (ADR-028, ADR-136 §2.5 replay contract, ADR-137 §2.7 BLAKE3 witness in v2/crates/wifi-densepose-engine/src/lib.rs:437-448) requires bit-exact deterministic float output. Every recommendation below is tagged with its determinism risk. Anything that reorders float additions, enables FMA contraction, fast-math, or parallel reduction changes the witness hash and requires a coordinated proof-hash regeneration (verify.py --generate-hash) plus witness-bundle re-issue.

1. What we actually have measured (and what we don't)

/home/user/RuView/benchmark_baseline.json is a signal-quality soak baseline, not a latency benchmark: 1,566 samples (ticks 51131–52395) of variance / motion / presence / confidence / est_persons / kp_spread / rssi, with a summary block (confidence_mean: 0.643, presence_ratio: 0.934, kp_spread_mean: 86.7, person_count_changes: 10). It contains zero timing data. It is the accuracy guardrail for any optimization (post-change soak must reproduce these distributions), not a latency baseline.

Latency benchmarks exist but no committed results were found in the repo:

Bench	File	What it measures
`process_cycle_4nodes_56sc`	`v2/crates/wifi-densepose-engine/benches/engine_cycle.rs:34-48`	One full engine cycle, 4 nodes × 56 subcarriers, vs. the documented 50 ms budget (`engine_cycle.rs:3-6`)
`cir_bench`	`v2/crates/wifi-densepose-signal/benches/cir_bench.rs`	`CirEstimator::estimate()` per tier (HT20/HT40/HE20/HE40) + 12-link amortization
`sketch_bench`	`v2/crates/wifi-densepose-ruvector/benches/sketch_bench.rs:86-175`	Hamming sketch vs. float L2/cosine compare; top-K over 1,024-sketch bank
`signal_bench`, `calibration_bench`, `aether_prefilter_bench`	`v2/crates/wifi-densepose-signal/benches/`	Signal-path and ADR-135 calibration throughput

Action zero of the roadmap is to run these on a Pi 5 and commit the criterion baselines. All impact classes below are derived from operation counts read out of the code (cited), not invented measurements.

2. Latency budget model — streaming pipeline

Two clock domains exist and must not be conflated:

TDMA sensing cycle: 20 Hz / 50 ms — the architecture's own budget (v2/crates/wifi-densepose-signal/src/ruvsense/mod.rs:5, RuvSenseConfig::target_hz = 20.0 at mod.rs:258, and the bench doc engine_cycle.rs:3).
CSI ingest: 100 Hz per node — raw frames arrive ~5× faster than the fused output rate; per-frame ingest work (parse, normalize, calibrate, window) must therefore fit a 10 ms per-frame envelope while the fused path fits < 50 ms end-to-end.

Proposed per-stage budget for the 50 ms end-to-end target (4 nodes, HT20 / 56 subcarriers — the configuration the engine bench encodes):

#	Stage	Code	Budget	Risk (from code reading)
1	Ingest + hardware normalize (per 100 Hz frame)	`hardware_norm`, `multiband.rs`	2 ms	Low — vector ops on 56 floats
2	Calibration apply (ADR-135)	`ruvsense/calibration.rs`	2 ms	Low — Welford lookups
3	Phase alignment	`phase_align.rs:117-152`	1 ms	Low — ≤ 20 iterations over ≤ 17 static subcarriers (`config.max_iterations: 20`, `phase_align.rs:57`); allocation churn only (§3)
4	Multistatic fusion (attention + softmax)	`multistatic.rs:512-598`	2 ms	Low — O(nodes × 56); but does duplicate work in `fuse_scored` (§3, F2)
5	CIR gate (ISTA L1)	`multistatic.rs:440-475` → `cir.rs:601-654`	15 ms	HIGH — dominant cost, scales badly with PHY tier (below)
6	Coherence score + gate decision	`coherence.rs`, `coherence_gate.rs`	2 ms	Low — z-scores over 56 subcarriers
7	Tomography (ADR-030 tier 2, when enabled)	`tomography.rs:236-323`	8 ms	Medium — per-iteration allocation + loose step size (§3, F8/F9)
8	Pose tracker (17-kp Kalman + re-ID)	`pose_tracker.rs`	8 ms	Medium — sketch prefilter (ADR-084) already mitigates the re-ID scan
9	Engine: quality score, privacy gate, WorldGraph node, BLAKE3 witness	`engine/src/lib.rs:304-368`	5 ms	Low per cycle, but unbounded memory growth (§4)
10	Publish (WS/serde)	sensing-server	5 ms	Low
	Total		50 ms

Why stage 5 is the at-risk stage — operation counts from the code

ista_solve (cir.rs:601-654) runs two dense complex mat-vecs per iteration (matvec_phi at cir.rs:717-726, matvec_phi_h at cir.rs:730-745), each O(K·G) complex MACs (≈ 8 FLOPs each), up to max_iters: 100 (cir.rs:176). Per CirConfig (cir.rs:164-233):

Tier	K (active)	G (taps)	FLOPs/iter (2·K·G·8)	FLOPs @100 iters
HT20	52	156	≈ 0.13 M	≈ 13 M
HT40	114	342	≈ 0.62 M	≈ 62 M
HE20	242	726	≈ 2.8 M	≈ 0.28 G
HE40	484	1,452	≈ 11.2 M	≈ 1.1 G

HT20 fits the 15 ms budget comfortably on a Pi 5; HE40 at worst-case iteration count is ~1.1 GFLOP of scalar, cache-unfriendly work per estimate and will not fit any 50 ms budget without structural change (F4 below). Today the gate runs once per cycle on the first link only (multistatic.rs:452-463), which contains the damage; the 12-link amortization pattern in cir_bench.rs shows the intended scale-up, which multiplies this cost ×12.

3. Findings table — optimization opportunities

Impact: relative cycle-time/memory effect at the 4-node HT20 operating point unless noted. Determinism: EXACT = bit-identical output guaranteed; TIE = only tie-breaking/ordering may differ; CHANGES-FLOATS = output bits change, witness/proof hash must be regenerated.

ID	Finding (file:line)	Impact	Effort	Determinism
F1	`FusedSensingFrame` deep-copies every input frame each cycle: `node_frames: node_frames.to_vec()` (`multistatic.rs:282`) — clones all per-node amplitude+phase vectors per 50 ms cycle even when downstream geometry consumers don't need them	Med	Low (Arc/Cow or borrow)	EXACT
F2	`fuse_scored` re-derives the per-node amplitude views and recomputes `node_attention_weights` after `fuse` already computed them inside `attention_weighted_fusion` (`multistatic.rs:311-321` duplicating `multistatic.rs:520`) — full cosine-sim + softmax done twice per cycle	Low-Med	Low (return weights from `fuse`)	EXACT (same math, computed once)
F3	CIR gate rebuilds a heap `CsiFrame` per cycle: `build_csi_frame_from_channel` allocates an `Array2<Complex64>` and converts amplitude/phase via `from_polar` per subcarrier (`multistatic.rs:488-506`, called from `multistatic.rs:462`), then `extract_csi_vector` converts back to `Complex32` (`cir.rs:505-530`) — f32→f64→f32 round-trip plus two allocations purely as glue	Med	Med (give `CirEstimator` a slice-based entry point)	EXACT if conversions reproduce exactly (f32→f64 is lossless; `from_polar` in f64 then truncate ≠ f32 polar — keep the f64 intermediate to stay exact, or accept CHANGES-FLOATS and regenerate hashes)
F4	ISTA inner loop uses dense O(K·G) mat-vecs (`cir.rs:717-745`) although Φ is a sub-sampled DFT (`cir.rs:539-558`) — the products Φx and Φᴴr are computable via an FFT of length G in O(G log G), an ~8–40× FLOP cut at HE20/HE40 (table §2)	High (the only path to HE40 real-time)	High	CHANGES-FLOATS (different summation order than the sequential dot product) — must ship behind a feature flag, A/B against `cir_proof_runner`, regenerate `expected_features.sha256` + witness bundle
F5	`neumann_warm_start` recomputes the diagonal of ΦᴴΦ with a full K×G pass per frame (`cir.rs:676-681`), rebuilds the COO→CSR diagonal matrix per frame (`cir.rs:683-685`), and collects `rhs_re`/`rhs_im` Vecs per frame (`cir.rs:689-690`) — yet `diag` depends only on Φ, which is fixed at `CirEstimator::new`	Med	Low (precompute diag+CSR in `new()`)	EXACT (same values, computed once)
F6	`phase_variance` collects a `Vec<f32>` of phases per call (`cir.rs:792`) — replaceable by a two-pass loop with zero allocation	Low	Low	EXACT
F7	Φ and Φᴴ are both stored densely (`cir.rs:546-547`): 2·K·G·8 bytes — Φᴴ entries are just conjugates of Φ (`cir.rs:555`), so a transposed-iteration kernel over Φ alone halves the footprint (HE40: 11.2 MB → 5.6 MB)	Low (latency) / Med (memory §4)	Med	EXACT (conjugation is exact; keep identical accumulation order in the transposed kernel)
F8	Tomography allocates the gradient vector inside the solver iteration loop: `let mut gradient = vec![0.0_f64; self.n_voxels]` (`tomography.rs:266`) — one heap alloc + zeroing per iteration, up to `max_iterations: 100` (`tomography.rs:75`); hoist and `fill(0.0)`	Med (for tier-2 deployments)	Low	EXACT
F9	Tomography step size uses the Frobenius-norm upper bound for the Lipschitz constant (`tomography.rs:253-259`, comment admits `‖WᵀW‖ ≤ ‖W‖_F²`) — a bound loose by up to the matrix rank, forcing proportionally more ISTA iterations than the power-method estimate used in `cir.rs:566-590`	Med	Low (reuse the cir.rs power-method pattern)	CHANGES-FLOATS (different step ⇒ different iterate path)
F10	`apply_phase_correction` clones the amplitude vector and allocates a fresh corrected-phase Vec per channel per cycle (`phase_align.rs:258-268`, `frame.amplitude.clone()` at `phase_align.rs:264`); `align` additionally `frames.to_vec()`s on the single-channel path (`phase_align.rs:128`) — an in-place `align_mut` avoids all of it	Low-Med	Low	EXACT
F11	Static-subcarrier selection fully sorts all subcarriers by variance (`phase_align.rs:180`) where `select_nth_unstable_by` suffices — trivial at 56 subcarriers, relevant at HE tiers (242–484)	Low	Low	TIE (equal-variance ties may select a different subcarrier set; pin a stable tie-break on index to stay EXACT)
F12	Engine clones each node's amplitude vector for the array coordinator every cycle: `cf.amplitude.clone()` (`engine/src/lib.rs:385`); also allocates a `Vec<Option<CalibrationId>>` per cycle (`lib.rs:293`) and `format!("{e:?}")` strings for every evidence ref (`lib.rs:337`)	Low	Low	EXACT
F13	`fuse_scored_calibrated` computes the modal calibration id in O(n²) (`multistatic.rs:404-410`) — harmless at n ≤ 15 nodes, noted for swarm-scale reuse (ADR-148)	Low	Low	EXACT
F14	No `rayon` and no SIMD feature exists anywhere in the hot crates (grep over `crates//Cargo.toml`: zero hits for rayon/simd/target-feature outside wasm-opt flags). The 12-link CIR pattern (`cir_bench.rs:4-5`) and the per-node ingest path are embarrassingly parallel across independent links/nodes*	High (multi-link tiers)	Med	EXACT if and only if parallelism stays at link/node granularity with results collected in deterministic (index) order and no shared float accumulator; intra-link parallel reductions are CHANGES-FLOATS and are banned
F15	`Cir::top_k_taps` clones and fully sorts all G taps (`cir.rs:322-332`) — O(G log G) with a G-sized clone; a k-heap (the exact pattern already written in `sketch.rs:546-563`) is O(G log k)	Low	Low	TIE (equal-magnitude ordering; pin index tie-break)
F16	Core `CsiFrame` carries `Complex64` while the entire ruvsense DSP path computes in f32 (conversion at `cir.rs:525`) — 2× memory and bandwidth on every ingest for precision the pipeline immediately discards	Med (memory/bandwidth)	High (core type change ripples everywhere)	CHANGES-FLOATS at the boundary; defer until a major version
F17	Sketch path is already well-optimized: heap-based top-K with n ≤ k fast path (`sketch.rs:536-569`), 28-byte wire format (`sketch.rs:303`). Remaining win is build-level: `count_ones()` only lowers to POPCNT/NEON-vcnt when the target CPU enables it (see §5)	Low	Low	EXACT (integer ops)

4. Memory-footprint analysis (Pi 5-class and WASM; ESP32 aggregation out of scope)

Static, per-process (from struct definitions):

Component	Sizing source	Footprint
`CirEstimator` HT20 (Φ + Φᴴ, `Complex32`)	`cir.rs:546-547`, K=52 G=156	2 · 52 · 156 · 8 B ≈ 130 KB
`CirEstimator` HE20	K=242 G=726	≈ 2.8 MB
`CirEstimator` HE40	K=484 G=1452	≈ 11.2 MB (halvable via F7)
Tomography weight matrix	`tomography.rs:214-217`, sparse per-link (voxel,weight) pairs; default grid 8×8×4 = 256 voxels (`tomography.rs:70-73`)	tens of KB at default grid
Sketch bank, 1,024 × 128-d	`sketch.rs` 1 bit/dim	1,024 · 16 B ≈ 16 KB (vs 512 KB float)

A Pi 5 (4–8 GB) absorbs all of this trivially. The real memory risks are dynamic:

Unbounded WorldGraph growth (the one genuine leak-class issue). Every process_cycle appends a SemanticState node plus a DerivedFrom edge (engine/src/lib.rs:346-352), and change-points append Event nodes (lib.rs:422-428). At 20 Hz that is 1.73 M nodes/day with no eviction anywhere in the engine. snapshot_json (lib.rs:191-193) then serializes the whole graph. Required: a retention/compaction policy (ring buffer or time-windowed rollup of SemanticStates). Determinism caveat: eviction changes snapshot contents (a product decision), not float math — the per-cycle witness (lib.rs:437-448) is unaffected.
Per-cycle allocation churn (F1, F3, F5, F8, F10, F12): at 20 Hz this is dozens of short-lived heap allocations per cycle. On a Pi 5 this is allocator pressure and cache pollution rather than RSS growth; on WASM (bump-ish dlmalloc, no MADV_FREE) it inflates the linear memory high-water mark, which is never returned to the host.
WASM targets. wifi-densepose-wasm is a browser binding crate (JS interop, serde, chrono — crates/wifi-densepose-wasm/Cargo.toml) and pulls wifi-densepose-mat optionally; it relies on wasm-opt -O4 (Cargo.toml [package.metadata.wasm-pack]). wifi-densepose-wasm-edge is the disciplined one: no_std + libm, its own profile opt-level = "s", lto, cgu=1 (crates/wifi-densepose-wasm-edge/Cargo.toml). Neither enables +simd128 (§5). If the CIR estimator is ever compiled to wasm-edge, HE40's 11.2 MB of sensing matrix alone is ~700 pages of linear memory — restrict edge WASM to HT20 (130 KB) or ship F4/F7 first.

5. Build-profile review & recommendations

Current release profile (v2/Cargo.toml:213-218) is already aggressive and correct: opt-level = 3, lto = true (fat), codegen-units = 1, panic = "abort", strip = true; bench inherits release with debug symbols (v2/Cargo.toml:225-227). There is nothing wrong to fix here — the gains left are target- and feedback-driven:

Per-target CPU tuning (EXACT, do first). No target-cpu is set anywhere. For Pi 5 fleet builds: RUSTFLAGS="-C target-cpu=cortex-a76" — enables NEON scheduling and vcnt for the sketch path (F17) without changing IEEE semantics. LLVM does not reassociate float reductions or contract to FMA without explicit fast-math/contract flags, so scalar float results stay bit-exact. Verify with the existing proof runners (cir_proof_runner, calibration_proof_runner, signal/Cargo.toml) as the acceptance gate — that is exactly what they exist for.
WASM SIMD. Add -C target-feature=+simd128 for wifi-densepose-wasm builds and keep a non-SIMD artifact for older runtimes. Same determinism note as above; gate with the proof runners compiled to wasm where feasible.
PGO: feasible and determinism-safe. PGO changes inlining/layout, never FP semantics. The repo already has ideal deterministic training workloads: the proof runner binaries plus engine_cycle / cir_bench. Pipeline: cargo pgo build → run proof runners + benches → cargo pgo optimize. Expect mid-single-digit to ~15% on branchy paths (gate decisions, tracker lifecycle); the dense ISTA loop will see little. Cost: CI complexity. Verdict: do it after F1–F12, not before.
Do not enable -ffast-math-equivalents (fadd_fast, core::intrinsics, -C llvm-args=-fp-contract=fast) anywhere in the witness path. This must be a stated rule in CONTRIBUTING/ADR, not tribal knowledge.
BOLT / opt-level experiments are not worth it ahead of F4; the pipeline is FLOP-bound in one loop, not front-end bound.

6. Prioritized 90-day plan

Phase 0 — Measure (days 1–10)

Run and commit criterion baselines on a Pi 5 and an x86 dev box: engine_cycle, cir_bench (all four tiers), sketch_bench, signal_bench, calibration_bench. The 50 ms claim in engine_cycle.rs:3 becomes a measured number.
Add a lightweight per-stage timing histogram (feature-gated, off in witness builds) at the §2 stage boundaries; wire a CI perf-regression gate (±10%) on the committed baselines.
Re-run the soak that produced benchmark_baseline.json and pin it as the accuracy guardrail for everything below.

Phase 1 — Exact, zero-risk wins (days 10–35)

All EXACT findings; no witness impact; each lands with proof-runner verification:

F5 (precompute warm-start diag/CSR in CirEstimator::new) — biggest exact CIR win.
F8 (hoist tomography gradient buffer), F6, F10, F12, F1, F2 (allocation/duplication removal), F15 + F11 with pinned index tie-breaks.
WorldGraph retention policy (the §4.1 unbounded-growth fix) — design ADR + ring-buffer implementation.
Expected outcome: measurable cycle-time reduction and flat memory under 24 h soak; identical witness hashes.

Phase 2 — Determinism-managed structural wins (days 35–70)

Each behind a feature flag, A/B'd against the legacy path (the use_cir_gate A/B switch at multistatic.rs:103 is the template), with proof-hash regeneration as an explicit, witnessed release event:

F4: FFT-based Φ/Φᴴ application in ISTA — the headline item; the only route to HE20/HE40 real-time and the 12-link pattern. Acceptance: cir_bench speedup ≥ 5× at HE20, soak metrics within guardrail, new expected_features.sha256 published in a fresh witness bundle.
F9 (power-method Lipschitz in tomography) riding the same hash-regen train.
F3 (slice-based CIR entry point), choosing the exact-f64-intermediate variant if the hash train slips.
F14: feature-gated rayon across links/nodes only, deterministic index-ordered collection; CI must run the determinism test (engine/src/lib.rs:535-548 cycle_is_deterministic) with the feature on.

Phase 3 — Platform & toolchain (days 70–90)

Pi 5 target-cpu=cortex-a76 fleet builds + proof-runner verification (§5.1).
+simd128 WASM artifact + size budget check for wasm-edge (§5.2, §4.3).
PGO pilot in CI using proof runners as the training corpus (§5.3).
Re-baseline: new criterion numbers, refreshed witness bundle, updated this document's §1 with real measured latencies.

Out of 90-day scope, flagged for the architecture backlog: F16 (Complex64→Complex32 in core), F7 (single-matrix Φ kernel — bundle with F4), and HE40-on-edge (blocked on F4+F7).

7. Summary

The pipeline's only structural latency hazard is the dense ISTA CIR solver (cir.rs:601-654 + cir.rs:717-745): fine at HT20, ~1.1 GFLOP worst-case per estimate at HE40, and slated to run per-link (×12). Everything else is allocation churn and duplicated work that can be removed with bit-exact refactors (F1–F12), plus one genuine memory bug-class issue: unbounded WorldGraph growth at 20 Hz (engine/src/lib.rs:346-352). The build profile is already optimal; remaining toolchain gains (target-cpu, wasm simd128, PGO) are determinism-safe and cheap. The determinism constraint is workable because the repo already owns the right tools — deterministic proof runners, an A/B gate pattern, and a per-cycle witness — so float-changing optimizations become scheduled, witnessed hash-regeneration events rather than risks.

20 KiB Raw Blame History Unescape Escape