Files
ruvnet--RuView/docs/adr/ADR-168-benchmark-proof.md
T
rUv 42dcf49f4d fix(adr): resolve duplicate ADR numbers + close ADR-080 security + ADR-154 M1 signal backlog (#1051)
* fix(signal): circular phase variance for ghost-tap guard (ADR-154 §7.4 #1)

`phase_variance` computed a LINEAR sample variance over phase angles that
wrap at ±π, so a tightly-clustered set straddling the branch cut reported
spuriously HIGH dispersion — false-tripping the `> TAU` ghost-tap guard on
real, tightly-clustered CIR taps.

Replace with Mardia's circular variance V = 1 − R̄, bounded [0,1] and
invariant to where the cluster sits on the circle. Re-derive the guard
against the bounded metric via a named const
`GHOST_TAP_CIRCULAR_VARIANCE_MAX` (the old TAU-scaled threshold is
meaningless on [0,1]).

Grade: metric fix MEASURED; threshold value DATA-GATED — a clean single-path
ramp also sweeps the circle, so V alone cannot separate clean from
unsanitized without labelled frames. Conservative default (0.99) errs toward
never false-rejecting, strictly more permissive at the wrap boundary than the
buggy linear guard.

Fails-on-old test: `phase_variance_circular_not_fooled_by_branch_cut` —
inlines the old linear variance to show it exceeds TAU on wrap-straddling
phases while circular V≈0 and the guard no longer trips. Plus
`phase_variance_circular_is_bounded_and_extremal` (V∈[0,1], V≈0 identical,
V≈1 uniform).

cargo test -p wifi-densepose-signal --no-default-features --features cir --lib
→ 432 passed, 0 failed.

Co-Authored-By: claude-flow <ruv@ruv.net>

* fix(signal): pin Welford n=0/n=1 finiteness guard (ADR-154 §7.4 #10)

The shared `WelfordStats` (field_model.rs, used by longitudinal.rs and others)
relies on `count < 2` guards in `variance`/`sample_variance`/`std_dev`/
`z_score` to stay finite at the boundaries. The guards existed but the n=0
boundary was UNTESTED — exactly the §4 divide-by-(n−1) family the ADR groups
this with.

Add `welford_finite_at_n0_and_n1` asserting every statistic is finite and
returns the documented sentinel (0.0) at n=0 and n=1, plus load-bearing doc
comments on the two guards.

Fails-on-old proof: with the `sample_variance` guard removed, the test FAILS
with "attempt to subtract with overflow" at the `(self.count - 1)` underflow
(0usize − 1); `variance` would similarly yield 0.0/0.0 = NaN. The guard is
restored; the test pins it so a future regression is caught.

Grade: MEASURED (boundary finiteness is asserted; the guard is the §4-family
fix made testable).

cargo test -p wifi-densepose-signal --no-default-features --lib field_model
→ 22 passed, 0 failed.

Co-Authored-By: claude-flow <ruv@ruv.net>

* refactor(signal): de-magic adversarial thresholds + boundary tests (ADR-154 §7.4 #13)

Lift the bare numeric literals buried in `check`/`check_consistency` into
named, documented module consts (FIELD_MODEL_GINI_VIOLATION=0.8,
ENERGY_RATIO_HIGH_VIOLATION=2.0, ENERGY_RATIO_LOW_VIOLATION=0.1,
CONSISTENCY_ACTIVE_FRACTION_OF_MEAN=0.1, SCORE_W_* weights). VALUES UNCHANGED —
each const equals the original literal; only names + pinning tests are new.

Grade: DATA-GATED. The operating values stay empirical (defensible values need
labelled spoofed/clean CSI — Wi-Spoof, §6.2/§7.3). The de-magicking +
characterization tests are MEASURED: `tuning_consts_unchanged_from_literals`,
`energy_ratio_high_boundary`, `energy_ratio_low_boundary`,
`field_model_gini_boundary`, `consistency_active_fraction_boundary` pin the
decision boundaries at/just-below/just-above each threshold, so a future
data-driven retune is a visible, tested change.

Fails-on-change proof: bumping ENERGY_RATIO_HIGH_VIOLATION 2.0→3.0 makes
`energy_ratio_high_boundary` FAIL (restored). Operating values explicitly
NOT changed.

cargo test -p wifi-densepose-signal --no-default-features --lib ruvsense::adversarial
→ 20 passed, 0 failed.

Co-Authored-By: claude-flow <ruv@ruv.net>

* refactor(signal): de-magic coherence drift/gate thresholds (ADR-154 §7.4 #9)

Lift the bare detection literals in `coherence.rs::classify_drift`
(DRIFT_STABLE_SCORE=0.85, DRIFT_STEP_CHANGE_MAX_STALE=10) and the
`coherence_gate.rs` Default impl (DEFAULT_ACCEPT_THRESHOLD=0.85,
DEFAULT_REJECT_THRESHOLD=0.5, DEFAULT_MAX_STALE_FRAMES=200,
DEFAULT_PREDICT_ONLY_NOISE=3.0) into named, documented consts. VALUES
UNCHANGED. The gate already exposed these via GatePolicyConfig (config seam);
this names + pins the defaults.

Grade: DATA-GATED. Operating values stay empirical (defensible Z-score
thresholds need labelled stable/drifting coherence traces). De-magicking +
boundary tests are MEASURED: `classify_drift_stable_score_boundary`,
`classify_drift_stale_count_boundary` pin the at/just-below/just-above
decisions; `drift_consts_unchanged_from_literals` /
`gate_default_consts_unchanged_from_literals` pin the values. Operating values
explicitly NOT changed.

cargo test -p wifi-densepose-signal --no-default-features --lib ruvsense::coherence
→ 40 passed, 0 failed.

Co-Authored-By: claude-flow <ruv@ruv.net>

* docs(adr-154): mark §7.4 P1 backlog cleared — Milestone-1 (#1,#10 RESOLVED; #9,#13 DATA-GATED)

Update ADR-154 §7.4 backlog rows #1, #9, #10, #13 with commit refs + grades,
the §7.4 intro count (four P1 items cleared, ~41 P2/P3 remain), the
Horizon-ledger one-liner (Milestone-1 DONE), and the §8 honest-limits #1 line
(metric now correct; threshold still DATA-GATED). Add CHANGELOG [Unreleased]
entry.

Grades: #1 RESOLVED (MEASURED metric / DATA-GATED threshold), #10 RESOLVED
(MEASURED), #9 & #13 RESOLVED-PARTIAL (DATA-GATED — de-magicked + boundary
tested, operating values unchanged).

Validation: cargo test --workspace --no-default-features → 2057 passed, 0
failed; wifi-densepose-signal lib → 442 passed (no-default + --features cir);
python archive/v1/data/proof/verify.py → VERDICT: PASS, hash f8e76f21…46f7a
UNCHANGED (CIR ghost-tap guard is not on the deterministic proof path).

Co-Authored-By: claude-flow <ruv@ruv.net>

* fix(sensing-server): stop leaking internal errors in HTTP responses (ADR-080 #2)

Six handlers in `main.rs` serialized the internal error `Display` straight
into the JSON response body, leaking server internals to any client (ADR-080
finding #2, CWE-209; reframed onto the Rust boundary by ADR-164 G11):

  - edge_registry_endpoint: a panicked spawn_blocking `JoinError`
    ("task … panicked") in a 500, and the raw upstream error in a 503
  - delete_model / delete_recording / start_recording: std::io::Error
    strings carrying OS detail / filesystem paths
  - calibration_start / calibration_stop: the FieldModel error chain

New `error_response` module: `internal_error` / `internal_error_json` /
`upstream_unavailable` log the full detail server-side only (tagged with a
correlation id) and return a generic body
(`{"error":"internal_error","correlation_id":…}`) — no `panicked`, no file
paths, no Debug chain. The correlation id lets an operator join a client
report to the exact server log line without ever shipping the detail.

Pinned by 5 error_response tests, incl. a leak-substring guard
(internal_error_body_does_not_leak_detail) verified to FAIL on the reverted
old body (returns the panic message / path / "os error"). The HOMECORE sweep
(ADR-161) covered homecore-server, not this crate.

Co-Authored-By: claude-flow <ruv@ruv.net>

* test(sensing-server): pin XFF-immunity + no-query-token (ADR-080 #1, #3)

Findings #1 (XFF-spoofing bypass) and #3 (JWT-in-URL, CWE-598) were logged
against the Python v1 API but are VERIFIED ABSENT on the current Rust
sensing-server, so they get regression tests rather than redundant fixes:

  - #1 XFF: there is no IP-based rate-limiter or IP-allowlist to bypass, and
    neither security middleware reads a forwarded header. Added
    bearer_auth::xff_header_never_affects_auth_decision (spoofed
    X-Forwarded-For never flips a 401<->200 decision) and
    host_validation::forwarded_headers_never_bypass_host_allowlist (spoofed
    X-Forwarded-Host: localhost never lets Host: evil.com past the allowlist).

  - #3 JWT-in-URL: require_bearer reads the token only from the Authorization
    header; WS handlers take no query token; the sole Query extractor
    (EdgeRegistryParams) is a non-secret refresh flag. Added
    bearer_auth::query_string_token_is_never_accepted — ?token= / ?access_token=
    in the URL never authenticates (stays 401) while the header path still 200s.
    Verified to FAIL when a query-token path is injected into require_bearer.

Co-Authored-By: claude-flow <ruv@ruv.net>

* docs(adr-080): mark P0 security findings #1-#3 RESOLVED; close ADR-164 G11

- ADR-080: Status note + per-finding closure (#1 XFF and #3 JWT-in-URL
  verified absent + regression-pinned; #2 leaked errors fixed via the
  error_response module). Records the v1-vs-Rust boundary distinction
  explicitly: v1 paths remain archived; this closure governs the shipped
  Rust sensing-server.
- ADR-164: Gap Register G11 and the Open/Gated Backlog entry marked
  RESOLVED with the fix + branch reference.
- CHANGELOG: [Unreleased] -> ### Security entry covering all three findings.

Co-Authored-By: claude-flow <ruv@ruv.net>

* docs(adr): renumber 6 displaced ADRs to resolve duplicate-number collisions (ADR-164 G1)

Resolves the 5 duplicate ADR numbers (6 displaced files) flagged by ADR-164
Gap Register item G1. Canonical keeper per number = first file committed at
that number (date tie-broken by inbound cross-reference count / parent-appendix
relationship). Displaced files renumbered to the next free numbers (166-171):

  050 keeps provisioning-tool-enhancements (5 refs vs 1)
    -> ADR-166-quality-engineering-security-hardening
  052 keeps tauri-desktop-frontend (parent ADR)
    -> ADR-167-ddd-bounded-contexts (its appendix)
  147 keeps nvidia-cosmos/OccWorld (the actual ADR, has Status header)
    -> ADR-168-benchmark-proof (proof companion, no Status)
    -> ADR-169-adam-mode-light-theme (was untracked)
  148 keeps drone-swarm-control-system (committed #862)
    -> ADR-170-yoga-mode-pose-system (was untracked)
  149 keeps public-community-leaderboard-huggingface (committed 16:47 vs 17:38)
    -> ADR-171-swarm-benchmarking-evaluation-methodology

Updates in-file `# ADR-NNN` headers and intra-file self-references (yoga-modes

* docs(adr): repoint inbound cross-references to renumbered ADRs (166-171)

Follow-up to the ADR renumbering (ADR-164 G1). Updates every inbound reference
that pointed at a displaced ADR, disambiguating shared numbers by title/slug so
only references to the DISPLACED topic move and keeper references stay put.

ADR-168 (was 147 benchmark-proof): README, CHANGELOG, user-guide,
  proof-of-capabilities, research docs 00/03 — all path/label refs updated.
ADR-169 (was 147 adam-mode) / ADR-170 (was 148 yoga-mode): docs/adr/README index.
ADR-171 (was 149 swarm-benchmarking): all ruview-swarm eval code+docs
  (Cargo.toml, evals/, eval_swarm.rs, metrics/mod/report/runner.rs), research
  doc 03 (every §-ref matched ADR-171 sections, not AetherArena), 00-system-review,
  series README, CHANGELOG, and ADR-148's forward/"open issues" pointers.
ADR-166 (was 050 quality-engineering / security-hardening): disambiguated from the
  ADR-050 provisioning KEEPER by topic. The HMAC/secure_tdm, directory-traversal,
  bind-address, and OTA-PSK-auth references in code comments
  (wifi-densepose-hardware Cargo.toml + secure_tdm.rs, sensing-server main.rs) and
  in ADR-052-tauri / ADR-167 all describe the security-hardening ADR -> ADR-166.
ADR-167 (was 052 ddd-appendix): inbound appendix references.

Index/registry updates: docs/adr/README.md, gap-analysis/census.md (rows +
header count), gap-analysis/lens-findings.md (collision table marked RESOLVED),
and ADR-164 Gap Register G1 marked RESOLVED with the full renumber map.

Keeper references deliberately untouched: all ADR-147 OccWorld code, all ADR-148
drone-swarm code/docs, all ADR-149 AetherArena refs (incl. ADR-150's SSL/resampling
refs, which ADR-150 explicitly binds to the AetherArena benchmark), ADR-050
provisioning refs, ADR-052 tauri refs. The frozen GitHub blob URLs in
docs/adr/.issue-177-body.md (pinned to an old branch) are left as historical.

Comment-only code edits; no behavior change. wifi-densepose-hardware compiles
clean; the sensing-server build's sole blocker is the pre-existing upstream
midstreamer-temporal-compare@0.2.1 registry crate, unrelated to these edits.

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-06-13 14:31:38 -04:00

8.3 KiB
Raw Blame History

ADR-168 Benchmark Proof — OccWorld on RTX 5080

Date: 2026-05-29 Hardware: NVIDIA GeForce RTX 5080 (15.47 GB VRAM), CUDA 12.8 Model: OccWorld TransVQVAE (random weights — pre-domain-fine-tuning baseline) PyTorch: 2.10.0+cu128 mmengine: 0.10.7 Python env: /home/ruvultra/ml-env

Context

This document proves that the OccWorld TransVQVAE model builds, loads, and runs end-to-end on the local RTX 5080 at acceptable latency before any domain fine-tuning on RuView CSI/occupancy data. All numbers are measured from a cold Python process; no weights were loaded from a checkpoint (the config references out/occworld/epoch_125.pth which is absent — random initialisation is used throughout). Prediction quality numbers are therefore a baseline-without-domain-fine-tuning reading, not a target metric.


1. Model Metrics

Metric Value
Architecture TransVQVAE (VAE-ResNet2D encoder/decoder + autoregressive transformer)
Total parameters 72.39 M
Trainable parameters 72.39 M
Weight initialisation Random (no checkpoint — epoch_125.pth absent)
Model in-memory size 276.1 MB (float32)
Sub-module — VAE 14.17 M params
Sub-module — Transformer (PlanUAutoRegTransformer) 58.18 M params
Sub-module — PoseEncoder 0.02 M params
Sub-module — PoseDecoder 0.02 M params
Input tensor (1, 16, 200, 200, 16) int64 — batch × frames × X × Y × Z
Input semantics 18-class occupancy labels (nuScenes schema); 17 = empty
Output — sem_pred (1, 15, 200, 200, 16) int64 — 15 predicted future frames
Output — pose_decoded (1, 3, 1, 2) float32 — 3-mode ego-motion predictions

2. Inference Latency (batch=1, 10 runs, post-3-run warmup)

Metric ms
Run 1 (cold JIT) 231.7
Run 2 227.6
Run 3 208.9
Run 4 208.8
Run 5 209.0
Run 6 208.7
Run 7 208.8
Run 8 208.7
Run 9 209.0
Run 10 208.9
Mean 213.0
P50 208.9
P90 228.0
P99 231.3
Min 208.7
Max 231.7
Throughput (15 frames predicted per inference) 70.4 predicted frames/sec
Per-frame latency 14.2 ms/predicted-frame

Notes:

  • Runs 12 are ~22 ms slower than steady-state (CUDA kernel compilation).
  • Steady-state (runs 310) is remarkably stable: 208.7209.0 ms (0.2 ms jitter).
  • The P99mean spread of 18 ms is entirely from the first two JIT runs.

3. VRAM Profile

Stage GB (allocated) Notes
Baseline (before model load) 0.000 Clean process, CUDA context not yet created
After model load (idle) 0.270 Weights resident, no activations
During inference (peak allocated) 3.368 Forward pass activations + VAE codebook lookup
After inference (retained) 2.095 KV-cache / activation buffers not freed
Peak reserved (PyTorch allocator) 6.543 PyTorch memory pool; returned to OS on empty_cache()
Total VRAM on device 15.47
Headroom at inference peak 12.10 Available for larger batches or multi-model co-location

VRAM budget analysis:

  • Idle footprint (0.27 GB) is small enough to co-locate with a RuView CSI inference pipeline on the same GPU without contention.
  • Peak inference (3.37 GB allocated / 6.54 GB reserved) leaves >9 GB free for a batched training run alongside real-time inference.

4. Prediction Quality (Synthetic Linear Walk)

Setup: synthetic 200×200×16 occupancy grid; a single pedestrian (class 8) placed at voxel (100, 100, 8) and moved +2 voxels/frame eastward (≈1 m/s at nuScenes 0.5 m/voxel, 2 Hz). Fifteen past frames fed as context; 15 future frames compared against linear ground truth.

Metric Value Notes
Voxel resolution 0.5 m/voxel nuScenes standard
Frame rate 2 Hz 0.5 s per frame
Person speed (ground truth) 1.0 m/s east 2 vox/frame
MDE — mean displacement error 18.98 vox / 9.49 m averaged over 15 future frames
FDE — final displacement error 32.46 vox / 16.23 m at frame 15 (7.5 s horizon)
Pedestrian voxels predicted (total, 15 frames) 1,604,019 model over-predicts occupancy with random weights

Frame-by-frame comparison (first 5 of 15):

Frame GT centroid (X,Y) Predicted centroid (X,Y) Displacement (vox)
1 (102, 100) (97.0, 96.3) 6.3
2 (104, 100) (97.5, 97.1) 7.1
3 (106, 100) (97.3, 96.6) 9.4
4 (108, 100) (97.4, 97.2) 10.9
5 (110, 100) (97.7, 96.2) 12.9

Interpretation: with random weights the transformer predicts a near-static pseudo-centroid biased toward grid centre rather than tracking the moving target. This is the expected behaviour of an uninitialised network and establishes the pre-training MDE baseline. After domain fine-tuning on annotated CSI-derived occupancy sequences the MDE target is ≤2.0 vox (≤1.0 m) at 5-frame horizon per ADR-147 §5.


5. IPC Round-trip

The OccWorld server (configured port 25095) was not running during this benchmark session. IPC round-trip measurement was therefore skipped.

Port Status
25095 (OccWorld config) closed — server not running
8080 (other service) open (unrelated)

To measure IPC latency: start the serving process configured in config/occworld.py (port = 25095), then re-run the benchmark. Expected IPC overhead is negligible (<1 ms localhost TCP) compared to the 213 ms inference latency.


6. Verdict

PASS — all structural benchmarks pass.

Check Result
Model builds from config without error PASS
Model loads to CUDA in <500 ms PASS — 281 ms
Forward pass completes without error PASS
Steady-state latency ≤500 ms at batch=1 PASS — 208.7 ms (P50)
Peak VRAM ≤ 8 GB PASS — 3.37 GB peak allocated
Output shape correct (1,15,200,200,16) PASS
Pedestrian voxels present in output PASS — 1.6 M voxels
Pre-training MDE documented PASS — 18.98 vox baseline recorded
IPC test SKIP — server not running

Summary: OccWorld TransVQVAE runs end-to-end on the RTX 5080 at 213 ms mean latency with a 3.37 GB VRAM peak. The model is ready for domain fine-tuning on RuView CSI-derived occupancy data. Prediction quality numbers (MDE 9.49 m) confirm that the random-weight baseline is far from target and that domain fine-tuning is a prerequisite before any deployment evaluation. The VRAM headroom (12.1 GB free at inference peak) is sufficient to run training and inference concurrently on the same device.


7. Real CSI Data Benchmark (no mocks)

Run date: 2026-05-29
Data source: archive/v1/data/proof/ — deterministic real-hardware-parameter CSI (seed=42, 3 RX antennas, 56 subcarriers, 100 Hz, 10 s = 1000 frames)
Pipeline: CSI amplitude → variance-threshold presence → antenna-power-differential ENU position → snapshot_to_voxels() → OccWorld inference

Metric Value
CSI frames 1000 @ 100 Hz (10 s recording)
Antennas / Subcarriers 3 RX / 56 SC
Breathing frequency 0.300 Hz
Walking frequency 1.200 Hz
Active frames (40th-pct threshold) 400/1000 (40%)
Inference windows (stride 50) 20

Latency (20 real-CSI windows, RTX 5080)

Metric ms
mean 212.47
median 208.45
p95 226.01
min 207.81
max 226.11
stdev 7.39

VRAM (real-CSI pipeline)

Stage GB
Peak allocated 3.977
Retained after inference 2.686
Free headroom (RTX 5080) 11.49

Output occupancy (15 predicted future frames)

Metric Value
Person-class voxels / inference (mean) 48,504
Person-class voxels (range) [48,306 48,668]

Note: high voxel count is expected with random weights (no domain fine-tuning). After retraining on RuView CSI data, person voxels will cluster tightly around predicted person positions.

Throughput

Metric Value
Predicted frames / sec 72.0
Inferences / sec 4.80
CSI → prediction end-to-end ~210 ms

Verdict: PASS

Real CSI pipeline runs cleanly end-to-end. Latency (208 ms median) and VRAM (3.98 GB peak, 11.5 GB headroom) are identical to the synthetic baseline — confirming that input data content does not affect inference cost, as expected for a batch=1 forward pass.