fix(sensing-server): make multistatic guard interval operator-configurable (#1049 )

Two ESP32-S3 nodes on WiFi/ESP-NOW sync drift 10-150 ms (~70 ms typ.), exceeding the 60 ms default guard → permanent trust demotion to Restricted, all pose output suppressed, 200k+ errors, no escape but a container restart. Add a direct WDP_GUARD_INTERVAL_US override (+ optional WDP_SOFT_GUARD_US) to multistatic_guard_config_from_env. Precedence (most-specific wins): direct override > WDP_TDM_SLOTS+WDP_TDM_SLOT_US schedule-derived > 60ms/20ms default. Soft band always clamped strictly below hard; malformed/zero ignored (falls back, never breaks fusion). Effective guard logged at startup. Pinned by 6 tests (multistatic_guard_config_tests). sensing-server bin tests 449 -> 455, 0 failed. Python proof PASS, hash unchanged (off signal path). Closes #1049. Co-Authored-By: claude-flow <ruv@ruv.net>
2026-06-16 11:23:19 +00:00 · 2026-06-15 13:41:43 -04:00
2 changed files with 129 additions and 16 deletions
@@ -7,6 +7,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

 ## [Unreleased]

+### Fixed
+- **Multistatic fusion guard interval is now operator-configurable — fixes permanent trust demotion with WiFi-synced ESP32 nodes (#1049).** Two independently-clocked ESP32-S3 boards on ESP-NOW sync drift 10–150 ms (typ. ~70 ms) — the 100 ms beacon + WiFi-MAC jitter cannot hold them within the published 60 ms default guard, so the governed-trust cycle permanently demoted to `Restricted`, suppressed all pose output, and spun the error counter to 200k+ with **no escape hatch but a container restart**. Added a **direct `WDP_GUARD_INTERVAL_US` override** (+ optional `WDP_SOFT_GUARD_US`) to `multistatic_guard_config_from_env`, so a deployment can lift the hard guard past its measured spread (e.g. `WDP_GUARD_INTERVAL_US=200000`) without having to know its exact TDM schedule. Precedence is most-specific-wins: a direct override beats the existing `WDP_TDM_SLOTS`+`WDP_TDM_SLOT_US` schedule-derived guard, which beats the 60 ms/20 ms default; the override is applied on top of whichever base is selected, the soft band is always clamped strictly below the hard guard, and a malformed/zero value is ignored (falls back to the base rather than breaking fusion). The effective guard is now logged at startup. Pinned by 6 new tests (`multistatic_guard_config_tests`): direct-override-wins / beats-TDM-derived / soft-clamped-below-hard / lowering-hard-pulls-soft-down / malformed-or-zero-falls-back / default-when-unset. `wifi-densepose-sensing-server` bin tests **449 → 455**, 0 failed; Python proof VERDICT PASS, hash unchanged (off the signal proof path).
+
 ### Security
 - **`wifi-densepose-occworld-candle` — beyond-SOTA security + correctness review (Milestone #9, crate 4/4).** (1) **HIGH (MEASURED) — checkpoint-load crash on any int32 tensor** (`model.rs::safetensor_dtype_to_candle`). `safetensors::Dtype::I32` was mapped to `candle_core::DType::I64` and the raw int32 byte buffer (4 bytes/elem) was then handed to `Tensor::from_raw_buffer(.., I64, shape, ..)`. Candle derives `elem_count = data.len() / dtype.size_in_bytes()`, so the I64 path halved the element count while keeping the *original* shape — yielding a tensor whose declared shape claims twice as many elements as its backing storage holds. Reading it **panics** (`range end index 6 out of range for slice of length 3` — slice OOB inside candle-core) on any attacker-supplied or PyTorch-exported checkpoint containing an int32 tensor (common: index/buffer tensors). Fixed by mapping `I32 → DType::I32` (and `I16 → DType::I16`), both first-class candle dtypes. Reproduction recorded on old code; pinned by `tests/checkpoint_loading.rs::int32_tensor_loads_with_consistent_shape_and_values` (panics on old, passes on new) plus F32/I64/corrupt-file control cases. (2) **LOW (MEASURED) — `predict()` lacked frame/batch validation at the input boundary** (`inference.rs`). It validated H/W/D but not the externally-supplied frame count; an `f_in > num_frames*2` over-indexed the temporal positional embedding deep in the transformer and surfaced as a cryptic candle "gather" `InvalidIndex` (returned error, not a panic — candle bounds-checks), and a zero frame/batch dim fed a zero-element tensor into the pipeline. Now rejected at the boundary with a clear `ShapeMismatch`. Pinned by `predict_rejects_zero_frames` / `predict_rejects_too_many_frames` / `predict_accepts_frame_count_at_capacity`. (3) **LOW (MEASURED) — divide-by-zero panic on a degenerate input to the public `VQCodebook::encode`** (`vqvae.rs`): a rank-0 / empty-last-dim tensor made `last == 0` and panicked on `elem_count() / last`. Now fails closed with a clear error. Pinned by `encode_rejects_scalar_without_panicking`. **Dimensions confirmed CLEAN with evidence:** panic surface — zero `unwrap()`/`expect()`/`panic!`/`unreachable!` in production code paths (grep evidence; all error handling via `?`/`map_err`); NaN-state-poisoning — N/A (engine is stateless between `predict` calls, input is `u8` class indices so non-finite input is structurally impossible, no persistent world-model buffer to latch into); unbounded-alloc / shape-data mismatch from malformed weights — defended upstream by `safetensors::validate()` (overflow-checked `nelements*dtype.size()` vs declared byte range, rejected before reaching candle); secrets — none (grep clean, only `token_h`/`token_w` config fields match). `unsafe_code = forbid` in the crate manifest. **Build/validation status (MEASURED on Windows):** crate builds and tests under `cargo test -p wifi-densepose-occworld-candle --no-default-features` — **29/29 pass** (20 unit + 4 checkpoint_loading + 3 predict_honesty + 2 doc) after fixes; `cargo test --workspace --no-default-features` = 0 failed across all crates (lone `wifi-densepose-desktop` `api_integration` failure was a Windows "Access is denied (os error 5)" file-lock flake — re-ran in isolation **21/21 pass**); Python proof VERDICT PASS, hash `f8e76f21…446f7a` unchanged. *Warrants ADR slot 179 (parent to author).*
 - **`wifi-densepose-wasm-edge` beyond-SOTA closing review — boundary NaN-state-poisoning guard + clean-with-evidence attestation (ADR-040 edge crate, ~70 modules).** Closing pass of the security campaign over the last untouched sizeable crate. **One real finding fixed (LOW / source-analysis + reproduced):** the two WASM↔host frame boundaries (`lib.rs::on_frame`/`on_timer` and `bin/ghost_hunter.rs::on_frame`) read raw IEEE-754 `f32` from the `csi_get_phase`/`csi_get_amplitude`/`csi_get_variance`/`csi_get_motion_energy` host imports **without any finiteness check** — the entire crate had **zero** `is_finite`/`is_nan` guards, and the in-crate `clamp` helpers propagate NaN (`NaN < lo` and `NaN > hi` are both false). A single non-finite value (firmware DSP bug, uninitialised buffer, or hostile host) latches NaN into the long-lived per-module accumulators (EMA, Welford, phasor sums, anomaly baselines); once latched, every downstream comparison evaluates `false`, so detectors fail **degraded** (stuck gate state, silently-disabled anomaly checks) — silent corruption, not a crash (WASM `panic=abort` is *not* tripped: no indexing/`unwrap` on the poisoned value). Threat model is a **semi-trusted** boundary (the Tier-2 DSP firmware supplies the imports, not direct network/JS), hence LOW severity / defense-in-depth. **Fix:** added `sanitize_host_f32()` (maps non-finite→`0.0`, `core`-only so it holds in `no_std`) applied at every `host_get_*` float read — a single chokepoint covering all ~70 downstream modules, mirroring the existing M-01 negative-`n_subcarriers` boundary clamp. **Pinned by** `boundary_tests::{sanitize_passes_finite_values_through, sanitize_maps_non_finite_to_zero, coherence_monitor_nan_latches_without_sanitize_but_not_with}` — the last asserts on the *current* `CoherenceMonitor` that a raw NaN frame latches the smoothed score (documents the hazard) while the boundary-sanitized path stays finite. **Dimensions attested CLEAN with evidence (source-analysis):** (a) **panic-on-input** — every non-test `unwrap()`/`expect()` is either `#[cfg(test)]` or in the `std`-gated RVF *builder* host tool writing to an in-memory `Vec` (infallible); no `panic!`/`unreachable!`/`todo!`/`get_unchecked` in any hot path. (b) **shape/bounds** — all frame-buffer access is `min()`-clamped (`MAX_SC=32`, `DTW_MAX_LEN`, `LCS_WINDOW`, `PATTERN_LEN`), all index-by-cast sites (`feature_id as usize`, `conclusion_id`, `minute_counter`, `plan_step`) are either compile-time-const-bounded or `if idx <`/`%`-guarded; negative `n_subcarriers` already mapped to 0 (M-01). (c) **memory/leak** — no `move ||` closures, no `mem::forget`/`Box::leak`/`.leak()`; the only `Box::new` is in the `std`-gated `skill_registry` (one-time init, bounded). (d) **secrets** — none (grep clean). **MEASURED build/test evidence:** host `cargo test --features std,medical-experimental` = **672 passed / 0 failed** (was 669 pre-fix; +3 new tests); the real deployment artifacts all build clean on the actual target — `cargo build --target wasm32-unknown-unknown --release` (no_std/panic=abort default lib), `--bin ghost_hunter --no-default-features --features standalone-bin`, and `--features medical-experimental` (toolchain 1.89 per `rust-toolchain.toml`). No ADR slot needed — a single LOW defense-in-depth boundary fix; CHANGELOG attestation suffices.
@@ -6391,32 +6391,71 @@ fn vitals_snapshots_from_sensing_json(
    }
 }

-/// Build the multistatic guard config, optionally derived from the TDM schedule
-/// declared in the environment (#1031).
+/// Build the multistatic guard config from the environment (#1031, #1049).
 ///
-/// When both `WDP_TDM_SLOTS` and `WDP_TDM_SLOT_US` parse as positive integers,
-/// the guard is derived via [`MultistaticConfig::for_tdm_schedule`] so a
-/// deployment can match its exact schedule. Otherwise the published default
-/// (60 ms hard / 20 ms soft) is returned. `min_nodes` is *not* set here — the
-/// caller overrides it for single-node passthrough.
+/// Three precedence layers, most-specific wins:
+/// 1. `WDP_GUARD_INTERVAL_US` (+ optional `WDP_SOFT_GUARD_US`) — a **direct**
+///    hard-guard override. This is the #1049 escape hatch: WiFi/ESP-NOW-synced
+///    ESP32 nodes drift 10–150 ms (the 100 ms beacon + WiFi-MAC jitter cannot
+///    hold two independently-clocked boards within the published default), so a
+///    deployment can simply lift the guard past its measured spread (e.g.
+///    `WDP_GUARD_INTERVAL_US=200000`) without knowing its exact TDM schedule.
+/// 2. `WDP_TDM_SLOTS` + `WDP_TDM_SLOT_US` (both positive) — derive the guard
+///    from the declared schedule via [`MultistaticConfig::for_tdm_schedule`].
+/// 3. Otherwise the published default (60 ms hard / 20 ms soft).
+///
+/// The direct override (1) is applied **on top of** whichever base (2 or 3) is
+/// selected, so `WDP_GUARD_INTERVAL_US` always wins for the hard guard while a
+/// TDM-derived soft band is preserved unless it would exceed the new hard guard.
+/// `min_nodes` is *not* set here — the caller overrides it for single-node
+/// passthrough.
 fn multistatic_guard_config_from_env() -> MultistaticConfig {
    multistatic_guard_config_from(
        std::env::var("WDP_TDM_SLOTS").ok().as_deref(),
        std::env::var("WDP_TDM_SLOT_US").ok().as_deref(),
+        std::env::var("WDP_GUARD_INTERVAL_US").ok().as_deref(),
+        std::env::var("WDP_SOFT_GUARD_US").ok().as_deref(),
    )
 }

 /// Pure core of [`multistatic_guard_config_from_env`] for testability.
-fn multistatic_guard_config_from(slots: Option<&str>, slot_us: Option<&str>) -> MultistaticConfig {
-    match (
+fn multistatic_guard_config_from(
+    slots: Option<&str>,
+    slot_us: Option<&str>,
+    guard_us: Option<&str>,
+    soft_us: Option<&str>,
+) -> MultistaticConfig {
+    // Base: TDM-schedule-derived when both slot params are valid, else default.
+    let mut cfg = match (
        slots.and_then(|s| s.trim().parse::<usize>().ok()),
        slot_us.and_then(|s| s.trim().parse::<u64>().ok()),
    ) {
-        (Some(n), Some(us)) if n >= 1 && us >= 1 => {
-            MultistaticConfig::for_tdm_schedule(n, us)
-        }
+        (Some(n), Some(us)) if n >= 1 && us >= 1 => MultistaticConfig::for_tdm_schedule(n, us),
        _ => MultistaticConfig::default(),
+    };
+
+    // Direct hard-guard override (#1049). Ignored when unset/zero/unparseable so
+    // a malformed env var falls back to the base rather than breaking fusion.
+    if let Some(g) = guard_us
+        .and_then(|s| s.trim().parse::<u64>().ok())
+        .filter(|&g| g >= 1)
+    {
+        cfg.guard_interval_us = g;
+        // Keep the soft band strictly below the (possibly lowered) hard guard.
+        if cfg.soft_guard_us >= g {
+            cfg.soft_guard_us = g.saturating_sub(1).max(1);
+        }
    }
+
+    // Optional explicit soft-guard override, always clamped strictly below hard.
+    if let Some(s) = soft_us
+        .and_then(|s| s.trim().parse::<u64>().ok())
+        .filter(|&s| s >= 1)
+    {
+        cfg.soft_guard_us = s.min(cfg.guard_interval_us.saturating_sub(1).max(1));
+    }
+
+    cfg
 }

 /// Turn a `ProgressiveLoader::new` failure into an actionable diagnostic (#894).
@@ -7485,11 +7524,16 @@ async fn main() {
        pose_tracker: PoseTracker::new(),
        last_tracker_instant: None,
        multistatic_fuser: {
-            // #1031: the default guard (60 ms hard / 20 ms soft) accommodates a
-            // real TDM slot offset. A deployment can override it to match its
-            // own schedule via WDP_TDM_SLOTS + WDP_TDM_SLOT_US (both set ⇒ derive
-            // from the schedule), else the published default is used.
+            // #1031/#1049: the default guard (60 ms hard / 20 ms soft)
+            // accommodates a real TDM slot offset. A deployment overrides it via
+            // WDP_GUARD_INTERVAL_US (direct, e.g. 200000 for WiFi/ESP-NOW sync —
+            // #1049) or WDP_TDM_SLOTS + WDP_TDM_SLOT_US (derive from schedule).
            let cfg = multistatic_guard_config_from_env();
+            info!(
+                "Multistatic fusion guard: {} µs hard / {} µs soft (override via \
+                 WDP_GUARD_INTERVAL_US / WDP_SOFT_GUARD_US, or WDP_TDM_SLOTS+WDP_TDM_SLOT_US)",
+                cfg.guard_interval_us, cfg.soft_guard_us
+            );
            let mut fuser = MultistaticFuser::with_config(MultistaticConfig {
                min_nodes: 1, // single-node passthrough
                ..cfg
@@ -7797,6 +7841,72 @@ async fn main() {
    info!("Server shut down cleanly");
 }

+#[cfg(test)]
+mod multistatic_guard_config_tests {
+    //! #1049 — the multistatic guard interval must be operator-configurable so a
+    //! WiFi/ESP-NOW deployment (10–150 ms inter-node clock drift) can lift the
+    //! guard past its measured timestamp spread instead of being permanently
+    //! demoted to Restricted with no escape hatch.
+    use super::*;
+
+    #[test]
+    fn default_guard_when_nothing_set() {
+        let cfg = multistatic_guard_config_from(None, None, None, None);
+        assert_eq!(cfg.guard_interval_us, MultistaticConfig::default().guard_interval_us);
+        assert_eq!(cfg.soft_guard_us, MultistaticConfig::default().soft_guard_us);
+    }
+
+    #[test]
+    fn direct_guard_override_wins_and_unblocks_wifi_spread() {
+        // The #1049 reporter's measured ~70 ms spread exceeds the 60 ms default
+        // → permanent demotion. A direct 200 ms override accepts it.
+        let cfg = multistatic_guard_config_from(None, None, Some("200000"), None);
+        assert_eq!(cfg.guard_interval_us, 200_000);
+        assert!(cfg.soft_guard_us < cfg.guard_interval_us);
+        // 70 ms spread now sits inside the guard.
+        assert!(70_000 < cfg.guard_interval_us);
+    }
+
+    #[test]
+    fn direct_guard_override_beats_tdm_derived() {
+        // Both TDM params AND a direct override set → the direct hard guard wins,
+        // the TDM-derived soft band is preserved (still strictly below hard).
+        let cfg = multistatic_guard_config_from(Some("2"), Some("18000"), Some("200000"), None);
+        assert_eq!(cfg.guard_interval_us, 200_000);
+        assert!(cfg.soft_guard_us < cfg.guard_interval_us);
+        assert!(cfg.soft_guard_us >= 1);
+    }
+
+    #[test]
+    fn soft_override_is_clamped_strictly_below_hard() {
+        // A soft guard ≥ hard would be nonsensical → clamped below the hard guard.
+        let cfg = multistatic_guard_config_from(None, None, Some("50000"), Some("999999"));
+        assert_eq!(cfg.guard_interval_us, 50_000);
+        assert!(cfg.soft_guard_us < 50_000);
+    }
+
+    #[test]
+    fn lowering_hard_below_default_soft_pulls_soft_down() {
+        // Override hard to 10 ms (< default 20 ms soft) → soft drops below it.
+        let cfg = multistatic_guard_config_from(None, None, Some("10000"), None);
+        assert_eq!(cfg.guard_interval_us, 10_000);
+        assert!(cfg.soft_guard_us < 10_000);
+    }
+
+    #[test]
+    fn malformed_or_zero_override_falls_back_to_base() {
+        // Garbage / zero must not break fusion — fall back to the base config.
+        for bad in ["", "abc", "0", "-5", "12.5"] {
+            let cfg = multistatic_guard_config_from(None, None, Some(bad), None);
+            assert_eq!(
+                cfg.guard_interval_us,
+                MultistaticConfig::default().guard_interval_us,
+                "override {bad:?} should be ignored"
+            );
+        }
+    }
+}
+
 #[cfg(test)]
 mod node_sync_snapshot_serialization_tests {
    //! ADR-110 iter 24 — JSON public-API contract for the iter 23