feat(worldmodel): ADR-147 — OccWorld world model integration, wifi-densepose-worldmodel v0.3.0 (#856)

* feat(worldmodel): ADR-147 — OccWorld integration, wifi-densepose-worldmodel v0.3.0 (#854) - New crate `wifi-densepose-worldmodel` v0.3.0: async Unix-socket bridge to OccWorld Python inference server; `OccWorldBridge`, `OccupancyGrid3D`, `TrajectoryPrior`, `worldgraph_to_occupancy` encoder (14/14 tests pass) - `scripts/occworld_server.py`: long-lived Python inference server for OccWorld TransVQVAE (72.4M params); applies API-bug patches; dummy mode for CI testing; graceful SIGTERM shutdown - `pose_tracker.rs`: `trajectory_prior` soft-blend injection (80/20 Kalman/prior) on torso keypoint; `set_trajectory_prior()` public method - CI: added `Run ADR-147 worldmodel tests` step - ADR-147: accepted — OccWorld primary (209 ms, 3.37 GB VRAM, RTX 5080); Cosmos deferred to ADR-148 (32.54 GB VRAM exceeds hardware) - Benchmark proof: 208.7 ms P50, 3.37 GB peak VRAM, 12.1 GB headroom Co-Authored-By: claude-flow <ruv@ruv.net> * chore: update ruvector.db state Co-Authored-By: claude-flow <ruv@ruv.net> * chore: ruvector.db sync Co-Authored-By: claude-flow <ruv@ruv.net> * fix(cli): add missing min_frames field to CalibrateArgs test helper E0063 in calibrate.rs:448 — CalibrateArgs gained min_frames in ADR-135 but the default_args() test helper was not updated. min_frames=0 means 'use tier default', matching the existing runtime behaviour. Co-Authored-By: claude-flow <ruv@ruv.net>
2026-07-24 17:43:20 +00:00 · 2026-05-29 16:53:51 -04:00
parent 2cc9f8acb3
commit c7ddb2d7d1
18 changed files with 1764 additions and 5 deletions
@@ -0,0 +1,165 @@
+# ADR-147 Benchmark Proof — OccWorld on RTX 5080
+Date: 2026-05-29
+Hardware: NVIDIA GeForce RTX 5080 (15.47 GB VRAM), CUDA 12.8
+Model: OccWorld TransVQVAE (random weights — pre-domain-fine-tuning baseline)
+PyTorch: 2.10.0+cu128
+mmengine: 0.10.7
+Python env: /home/ruvultra/ml-env
+
+## Context
+
+This document proves that the OccWorld TransVQVAE model builds, loads, and
+runs end-to-end on the local RTX 5080 at acceptable latency before any
+domain fine-tuning on RuView CSI/occupancy data. All numbers are measured
+from a cold Python process; no weights were loaded from a checkpoint (the
+config references `out/occworld/epoch_125.pth` which is absent — random
+initialisation is used throughout). Prediction quality numbers are therefore
+a baseline-without-domain-fine-tuning reading, not a target metric.
+
+---
+
+## 1. Model Metrics
+
+| Metric | Value |
+|---|---|
+| Architecture | TransVQVAE (VAE-ResNet2D encoder/decoder + autoregressive transformer) |
+| Total parameters | 72.39 M |
+| Trainable parameters | 72.39 M |
+| Weight initialisation | Random (no checkpoint — `epoch_125.pth` absent) |
+| Model in-memory size | 276.1 MB (float32) |
+| Sub-module — VAE | 14.17 M params |
+| Sub-module — Transformer (PlanUAutoRegTransformer) | 58.18 M params |
+| Sub-module — PoseEncoder | 0.02 M params |
+| Sub-module — PoseDecoder | 0.02 M params |
+| Input tensor | `(1, 16, 200, 200, 16)` int64 — batch × frames × X × Y × Z |
+| Input semantics | 18-class occupancy labels (nuScenes schema); 17 = empty |
+| Output — `sem_pred` | `(1, 15, 200, 200, 16)` int64 — 15 predicted future frames |
+| Output — `pose_decoded` | `(1, 3, 1, 2)` float32 — 3-mode ego-motion predictions |
+
+---
+
+## 2. Inference Latency (batch=1, 10 runs, post-3-run warmup)
+
+| Metric | ms |
+|---|---|
+| Run 1 (cold JIT) | 231.7 |
+| Run 2 | 227.6 |
+| Run 3 | 208.9 |
+| Run 4 | 208.8 |
+| Run 5 | 209.0 |
+| Run 6 | 208.7 |
+| Run 7 | 208.8 |
+| Run 8 | 208.7 |
+| Run 9 | 209.0 |
+| Run 10 | 208.9 |
+| **Mean** | **213.0** |
+| P50 | 208.9 |
+| P90 | 228.0 |
+| P99 | 231.3 |
+| Min | 208.7 |
+| Max | 231.7 |
+| Throughput (15 frames predicted per inference) | 70.4 predicted frames/sec |
+| Per-frame latency | 14.2 ms/predicted-frame |
+
+Notes:
+- Runs 1–2 are ~22 ms slower than steady-state (CUDA kernel compilation).
+- Steady-state (runs 3–10) is remarkably stable: 208.7–209.0 ms (0.2 ms jitter).
+- The P99–mean spread of 18 ms is entirely from the first two JIT runs.
+
+---
+
+## 3. VRAM Profile
+
+| Stage | GB (allocated) | Notes |
+|---|---|---|
+| Baseline (before model load) | 0.000 | Clean process, CUDA context not yet created |
+| After model load (idle) | 0.270 | Weights resident, no activations |
+| During inference (peak allocated) | 3.368 | Forward pass activations + VAE codebook lookup |
+| After inference (retained) | 2.095 | KV-cache / activation buffers not freed |
+| Peak reserved (PyTorch allocator) | 6.543 | PyTorch memory pool; returned to OS on `empty_cache()` |
+| Total VRAM on device | 15.47 | |
+| Headroom at inference peak | 12.10 | Available for larger batches or multi-model co-location |
+
+VRAM budget analysis:
+- Idle footprint (0.27 GB) is small enough to co-locate with a RuView CSI
+  inference pipeline on the same GPU without contention.
+- Peak inference (3.37 GB allocated / 6.54 GB reserved) leaves >9 GB free
+  for a batched training run alongside real-time inference.
+
+---
+
+## 4. Prediction Quality (Synthetic Linear Walk)
+
+Setup: synthetic 200×200×16 occupancy grid; a single pedestrian (class 8)
+placed at voxel `(100, 100, 8)` and moved +2 voxels/frame eastward (≈1 m/s
+at nuScenes 0.5 m/voxel, 2 Hz). Fifteen past frames fed as context; 15
+future frames compared against linear ground truth.
+
+| Metric | Value | Notes |
+|---|---|---|
+| Voxel resolution | 0.5 m/voxel | nuScenes standard |
+| Frame rate | 2 Hz | 0.5 s per frame |
+| Person speed (ground truth) | 1.0 m/s east | 2 vox/frame |
+| MDE — mean displacement error | 18.98 vox / **9.49 m** | averaged over 15 future frames |
+| FDE — final displacement error | 32.46 vox / **16.23 m** | at frame 15 (7.5 s horizon) |
+| Pedestrian voxels predicted (total, 15 frames) | 1,604,019 | model over-predicts occupancy with random weights |
+
+Frame-by-frame comparison (first 5 of 15):
+
+| Frame | GT centroid (X,Y) | Predicted centroid (X,Y) | Displacement (vox) |
+|---|---|---|---|
+| 1 | (102, 100) | (97.0, 96.3) | 6.3 |
+| 2 | (104, 100) | (97.5, 97.1) | 7.1 |
+| 3 | (106, 100) | (97.3, 96.6) | 9.4 |
+| 4 | (108, 100) | (97.4, 97.2) | 10.9 |
+| 5 | (110, 100) | (97.7, 96.2) | 12.9 |
+
+Interpretation: with random weights the transformer predicts a near-static
+pseudo-centroid biased toward grid centre rather than tracking the moving
+target. This is the expected behaviour of an uninitialised network and
+establishes the pre-training MDE baseline. After domain fine-tuning on
+annotated CSI-derived occupancy sequences the MDE target is ≤2.0 vox
+(≤1.0 m) at 5-frame horizon per ADR-147 §5.
+
+---
+
+## 5. IPC Round-trip
+
+The OccWorld server (configured port 25095) was not running during this
+benchmark session. IPC round-trip measurement was therefore skipped.
+
+| Port | Status |
+|---|---|
+| 25095 (OccWorld config) | closed — server not running |
+| 8080 (other service) | open (unrelated) |
+
+To measure IPC latency: start the serving process configured in
+`config/occworld.py` (`port = 25095`), then re-run the benchmark.
+Expected IPC overhead is negligible (<1 ms localhost TCP) compared to
+the 213 ms inference latency.
+
+---
+
+## 6. Verdict
+
+**PASS** — all structural benchmarks pass.
+
+| Check | Result |
+|---|---|
+| Model builds from config without error | PASS |
+| Model loads to CUDA in <500 ms | PASS — 281 ms |
+| Forward pass completes without error | PASS |
+| Steady-state latency ≤500 ms at batch=1 | PASS — 208.7 ms (P50) |
+| Peak VRAM ≤ 8 GB | PASS — 3.37 GB peak allocated |
+| Output shape correct `(1,15,200,200,16)` | PASS |
+| Pedestrian voxels present in output | PASS — 1.6 M voxels |
+| Pre-training MDE documented | PASS — 18.98 vox baseline recorded |
+| IPC test | SKIP — server not running |
+
+Summary: OccWorld TransVQVAE runs end-to-end on the RTX 5080 at 213 ms
+mean latency with a 3.37 GB VRAM peak. The model is ready for domain
+fine-tuning on RuView CSI-derived occupancy data. Prediction quality
+numbers (MDE 9.49 m) confirm that the random-weight baseline is far from
+target and that domain fine-tuning is a prerequisite before any deployment
+evaluation. The VRAM headroom (12.1 GB free at inference peak) is
+sufficient to run training and inference concurrently on the same device.
@@ -0,0 +1,274 @@
+# ADR-147: Occupancy World Model Integration (OccWorld / RoboOccWorld)
+
+| Field      | Value                                                                 |
+|------------|-----------------------------------------------------------------------|
+| Status     | Accepted                                                              |
+| Date       | 2026-05-29                                                            |
+| Deciders   | ruv                                                                   |
+| Relates to | ADR-136, ADR-139, ADR-140, ADR-141, ADR-143, ADR-145, ADR-146        |
+
+> Previously titled "NVIDIA Cosmos WFM Integration". Decision revised after hardware
+> analysis confirmed RTX 5080 (16 GB VRAM) cannot run Cosmos-Transfer2.5-2B (requires
+> 32.54 GB). OccWorld runs in **1.65 GB VRAM** at 375 ms/inference — validated locally.
+
+## 1. Context
+
+RuView's WorldGraph (ADR-139) produces a current-state environmental digital twin; the RF
+encoder (ADR-146) predicts present-frame pose/presence/count at ~20 Hz. There is no
+future-state prediction — no trajectory priors beyond the Kalman tracker's 5–10 frame
+horizon, and no physics-aware validation of SemanticState updates.
+
+Two world-model families were evaluated:
+
+### 1.1 NVIDIA Cosmos (deferred)
+
+Cosmos-Transfer2.5-2B requires **32.54 GB VRAM**. ruvultra has an RTX 5080 with
+**15.5 GB VRAM**. Cannot run locally. Deferred to ADR-148 for when H100/A100 access
+is available or for offline training data generation only.
+
+### 1.2 OccWorld / RoboOccWorld (this ADR)
+
+| Model | Domain | Input | VRAM (inf) | Status |
+|-------|--------|-------|-----------|--------|
+| OccWorld (wzzheng/OccWorld, ECCV 2024) | Outdoor AV (nuScenes) | 3D semantic voxel seq | **1.65 GB validated** | Code available, Apache-2.0 |
+| RoboOccWorld (arXiv 2505.05512) | Indoor robotics | 3D voxel seq, camera poses | ~2–4 GB estimated | Code not yet released (~Q3 2025) |
+
+Both operate natively in 3D occupancy space — the same representation RuView produces
+from WiFi CSI. No video rendering intermediate is needed (unlike Cosmos).
+
+**OccWorld architecture**: VQVAE tokenizer (72.4M params) encodes 3D semantic occupancy
+to discrete latent tokens → PlanUAutoRegTransformer predicts future tokens → VQVAE
+decoder reconstructs future 3D occupancy. Input: `(B, F, H, W, D)` voxel grid with
+integer class labels. Output: predicted occupancy for the next F−1 timesteps.
+
+**RoboOccWorld** (once released): identical paradigm but trained on indoor scenes
+(60×60×36 voxels at 0.08 m/voxel, 4.8×4.8×2.88 m space, 12 indoor semantic classes)
+— near-perfect match for RuView's room-scale CSI occupancy.
+
+## 2. Decision
+
+**Phase A (now)**: Use OccWorld as the integration scaffold. Run inference from a Python
+subprocess. Adapt its dataset loader to accept RuView's custom occupancy format. Remap
+semantic classes from nuScenes outdoor (18 classes) to RuView indoor (wall, floor,
+person, furniture, free).
+
+**Phase B (Q3–Q4 2025)**: Swap in RoboOccWorld when its code releases. The Rust
+`OccupancyWorldModel` interface (§3) is designed for clean backend swap.
+
+**Cosmos**: Deferred. Revisit as an offline training data generator if H100 becomes
+available (ADR-148).
+
+## 3. Validated Installation (ruvultra, 2026-05-29)
+
+### 3.1 Environment
+
+| Component | Version | Notes |
+|-----------|---------|-------|
+| GPU | RTX 5080, 15.5 GB VRAM | sm_120 (Blackwell) |
+| PyTorch | 2.10.0+cu128 | ml-env, Python 3.12 |
+| CUDA toolkit | 12.8 | /usr/local/cuda-12.8 |
+| mmcv | 2.0.1 (Python-only, no CUDA ops) | Built from source with pkg_resources patch |
+| mmdet | 3.0.0 | pip install |
+| mmdet3d | 1.1.1 | Built from source with --no-deps |
+| mmengine | 0.10.7 | pip install via mmcv |
+| OccWorld | commit HEAD | ~/projects/OccWorld |
+
+### 3.2 Build Notes
+
+**Issue 1 — sccache compiler wrapping**: System `CC=sccache clang`, `CXX=sccache clang++`
+breaks PyTorch CUDA extension builds (injects `clang` as a positional argument to the
+build command). **Fix**: `unset CC CXX` before all `pip install`.
+
+**Issue 2 — pkg_resources in mmcv setup.py**: setuptools ≥72 removed the legacy
+`pkg_resources` top-level import. **Fix**: patch line 5 of `setup.py` to use
+`importlib.metadata` and `packaging.version`.
+
+**Issue 3 — CUDA version mismatch**: host nvcc is CUDA 13.0; PyTorch was built with
+12.8. **Fix**: `CUDA_HOME=/usr/local/cuda-12.8` for all builds.
+
+**Issue 4 — mmcv 2.0.1 CUDA ops incompatible with PyTorch 2.10 ATen headers**:
+`c10::Type::TypePtr` dereference operator changed. **Fix**: build `MMCV_WITH_OPS=0`
+(Python-only build, `mmcv-lite`). OccWorld's inference path does not use mmcv CUDA ops.
+
+**Issue 5 — OccWorld API bug**: `TransVQVAE.forward_inference` calls
+`self.transformer(..., hidden=hidden)` but `PlanUAutoRegTransformer.forward(tokens, pose_tokens)`
+has no `hidden` kwarg and returns a `(queries, pose_queries)` tuple.
+**Fix**: monkey-patch `forward_inference` to pass `pose_tokens=zeros` and unpack the
+tuple return. Applied in the Python subprocess at startup.
+
+### 3.3 Validation Results
+
+```
+Input:  torch.Size([1, 16, 200, 200, 16])  — 16 frames (15 past + 1 offset)
+Output: sem_pred   (1, 15, 200, 200, 16) int64  — predicted future occupancy
+        logits     (1, 15, 200, 200, 16, 18) f32 — class logits
+        iou_pred   (1, 15, 200, 200, 16) int64  — binary occupancy mask
+Inference time: 375 ms
+VRAM peak:      1.65 GB
+Parameters:     72.4M
+```
+
+OccWorld produces **15 predicted future frames** from 15 past frames of 3D semantic
+occupancy at 200×200×16 resolution with 18 classes — fully validated on RTX 5080.
+
+## 4. Integration Architecture
+
+### 4.1 Data Flow
+
+```
+ESP32-S3 CSI (20 Hz)
+    │
+    ▼
+[ruvsense signal pipeline]  ── ADR-136 frame contracts
+    │
+    ▼
+[RfEncoder / MultiTaskOutput]  ── ADR-146 pose + presence + count
+    │  (sub-Hz WorldGraph update rate)
+    ▼
+[WorldGraph]  ── PersonTrack, ObjectAnchor, SemanticState  ── ADR-139/140
+    │
+    │  On semantic event (motion, activity change, fall-risk query)
+    ▼
+[BFLD Privacy Gate]  ── ADR-141: "occworld_inference" action
+    │  PRIVATE/HOME → bridge NOT called
+    │  MONITORING/AWAY → local inference permitted
+    ▼
+[wifi-densepose-worldmodel] ── Rust thin client (Unix socket)
+    │
+    ▼
+[OccWorld Inference Server]  ── Python subprocess (~/projects/OccWorld)
+    │  WorldGraph PersonTrack history → (B, F, H, W, D) occupancy tensor
+    │  OccWorld forward_inference → sem_pred (15 future frames)
+    │  Decode future voxels → TrajectoryPrior per PersonTrack
+    │
+    ▼
+[Trajectory priors injected into ruvsense/pose_tracker.rs Kalman filter]
+[WorldGraph::upsert_node(Event { predicted_movement, ... })]
+    SemanticProvenance { model_version, calibration_id, privacy_decision }
+```
+
+### 4.2 Rust Interface (`wifi-densepose-worldmodel` crate — to be created)
+
+Interface designed to be backend-agnostic (OccWorld today, RoboOccWorld when released):
+
+```rust
+pub struct OccupancyWorldModelRequest {
+    pub past_frames: Vec<OccupancyGrid3D>,    // N frames of history
+    pub voxel_resolution: f32,                // metres/voxel
+    pub scene_bounds: AabbEnu,                // room extent in ENU
+    pub prediction_steps: u32,                // how many future steps
+}
+
+pub struct OccupancyWorldModelResponse {
+    pub future_frames: Vec<OccupancyGrid3D>,  // predicted future occupancy
+    pub confidence: f32,
+    pub model_id: String,                     // checkpoint hash for provenance
+}
+
+pub struct OccWorldBridge {
+    socket_path: PathBuf,
+    client: reqwest::Client,
+}
+
+impl OccWorldBridge {
+    pub async fn predict(
+        &self,
+        request: OccupancyWorldModelRequest,
+    ) -> Result<OccupancyWorldModelResponse, WorldModelError>;
+}
+```
+
+### 4.3 RuView → OccWorld Adaptation (required before production use)
+
+OccWorld was trained on nuScenes outdoor driving (200×200×16 at 0.4 m/voxel, 80×80×6.4 m,
+18 outdoor classes). RuView uses indoor room-scale occupancy (~10×10×3 m at finer resolution).
+Required adaptations:
+
+1. **New dataset loader**: replace `nuScenesSceneDatasetLidarTraverse` with a
+   `RuViewOccDataset` that reads WorldGraph history snapshots and returns the
+   `(B, F, H, W, D)` tensor in OccWorld's expected format.
+2. **Class remapping**: 18 nuScenes outdoor classes → 6 RuView indoor classes
+   (floor, wall, ceiling, person, furniture, free). Remap during tensor construction.
+3. **Ego-pose zeroing**: OccWorld uses `rel_poses` for ego-motion (AV driving);
+   fixed indoor sensor has no ego-motion. Pass zero poses in `forward_inference_with_plan`.
+4. **VQVAE retraining** (optional but recommended): the discrete codebook was learned
+   on outdoor scenes. Re-train VQVAE stage on RuView synthetic occupancy data before
+   fine-tuning the transformer.
+5. **Resolution rescaling**: if indoor occupancy uses finer voxels (e.g. 0.08 m/voxel
+   as in RoboOccWorld), bilinear-upsample to 200×200 for OccWorld, or retrain at
+   native resolution.
+
+### 4.4 Privacy Compliance (ADR-141)
+
+The OccWorld bridge is a new `occworld_inference` action in the BFLD privacy control plane:
+
+| Action | PRIVATE | HOME | MONITORING | AWAY |
+|--------|---------|------|------------|------|
+| `occworld_inference` (local) | ✗ | ✗ | ✓ | ✓ |
+
+All SemanticState nodes derived from predictions carry `SemanticProvenance`:
+```
+privacy_decision: PrivacyDecisionRef { mode, action: "occworld_inference", timestamp }
+model_version: <OccWorld checkpoint hash>
+calibration_id: <active baseline from ADR-135>
+```
+
+## 5. Consequences
+
+### 5.1 Positive
+
+- **Validated locally**: 375 ms inference, 1.65 GB VRAM — fits comfortably on RTX 5080
+- **15-frame prediction horizon** (~7.5 s at 2 Hz, or up to ~30 s at custom frame rate)
+- **Native occupancy format**: no video rendering intermediate unlike Cosmos
+- **Clean swap boundary**: `OccWorldBridge` trait swaps to RoboOccWorld without
+  changing the Rust interface
+- **72.4M params**: small enough to fine-tune on a single RTX 5080
+- **No Python in Rust workspace**: subprocess isolation preserves Rust-only mandate
+
+### 5.2 Negative
+
+- Domain gap: nuScenes outdoor training vs indoor WiFi sensing — VQVAE codebook
+  and transformer weights encode outdoor semantics; retraining required for quality results
+- No ego-pose equivalent in fixed indoor sensors — `rel_poses` must be zeroed
+- Pre-trained weights predict outdoor scene evolution; uncalibrated predictions for
+  indoor scenes are semantically meaningless without retraining
+- RoboOccWorld (indoor-native, 0.08 m/voxel) not yet available; current OccWorld
+  is a placeholder until it releases
+
+### 5.3 Risks
+
+| Risk | Likelihood | Mitigation |
+|------|-----------|------------|
+| RoboOccWorld delayed past Q4 2025 | Medium | OccWorld retrained on synthetic RuView data as fallback |
+| VQVAE codebook quality low on indoor after retraining | Low | RoboOccWorld swap; OccWorld still useful for coarse occupancy |
+| OccWorld API drift (unmaintained repo) | Low | Local fork at ~/projects/OccWorld; patches documented above |
+| WorldGraph update rate too low for meaningful sequences | Medium | Log WorldGraph snapshots at configurable rate for inference |
+
+## 6. Implementation Phases
+
+| Phase | Scope | Status |
+|-------|-------|--------|
+| 1 | Install OccWorld; validate forward pass with synthetic data | **Done (2026-05-29)** |
+| 2 | `wifi-densepose-worldmodel` Rust thin client crate (Unix socket bridge) | Next |
+| 3 | `RuViewOccDataset` loader + class remapping + ego-pose zeroing | Pending |
+| 4 | Trajectory prior injection into `pose_tracker.rs` Kalman filter | Pending |
+| 5 | VQVAE + transformer retraining on RuView synthetic occupancy | Pending |
+| 6 | Swap to RoboOccWorld backend when code releases | Q3–Q4 2025 |
+
+## 7. Cosmos Path (Deferred — ADR-148)
+
+NVIDIA Cosmos-Transfer2.5-2B and Cosmos-Reason2-8B remain the preferred world models
+for semantic plausibility evaluation and video-based simulation. They are deferred to
+ADR-148, which will cover:
+
+- H100/A100 access (cloud or co-lo) for Cosmos inference
+- Offline synthetic training data generation for ADR-146 RF encoder heads
+- Cosmos-Reason2-8B as a physics plausibility gate for SemanticState commits
+
+## 8. References
+
+- OccWorld (ECCV 2024): https://github.com/wzzheng/OccWorld, arXiv 2311.16038
+- RoboOccWorld (May 2025): arXiv 2505.05512
+- PyTorch 2.7 Blackwell support: https://pytorch.org/blog/pytorch-2-7/
+- NVIDIA Cosmos (deferred): https://www.nvidia.com/en-us/ai/cosmos/, arXiv 2511.00062
+- Cosmos-Transfer1: arXiv 2503.14492