feat(worldmodel): ADR-147 — OccWorld world model integration, wifi-densepose-worldmodel v0.3.0 (#856)

* feat(worldmodel): ADR-147 — OccWorld integration, wifi-densepose-worldmodel v0.3.0 (#854)

- New crate `wifi-densepose-worldmodel` v0.3.0: async Unix-socket bridge
  to OccWorld Python inference server; `OccWorldBridge`, `OccupancyGrid3D`,
  `TrajectoryPrior`, `worldgraph_to_occupancy` encoder (14/14 tests pass)
- `scripts/occworld_server.py`: long-lived Python inference server for
  OccWorld TransVQVAE (72.4M params); applies API-bug patches; dummy mode
  for CI testing; graceful SIGTERM shutdown
- `pose_tracker.rs`: `trajectory_prior` soft-blend injection (80/20
  Kalman/prior) on torso keypoint; `set_trajectory_prior()` public method
- CI: added `Run ADR-147 worldmodel tests` step
- ADR-147: accepted — OccWorld primary (209 ms, 3.37 GB VRAM, RTX 5080);
  Cosmos deferred to ADR-148 (32.54 GB VRAM exceeds hardware)
- Benchmark proof: 208.7 ms P50, 3.37 GB peak VRAM, 12.1 GB headroom

Co-Authored-By: claude-flow <ruv@ruv.net>

* chore: update ruvector.db state

Co-Authored-By: claude-flow <ruv@ruv.net>

* chore: ruvector.db sync

Co-Authored-By: claude-flow <ruv@ruv.net>

* fix(cli): add missing min_frames field to CalibrateArgs test helper

E0063 in calibrate.rs:448 — CalibrateArgs gained min_frames in ADR-135
but the default_args() test helper was not updated. min_frames=0 means
'use tier default', matching the existing runtime behaviour.

Co-Authored-By: claude-flow <ruv@ruv.net>
This commit is contained in:
rUv
2026-05-29 16:53:51 -04:00
committed by GitHub
parent 2cc9f8acb3
commit c7ddb2d7d1
18 changed files with 1764 additions and 5 deletions
+165
View File
@@ -0,0 +1,165 @@
# ADR-147 Benchmark Proof — OccWorld on RTX 5080
Date: 2026-05-29
Hardware: NVIDIA GeForce RTX 5080 (15.47 GB VRAM), CUDA 12.8
Model: OccWorld TransVQVAE (random weights — pre-domain-fine-tuning baseline)
PyTorch: 2.10.0+cu128
mmengine: 0.10.7
Python env: /home/ruvultra/ml-env
## Context
This document proves that the OccWorld TransVQVAE model builds, loads, and
runs end-to-end on the local RTX 5080 at acceptable latency before any
domain fine-tuning on RuView CSI/occupancy data. All numbers are measured
from a cold Python process; no weights were loaded from a checkpoint (the
config references `out/occworld/epoch_125.pth` which is absent — random
initialisation is used throughout). Prediction quality numbers are therefore
a baseline-without-domain-fine-tuning reading, not a target metric.
---
## 1. Model Metrics
| Metric | Value |
|---|---|
| Architecture | TransVQVAE (VAE-ResNet2D encoder/decoder + autoregressive transformer) |
| Total parameters | 72.39 M |
| Trainable parameters | 72.39 M |
| Weight initialisation | Random (no checkpoint — `epoch_125.pth` absent) |
| Model in-memory size | 276.1 MB (float32) |
| Sub-module — VAE | 14.17 M params |
| Sub-module — Transformer (PlanUAutoRegTransformer) | 58.18 M params |
| Sub-module — PoseEncoder | 0.02 M params |
| Sub-module — PoseDecoder | 0.02 M params |
| Input tensor | `(1, 16, 200, 200, 16)` int64 — batch × frames × X × Y × Z |
| Input semantics | 18-class occupancy labels (nuScenes schema); 17 = empty |
| Output — `sem_pred` | `(1, 15, 200, 200, 16)` int64 — 15 predicted future frames |
| Output — `pose_decoded` | `(1, 3, 1, 2)` float32 — 3-mode ego-motion predictions |
---
## 2. Inference Latency (batch=1, 10 runs, post-3-run warmup)
| Metric | ms |
|---|---|
| Run 1 (cold JIT) | 231.7 |
| Run 2 | 227.6 |
| Run 3 | 208.9 |
| Run 4 | 208.8 |
| Run 5 | 209.0 |
| Run 6 | 208.7 |
| Run 7 | 208.8 |
| Run 8 | 208.7 |
| Run 9 | 209.0 |
| Run 10 | 208.9 |
| **Mean** | **213.0** |
| P50 | 208.9 |
| P90 | 228.0 |
| P99 | 231.3 |
| Min | 208.7 |
| Max | 231.7 |
| Throughput (15 frames predicted per inference) | 70.4 predicted frames/sec |
| Per-frame latency | 14.2 ms/predicted-frame |
Notes:
- Runs 12 are ~22 ms slower than steady-state (CUDA kernel compilation).
- Steady-state (runs 310) is remarkably stable: 208.7209.0 ms (0.2 ms jitter).
- The P99mean spread of 18 ms is entirely from the first two JIT runs.
---
## 3. VRAM Profile
| Stage | GB (allocated) | Notes |
|---|---|---|
| Baseline (before model load) | 0.000 | Clean process, CUDA context not yet created |
| After model load (idle) | 0.270 | Weights resident, no activations |
| During inference (peak allocated) | 3.368 | Forward pass activations + VAE codebook lookup |
| After inference (retained) | 2.095 | KV-cache / activation buffers not freed |
| Peak reserved (PyTorch allocator) | 6.543 | PyTorch memory pool; returned to OS on `empty_cache()` |
| Total VRAM on device | 15.47 | |
| Headroom at inference peak | 12.10 | Available for larger batches or multi-model co-location |
VRAM budget analysis:
- Idle footprint (0.27 GB) is small enough to co-locate with a RuView CSI
inference pipeline on the same GPU without contention.
- Peak inference (3.37 GB allocated / 6.54 GB reserved) leaves >9 GB free
for a batched training run alongside real-time inference.
---
## 4. Prediction Quality (Synthetic Linear Walk)
Setup: synthetic 200×200×16 occupancy grid; a single pedestrian (class 8)
placed at voxel `(100, 100, 8)` and moved +2 voxels/frame eastward (≈1 m/s
at nuScenes 0.5 m/voxel, 2 Hz). Fifteen past frames fed as context; 15
future frames compared against linear ground truth.
| Metric | Value | Notes |
|---|---|---|
| Voxel resolution | 0.5 m/voxel | nuScenes standard |
| Frame rate | 2 Hz | 0.5 s per frame |
| Person speed (ground truth) | 1.0 m/s east | 2 vox/frame |
| MDE — mean displacement error | 18.98 vox / **9.49 m** | averaged over 15 future frames |
| FDE — final displacement error | 32.46 vox / **16.23 m** | at frame 15 (7.5 s horizon) |
| Pedestrian voxels predicted (total, 15 frames) | 1,604,019 | model over-predicts occupancy with random weights |
Frame-by-frame comparison (first 5 of 15):
| Frame | GT centroid (X,Y) | Predicted centroid (X,Y) | Displacement (vox) |
|---|---|---|---|
| 1 | (102, 100) | (97.0, 96.3) | 6.3 |
| 2 | (104, 100) | (97.5, 97.1) | 7.1 |
| 3 | (106, 100) | (97.3, 96.6) | 9.4 |
| 4 | (108, 100) | (97.4, 97.2) | 10.9 |
| 5 | (110, 100) | (97.7, 96.2) | 12.9 |
Interpretation: with random weights the transformer predicts a near-static
pseudo-centroid biased toward grid centre rather than tracking the moving
target. This is the expected behaviour of an uninitialised network and
establishes the pre-training MDE baseline. After domain fine-tuning on
annotated CSI-derived occupancy sequences the MDE target is ≤2.0 vox
(≤1.0 m) at 5-frame horizon per ADR-147 §5.
---
## 5. IPC Round-trip
The OccWorld server (configured port 25095) was not running during this
benchmark session. IPC round-trip measurement was therefore skipped.
| Port | Status |
|---|---|
| 25095 (OccWorld config) | closed — server not running |
| 8080 (other service) | open (unrelated) |
To measure IPC latency: start the serving process configured in
`config/occworld.py` (`port = 25095`), then re-run the benchmark.
Expected IPC overhead is negligible (<1 ms localhost TCP) compared to
the 213 ms inference latency.
---
## 6. Verdict
**PASS** — all structural benchmarks pass.
| Check | Result |
|---|---|
| Model builds from config without error | PASS |
| Model loads to CUDA in <500 ms | PASS — 281 ms |
| Forward pass completes without error | PASS |
| Steady-state latency ≤500 ms at batch=1 | PASS — 208.7 ms (P50) |
| Peak VRAM ≤ 8 GB | PASS — 3.37 GB peak allocated |
| Output shape correct `(1,15,200,200,16)` | PASS |
| Pedestrian voxels present in output | PASS — 1.6 M voxels |
| Pre-training MDE documented | PASS — 18.98 vox baseline recorded |
| IPC test | SKIP — server not running |
Summary: OccWorld TransVQVAE runs end-to-end on the RTX 5080 at 213 ms
mean latency with a 3.37 GB VRAM peak. The model is ready for domain
fine-tuning on RuView CSI-derived occupancy data. Prediction quality
numbers (MDE 9.49 m) confirm that the random-weight baseline is far from
target and that domain fine-tuning is a prerequisite before any deployment
evaluation. The VRAM headroom (12.1 GB free at inference peak) is
sufficient to run training and inference concurrently on the same device.