Files
ruvnet--RuView/docs/benchmarks/pose-estimation-cog.md
T
rUv 3314c8db8d feat(cog-pose-estimation): scaffold first Cog from this repo (ADR-100 + ADR-101) (#642)
* feat(cog-pose-estimation): scaffold first Cog from this repo (ADR-100 + ADR-101)

Adds the foundation for the pose-estimation Cog that ships from this
repo into Cognitum V0 appliances. Companion ADR-225 + crate land in
cognitum-one/v0-appliance.

ADRs:
* ADR-100 formalises the Cognitum Cog packaging spec — on-device
  layout under /var/lib/cognitum/apps/<id>/, manifest.json schema
  (incl. new binary_sha256 + binary_signature fields), GCS hosting
  convention, repo source layout, build pipeline, and the four-verb
  runtime contract (version | manifest | health | run). Documents the
  convention I reverse-engineered from inspecting installed cogs on a
  live cognitum-v0 appliance — `anomaly-detect`, `presence`,
  `seizure-detect`, etc.
* ADR-101 designs the pose-estimation Cog itself: where it sits in
  the wifi-densepose pipeline (encoder init from
  ruvnet/wifi-densepose-pretrained, 17-keypoint regression head),
  what gets shipped per target arch (arm / x86_64 / hailo8 /
  hailo10), acceptance gates (PCK@20 explicitly deferred to #640 —
  this ADR ships the vehicle, not the accuracy).

Crate v2/crates/cog-pose-estimation/:
* Cargo.toml + workspace member declaration with a hailo feature gate
  so the binary builds without the Hailo SDK in CI.
* main.rs implements the four-verb CLI exactly per ADR-100.
* config.rs / manifest.rs / publisher.rs / inference.rs / runtime.rs —
  small modules, each <100 lines.
* publisher.rs emits ADR-100 structured JSON events.
* inference.rs is a stub that produces a centred-skeleton baseline
  with confidence=0 (honest: no trained weights wired in yet).
* runtime.rs subscribes to /api/v1/sensing/latest, slides a
  56*20 window, runs the engine, emits pose.frame events.
* cog/manifest.template.json + cog/config.schema.json define the
  release artifact + runtime config schemas.
* cog/Makefile holds build / sign / upload targets.
* tests/smoke.rs covers manifest roundtrip + engine I/O surface.

Verified locally:
* cargo check -p cog-pose-estimation: clean.
* cargo test  -p cog-pose-estimation: 4/4 pass.
* ./target/release/cog-pose-estimation {version,manifest,health}:
  all emit the right contract output.

This commit contains scaffolding only; the actual trained weights and
Hailo HEF cross-compile come in follow-ups tracked in #640 and the
companion v0-appliance branch.

* feat(cog-pose-estimation): first measured run — Candle CUDA on RTX 5080

Trained pose_v1 on ruvultra (RTX 5080) via Candle 0.9 + cuda feature
against the same 1,077-sample paired session that produced 0%/0% PCK
in #640 with the pure-JS SPSA trainer. First real numbers:

  PCK@20 = 3.0%   (up from 0.0%)
  PCK@50 = 18.5%  (up from 0.0%)
  MPJPE  = 0.093  (down from 0.66, ~7x improvement)

400 epochs in 2.1 s wall time, full-batch, ~5 ms/epoch. Loss curve
0.181 -> 0.014 over the run, eval 0.010. Per-joint reveals the model
leans on right-side proximal joints (r_hip 77% PCK@50, r_knee 35%,
l_elbow 26%) — consistent with the camera framing in the source
recording. Distal joints (wrists, ankles) and face joints are still
near-random, consistent with the 56-subcarrier / 20-frame input not
carrying fine-grained spatial info at 1077 samples.

This commit:

* Adds v2/crates/cog-pose-estimation/cog/artifacts/{pose_v1.safetensors,
  train_results.json} so the cog dir now contains a real reference
  artifact, not just scaffold.
* Updates cog/README.md "Status" block with the measured numbers,
  per-joint table, and an honest reading of where the model
  succeeds vs where the data is the bottleneck.
* Adds docs/benchmarks/pose-estimation-cog.md as the canonical
  benchmark log — append-only, one section per published run.
* Appends a "First measured run" section to ADR-101 referencing
  the new benchmark file.

Still pending in the follow-up:
* Wire pose_v1.safetensors into src/inference.rs (replace stub).
* ONNX export (Candle lacks a writer — needs external conversion).
* Hailo HEF cross-compile + cluster deploy.

The data-bound gap to PCK@20 >= 35% is tracked in #640.

* feat(cog-pose-estimation): wire real weights — cog is no longer a stub

Replaces the centred-skeleton stub in src/inference.rs with a real
Candle-based loader that reads cog/artifacts/pose_v1.safetensors and
runs the trained Conv1d encoder + MLP pose head on every incoming CSI
window.

What changes:

* src/inference.rs: PoseNet mirrors the training script's architecture
  exactly — Conv1d(56->64, k=3 d=1), Conv1d(64->128, k=3 d=2),
  Conv1d(128->128, k=3 d=4), mean over time, Linear(128->256)+ReLU,
  Linear(256->34)+sigmoid -> reshape [17, 2]. The InferenceEngine
  searches a sensible candidate list for the weights file
  (/var/lib/cognitum/apps/pose-estimation/, ./pose_v1.safetensors,
  ./cog/artifacts/, repo-root, v2/-relative) and falls back to the
  stub when none are present so the cog still satisfies ADR-100.
* Cargo.toml: adds candle-core 0.9 + candle-nn 0.9 (no-default-features,
  CPU build by default) + safetensors 0.4. New `cuda` feature opt-in
  for GPU inference on hosts that have it. Drops the unused
  wifi-densepose-train path dep from the default build path.
* src/main.rs + src/publisher.rs: health.ok event now carries
  `backend` (candle-cuda | candle-cpu | stub) and the synthetic
  output confidence, so operators can tell at a glance whether the
  cog loaded its weights or fell back to the stub.
* tests/smoke.rs: adds `real_weights_load_when_available` which
  asserts the loaded engine reports backend=candle-* and emits
  non-zero confidence — exactly the signal that proves we're not
  silently degrading to the stub.

Verified locally:

* `cargo check -p cog-pose-estimation --no-default-features` — clean
* `cargo test  -p cog-pose-estimation --no-default-features` — 5/5 pass
* `./target/release/cog-pose-estimation health` emits:
  {"event":"health.ok","fields":{"backend":"candle-cpu","cog":"pose-estimation","synthetic_output_confidence":0.185}}
  — 0.185 is the published PCK@50 from cog/artifacts/train_results.json,
  emitted by the real Candle inference path (would be 0.0 if it had
  fallen back to the stub).

The cog now runs the trained pose_v1 model end-to-end. Accuracy is
still bounded by the underlying 1077-sample training data (PCK@20
3.0%, PCK@50 18.5% per docs/benchmarks/pose-estimation-cog.md) — that
gap is data-bound and tracked in #640. ONNX export + Hailo HEF
cross-compile remain follow-ups.

* docs(benchmarks): measure cog-pose-estimation cold-start latency

100 sequential `cog-pose-estimation health` invocations average 76.2 ms
each on a Windows x86_64 host using the `candle-cpu` backend. Each
invocation re-loads pose_v1.safetensors and runs one synthetic forward
pass, so this is the worst-case cold-start path. Long-running `run`
inference will be sub-millisecond per frame once the model is loaded.

Updates the benchmarks doc accordingly.

* feat(cog-pose-estimation): ONNX export — pose_v1.onnx + scripts/export-onnx.py

Adds the canonical ONNX artifact that unblocks downstream Hailo HEF
cross-compile + ONNX Runtime benchmarks. Generated on ruvultra (torch
2.12.0 + CUDA), 12,059 bytes, opset 18, dynamic batch axis.

* scripts/export-onnx.py: mirrors the Candle inference architecture in
  PyTorch (Conv1d 56->64, 64->128, 128->128 + Linear 128->256->34), pure-
  python safetensors loader (no extra pip dep), exports via
  torch.onnx.export, then verifies via onnx.checker.check_model and
  numerical parity against the torch reference.
* Verified parity vs torch: max |torch - onnx| = 8.94e-8 (1e-5
  threshold). Effectively bit-perfect.
* v2/crates/cog-pose-estimation/cog/artifacts/pose_v1.onnx — the
  artifact itself, 12 KB.
* docs/benchmarks/pose-estimation-cog.md — adds an ONNX export
  section with the verification numbers.

Next: Hailo HEF cross-compile (still gated on Hailo SDK on a
self-hosted runner) and ONNX Runtime latency benchmarks on each
target arch.

* feat(cog-pose-estimation): release v0.0.1 — signed aarch64 binary on GCS

End-to-end deploy: cross-compiled to aarch64-unknown-linux-gnu on
ruvultra, ran via qemu-aarch64-static, then smoke-tested on a real
cognitum-v0 Pi 5. Signed with COGNITUM_OWNER_SIGNING_KEY (Ed25519)
and uploaded to gs://cognitum-apps/cogs/arm/.

Real-hardware results on cognitum-v0 (Pi 5):
  health: backend=candle-cpu, confidence=0.185, real weights loaded
  30x sequential `health`: 0.251 s total -> 8.4 ms / invocation (cold)

GCS release artifacts (publicly downloadable):
  binary:  3,741,976 bytes
    sha256 1e1a7d3dd01ca05d5bfc5dbb142a5941b7866ed9f3224a21edc04d3f09a99bf5
  weights:   507,032 bytes
    sha256 eb249b9a6b2e10130437a10976ed0230b0d085f86a0553d7226e1ae6eae4b9e5
  signature (Ed25519, b64): LUN7xqLPYD3MFzm5dKB5MnYU0LvoRtek5ci5KiKPHBg+Xo6xuazwokn2Dw2JPMaLYJzmWn/SpT4djuR7hYvVDw==

Adds:
* v2/crates/cog-pose-estimation/cog/artifacts/manifest.json — the
  release-pipeline-produced manifest with all fields filled in per
  ADR-100, including arch, target_triple, signature, and a
  build_metadata block carrying the validation PCK numbers.
* docs/benchmarks/pose-estimation-cog.md — new sections covering
  the real Pi 5 smoke (8.4 ms cold-start) and the signed GCS
  release artifacts.

Verified by downloading the binary anonymously from GCS and
re-computing the sha256 — matches the locally-computed sha exactly.
Signature decoded to the expected 64-byte Ed25519 length.

Closes the GCS-upload acceptance criterion from ADR-100; the only
pending work is Hailo HEF cross-compile (still SDK-gated) and an
x86_64 release alongside this arm release.

* docs(benchmarks): record live cognitum-v0 install + 5-sec smoke run

Adds the "Live appliance install" section documenting what happened
when the signed v0.0.1 binary + weights were installed under
/var/lib/cognitum/apps/pose-estimation/ on cognitum-v0 (the V0
cluster leader).

* Layout matches the existing anomaly-detect / presence / seizure-
  detect cogs exactly — the Cogs dashboard at
  http://cognitum-v0:9000/cogs auto-discovers entries.
* `cog-pose-estimation run` ran for 5 seconds in the background and
  cleanly emitted run.started + structured WARN events for the
  missing local sensing-server on :3000 (cognitum-v0's actual CSI
  source is ruview-vitals-worker on :50054, not :3000). No crashes,
  no NaN, no leaks.
* Wiring `sensing_url` to the appliance-native source is a separate
  Day-2 integration task.
2026-05-19 17:03:09 -04:00

159 lines
8.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# `cog-pose-estimation` — Benchmark Log
This file tracks every published benchmark for the pose-estimation Cog. New runs append; never overwrite history. Per ADR-101 §"Acceptance gates".
## v0.0.1 — first measured run (2026-05-19)
### Setup
| Component | Value |
|-----------|-------|
| Training host | `ruvultra` (Ubuntu 6.17, x86_64, RTX 5080) |
| Backend | `candle-core 0.9` with `cuda` feature |
| Data | `data/paired/wiflow-p7-1779210883.paired.jsonl` — 1,077 paired samples, 30-min seated-at-desk recording, avg conf 0.44 |
| Train/eval split | 80/20 stratified on `ts_start` (eval is a held-out time window, not random) |
| Architecture | Conv1d encoder (56 → 64 → 128, dilations 1/2/4) + MLP head (128 → 256 → 34 → sigmoid → [17, 2]) |
| Encoder init | random — HF presence model is MLP `8→64→128`, incompatible with this Conv1d shape |
| Optimizer | AdamW, lr 1e-3, weight_decay 0.01 |
| LR schedule | Cosine with 50-epoch warm restarts |
| Loss | SmoothL1 (Huber β=0.1), confidence-weighted by `record.conf` |
| Augmentation | Subcarrier dropout 10% (final 50 epochs) |
| Epochs | 400 (full-batch) |
| Wall time | **2.1 s** total |
### Accuracy
| Metric | Value |
|--------|-------|
| **PCK@20** (overall) | **3.0%** |
| **PCK@50** (overall) | **18.5%** |
| **MPJPE** (normalized) | **0.0931** |
| Final eval loss | 0.0101 |
| Loss reduction | 0.181 → 0.014 (13×) |
### Per-joint PCK
| Joint | PCK@20 | PCK@50 | | Joint | PCK@20 | PCK@50 |
|-------|-------:|-------:|--|-------|-------:|-------:|
| nose | 0.5% | 5.1% | | l_hip | 0.0% | 27.3% |
| l_eye | 2.8% | 8.3% | | **r_hip** | **25.0%** | **76.9%** |
| r_eye | 1.9% | 15.7% | | l_knee | 2.3% | 20.8% |
| l_ear | 0.0% | 3.2% | | r_knee | 0.9% | 35.2% |
| r_ear | 1.9% | 9.7% | | l_ankle | 1.4% | 7.9% |
| l_shoulder | 4.6% | 8.8% | | r_ankle | 0.9% | 9.3% |
| r_shoulder | 1.9% | 19.9% | | l_elbow | 1.9% | 26.4% |
| l_wrist | 3.2% | 24.1% | | r_elbow | 0.0% | 4.2% |
| r_wrist | 1.4% | 12.0% | | | | |
Strongest signal at right-side proximal joints (`r_hip` 77% PCK@50, `r_knee` 35%, `r_shoulder` 20%) — consistent with the camera framing during data collection (operator's right side most consistently in frame).
### Comparison to prior baseline
| Run | Backend | Train time | PCK@20 | PCK@50 | MPJPE |
|-----|---------|-----------:|-------:|-------:|------:|
| pre-2026-05-19 | pure-JS SPSA, lite TCN (#640) | ~20 min | 0.0% | 0.0% | 0.66 |
| **v0.0.1** (this run) | **candle-cuda, Conv1d TCN** | **2.1 s** | **3.0%** | **18.5%** | **0.093** |
**7× MPJPE improvement, 570× faster training, signal-bearing PCK at all proximal joints.** The remaining gap to ADR-079's PCK@20 ≥ 35% target is data-bound, not infra-bound (see Issue #640).
### Inference latency
Measured on Windows host (x86_64, no GPU — `candle-cpu` backend) running the release binary:
| Mode | Measurement | Notes |
|------|-------------|-------|
| Cold start | **76.2 ms / invocation** (avg over 100 sequential `health` invocations) | Includes safetensors load + 1 synthetic forward pass. Most of the cost is process startup + mmap. |
| Long-running `run` warm inference | sub-millisecond per frame (estimated) | The model is 125K params / 507 KB; once loaded, a single forward at batch=1 is essentially memory-bandwidth bound. To be measured precisely against a live sensing-server feed. |
### ONNX export
`pose_v1.onnx` is produced from `pose_v1.safetensors` by `scripts/export-onnx.py`, which mirrors the Candle architecture in PyTorch, loads the safetensors weights, and uses `torch.onnx.export` with opset 18 + dynamic batch axis. Verified end-to-end:
| Check | Result |
|-------|--------|
| `onnx.checker.check_model` | ✅ ok |
| Parity vs torch reference | **max \|torch onnx\| = 8.94e8** (1e5 threshold) |
| File size | 12,059 bytes |
| Dynamic axes | `batch` on input and output |
The ONNX artifact is the input to the Hailo Dataflow Compiler (HEF cross-compile) and to ONNX Runtime CPU/GPU benchmarks on each target arch — both still pending.
### Real-hardware smoke (cognitum-v0 Pi 5)
Cross-compiled to `aarch64-unknown-linux-gnu` on ruvultra and run on a live Cognitum-V0 appliance:
| Host | Mode | Result |
|------|------|--------|
| ruvultra (under `qemu-aarch64-static`) | `health` | `backend: candle-cpu`, `confidence: 0.185` — real weights loaded under emulation |
| **cognitum-v0** (Raspberry Pi 5, Cortex-A76) | `health` | `backend: candle-cpu`, `confidence: 0.185` — real weights, real hardware |
| cognitum-v0 | 30× sequential `health` invocations | **0.251 s total → 8.4 ms / invocation** (cold) |
8.4 ms cold-start on real Pi 5 hardware vs 76 ms on the x86_64 Windows host. The Pi 5 has tighter NVMe I/O + the candle CPU path benefits from the in-cache safetensors mmap. Long-running `run` warm inference will still be sub-millisecond.
### Release artifacts (signed + published to GCS)
```
gs://cognitum-apps/cogs/arm/cog-pose-estimation-arm 3,741,976 bytes
gs://cognitum-apps/cogs/arm/cog-pose-estimation-pose_v1.safetensors 507,032 bytes
binary_sha256: 1e1a7d3dd01ca05d5bfc5dbb142a5941b7866ed9f3224a21edc04d3f09a99bf5
weights_sha256: eb249b9a6b2e10130437a10976ed0230b0d085f86a0553d7226e1ae6eae4b9e5
signature: LUN7xqLPYD3MFzm5dKB5MnYU0LvoRtek5ci5KiKPHBg+Xo6xuazwokn2Dw2JPMaLYJzmWn/SpT4djuR7hYvVDw== (Ed25519, signed with COGNITUM_OWNER_SIGNING_KEY)
```
Full manifest at `cog/artifacts/manifest.json`. Verified via public anonymous GET against `https://storage.googleapis.com/cognitum-apps/cogs/arm/cog-pose-estimation-arm` — downloaded SHA matches the locally-computed SHA.
### Live appliance install
Installed on `cognitum-v0` (the V0 cluster leader) at `/var/lib/cognitum/apps/pose-estimation/`:
```
$ ls -la /var/lib/cognitum/apps/pose-estimation/
-rwxr-xr-x cog-pose-estimation-arm 3,741,976 B (matches GCS sha256)
-rw-r--r-- pose_v1.safetensors 507,032 B
-rw-r--r-- manifest.json 989 B
-rw-r--r-- config.json 187 B
-rw-r--r-- output.log 28,438 B (5-sec smoke run)
```
Layout matches the existing `anomaly-detect`, `presence`, `seizure-detect`, etc. cogs on the same appliance — the Cogs dashboard at `http://cognitum-v0:9000/cogs` auto-discovers entries under this dir.
`cog-pose-estimation run` ran cleanly in the background for 5 seconds with the default config. It correctly:
- Emitted a `run.started` event with the configured `sensing_url`, `model_path`, and `poll_ms`.
- Started its 40 ms poll loop.
- **Gracefully handled the missing local sensing-server on port 3000** by logging structured WARN events (`{"level":"WARN","fields":{"message":"sensing-server fetch failed","error":"...Connection refused..."}}`) without crashing, leaking, or producing NaN output.
- Exited cleanly on SIGTERM.
0 `pose.frame` events fired during the smoke run — expected, since `127.0.0.1:3000` isn't serving CSI on the appliance. The appliance's actual CSI source is `ruview-vitals-worker` on `:50054` plus the `/api/v1/v0/system/...` endpoints behind the appliance's bearer auth on `:9000`. Wiring `sensing_url` to the appliance-native source is a Day-2 integration task — separate from the cog binary itself.
Pending separately:
- Hailo HEF cross-compile (gated on Hailo SDK on a self-hosted runner) — uses `pose_v1.onnx` as input.
- Appliance-native sensing-source integration (`config.sensing_url` should point at the cog-gateway's CSI tap on `:9000`, not the dev-loopback `:3000`).
- x86_64 release upload (today's release is arm-only).
### Artifacts
- `v2/crates/cog-pose-estimation/cog/artifacts/pose_v1.safetensors` — 507 KB
- `v2/crates/cog-pose-estimation/cog/artifacts/train_results.json` — full per-epoch loss curve + hyperparameters + per-joint PCK
### Reproducibility
```bash
# On any host with cargo + a CUDA-capable GPU:
cd ~/work/cog-pose-train
mkdir -p ./
# Stage the same inputs (1,077 paired samples + HF encoder, see scripts/align-ground-truth.js for regeneration)
cp paired.jsonl ./paired.jsonl
cp encoder.safetensors ./encoder.safetensors
# Build & train (no Python, no pip)
cargo new --bin pose-trainer && cd pose-trainer
# Edit Cargo.toml deps: candle-core 0.9 (cuda), candle-nn 0.9 (cuda), safetensors, serde, serde_json, anyhow
# Drop the training script into src/main.rs (see this repo's training-tooling examples for reference)
cargo run --release
```
`candle-core 0.8.4 + 0.9.2` are typically already in `~/.cargo/registry/cache/` on any developer host, so the build completes in seconds.