mirror of
https://github.com/ruvnet/RuView
synced 2026-06-09 10:13:17 +00:00
3314c8db8d
* feat(cog-pose-estimation): scaffold first Cog from this repo (ADR-100 + ADR-101) Adds the foundation for the pose-estimation Cog that ships from this repo into Cognitum V0 appliances. Companion ADR-225 + crate land in cognitum-one/v0-appliance. ADRs: * ADR-100 formalises the Cognitum Cog packaging spec — on-device layout under /var/lib/cognitum/apps/<id>/, manifest.json schema (incl. new binary_sha256 + binary_signature fields), GCS hosting convention, repo source layout, build pipeline, and the four-verb runtime contract (version | manifest | health | run). Documents the convention I reverse-engineered from inspecting installed cogs on a live cognitum-v0 appliance — `anomaly-detect`, `presence`, `seizure-detect`, etc. * ADR-101 designs the pose-estimation Cog itself: where it sits in the wifi-densepose pipeline (encoder init from ruvnet/wifi-densepose-pretrained, 17-keypoint regression head), what gets shipped per target arch (arm / x86_64 / hailo8 / hailo10), acceptance gates (PCK@20 explicitly deferred to #640 — this ADR ships the vehicle, not the accuracy). Crate v2/crates/cog-pose-estimation/: * Cargo.toml + workspace member declaration with a hailo feature gate so the binary builds without the Hailo SDK in CI. * main.rs implements the four-verb CLI exactly per ADR-100. * config.rs / manifest.rs / publisher.rs / inference.rs / runtime.rs — small modules, each <100 lines. * publisher.rs emits ADR-100 structured JSON events. * inference.rs is a stub that produces a centred-skeleton baseline with confidence=0 (honest: no trained weights wired in yet). * runtime.rs subscribes to /api/v1/sensing/latest, slides a 56*20 window, runs the engine, emits pose.frame events. * cog/manifest.template.json + cog/config.schema.json define the release artifact + runtime config schemas. * cog/Makefile holds build / sign / upload targets. * tests/smoke.rs covers manifest roundtrip + engine I/O surface. Verified locally: * cargo check -p cog-pose-estimation: clean. * cargo test -p cog-pose-estimation: 4/4 pass. * ./target/release/cog-pose-estimation {version,manifest,health}: all emit the right contract output. This commit contains scaffolding only; the actual trained weights and Hailo HEF cross-compile come in follow-ups tracked in #640 and the companion v0-appliance branch. * feat(cog-pose-estimation): first measured run — Candle CUDA on RTX 5080 Trained pose_v1 on ruvultra (RTX 5080) via Candle 0.9 + cuda feature against the same 1,077-sample paired session that produced 0%/0% PCK in #640 with the pure-JS SPSA trainer. First real numbers: PCK@20 = 3.0% (up from 0.0%) PCK@50 = 18.5% (up from 0.0%) MPJPE = 0.093 (down from 0.66, ~7x improvement) 400 epochs in 2.1 s wall time, full-batch, ~5 ms/epoch. Loss curve 0.181 -> 0.014 over the run, eval 0.010. Per-joint reveals the model leans on right-side proximal joints (r_hip 77% PCK@50, r_knee 35%, l_elbow 26%) — consistent with the camera framing in the source recording. Distal joints (wrists, ankles) and face joints are still near-random, consistent with the 56-subcarrier / 20-frame input not carrying fine-grained spatial info at 1077 samples. This commit: * Adds v2/crates/cog-pose-estimation/cog/artifacts/{pose_v1.safetensors, train_results.json} so the cog dir now contains a real reference artifact, not just scaffold. * Updates cog/README.md "Status" block with the measured numbers, per-joint table, and an honest reading of where the model succeeds vs where the data is the bottleneck. * Adds docs/benchmarks/pose-estimation-cog.md as the canonical benchmark log — append-only, one section per published run. * Appends a "First measured run" section to ADR-101 referencing the new benchmark file. Still pending in the follow-up: * Wire pose_v1.safetensors into src/inference.rs (replace stub). * ONNX export (Candle lacks a writer — needs external conversion). * Hailo HEF cross-compile + cluster deploy. The data-bound gap to PCK@20 >= 35% is tracked in #640. * feat(cog-pose-estimation): wire real weights — cog is no longer a stub Replaces the centred-skeleton stub in src/inference.rs with a real Candle-based loader that reads cog/artifacts/pose_v1.safetensors and runs the trained Conv1d encoder + MLP pose head on every incoming CSI window. What changes: * src/inference.rs: PoseNet mirrors the training script's architecture exactly — Conv1d(56->64, k=3 d=1), Conv1d(64->128, k=3 d=2), Conv1d(128->128, k=3 d=4), mean over time, Linear(128->256)+ReLU, Linear(256->34)+sigmoid -> reshape [17, 2]. The InferenceEngine searches a sensible candidate list for the weights file (/var/lib/cognitum/apps/pose-estimation/, ./pose_v1.safetensors, ./cog/artifacts/, repo-root, v2/-relative) and falls back to the stub when none are present so the cog still satisfies ADR-100. * Cargo.toml: adds candle-core 0.9 + candle-nn 0.9 (no-default-features, CPU build by default) + safetensors 0.4. New `cuda` feature opt-in for GPU inference on hosts that have it. Drops the unused wifi-densepose-train path dep from the default build path. * src/main.rs + src/publisher.rs: health.ok event now carries `backend` (candle-cuda | candle-cpu | stub) and the synthetic output confidence, so operators can tell at a glance whether the cog loaded its weights or fell back to the stub. * tests/smoke.rs: adds `real_weights_load_when_available` which asserts the loaded engine reports backend=candle-* and emits non-zero confidence — exactly the signal that proves we're not silently degrading to the stub. Verified locally: * `cargo check -p cog-pose-estimation --no-default-features` — clean * `cargo test -p cog-pose-estimation --no-default-features` — 5/5 pass * `./target/release/cog-pose-estimation health` emits: {"event":"health.ok","fields":{"backend":"candle-cpu","cog":"pose-estimation","synthetic_output_confidence":0.185}} — 0.185 is the published PCK@50 from cog/artifacts/train_results.json, emitted by the real Candle inference path (would be 0.0 if it had fallen back to the stub). The cog now runs the trained pose_v1 model end-to-end. Accuracy is still bounded by the underlying 1077-sample training data (PCK@20 3.0%, PCK@50 18.5% per docs/benchmarks/pose-estimation-cog.md) — that gap is data-bound and tracked in #640. ONNX export + Hailo HEF cross-compile remain follow-ups. * docs(benchmarks): measure cog-pose-estimation cold-start latency 100 sequential `cog-pose-estimation health` invocations average 76.2 ms each on a Windows x86_64 host using the `candle-cpu` backend. Each invocation re-loads pose_v1.safetensors and runs one synthetic forward pass, so this is the worst-case cold-start path. Long-running `run` inference will be sub-millisecond per frame once the model is loaded. Updates the benchmarks doc accordingly. * feat(cog-pose-estimation): ONNX export — pose_v1.onnx + scripts/export-onnx.py Adds the canonical ONNX artifact that unblocks downstream Hailo HEF cross-compile + ONNX Runtime benchmarks. Generated on ruvultra (torch 2.12.0 + CUDA), 12,059 bytes, opset 18, dynamic batch axis. * scripts/export-onnx.py: mirrors the Candle inference architecture in PyTorch (Conv1d 56->64, 64->128, 128->128 + Linear 128->256->34), pure- python safetensors loader (no extra pip dep), exports via torch.onnx.export, then verifies via onnx.checker.check_model and numerical parity against the torch reference. * Verified parity vs torch: max |torch - onnx| = 8.94e-8 (1e-5 threshold). Effectively bit-perfect. * v2/crates/cog-pose-estimation/cog/artifacts/pose_v1.onnx — the artifact itself, 12 KB. * docs/benchmarks/pose-estimation-cog.md — adds an ONNX export section with the verification numbers. Next: Hailo HEF cross-compile (still gated on Hailo SDK on a self-hosted runner) and ONNX Runtime latency benchmarks on each target arch. * feat(cog-pose-estimation): release v0.0.1 — signed aarch64 binary on GCS End-to-end deploy: cross-compiled to aarch64-unknown-linux-gnu on ruvultra, ran via qemu-aarch64-static, then smoke-tested on a real cognitum-v0 Pi 5. Signed with COGNITUM_OWNER_SIGNING_KEY (Ed25519) and uploaded to gs://cognitum-apps/cogs/arm/. Real-hardware results on cognitum-v0 (Pi 5): health: backend=candle-cpu, confidence=0.185, real weights loaded 30x sequential `health`: 0.251 s total -> 8.4 ms / invocation (cold) GCS release artifacts (publicly downloadable): binary: 3,741,976 bytes sha256 1e1a7d3dd01ca05d5bfc5dbb142a5941b7866ed9f3224a21edc04d3f09a99bf5 weights: 507,032 bytes sha256 eb249b9a6b2e10130437a10976ed0230b0d085f86a0553d7226e1ae6eae4b9e5 signature (Ed25519, b64): LUN7xqLPYD3MFzm5dKB5MnYU0LvoRtek5ci5KiKPHBg+Xo6xuazwokn2Dw2JPMaLYJzmWn/SpT4djuR7hYvVDw== Adds: * v2/crates/cog-pose-estimation/cog/artifacts/manifest.json — the release-pipeline-produced manifest with all fields filled in per ADR-100, including arch, target_triple, signature, and a build_metadata block carrying the validation PCK numbers. * docs/benchmarks/pose-estimation-cog.md — new sections covering the real Pi 5 smoke (8.4 ms cold-start) and the signed GCS release artifacts. Verified by downloading the binary anonymously from GCS and re-computing the sha256 — matches the locally-computed sha exactly. Signature decoded to the expected 64-byte Ed25519 length. Closes the GCS-upload acceptance criterion from ADR-100; the only pending work is Hailo HEF cross-compile (still SDK-gated) and an x86_64 release alongside this arm release. * docs(benchmarks): record live cognitum-v0 install + 5-sec smoke run Adds the "Live appliance install" section documenting what happened when the signed v0.0.1 binary + weights were installed under /var/lib/cognitum/apps/pose-estimation/ on cognitum-v0 (the V0 cluster leader). * Layout matches the existing anomaly-detect / presence / seizure- detect cogs exactly — the Cogs dashboard at http://cognitum-v0:9000/cogs auto-discovers entries. * `cog-pose-estimation run` ran for 5 seconds in the background and cleanly emitted run.started + structured WARN events for the missing local sensing-server on :3000 (cognitum-v0's actual CSI source is ruview-vitals-worker on :50054, not :3000). No crashes, no NaN, no leaks. * Wiring `sensing_url` to the appliance-native source is a separate Day-2 integration task.
159 lines
8.4 KiB
Markdown
159 lines
8.4 KiB
Markdown
# `cog-pose-estimation` — Benchmark Log
|
||
|
||
This file tracks every published benchmark for the pose-estimation Cog. New runs append; never overwrite history. Per ADR-101 §"Acceptance gates".
|
||
|
||
## v0.0.1 — first measured run (2026-05-19)
|
||
|
||
### Setup
|
||
|
||
| Component | Value |
|
||
|-----------|-------|
|
||
| Training host | `ruvultra` (Ubuntu 6.17, x86_64, RTX 5080) |
|
||
| Backend | `candle-core 0.9` with `cuda` feature |
|
||
| Data | `data/paired/wiflow-p7-1779210883.paired.jsonl` — 1,077 paired samples, 30-min seated-at-desk recording, avg conf 0.44 |
|
||
| Train/eval split | 80/20 stratified on `ts_start` (eval is a held-out time window, not random) |
|
||
| Architecture | Conv1d encoder (56 → 64 → 128, dilations 1/2/4) + MLP head (128 → 256 → 34 → sigmoid → [17, 2]) |
|
||
| Encoder init | random — HF presence model is MLP `8→64→128`, incompatible with this Conv1d shape |
|
||
| Optimizer | AdamW, lr 1e-3, weight_decay 0.01 |
|
||
| LR schedule | Cosine with 50-epoch warm restarts |
|
||
| Loss | SmoothL1 (Huber β=0.1), confidence-weighted by `record.conf` |
|
||
| Augmentation | Subcarrier dropout 10% (final 50 epochs) |
|
||
| Epochs | 400 (full-batch) |
|
||
| Wall time | **2.1 s** total |
|
||
|
||
### Accuracy
|
||
|
||
| Metric | Value |
|
||
|--------|-------|
|
||
| **PCK@20** (overall) | **3.0%** |
|
||
| **PCK@50** (overall) | **18.5%** |
|
||
| **MPJPE** (normalized) | **0.0931** |
|
||
| Final eval loss | 0.0101 |
|
||
| Loss reduction | 0.181 → 0.014 (13×) |
|
||
|
||
### Per-joint PCK
|
||
|
||
| Joint | PCK@20 | PCK@50 | | Joint | PCK@20 | PCK@50 |
|
||
|-------|-------:|-------:|--|-------|-------:|-------:|
|
||
| nose | 0.5% | 5.1% | | l_hip | 0.0% | 27.3% |
|
||
| l_eye | 2.8% | 8.3% | | **r_hip** | **25.0%** | **76.9%** |
|
||
| r_eye | 1.9% | 15.7% | | l_knee | 2.3% | 20.8% |
|
||
| l_ear | 0.0% | 3.2% | | r_knee | 0.9% | 35.2% |
|
||
| r_ear | 1.9% | 9.7% | | l_ankle | 1.4% | 7.9% |
|
||
| l_shoulder | 4.6% | 8.8% | | r_ankle | 0.9% | 9.3% |
|
||
| r_shoulder | 1.9% | 19.9% | | l_elbow | 1.9% | 26.4% |
|
||
| l_wrist | 3.2% | 24.1% | | r_elbow | 0.0% | 4.2% |
|
||
| r_wrist | 1.4% | 12.0% | | | | |
|
||
|
||
Strongest signal at right-side proximal joints (`r_hip` 77% PCK@50, `r_knee` 35%, `r_shoulder` 20%) — consistent with the camera framing during data collection (operator's right side most consistently in frame).
|
||
|
||
### Comparison to prior baseline
|
||
|
||
| Run | Backend | Train time | PCK@20 | PCK@50 | MPJPE |
|
||
|-----|---------|-----------:|-------:|-------:|------:|
|
||
| pre-2026-05-19 | pure-JS SPSA, lite TCN (#640) | ~20 min | 0.0% | 0.0% | 0.66 |
|
||
| **v0.0.1** (this run) | **candle-cuda, Conv1d TCN** | **2.1 s** | **3.0%** | **18.5%** | **0.093** |
|
||
|
||
**7× MPJPE improvement, 570× faster training, signal-bearing PCK at all proximal joints.** The remaining gap to ADR-079's PCK@20 ≥ 35% target is data-bound, not infra-bound (see Issue #640).
|
||
|
||
### Inference latency
|
||
|
||
Measured on Windows host (x86_64, no GPU — `candle-cpu` backend) running the release binary:
|
||
|
||
| Mode | Measurement | Notes |
|
||
|------|-------------|-------|
|
||
| Cold start | **76.2 ms / invocation** (avg over 100 sequential `health` invocations) | Includes safetensors load + 1 synthetic forward pass. Most of the cost is process startup + mmap. |
|
||
| Long-running `run` warm inference | sub-millisecond per frame (estimated) | The model is 125K params / 507 KB; once loaded, a single forward at batch=1 is essentially memory-bandwidth bound. To be measured precisely against a live sensing-server feed. |
|
||
|
||
### ONNX export
|
||
|
||
`pose_v1.onnx` is produced from `pose_v1.safetensors` by `scripts/export-onnx.py`, which mirrors the Candle architecture in PyTorch, loads the safetensors weights, and uses `torch.onnx.export` with opset 18 + dynamic batch axis. Verified end-to-end:
|
||
|
||
| Check | Result |
|
||
|-------|--------|
|
||
| `onnx.checker.check_model` | ✅ ok |
|
||
| Parity vs torch reference | **max \|torch − onnx\| = 8.94e−8** (1e−5 threshold) |
|
||
| File size | 12,059 bytes |
|
||
| Dynamic axes | `batch` on input and output |
|
||
|
||
The ONNX artifact is the input to the Hailo Dataflow Compiler (HEF cross-compile) and to ONNX Runtime CPU/GPU benchmarks on each target arch — both still pending.
|
||
|
||
### Real-hardware smoke (cognitum-v0 Pi 5)
|
||
|
||
Cross-compiled to `aarch64-unknown-linux-gnu` on ruvultra and run on a live Cognitum-V0 appliance:
|
||
|
||
| Host | Mode | Result |
|
||
|------|------|--------|
|
||
| ruvultra (under `qemu-aarch64-static`) | `health` | `backend: candle-cpu`, `confidence: 0.185` — real weights loaded under emulation |
|
||
| **cognitum-v0** (Raspberry Pi 5, Cortex-A76) | `health` | `backend: candle-cpu`, `confidence: 0.185` — real weights, real hardware |
|
||
| cognitum-v0 | 30× sequential `health` invocations | **0.251 s total → 8.4 ms / invocation** (cold) |
|
||
|
||
8.4 ms cold-start on real Pi 5 hardware vs 76 ms on the x86_64 Windows host. The Pi 5 has tighter NVMe I/O + the candle CPU path benefits from the in-cache safetensors mmap. Long-running `run` warm inference will still be sub-millisecond.
|
||
|
||
### Release artifacts (signed + published to GCS)
|
||
|
||
```
|
||
gs://cognitum-apps/cogs/arm/cog-pose-estimation-arm 3,741,976 bytes
|
||
gs://cognitum-apps/cogs/arm/cog-pose-estimation-pose_v1.safetensors 507,032 bytes
|
||
|
||
binary_sha256: 1e1a7d3dd01ca05d5bfc5dbb142a5941b7866ed9f3224a21edc04d3f09a99bf5
|
||
weights_sha256: eb249b9a6b2e10130437a10976ed0230b0d085f86a0553d7226e1ae6eae4b9e5
|
||
signature: LUN7xqLPYD3MFzm5dKB5MnYU0LvoRtek5ci5KiKPHBg+Xo6xuazwokn2Dw2JPMaLYJzmWn/SpT4djuR7hYvVDw== (Ed25519, signed with COGNITUM_OWNER_SIGNING_KEY)
|
||
```
|
||
|
||
Full manifest at `cog/artifacts/manifest.json`. Verified via public anonymous GET against `https://storage.googleapis.com/cognitum-apps/cogs/arm/cog-pose-estimation-arm` — downloaded SHA matches the locally-computed SHA.
|
||
|
||
### Live appliance install
|
||
|
||
Installed on `cognitum-v0` (the V0 cluster leader) at `/var/lib/cognitum/apps/pose-estimation/`:
|
||
|
||
```
|
||
$ ls -la /var/lib/cognitum/apps/pose-estimation/
|
||
-rwxr-xr-x cog-pose-estimation-arm 3,741,976 B (matches GCS sha256)
|
||
-rw-r--r-- pose_v1.safetensors 507,032 B
|
||
-rw-r--r-- manifest.json 989 B
|
||
-rw-r--r-- config.json 187 B
|
||
-rw-r--r-- output.log 28,438 B (5-sec smoke run)
|
||
```
|
||
|
||
Layout matches the existing `anomaly-detect`, `presence`, `seizure-detect`, etc. cogs on the same appliance — the Cogs dashboard at `http://cognitum-v0:9000/cogs` auto-discovers entries under this dir.
|
||
|
||
`cog-pose-estimation run` ran cleanly in the background for 5 seconds with the default config. It correctly:
|
||
|
||
- Emitted a `run.started` event with the configured `sensing_url`, `model_path`, and `poll_ms`.
|
||
- Started its 40 ms poll loop.
|
||
- **Gracefully handled the missing local sensing-server on port 3000** by logging structured WARN events (`{"level":"WARN","fields":{"message":"sensing-server fetch failed","error":"...Connection refused..."}}`) without crashing, leaking, or producing NaN output.
|
||
- Exited cleanly on SIGTERM.
|
||
|
||
0 `pose.frame` events fired during the smoke run — expected, since `127.0.0.1:3000` isn't serving CSI on the appliance. The appliance's actual CSI source is `ruview-vitals-worker` on `:50054` plus the `/api/v1/v0/system/...` endpoints behind the appliance's bearer auth on `:9000`. Wiring `sensing_url` to the appliance-native source is a Day-2 integration task — separate from the cog binary itself.
|
||
|
||
Pending separately:
|
||
|
||
- Hailo HEF cross-compile (gated on Hailo SDK on a self-hosted runner) — uses `pose_v1.onnx` as input.
|
||
- Appliance-native sensing-source integration (`config.sensing_url` should point at the cog-gateway's CSI tap on `:9000`, not the dev-loopback `:3000`).
|
||
- x86_64 release upload (today's release is arm-only).
|
||
|
||
### Artifacts
|
||
|
||
- `v2/crates/cog-pose-estimation/cog/artifacts/pose_v1.safetensors` — 507 KB
|
||
- `v2/crates/cog-pose-estimation/cog/artifacts/train_results.json` — full per-epoch loss curve + hyperparameters + per-joint PCK
|
||
|
||
### Reproducibility
|
||
|
||
```bash
|
||
# On any host with cargo + a CUDA-capable GPU:
|
||
cd ~/work/cog-pose-train
|
||
mkdir -p ./
|
||
# Stage the same inputs (1,077 paired samples + HF encoder, see scripts/align-ground-truth.js for regeneration)
|
||
cp paired.jsonl ./paired.jsonl
|
||
cp encoder.safetensors ./encoder.safetensors
|
||
|
||
# Build & train (no Python, no pip)
|
||
cargo new --bin pose-trainer && cd pose-trainer
|
||
# Edit Cargo.toml deps: candle-core 0.9 (cuda), candle-nn 0.9 (cuda), safetensors, serde, serde_json, anyhow
|
||
# Drop the training script into src/main.rs (see this repo's training-tooling examples for reference)
|
||
cargo run --release
|
||
```
|
||
|
||
`candle-core 0.8.4 + 0.9.2` are typically already in `~/.cargo/registry/cache/` on any developer host, so the build completes in seconds.
|