Files
ruvnet--RuView/docs/adr/ADR-100-cog-packaging-specification.md
T
rUv 3314c8db8d feat(cog-pose-estimation): scaffold first Cog from this repo (ADR-100 + ADR-101) (#642)
* feat(cog-pose-estimation): scaffold first Cog from this repo (ADR-100 + ADR-101)

Adds the foundation for the pose-estimation Cog that ships from this
repo into Cognitum V0 appliances. Companion ADR-225 + crate land in
cognitum-one/v0-appliance.

ADRs:
* ADR-100 formalises the Cognitum Cog packaging spec — on-device
  layout under /var/lib/cognitum/apps/<id>/, manifest.json schema
  (incl. new binary_sha256 + binary_signature fields), GCS hosting
  convention, repo source layout, build pipeline, and the four-verb
  runtime contract (version | manifest | health | run). Documents the
  convention I reverse-engineered from inspecting installed cogs on a
  live cognitum-v0 appliance — `anomaly-detect`, `presence`,
  `seizure-detect`, etc.
* ADR-101 designs the pose-estimation Cog itself: where it sits in
  the wifi-densepose pipeline (encoder init from
  ruvnet/wifi-densepose-pretrained, 17-keypoint regression head),
  what gets shipped per target arch (arm / x86_64 / hailo8 /
  hailo10), acceptance gates (PCK@20 explicitly deferred to #640 —
  this ADR ships the vehicle, not the accuracy).

Crate v2/crates/cog-pose-estimation/:
* Cargo.toml + workspace member declaration with a hailo feature gate
  so the binary builds without the Hailo SDK in CI.
* main.rs implements the four-verb CLI exactly per ADR-100.
* config.rs / manifest.rs / publisher.rs / inference.rs / runtime.rs —
  small modules, each <100 lines.
* publisher.rs emits ADR-100 structured JSON events.
* inference.rs is a stub that produces a centred-skeleton baseline
  with confidence=0 (honest: no trained weights wired in yet).
* runtime.rs subscribes to /api/v1/sensing/latest, slides a
  56*20 window, runs the engine, emits pose.frame events.
* cog/manifest.template.json + cog/config.schema.json define the
  release artifact + runtime config schemas.
* cog/Makefile holds build / sign / upload targets.
* tests/smoke.rs covers manifest roundtrip + engine I/O surface.

Verified locally:
* cargo check -p cog-pose-estimation: clean.
* cargo test  -p cog-pose-estimation: 4/4 pass.
* ./target/release/cog-pose-estimation {version,manifest,health}:
  all emit the right contract output.

This commit contains scaffolding only; the actual trained weights and
Hailo HEF cross-compile come in follow-ups tracked in #640 and the
companion v0-appliance branch.

* feat(cog-pose-estimation): first measured run — Candle CUDA on RTX 5080

Trained pose_v1 on ruvultra (RTX 5080) via Candle 0.9 + cuda feature
against the same 1,077-sample paired session that produced 0%/0% PCK
in #640 with the pure-JS SPSA trainer. First real numbers:

  PCK@20 = 3.0%   (up from 0.0%)
  PCK@50 = 18.5%  (up from 0.0%)
  MPJPE  = 0.093  (down from 0.66, ~7x improvement)

400 epochs in 2.1 s wall time, full-batch, ~5 ms/epoch. Loss curve
0.181 -> 0.014 over the run, eval 0.010. Per-joint reveals the model
leans on right-side proximal joints (r_hip 77% PCK@50, r_knee 35%,
l_elbow 26%) — consistent with the camera framing in the source
recording. Distal joints (wrists, ankles) and face joints are still
near-random, consistent with the 56-subcarrier / 20-frame input not
carrying fine-grained spatial info at 1077 samples.

This commit:

* Adds v2/crates/cog-pose-estimation/cog/artifacts/{pose_v1.safetensors,
  train_results.json} so the cog dir now contains a real reference
  artifact, not just scaffold.
* Updates cog/README.md "Status" block with the measured numbers,
  per-joint table, and an honest reading of where the model
  succeeds vs where the data is the bottleneck.
* Adds docs/benchmarks/pose-estimation-cog.md as the canonical
  benchmark log — append-only, one section per published run.
* Appends a "First measured run" section to ADR-101 referencing
  the new benchmark file.

Still pending in the follow-up:
* Wire pose_v1.safetensors into src/inference.rs (replace stub).
* ONNX export (Candle lacks a writer — needs external conversion).
* Hailo HEF cross-compile + cluster deploy.

The data-bound gap to PCK@20 >= 35% is tracked in #640.

* feat(cog-pose-estimation): wire real weights — cog is no longer a stub

Replaces the centred-skeleton stub in src/inference.rs with a real
Candle-based loader that reads cog/artifacts/pose_v1.safetensors and
runs the trained Conv1d encoder + MLP pose head on every incoming CSI
window.

What changes:

* src/inference.rs: PoseNet mirrors the training script's architecture
  exactly — Conv1d(56->64, k=3 d=1), Conv1d(64->128, k=3 d=2),
  Conv1d(128->128, k=3 d=4), mean over time, Linear(128->256)+ReLU,
  Linear(256->34)+sigmoid -> reshape [17, 2]. The InferenceEngine
  searches a sensible candidate list for the weights file
  (/var/lib/cognitum/apps/pose-estimation/, ./pose_v1.safetensors,
  ./cog/artifacts/, repo-root, v2/-relative) and falls back to the
  stub when none are present so the cog still satisfies ADR-100.
* Cargo.toml: adds candle-core 0.9 + candle-nn 0.9 (no-default-features,
  CPU build by default) + safetensors 0.4. New `cuda` feature opt-in
  for GPU inference on hosts that have it. Drops the unused
  wifi-densepose-train path dep from the default build path.
* src/main.rs + src/publisher.rs: health.ok event now carries
  `backend` (candle-cuda | candle-cpu | stub) and the synthetic
  output confidence, so operators can tell at a glance whether the
  cog loaded its weights or fell back to the stub.
* tests/smoke.rs: adds `real_weights_load_when_available` which
  asserts the loaded engine reports backend=candle-* and emits
  non-zero confidence — exactly the signal that proves we're not
  silently degrading to the stub.

Verified locally:

* `cargo check -p cog-pose-estimation --no-default-features` — clean
* `cargo test  -p cog-pose-estimation --no-default-features` — 5/5 pass
* `./target/release/cog-pose-estimation health` emits:
  {"event":"health.ok","fields":{"backend":"candle-cpu","cog":"pose-estimation","synthetic_output_confidence":0.185}}
  — 0.185 is the published PCK@50 from cog/artifacts/train_results.json,
  emitted by the real Candle inference path (would be 0.0 if it had
  fallen back to the stub).

The cog now runs the trained pose_v1 model end-to-end. Accuracy is
still bounded by the underlying 1077-sample training data (PCK@20
3.0%, PCK@50 18.5% per docs/benchmarks/pose-estimation-cog.md) — that
gap is data-bound and tracked in #640. ONNX export + Hailo HEF
cross-compile remain follow-ups.

* docs(benchmarks): measure cog-pose-estimation cold-start latency

100 sequential `cog-pose-estimation health` invocations average 76.2 ms
each on a Windows x86_64 host using the `candle-cpu` backend. Each
invocation re-loads pose_v1.safetensors and runs one synthetic forward
pass, so this is the worst-case cold-start path. Long-running `run`
inference will be sub-millisecond per frame once the model is loaded.

Updates the benchmarks doc accordingly.

* feat(cog-pose-estimation): ONNX export — pose_v1.onnx + scripts/export-onnx.py

Adds the canonical ONNX artifact that unblocks downstream Hailo HEF
cross-compile + ONNX Runtime benchmarks. Generated on ruvultra (torch
2.12.0 + CUDA), 12,059 bytes, opset 18, dynamic batch axis.

* scripts/export-onnx.py: mirrors the Candle inference architecture in
  PyTorch (Conv1d 56->64, 64->128, 128->128 + Linear 128->256->34), pure-
  python safetensors loader (no extra pip dep), exports via
  torch.onnx.export, then verifies via onnx.checker.check_model and
  numerical parity against the torch reference.
* Verified parity vs torch: max |torch - onnx| = 8.94e-8 (1e-5
  threshold). Effectively bit-perfect.
* v2/crates/cog-pose-estimation/cog/artifacts/pose_v1.onnx — the
  artifact itself, 12 KB.
* docs/benchmarks/pose-estimation-cog.md — adds an ONNX export
  section with the verification numbers.

Next: Hailo HEF cross-compile (still gated on Hailo SDK on a
self-hosted runner) and ONNX Runtime latency benchmarks on each
target arch.

* feat(cog-pose-estimation): release v0.0.1 — signed aarch64 binary on GCS

End-to-end deploy: cross-compiled to aarch64-unknown-linux-gnu on
ruvultra, ran via qemu-aarch64-static, then smoke-tested on a real
cognitum-v0 Pi 5. Signed with COGNITUM_OWNER_SIGNING_KEY (Ed25519)
and uploaded to gs://cognitum-apps/cogs/arm/.

Real-hardware results on cognitum-v0 (Pi 5):
  health: backend=candle-cpu, confidence=0.185, real weights loaded
  30x sequential `health`: 0.251 s total -> 8.4 ms / invocation (cold)

GCS release artifacts (publicly downloadable):
  binary:  3,741,976 bytes
    sha256 1e1a7d3dd01ca05d5bfc5dbb142a5941b7866ed9f3224a21edc04d3f09a99bf5
  weights:   507,032 bytes
    sha256 eb249b9a6b2e10130437a10976ed0230b0d085f86a0553d7226e1ae6eae4b9e5
  signature (Ed25519, b64): LUN7xqLPYD3MFzm5dKB5MnYU0LvoRtek5ci5KiKPHBg+Xo6xuazwokn2Dw2JPMaLYJzmWn/SpT4djuR7hYvVDw==

Adds:
* v2/crates/cog-pose-estimation/cog/artifacts/manifest.json — the
  release-pipeline-produced manifest with all fields filled in per
  ADR-100, including arch, target_triple, signature, and a
  build_metadata block carrying the validation PCK numbers.
* docs/benchmarks/pose-estimation-cog.md — new sections covering
  the real Pi 5 smoke (8.4 ms cold-start) and the signed GCS
  release artifacts.

Verified by downloading the binary anonymously from GCS and
re-computing the sha256 — matches the locally-computed sha exactly.
Signature decoded to the expected 64-byte Ed25519 length.

Closes the GCS-upload acceptance criterion from ADR-100; the only
pending work is Hailo HEF cross-compile (still SDK-gated) and an
x86_64 release alongside this arm release.

* docs(benchmarks): record live cognitum-v0 install + 5-sec smoke run

Adds the "Live appliance install" section documenting what happened
when the signed v0.0.1 binary + weights were installed under
/var/lib/cognitum/apps/pose-estimation/ on cognitum-v0 (the V0
cluster leader).

* Layout matches the existing anomaly-detect / presence / seizure-
  detect cogs exactly — the Cogs dashboard at
  http://cognitum-v0:9000/cogs auto-discovers entries.
* `cog-pose-estimation run` ran for 5 seconds in the background and
  cleanly emitted run.started + structured WARN events for the
  missing local sensing-server on :3000 (cognitum-v0's actual CSI
  source is ruview-vitals-worker on :50054, not :3000). No crashes,
  no NaN, no leaks.
* Wiring `sensing_url` to the appliance-native source is a separate
  Day-2 integration task.
2026-05-19 17:03:09 -04:00

8.0 KiB
Raw Blame History

ADR-100: Cognitum Cog Packaging Specification

  • Status: Accepted (formalises existing convention)
  • Date: 2026-05-19
  • Deciders: ruv

Context

The Cognitum V0 Appliance (/var/lib/cognitum/apps/) deploys discrete units called Cogs. They appear in the Appliance dashboard (http://cognitum-v0:9000/cogs) under an app-store UI (Today / Apps / Categories / Search / Updates). Until this ADR, the packaging convention has been implicit — derived from inspecting installed cogs (anomaly-detect, presence, seizure-detect, etc.) on a live appliance. Bringing new Cogs to the platform required reverse-engineering the layout each time.

This ADR formalises the layout so:

  1. A repo crate can be built into a Cog with a deterministic Makefile / CI pipeline.
  2. Cog binaries can be cross-compiled for every supported architecture from a single source.
  3. The appliance's installer (cognitum-cog-gateway) can verify manifests without bespoke per-cog adapters.
  4. Future Cogs in this repo (starting with cog-pose-estimation — see ADR-101) follow a single rule.

Decision

On-device layout

Each installed Cog lives at:

/var/lib/cognitum/apps/<cog-id>/
├── cog-<cog-id>-<arch>          # single self-contained executable
├── manifest.json                # immutable; signed by the publisher
├── config.json                  # mutable; runtime config, owned by the appliance
├── pid                          # current PID when running; absent when stopped
├── output.log                   # stdout (truncated on rotation)
└── error.log                    # stderr (truncated on rotation)

<cog-id> is kebab-case, ASCII, [a-z0-9-]{2,32}. <arch> is one of:

arch target triple hardware
arm aarch64-unknown-linux-gnu Raspberry Pi 5 (cognitum-v0, cluster Pis)
x86_64 x86_64-unknown-linux-gnu ruvultra, generic Linux dev
hailo8 aarch64-unknown-linux-gnu + Hailo HEF sidecar Pi + Hailo-8 hat (26 TOPS)
hailo10 aarch64-unknown-linux-gnu + Hailo HEF sidecar Pi + Hailo-10 hat (40 TOPS)

manifest.json schema

{
  "id": "anomaly-detect",
  "version": "0.1.0",
  "binary_url": "https://storage.googleapis.com/cognitum-apps/cogs/arm/cog-anomaly-detect-arm",
  "binary_bytes": 461904,
  "binary_sha256": "<hex>",
  "binary_signature": "<base64 Ed25519 sig over binary_sha256, signed with COGNITUM_OWNER_SIGNING_KEY>",
  "installed_at": 1778772536,
  "status": "installed"
}

Fields:

  • id, version, binary_url, binary_bytes, installed_at, status — already implemented and observed in production manifests (e.g. anomaly-detect@0.0.0). Documented here without change.
  • binary_sha256, binary_signaturenew, REQUIRED for any Cog shipped from this repo. Backwards-compatible with existing manifests: the appliance gateway treats both fields as optional today, MUST verify them when present. ADR-103 (witness chain) covers the trust model in more detail.
  • status values: "installed", "running", "stopped", "failed", "updating".

Binary hosting

Cog binaries live in Google Cloud Storage, public-read, at:

gs://cognitum-apps/cogs/<arch>/cog-<id>-<arch>

The HTTPS form is https://storage.googleapis.com/cognitum-apps/cogs/<arch>/cog-<id>-<arch> (no trailing extension; the URL is the canonical artifact). For Hailo variants, the HEF model file is sibling: cog-<id>-<arch>.hef.

Bucket conventions:

  • Bucket is public-read; write requires roles/storage.objectAdmin in project cognitum-20260110.
  • Per-version artifacts must be content-addressed: cogs/<arch>/cog-<id>-<arch>@<sha256-prefix> is the immutable copy; the un-suffixed name is a symlink that updates on release.
  • COGNITUM_OWNER_SIGNING_KEY (GCP Secret Manager) signs every binary before upload.

Source-tree layout (this repo)

Each Cog lives under v2/crates/cog-<id>/:

v2/crates/cog-<id>/
├── Cargo.toml                # crate name = cog-<id>; binary = cog-<id>
├── src/
│   ├── main.rs               # CLI: cog-<id> run | status | version
│   ├── lib.rs
│   └── inference.rs          # the actual work
├── cog/
│   ├── manifest.template.json
│   ├── config.schema.json    # JSON schema for runtime config
│   ├── README.md             # consumer-facing description (used by the App Store UI)
│   ├── icon.svg              # 1024×1024 icon (used by App Store hero)
│   └── Makefile              # build / sign / upload targets
└── tests/
    ├── smoke.rs
    └── manifest_signature.rs

Build pipeline

cd v2/crates/cog-<id>
make build-arm           # cross-compile to aarch64-unknown-linux-gnu
make build-x86_64        # x86_64 Linux build
make build-hailo8        # arm + HEF compilation (requires Hailo Dataflow Compiler)
make build-hailo10       # arm + HEF compilation
make sign                # produce binary_sha256 + binary_signature
make upload              # gsutil cp to gs://cognitum-apps/cogs/<arch>/
make manifest            # emit manifest.json with all fields filled

CI (GitHub Actions) MUST run make build-arm + make build-x86_64 on every PR touching v2/crates/cog-*/. Hailo HEF compilation requires the proprietary Hailo SDK and runs only on the Hailo-capable runners (currently a labelled self-hosted runner on the Pi cluster — TBD, separate ADR).

Runtime contract

A Cog binary MUST implement:

Subcommand Behaviour
cog-<id> version Print <id> <version> and exit 0.
cog-<id> manifest Print the embedded manifest JSON and exit 0.
cog-<id> run --config /path/to/config.json Long-running. Writes structured JSON logs to stdout (parsed by cognitum-cog-gateway). Exit code 0 on graceful shutdown, non-zero on fatal error.
cog-<id> health One-shot. Exit 0 if the cog could come up healthy; non-zero with diagnostic on stderr. Called by the gateway before run.

stdout JSON line format (one event per line):

{"ts": 1779210883.444, "level": "info", "event": "<event-name>", "fields": { ... }}

Consequences

Positive

  • New Cogs can be added without RE-ing the layout each time.
  • CI can verify the manifest schema before merge.
  • Signed binaries close a real supply-chain gap — current installed cogs (anomaly-detect@0.0.0) have no signature, and a compromised GCS object could push malicious code to every appliance.
  • The runtime contract (run | health | version | manifest) is uniform across cogs, so cognitum-cog-gateway can stop carrying per-cog adapters.

Negative

  • Existing installed cogs must be re-published with signatures within one minor release of the gateway adopting the verify-when-present rule.
  • Hailo HEF cross-compile is gated on a self-hosted runner; we accept that PRs touching Hailo variants will be slower to land.

Risks

  • Signing key rotation: COGNITUM_OWNER_SIGNING_KEY (Ed25519) is a single root-of-trust today. ADR-103 (witness chain) describes the rotation/recovery path; this ADR depends on that.
  • GCS bucket misconfiguration: a public-read bucket with versioning-off could allow rollback attacks. Bucket MUST have Object Versioning enabled + 90-day non-current-version retention.

Migration

  1. Land this ADR.
  2. Land ADR-101 (cog-pose-estimation — first Cog built to this spec).
  3. After two clean releases of cog-pose-estimation, re-publish the existing cogs (anomaly-detect, presence, etc.) with binary_sha256 + binary_signature. Track in a follow-up issue.
  4. Flip cognitum-cog-gateway from "verify when present" to "require signature" — separate ADR, separate review.

See also

  • ADR-101: Pose Estimation Cog (first Cog built to this spec).
  • ADR-103: Witness chain trust model (signing key rotation, future ADR).
  • docs/adr/ADR-079-camera-ground-truth-training.md — the training pipeline behind cog-pose-estimation.
  • CLAUDE.local.md § "Fleet Infrastructure (Tailscale)" — appliance layout this ADR describes.