mirror of https://github.com/ruvnet/RuView synced 2026-06-09 10:13:17 +00:00

Files

T

rUv 6b4994e105 feat(cog-person-count): train count_v1.safetensors — honest v0.0.1 (ADR-103) (#695 )

Phase 2 of ADR-103: trained count head on the existing 1,077 paired
samples (the same data that produced pose_v1 yesterday).

Honest result: 65.1% eval accuracy / 100% within ±1 / MAE 0.349 on
the held-out time-window. Per-class: 100% on "empty room" / 0% on
"1 person". The model overfit by epoch 100 (train_acc → 1.0,
eval_loss climbed 0.67 → 7.8) and the "best" checkpoint is the
snapshot that happened to predict the eval window's class
distribution (140/215 = 65.1%, matches eval_acc exactly). Confidence
head Spearman = 0.023 ⇒ uncalibrated. Same data-bound failure mode
as pose_v1 (#645), bounded by single-session training data; same
fix path (multi-room).

What v0.0.1 still validates end-to-end:
* PyTorch → safetensors → Candle Rust loads cleanly on first try.
  `cog-person-count health` reports `backend: candle-cpu` and emits
  real per-frame predictions instead of the stub backend's hard-coded
  {1 person, 0 confidence}. Architecture parity between train-count.py
  and src/inference.rs::CountNet is bit-exact.
* ONNX export bit-clean (16 KB, opset 18, dynamic batch axis).
* Training wall time: 5.6 s for 400 epochs on RTX 5080.
* Binary size unchanged (2.36 MB stripped), model loads via mmap at
  runtime.

This commit ships:

* scripts/align-ground-truth.js: extended to emit n_persons_mode +
  n_persons_max per window so the training pipeline has count
  labels. Backwards-compatible (additive fields).
* scripts/train-count.py: new — mirrors CountNet architecture
  exactly, loads paired.jsonl, trains 400 epochs with
  CE+BCE+Brier loss, exports safetensors + ONNX + per-epoch JSON.
* v2/.../cog/artifacts/{count_v1.safetensors,count_v1.onnx,
  count_train_results.json}: the trained artifacts.
* v2/.../cog/README.md: Status table updated with the v0.0.1 numbers
  + an Honest Caveat section explaining the data-bound result.
* docs/benchmarks/person-count-cog.md: new — full v0.0.1 benchmark
  log mirroring the format docs/benchmarks/pose-estimation-cog.md
  established. Includes comparison to ADR-103 v0.1.0 acceptance
  gates and per-class breakdown.

Still pending:
* `run` subcommand wiring (long-running polling loop, same as pose)
* Cross-compile + sign + GCS upload (mirror of pose cog pipeline)
* Live install on cognitum-v0
* v0.2.0: re-train on multi-room data, LoRA per-room adapters,
  Stoer-Wagner min-cut clip in fusion stage

2026-05-21 18:56:52 -04:00

4.8 KiB

Raw Blame History

`cog-person-count` — Benchmark Log

Append-only log of every published count_v1 training run per ADR-103. New runs add a section; never overwrite history.

v0.0.1 — first measured run (2026-05-21)

Setup

Component	Value
Training host	`ruvultra` (Ubuntu, x86_64, RTX 5080)
Backend	PyTorch 2.12 + CUDA
Data	`data/paired/wiflow-p7-1779210883.paired.jsonl` — 1,077 paired samples, single 30-min session, label distribution `{0: 533, 1: 544}`
Train/eval split	80/20 stratified on `ts_start` (held-out tail of the recording)
Architecture	Conv1d encoder (56→64→128→128, dilations 1/2/4) + Linear(128→64→8) count head + Linear(128→32→1) confidence head — bit-identical to `v2/crates/cog-person-count/src/inference.rs::CountNet`
Loss	`cross_entropy(count) + 0.3·BCE(conf) + 0.1·Brier(conf)` with per-class weighting
Optimizer	AdamW, lr 1e-3, cosine warm restarts (T_0=50)
Z-score normalisation	per-subcarrier on train statistics, applied to eval
Epochs	400
Wall time	5.6 s

Accuracy (held-out 215-sample tail of the 30-min recording)

Metric	Value
Best eval accuracy	65.1%
Final eval accuracy	65.1%
Within ±1	100% (labels are all in `{0, 1}`, predictions trivially within ±1)
MAE	0.349 persons
Class 0 ("empty") accuracy	100% (140 samples)
Class 1 ("1 person") accuracy	0% (75 samples)
Confidence↔correctness Spearman	0.023

Honest read

The model overfit hard. By epoch 100 train_acc reached 1.0 and eval_loss climbed from 0.67 → 7.8. The "best" checkpoint (epoch ~2-3) is the snapshot that happened to predict mostly class-0 across eval, which matches the held-out window's class distribution (140/215 = 65.1%) — i.e. it learned the distribution of the tail of the recording, not a real empty-vs-occupied classifier.

Why: the training data is one continuous 30-minute solo recording. The held-out tail captures a stretch where the operator stepped away from the desk for stretches at a time, so the eval set is class-0-heavy and the model finds a degenerate "always predict 0" minimum that gets the eval distribution exactly right. Class 1 accuracy = 0 is the smoking gun.

Same data-bound failure mode as pose_v1 (#645). Same fix path: multi-room paired recordings.

What v0.0.1 still validates

Pipeline correctness end-to-end. The Rust cog loaded the PyTorch-trained safetensors successfully on first try (backend: candle-cpu reported by cog-person-count health), confirming the architecture in src/inference.rs is byte-compatible with train-count.py.
ONNX parity. 16 KB ONNX, exports cleanly under opset 18 with dynamic batch axis.
Fast iteration loop. 5.6 s end-to-end training means we can sweep hyperparameters or retrain on new data in seconds, not hours.
Cog binary size. Same 2.36 MB stripped release binary (no change — model loads at runtime via mmap'd safetensors).

Comparison to ADR-103 v0.1.0 targets

Gate	Target	Today	Status
Day-0 same-room accuracy within ±1	≥ 80%	100% (trivially — labels span {0,1})	met
Cross-room accuracy within ±1	≥ 60%	Not measured (no cross-room data)	deferred to v0.2.0
MAE	≤ 0.6	0.349	met
Per-frame confidence reflects accuracy (Spearman)	r ≥ 0.5	0.023	NOT MET
Inference latency on Pi 5	< 5 ms / frame	Not yet measured (cross-compile pending)	deferred
Binary size on GCS	≤ 4 MB	2.36 MB	met

The accuracy ones look "met" only because the labels collapse to {0, 1} and "within ±1" with 8 classes is trivially satisfied. The confidence calibration is the real failure for v0.0.1 — Spearman 0.023 means the confidence head is essentially random noise. That's also bounded by data scarcity; multi-session training should sharpen it.

Artifacts

v2/crates/cog-person-count/cog/artifacts/count_v1.safetensors — 392 KB
v2/crates/cog-person-count/cog/artifacts/count_v1.onnx — 16 KB
v2/crates/cog-person-count/cog/artifacts/count_train_results.json — full per-epoch loss curve + hyperparameters + per-class breakdown

Reproducibility

# On any host with PyTorch + CUDA (cargo path not needed for training):
scp data/paired/wiflow-p7-1779210883.paired.jsonl <host>:/tmp/
scp scripts/train-count.py <host>:/tmp/
ssh <host> "cd /tmp && python3 train-count.py --paired wiflow-p7-1779210883.paired.jsonl --epochs 400"

Loads in the Rust cog with no translation step (safetensors layout matches cog-person-count::inference::CountNet exactly):

cp count_v1.safetensors v2/crates/cog-person-count/cog/artifacts/
cargo run -p cog-person-count --release -- health
# → {"backend":"candle-cpu", "synthetic_count": <int>, "synthetic_confidence": <float>, ...}

4.8 KiB Raw Blame History

cog-person-count — Benchmark Log