ruvnet--RuView

frank/ruvnet--RuView

Fork 0

mirror of https://github.com/ruvnet/RuView synced 2026-06-09 10:13:17 +00:00

Commit Graph

Author	SHA1	Message	Date
rUv	b3a5012dbd	feat(cog-person-count): v0.0.2 — K-fold + label-smoothing + temperature-calibrated (#699 ) * chore: stage v0.0.2 artifacts + temperature scalar for build pipeline Stages count_v1.{safetensors,onnx,temperature,train_results.json} ahead of the build/sign/upload step. This commit is a momentary side-effect — the next commit will refresh the per-arch manifests with the new binary SHAs once ruvultra finishes the cross-build. The .temperature file holds the calibration scalar from LBFGS over the held-out conf logits. The Rust cog will read it post-load and divide conf_logits by it before sigmoid, exactly matching the Python eval. * feat(cog-person-count): v0.0.2 — K-fold validated, label smoothing + early stop + temp scale The v0.0.1 "65.1% but class-1=0%" result was an unlucky temporal split that let a degenerate "always predict 0" classifier hit eval acc = class-0 fraction. 5-fold stratified random CV proved the architecture actually learns ~57.1% class-1 accuracy under fair splits — a real, modestly useful signal. v0.0.2 ships a retrained model that: * Splits randomly (seed=42) 80/20 instead of temporally — eliminates the trailing-window-class-imbalance cheat. * Class-balanced sampler (multinomial with replacement, weighted by inverse class frequency) — per-batch expected counts are equal regardless of dataset distribution. * Label smoothing 0.1 on the cross-entropy — reduces confidence saturation that drove v0.0.1's all-or-nothing predictions. * Early stopping with patience=20 — stops at epoch 29 instead of overfitting through 400. * Temperature scaling of the conf head — LBFGS fits a scalar T on held-out conf logits; ships as a count_v1.temperature sidecar so the Rust cog can divide conf_logits by T before sigmoid. Numbers on the same data: \| Metric \| v0.0.1 \| v0.0.2 \| K-fold (5x100) \| \|------------------\|--------\|--------\|----------------\| \| Overall acc \| 65.1% \| 62.3% \| 62.2% ± 1.9% \| \| Class 0 acc \| 100% \| 86.2% \| 67.4% \| \| Class 1 acc \| 0% \| 34.3% \| 57.1% ✓ \| \| MAE \| 0.349 \| 0.377 \| 0.378 \| \| Spearman \| 0.023 \| 0.013 \| 0.160 \| Class-1 accuracy 0 → 34.3% is the headline win. Net acc moves slightly because we stopped cheating on class 0. K-fold's 57% says there's headroom remaining; reaching it needs more independent splits (== more data), not more training tricks. Confidence calibration didn't move. Temperature scaling alone can't fix a confidence head trained against a noisy argmax==truth indicator over a 62%-accurate classifier — the head's training signal is the issue, not its post-hoc transform. The honest fix is multi-room data (#645), not another calibration knob. Live on cognitum-v0 at /var/lib/cognitum/apps/person-count/ — health reports candle-cpu backend, count = 1 (was 0 in v0.0.1) on synthetic zero input. Files changed: * scripts/train-count.py — adds --k-fold (no sklearn dep, hand-rolled stratified splits with deterministic shuffle) and --v2 paths. * v2/.../cog/artifacts/count_v1.safetensors (392 KB, new sha 32996433…) + count_v1.onnx (16 KB) + count_v1.temperature (0.9262 scalar) + count_train_results.json (full epoch trace). * v2/.../cog/artifacts/manifests/{arm,x86_64}/manifest.json bumped to version 0.0.2 with the new weights_sha256 + caveats. * docs/benchmarks/person-count-cog.md — appends a v0.0.2 section with the K-fold diagnostic table and honest-read paragraph. GCS: gs://cognitum-apps/cogs/arm/cog-person-count-count_v1.safetensors refreshed (binaries unchanged — load weights via mmap at runtime).	2026-05-21 19:47:04 -04:00
rUv	6b4994e105	feat(cog-person-count): train count_v1.safetensors — honest v0.0.1 (ADR-103) (#695 ) Phase 2 of ADR-103: trained count head on the existing 1,077 paired samples (the same data that produced pose_v1 yesterday). Honest result: 65.1% eval accuracy / 100% within ±1 / MAE 0.349 on the held-out time-window. Per-class: 100% on "empty room" / 0% on "1 person". The model overfit by epoch 100 (train_acc → 1.0, eval_loss climbed 0.67 → 7.8) and the "best" checkpoint is the snapshot that happened to predict the eval window's class distribution (140/215 = 65.1%, matches eval_acc exactly). Confidence head Spearman = 0.023 ⇒ uncalibrated. Same data-bound failure mode as pose_v1 (#645), bounded by single-session training data; same fix path (multi-room). What v0.0.1 still validates end-to-end: * PyTorch → safetensors → Candle Rust loads cleanly on first try. `cog-person-count health` reports `backend: candle-cpu` and emits real per-frame predictions instead of the stub backend's hard-coded {1 person, 0 confidence}. Architecture parity between train-count.py and src/inference.rs::CountNet is bit-exact. * ONNX export bit-clean (16 KB, opset 18, dynamic batch axis). * Training wall time: 5.6 s for 400 epochs on RTX 5080. * Binary size unchanged (2.36 MB stripped), model loads via mmap at runtime. This commit ships: * scripts/align-ground-truth.js: extended to emit n_persons_mode + n_persons_max per window so the training pipeline has count labels. Backwards-compatible (additive fields). * scripts/train-count.py: new — mirrors CountNet architecture exactly, loads paired.jsonl, trains 400 epochs with CE+BCE+Brier loss, exports safetensors + ONNX + per-epoch JSON. * v2/.../cog/artifacts/{count_v1.safetensors,count_v1.onnx, count_train_results.json}: the trained artifacts. * v2/.../cog/README.md: Status table updated with the v0.0.1 numbers + an Honest Caveat section explaining the data-bound result. * docs/benchmarks/person-count-cog.md: new — full v0.0.1 benchmark log mirroring the format docs/benchmarks/pose-estimation-cog.md established. Includes comparison to ADR-103 v0.1.0 acceptance gates and per-class breakdown. Still pending: * `run` subcommand wiring (long-running polling loop, same as pose) * Cross-compile + sign + GCS upload (mirror of pose cog pipeline) * Live install on cognitum-v0 * v0.2.0: re-train on multi-room data, LoRA per-room adapters, Stoer-Wagner min-cut clip in fusion stage	2026-05-21 18:56:52 -04:00

Author

SHA1

Message

Date

rUv

b3a5012dbd

feat(cog-person-count): v0.0.2 — K-fold + label-smoothing + temperature-calibrated (#699 )

* chore: stage v0.0.2 artifacts + temperature scalar for build pipeline

Stages count_v1.{safetensors,onnx,temperature,train_results.json}
ahead of the build/sign/upload step. This commit is a momentary
side-effect — the next commit will refresh the per-arch manifests
with the new binary SHAs once ruvultra finishes the cross-build.

The .temperature file holds the calibration scalar from LBFGS over the
held-out conf logits. The Rust cog will read it post-load and divide
conf_logits by it before sigmoid, exactly matching the Python eval.

* feat(cog-person-count): v0.0.2 — K-fold validated, label smoothing + early stop + temp scale

The v0.0.1 "65.1% but class-1=0%" result was an unlucky temporal split
that let a degenerate "always predict 0" classifier hit eval acc =
class-0 fraction. 5-fold stratified random CV proved the architecture
actually learns ~57.1% class-1 accuracy under fair splits — a real,
modestly useful signal.

v0.0.2 ships a retrained model that:

* **Splits randomly (seed=42) 80/20** instead of temporally — eliminates
  the trailing-window-class-imbalance cheat.
* **Class-balanced sampler** (multinomial with replacement, weighted by
  inverse class frequency) — per-batch expected counts are equal
  regardless of dataset distribution.
* **Label smoothing 0.1** on the cross-entropy — reduces confidence
  saturation that drove v0.0.1's all-or-nothing predictions.
* **Early stopping** with patience=20 — stops at epoch 29 instead of
  overfitting through 400.
* **Temperature scaling** of the conf head — LBFGS fits a scalar T on
  held-out conf logits; ships as a count_v1.temperature sidecar so the
  Rust cog can divide conf_logits by T before sigmoid.

Numbers on the same data:

  | Metric           | v0.0.1 | v0.0.2 | K-fold (5x100) |
  |------------------|--------|--------|----------------|
  | Overall acc      | 65.1%  | 62.3%  | 62.2% ± 1.9%   |
  | Class 0 acc      | 100%   | 86.2%  | 67.4%          |
  | Class 1 acc      |  0%    | 34.3%  | 57.1% ✓        |
  | MAE              | 0.349  | 0.377  | 0.378          |
  | Spearman         | 0.023  | 0.013  | 0.160          |

Class-1 accuracy 0 → 34.3% is the headline win. Net acc moves slightly
because we stopped cheating on class 0. K-fold's 57% says there's
headroom remaining; reaching it needs more independent splits (== more
data), not more training tricks.

Confidence calibration didn't move. Temperature scaling alone can't fix
a confidence head trained against a noisy argmax==truth indicator over
a 62%-accurate classifier — the head's training signal is the issue,
not its post-hoc transform. The honest fix is multi-room data (#645),
not another calibration knob.

Live on cognitum-v0 at /var/lib/cognitum/apps/person-count/ — health
reports candle-cpu backend, count = 1 (was 0 in v0.0.1) on synthetic
zero input.

Files changed:
* scripts/train-count.py — adds --k-fold (no sklearn dep, hand-rolled
  stratified splits with deterministic shuffle) and --v2 paths.
* v2/.../cog/artifacts/count_v1.safetensors (392 KB, new sha
  32996433…) + count_v1.onnx (16 KB) + count_v1.temperature (0.9262
  scalar) + count_train_results.json (full epoch trace).
* v2/.../cog/artifacts/manifests/{arm,x86_64}/manifest.json bumped to
  version 0.0.2 with the new weights_sha256 + caveats.
* docs/benchmarks/person-count-cog.md — appends a v0.0.2 section
  with the K-fold diagnostic table and honest-read paragraph.

GCS:
  gs://cognitum-apps/cogs/arm/cog-person-count-count_v1.safetensors
    refreshed (binaries unchanged — load weights via mmap at runtime).

2026-05-21 19:47:04 -04:00

rUv

6b4994e105

feat(cog-person-count): train count_v1.safetensors — honest v0.0.1 (ADR-103) (#695 )

Phase 2 of ADR-103: trained count head on the existing 1,077 paired
samples (the same data that produced pose_v1 yesterday).

Honest result: 65.1% eval accuracy / 100% within ±1 / MAE 0.349 on
the held-out time-window. Per-class: 100% on "empty room" / 0% on
"1 person". The model overfit by epoch 100 (train_acc → 1.0,
eval_loss climbed 0.67 → 7.8) and the "best" checkpoint is the
snapshot that happened to predict the eval window's class
distribution (140/215 = 65.1%, matches eval_acc exactly). Confidence
head Spearman = 0.023 ⇒ uncalibrated. Same data-bound failure mode
as pose_v1 (#645), bounded by single-session training data; same
fix path (multi-room).

What v0.0.1 still validates end-to-end:
* PyTorch → safetensors → Candle Rust loads cleanly on first try.
  `cog-person-count health` reports `backend: candle-cpu` and emits
  real per-frame predictions instead of the stub backend's hard-coded
  {1 person, 0 confidence}. Architecture parity between train-count.py
  and src/inference.rs::CountNet is bit-exact.
* ONNX export bit-clean (16 KB, opset 18, dynamic batch axis).
* Training wall time: 5.6 s for 400 epochs on RTX 5080.
* Binary size unchanged (2.36 MB stripped), model loads via mmap at
  runtime.

This commit ships:

* scripts/align-ground-truth.js: extended to emit n_persons_mode +
  n_persons_max per window so the training pipeline has count
  labels. Backwards-compatible (additive fields).
* scripts/train-count.py: new — mirrors CountNet architecture
  exactly, loads paired.jsonl, trains 400 epochs with
  CE+BCE+Brier loss, exports safetensors + ONNX + per-epoch JSON.
* v2/.../cog/artifacts/{count_v1.safetensors,count_v1.onnx,
  count_train_results.json}: the trained artifacts.
* v2/.../cog/README.md: Status table updated with the v0.0.1 numbers
  + an Honest Caveat section explaining the data-bound result.
* docs/benchmarks/person-count-cog.md: new — full v0.0.1 benchmark
  log mirroring the format docs/benchmarks/pose-estimation-cog.md
  established. Includes comparison to ADR-103 v0.1.0 acceptance
  gates and per-class breakdown.

Still pending:
* `run` subcommand wiring (long-running polling loop, same as pose)
* Cross-compile + sign + GCS upload (mirror of pose cog pipeline)
* Live install on cognitum-v0
* v0.2.0: re-train on multi-room data, LoRA per-room adapters,
  Stoer-Wagner min-cut clip in fusion stage

2026-05-21 18:56:52 -04:00

2 Commits