Files
ruvnet--RuView/benchmarks/wiflow-std/tiny_edge_bench.py
T
rUv 17471e93ff ADR-152: WiFi-Pose SOTA 2026 intake — WiFlow-STD benchmark, Rust integrations, ADR-153 802.11bf layer, efficiency frontier (#1008)
* feat(calibration): NodeGeometry transceiver-geometry recording (ADR-152 §2.1.1)

PerceptAlign-motivated geometry capture at enrollment: per-node optional
records (position, antenna orientation, inter-node distances, acquisition
method) — recorded when known, never required. Event-sourced via
EnrollmentEvent::GeometryRecorded (latest recording wins); persisted on
SpecialistBank with serde defaults so pre-ADR-152 bank JSON loads cleanly
(fixture-proven, and geometry-free banks serialize byte-shape-identical
to the old schema); threaded through MultiNodeMixture as data only — the
learned geometry embeddings and algorithmic fusion use are §2.1.2,
deliberately deferred until the ADR-151 P6 LoRA heads exist.

Geometry recorded from now on means banks captured today remain usable
for layout-conditioned training later — you can't retroactively add
geometry to data you didn't record.

8 new tests (3 geometry, 2 anchor, 2 bank, 1 multistatic) + full-loop
extension (2-node geometry, one tape-measured + one unknown, surviving
the bank JSON round-trip the runtime loads from). 50/50 calibration
(both feature configs) + 23 CLI tests green.

Co-Authored-By: RuFlo <ruv@ruv.net>

* feat(training): two-checkerboard camera↔room calibration for ADR-079 labels (ADR-152 §2.1.3)

Defends the camera-supervised pipeline against PerceptAlign's
"coordinate overfitting": MediaPipe keypoints were emitted in raw camera
coordinates with no shared frame and no transceiver-geometry metadata —
the exact label shape that memorizes deployment layout and collapses
cross-layout.

- scripts/calibrate-camera-room.py + calibration_lib.py: OpenCV
  two-checkerboard calibration → versioned bundle JSON (intrinsics,
  camera→room extrinsics, checkerboard spec, transceiver geometry,
  sha256 calibration_id). Intrinsics resolve from file > cache >
  multi-view computation > loud-warning 2-view fallback.
- collect-ground-truth.py --calibration <bundle>: every sample gains
  keypoints_room (unit bearing rays from the camera center in the room
  frame — documented projective alignment; raw image coords preserved
  so training chooses), camera_origin_room, calibration_id, and the
  transceiver geometry stamp. Without the flag, output is byte-identical
  to before (tested) + a one-line ADR-152 warning.

Design finding (recorded for ADR-152): a single planar checkerboard's
corner grid is centrosymmetric — the reversed corner ordering fits a
ghost camera pose with IDENTICAL reprojection error, so per-board flip
disambiguation is mathematically ill-posed. solve_two_board_extrinsics
solves the joint wall+floor set over all 4 flip combinations, where the
minimum is unique — an independent reason the TWO-checkerboard method is
required, beyond what PerceptAlign states.

15 headless pytest tests green (synthetic corners: extrinsics recovery
incl. ghost resolution, bundle round-trip + hash stability, ray
transforms w/ distortion + cross-resolution, no-calibration byte
identity).

Co-Authored-By: RuFlo <ruv@ruv.net>

* feat(benchmarks): WiFlow-STD reproduction harness + measurement (a) results (ADR-152 §2.2)

Shipped checkpoint REFUTED (0.08% PCK@20, wrong keypoint normalization);
6 reproducibility defects documented (broken imports, corrupted dataset
tail with float32-max garbage that NaN-poisons fp16 BatchNorm, unreachable
test phase). After repairs, retraining with upstream defaults reproduces
96.09% PCK@20 full-test / 96.61% corruption-free (published 97.25%) on
RTX 5080. Claims graded MEASURED-EQUIVALENT; 2.23M params + ~0.055 GFLOPs
verified. Third-party code/weights/data stay out of tree (gitignored).

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat: ADR-152 Rust integrations + ADR-153 802.11bf protocol model

- calibration: GeometryEmbedding — 32-slot permutation-invariant NodeGeometry
  featurization for future LoRA-head conditioning (ADR-152 §2.1.2); derived
  SpecialistBank::geometry_embedding() accessor; 59 tests
- train: MaePretrainConfig + patchify/random-mask with UNSW measured recipe
  (80% masking, (30,3) patches; ADR-152 §2.3, arXiv 2511.18792); strict
  no-truncate/no-NaN policy; proptest properties
- train: WiFlowStdModel — tch-gated port of the verified ~96%-PCK@20
  WiFlow-STD architecture (ADR-152 §2.2 beyond-SOTA); ungated param formula
  pinned to 2,225,042; 15/17-keypoint support; 239 crate tests
- hardware: ieee80211bf forward-compatibility protocol model (ADR-153):
  SpecProfile gates, SensingCapabilities negotiation, required ConsentMode,
  session FSM, SensingTransport + SimTransport + OpportunisticCsiBridge;
  full acceptance checklist covered; 156+4 tests
- deps: ruvector bumps per ADR-152 §2.6 survey (mincut/solver 2.0.6,
  attention 2.1.0, gnn 2.2.0); vendor/ruvector synced to a083bd77f
- docs: ADR-153 accepted; ADR-152 §2.2 status, §2.4 amendment, §2.6 added

Workspace: 162 test suites green (--no-default-features); Python proof PASS.
Known pre-existing flake: homecore-api env_empty_falls_back_to_defaults
(unserialized env-var mutation) — untouched, follow-up.

Co-Authored-By: claude-flow <ruv@ruv.net>

* docs: CHANGELOG + CLAUDE.md entries for ADR-152 integrations and ADR-153

Co-Authored-By: claude-flow <ruv@ruv.net>

* fix(train): repair tch-backend bit-rot — gated path compiles and tests run again

Mechanical API refresh against current tch: Vec::from(Tensor) -> try_from
(+ explicit flatten), numel() usize cast, Rem/div ops -> remainder() /
divide_scalar_mode(floor) — the latter fixed a silent true-division bug in
heatmap argmax decoding; clamp(1.0, f64::MAX) -> clamp_min (torch 2.x scalar
overflow panic); petgraph EdgeRef import; missing EvalMetrics and
verify_checkpoint_dir APIs that tests documented. wiflow_std roundtrip test
uses safetensors (.pt _save_parameters roundtrip broken in torch 2.11
Windows). Gated: 349 passed (incl. all 20 wiflow_std); ungated: unchanged.
Known pre-existing: gaussian-heatmap convention mismatch (2 tests), proof
seed race under parallel threads — documented, deliberate follow-ups.

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(train): WiFlow-STD PyTorch->tch weight import + numerical parity proof

export_to_safetensors.py maps the retrained checkpoint (295 tensors -> 248
mapped, param sum exactly 2,225,042; num_batches_tracked dropped) into a
tch-loadable safetensors plus a deterministic parity fixture. Gated #[ignore]
integration test loads it strictly and asserts forward-pass agreement:
max abs diff 1.192e-7 on the seed-42 fixture. dump_variable_names test makes
the tch name layout authoritative. Zero architecture discrepancies found.

Co-Authored-By: claude-flow <ruv@ruv.net>

* fix: workflow-review findings — BN gamma init, ThresholdParams serde, init docs

Concurrent validation workflow (2 review lanes + adversarial verification,
13 agents): 5 confirmed findings, 3 refuted. Fixes:
- wiflow_std: pin BatchNorm gamma to 1.0 (tch default draws Uniform(0,1) —
  silently halves activations in from-scratch training; loaded checkpoints
  unaffected, parity re-verified after the change)
- wiflow_std: document the conv-init divergences vs the reference's
  effective kaiming_normal(fan_out) re-init (from-scratch dynamics only)
- ieee80211bf: ThresholdParams deserialization validates via try_from so
  the <=100 invariant holds for untrusted payloads (+ rejection test)

Benchmarks (release, ruvzen): GeometryEmbedding 1.84us/call (542k/s),
MAE tokenization 7.38us/window (135k/s), 802.11bf FSM 8.9M events/s —
nothing suspicious.

Co-Authored-By: claude-flow <ruv@ruv.net>

* docs(adr): ADR-152 §2.1.4 gate resolved — PerceptAlign repo MIT, dataset on HF

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(benchmarks): edge optimization measured + measurement (b) blocked + 92.9% retraction

Edge optimization (ADR-152 optimize track): ONNX Runtime fp32 is the CPU
latency win (3.2 ms/window, ~3.4x faster than torch, parity 2.4e-7); ORT
dynamic int8 reaches 2.44 MB (paper's ~2.2 MB claim plausible only via
conv-capable toolchains; -0.16pt PCK@20, +18% MPJPE, 2x slower); torch
dynamic quant converts 0% of this conv-only model; fp16 halves storage free
but is slower on CPU.

Measurement (b) BLOCKED-ON-DATA: only 1,077 paired ESP32 windows exist
(stop rule <2k). Forensic recheck of the surviving April holdout RETRACTS
the ADR-079 '92.9% PCK@20' figure: constant-output model, absolute (not
torso) threshold, 69 near-static frames — mean predictor scores 100% under
that protocol; torso-PCK@20 is 19.1%. Corroborates PR #535. Stale citations
removed from user-guide, readme-details, ADR-152 §2.1.3; no-citation rule
extended to ADR-079 accuracy claims. Unblock: >=2k-window multi-pose paired
session + torso-PCK re-baseline.

Co-Authored-By: claude-flow <ruv@ruv.net>

* docs(user-guide): corrected camera-supervised collection tutorial

Step 0 CSI-rate check + session-length math (window yield = frames/20 —
the May session's 8x under-delivery was a ~12 Hz CSI rate, not an aligner
bug); two-checkerboard calibration step (ADR-152 §2.1.3); pose-variety and
confidence guidance; torso-normalized PCK + temporal-split + pred-variance
eval protocol (lessons from the 92.9% retraction); scale presets re-keyed
to realistic window counts.

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(benchmarks): static PTQ int8 (calibrated) results + overnight capture script

Conv-only static QDQ beats dynamic int8 on accuracy (PCK@20 96.61-96.63%
vs 96.52%, MPJPE +10% vs +18% over fp32) at ~equal size/latency; all-ops
QDQ strictly worse (int8 activations through attention glue). Entropy
calibration verified bit-identical to MinMax on this data. Deployment:
ONNX fp32 for speed (3.2ms), static conv-only QDQ for smallest (2.53MB).

Also: scripts/overnight-empty-capture.py — segmented UDP CSI recorder for
empty-room baselines (no glob collisions, detach-safe).

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(benchmarks): measurement (b) MEASURED — optimization transfer only, mean-pose baseline wins

WiFlow-STD fine-tuned on 2,046 fresh single-room ESP32 paired windows
(temporal 70/15/15, 70->540 adapter, K=17): pretrained-init 65% PCK@20 vs
scratch 0% (optimization transfer) but frozen-trunk ~0% (no feature
transfer), and NOTHING beats the mean-pose baseline (95.9% PCK@20 —
single subject, near-static normalized coords). Honesty gates held: pred
std 0.0113 (non-constant model) but mean-baseline dominance means no
citable CSI->pose capability from this data. ADR-152 open question 1
answered partially; definitive answer needs multi-subject/position data.

Two new aligner findings: heterogeneous csi_shape with silent zero-padding
(~20%), and extractCsiMatrix's transposed shape label (frame-major data,
[nSc, nFrames] label) — fixes pending.

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(benchmarks): efficiency sweep MEASURED — half model dominates full reference

Compact WiFlow-STD variants on the same data/split/protocol: half (843,834
params, 0.38x) strictly dominates the 2.23M reference (PCK@20 96.62 vs
96.61, PCK@50 99.47 vs 99.11, MPJPE 0.00898 vs 0.0094) — the published
architecture is over-parameterized for its own benchmark. quarter (338k)
96.05%; tiny (56,290 params, 1/39.5) holds 94.11% — a ~220KB fp32 edge
candidate. In-domain caveats recorded; cross-domain untested.

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(train): compact WiFlow-STD presets in Rust + tiny edge artifact (ADR-152)

WiFlowStdConfig gains half()/quarter()/tiny() mirroring the overnight sweep
exactly: TcnGroupsMode (Fixed/Gcd/Depthwise), input_pw_groups, derived
stride schedule and decoder-mid (all default to upstream behavior; legacy
serde JSON unaffected). Param formulas pin to trained ground truth first
try: 843,834 / 338,600 / 56,290; default 2,225,042 pin and 1.192e-7 parity
unchanged. 248 tests green.

Tiny edge artifact (tiny_edge_bench.py): ONNX fp32 = 295 KB, 0.66 ms/win
(~1,500/s CPU), 94.11% PCK@20 (matches sweep clean-test exactly; parity
1.49e-7). Static int8 is a bad trade at this scale (-1.43pt, +19% MPJPE,
-16% size, slower) — recorded as negative result. Export note: width-16
breaks AdaptiveAvgPool((15,1)) TorchScript export; replaced by exact
mean+matmul equivalent, proven by parity.

Co-Authored-By: claude-flow <ruv@ruv.net>

* fix: resolve all 10 confirmed code-review findings (7-angle review, 20/20 verified)

wiflow_std: min_feature_width (default 15) replaces the keypoints->stride
coupling — for_keypoints(17) now provably builds the trained [2,2,2,2]
graph and pools 15->17, matching the validated Python protocol (pinned by
tests); param_count() total on invalid configs; random_mask returns Result
and rejects non-finite/out-of-range ratios; trainer checkpoints switched
to safetensors (.pt VarStore roundtrip broken on Windows torch 2.11).

ieee80211bf: SBP proxy now re-triggers instances and relays reports via
Action::RelaySbpReport -> SensingFrame::SbpReport (clients consume via
their existing path); missed_instances reset on success = consecutive
semantics; SessionTable gains a guarded SBP entry point + unknown-id drop
counter; initiator-role sessions reject inbound setup/SBP requests
(RejectedNotSupported) closing the idle hijack; StartSetup/StartSbp
outside Idle return InvalidStateForCommand; SBP validation unified
through evaluate_setup with a 1:1 SetupStatus->SbpStatus mapping.
events.rs split out to honor the 500-line cap.

calibration/cli: enrollment geometry now actually reaches trained banks —
both production call sites attach .with_geometry; --geometry flag on
train-room and POST /enroll/geometry + train-body geometry on
calibrate-serve give production a recording surface; geometry-free banks
log the ADR-152 §2.1.2 note.

benchmarks: corruption masks committed as ground truth (unregenerable
after in-place cleaning; verified bit-identical regeneration from the
pristine copy) + generate_corruption_masks.py producer; _bench_common.py
dedups the 5x-copied shim/evaluate/seed/remap (post-refactor PCK@20
re-verified equal to the last digit); remote scripts get the mmap patch;
tiny_edge --calib validated multiple-of-64; onnx_bench --help no longer
executes (and overwrote) the export — artifact restored byte-exact.

Workspace: 2,963 tests passed, 0 failed; Python proof PASS.

Co-Authored-By: claude-flow <ruv@ruv.net>

* ci: build workspace tests without debuginfo — runner disk exhaustion

The combined 38-crate debug target exceeds the GitHub runner's disk
('final link failed: No space left on device'); the same tree measured
151GB locally with full debuginfo. CARGO_PROFILE_{DEV,TEST}_DEBUG=0
shrinks the target ~5-10x; debuginfo serves no purpose in CI test runs.

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-06-11 17:02:23 -04:00

314 lines
14 KiB
Python

"""ADR-152 efficiency-sweep follow-up: edge pipeline for the TINY compact
WiFlow-STD variant (56,290 params, results/tiny_best.pth, trained overnight
2026-06-10/11 -- see RESULTS.md "Efficiency sweep").
Headline question: what does the smallest deployable WiFlow-class model look
like (KB + ms + PCK)? Reuses the onnx_bench.py / static_ptq_bench.py
machinery on the tiny checkpoint:
1. Load tiny_best.pth with remote/sweep/model_compact.py
(depthwise TCN groups, input_pw_groups=4, conv [2,4,8,16], attn groups 2).
2. Export ONNX: dynamic batch, opset 17, TorchScript exporter (dynamo=False)
-- same recipe that worked for the full model; verified at batch 1/2/64.
One forced deviation: tiny's stride schedule [2,1,1,1] leaves final_width
16, and the TorchScript exporter cannot export AdaptiveAvgPool2d((15,1))
when 15 is not a factor of the input height (the full model never hit
this -- its width was exactly 15). The adaptive pool over a fixed-size
feature map is a fixed linear map, so the export wrapper replaces it with
an exact matmul equivalent (PyTorch adaptive-pool bin semantics:
bin i averages rows floor(i*H/K)..ceil((i+1)*H/K)); the W axis (20->1,
a factor) becomes mean(-1). Exactness is proven by the parity check
below, which compares against the ORIGINAL torch model with the real
AdaptiveAvgPool2d.
3. Torch-vs-ORT parity on the stored fixture input
(results/parity_fixture.npz, batch 2, seed 42 -- same 540x20 input layout;
reference output recomputed with the tiny torch model). PASS < 1e-4.
4. Static QDQ conv-only int8 (quant_pre_process + quantize_static,
per-channel QInt8 weights+activations, Percentile(99.99) calibration on
512 corruption-free TRAIN-split windows -- the winning recipe and
calibration count from static_ptq_bench.py. 512, not "about 500":
ORT 1.26's histogram collector np.asarray()'s the per-batch maxima, so
the calibration count must be a multiple of the batch size 64 or the
ragged last batch crashes it).
5. Disk size + CPU latency b1/b64 (3 interleaved reps, median ms/window)
for tiny fp32 + tiny int8, with the full-model ONNX fp32 + static-int8
sessions interleaved as same-session references.
6. Accuracy (PCK@20/50 + MPJPE) on the identical 10k-window seed-42
corruption-free test subset for tiny fp32 + tiny int8.
Usage:
PYTHONUTF8=1 .venv/Scripts/python.exe tiny_edge_bench.py \
[--data-dir <preprocessed_csi_data>] [--subset 10000] [--calib 512]
(--calib must be a multiple of 64; see step 4 above)
Writes/merges into results/edge_optimization.json under key "tiny_variant".
"""
import argparse
import json
import os
import platform
import sys
import time
import numpy as np
import torch
HERE = os.path.dirname(os.path.abspath(__file__))
RESULTS = os.path.join(HERE, "results")
sys.path.insert(0, HERE)
sys.path.insert(0, os.path.join(HERE, "remote", "sweep"))
# quantize_bench sets up upstream imports + the np.load mmap patch
from quantize_bench import build_test_subset # noqa: E402
from eval_ort_accuracy import evaluate_ort # noqa: E402
from static_ptq_bench import ( # noqa: E402
build_calibration_windows,
interleaved_latency,
make_reader,
ort_session,
)
from model_compact import CompactWiFlowPoseModel, describe # noqa: E402
TINY_CKPT = os.path.join(RESULTS, "tiny_best.pth")
TINY_FP32_ONNX = os.path.join(RESULTS, "tiny_fp32_dynamic.onnx")
TINY_PREPROC_ONNX = os.path.join(RESULTS, "tiny_fp32_preproc.onnx")
TINY_INT8_ONNX = os.path.join(RESULTS, "tiny_int8_static_percentile_conv.onnx")
FULL_FP32_ONNX = os.path.join(RESULTS, "retrained_fp32_dynamic.onnx")
FULL_INT8_ONNX = os.path.join(RESULTS, "retrained_int8_static_percentile_conv.onnx")
# Exact tiny config from remote/sweep/run_sweep.py VARIANTS (measured 56,290
# params, clean-test PCK@20 94.11% -- results/efficiency_sweep.jsonl).
TINY = dict(tcn=[68, 56, 44, 32], conv=[2, 4, 8, 16], attn_groups=2,
groups_mode="depthwise", input_pw_groups=4)
def load_tiny_model():
model = CompactWiFlowPoseModel(
tcn_channels=TINY["tcn"], conv_channels=TINY["conv"],
attn_groups=TINY["attn_groups"], groups_mode=TINY["groups_mode"],
input_pw_groups=TINY["input_pw_groups"], dropout=0.5)
state = torch.load(TINY_CKPT, map_location="cpu", weights_only=True)
model.load_state_dict(state, strict=True)
model.eval()
return model
def adaptive_pool_matrix(h_in, h_out):
"""Exact AdaptiveAvgPool1d as a (h_out, h_in) averaging matrix, using
PyTorch's bin rule: bin i covers rows floor(i*h_in/h_out) ..
ceil((i+1)*h_in/h_out)."""
w = torch.zeros(h_out, h_in)
for i in range(h_out):
s = (i * h_in) // h_out
e = -((-(i + 1) * h_in) // h_out) # ceil division
w[i, s:e] = 1.0 / (e - s)
return w
class ExportWrapper(torch.nn.Module):
"""CompactWiFlowPoseModel forward with the AdaptiveAvgPool2d((K,1))
replaced by an exact fixed linear map (mean over the factor W axis, then
a constant averaging matmul over the non-factor H axis) so the
TorchScript ONNX exporter accepts it. Bit-equivalent up to float
round-off; proven by the parity check against the original model."""
def __init__(self, m, num_keypoints=15):
super().__init__()
self.m = m
self.register_buffer(
"pool_w_t", adaptive_pool_matrix(m.final_width, num_keypoints).t())
def forward(self, x):
m = self.m
x = m.tcn(x)
x = x.transpose(1, 2).unsqueeze(1)
x = m.up(x)
for block in m.residual_blocks:
x = block(x)
x = x.permute(0, 1, 3, 2)
x = m.attention(x)
x = m.decoder(x) # [B, 2, H=final_width, T=20]
x = x.mean(-1) # W-axis pool (20 -> 1, a factor)
x = x.matmul(self.pool_w_t) # exact adaptive H pool: [B, 2, K]
return x.transpose(1, 2) # [B, K, 2]
def export_onnx(model):
"""Dynamic-batch TorchScript export (the recipe that worked for the full
model in onnx_bench.py), verified at batch 1/2/64. Uses ExportWrapper
(see docstring) because final_width 16 is not a multiple of 15."""
wrapper = ExportWrapper(model).eval()
x = torch.rand(2, 540, 20)
with torch.no_grad():
torch.onnx.export(
wrapper, (x,), TINY_FP32_ONNX, opset_version=17,
input_names=["input"], output_names=["output"], dynamo=False,
dynamic_axes={"input": {0: "batch"}, "output": {0: "batch"}})
sess = ort_session(TINY_FP32_ONNX)
inp = sess.get_inputs()[0].name
for b in (1, 2, 64):
y = sess.run(None, {inp: np.zeros((b, 540, 20), dtype=np.float32)})[0]
assert y.shape == (b, 15, 2), y.shape
return {
"mode": "dynamic-batch", "exporter": "torchscript", "opset": 17,
"file": os.path.basename(TINY_FP32_ONNX),
"size_bytes": os.path.getsize(TINY_FP32_ONNX),
"size_mb": os.path.getsize(TINY_FP32_ONNX) / 1e6,
"verified_batches": [1, 2, 64],
"note": "AdaptiveAvgPool2d((15,1)) replaced at export by an exact "
"mean(-1) + constant averaging matmul (final_width 16 is not "
"a multiple of 15, which the TorchScript exporter rejects); "
"exactness proven by the parity check vs the original torch "
"model",
}
def quantize_tiny(calib_windows):
"""quant_pre_process + static QDQ conv-only Percentile(99.99) int8 --
the winning recipe from static_ptq_bench.py."""
from onnxruntime.quantization import (CalibrationMethod, QuantFormat,
QuantType, quantize_static)
from onnxruntime.quantization.shape_inference import quant_pre_process
quant_pre_process(TINY_FP32_ONNX, TINY_PREPROC_ONNX)
t0 = time.time()
quantize_static(
TINY_PREPROC_ONNX, TINY_INT8_ONNX, make_reader(calib_windows),
quant_format=QuantFormat.QDQ,
op_types_to_quantize=["Conv"],
per_channel=True,
activation_type=QuantType.QInt8,
weight_type=QuantType.QInt8,
calibrate_method=CalibrationMethod.Percentile,
extra_options={"CalibPercentile": 99.99},
)
return {
"file": os.path.basename(TINY_INT8_ONNX),
"size_bytes": os.path.getsize(TINY_INT8_ONNX),
"size_mb": os.path.getsize(TINY_INT8_ONNX) / 1e6,
"calibration": {"method": "percentile", "percentile": 99.99,
"windows": int(len(calib_windows)),
"scope": "conv-only TRAIN-split corruption-free",
"seconds": time.time() - t0},
"per_channel": True,
"activation_type": "QInt8",
"weight_type": "QInt8",
}
def main():
import onnxruntime
parser = argparse.ArgumentParser()
parser.add_argument("--data-dir", default=os.path.join(
os.path.expanduser("~"), ".cache", "kagglehub", "datasets", "kaka2434",
"wiflow-dataset", "versions", "1", "preprocessed_csi_data"))
parser.add_argument("--subset", type=int, default=10000)
parser.add_argument("--calib", type=int, default=512,
help="calibration windows; must be a multiple of the "
"64-window calibration batch (ORT histogram "
"collector rejects ragged batches)")
parser.add_argument("--skip-accuracy", action="store_true")
parser.add_argument("--out", default=os.path.join(RESULTS, "edge_optimization.json"))
args = parser.parse_args()
if args.calib % 64 != 0:
parser.error(
f"--calib must be a multiple of 64 (got {args.calib}): ORT 1.26's "
f"histogram calibration collector np.asarray()'s the per-batch "
f"maxima and crashes on a ragged final batch (calibration batch "
f"size is 64)")
model = load_tiny_model()
info = describe(model)
print(f"tiny model: {info['params']:,} params, tcn_groups={info['tcn_groups_per_block']}, "
f"strides={info['conv_strides']}, final_width={info['final_width']}")
assert info["params"] == 56290, info["params"]
results = {
"env": {
"torch": torch.__version__,
"onnxruntime": onnxruntime.__version__,
"platform": platform.platform(),
"num_threads": torch.get_num_threads(),
"checkpoint": os.path.relpath(TINY_CKPT, HERE),
"checkpoint_size_bytes": os.path.getsize(TINY_CKPT),
"params": info["params"],
"variant_config": TINY,
},
}
# ---- export + parity ----------------------------------------------------
print("\n=== ONNX export (dynamic batch, opset 17, torchscript) ===")
results["export"] = export_onnx(model)
print(f" {results['export']['size_mb']:.3f} MB, batches {results['export']['verified_batches']} OK")
fixture = np.load(os.path.join(RESULTS, "parity_fixture.npz"))
fx = fixture["input"] # (2, 540, 20), seed 42 -- same input layout as full model
sess_fp32 = ort_session(TINY_FP32_ONNX)
y_ort = sess_fp32.run(None, {sess_fp32.get_inputs()[0].name: fx})[0]
with torch.no_grad():
y_torch = model(torch.from_numpy(fx)).numpy()
results["parity"] = {
"fixture": "results/parity_fixture.npz input (batch 2, seed 42); "
"reference output recomputed with the tiny torch model",
"max_abs_diff_vs_torch": float(np.abs(y_ort - y_torch).max()),
"pass_lt_1e-4": bool(np.abs(y_ort - y_torch).max() < 1e-4),
}
print("parity:", json.dumps(results["parity"], indent=2))
assert results["parity"]["pass_lt_1e-4"], "torch-vs-ORT parity FAILED"
# ---- static PTQ int8 ------------------------------------------------------
print(f"\n=== static QDQ int8 (Percentile conv-only, {args.calib} calib windows) ===")
calib = build_calibration_windows(args.data_dir, args.calib)
results["int8_static_percentile_conv"] = quantize_tiny(calib)
print(f" {results['int8_static_percentile_conv']['size_mb']:.3f} MB")
sess_int8 = ort_session(TINY_INT8_ONNX)
yq = sess_int8.run(None, {sess_int8.get_inputs()[0].name: fx})[0]
results["int8_static_percentile_conv"]["max_abs_diff_vs_fp32_fixture"] = float(
np.abs(yq - y_torch).max())
# ---- latency (3 interleaved reps, full-model sessions as references) -----
print("\n=== latency (3 interleaved reps) ===")
lat_sessions = {
"tiny_onnx_fp32": sess_fp32,
"tiny_onnx_int8_static_percentile_conv": sess_int8,
"full_onnx_fp32_reference": ort_session(FULL_FP32_ONNX),
"full_onnx_int8_static_percentile_conv_reference": ort_session(FULL_INT8_ONNX),
}
results["latency"] = {
"note": "3 interleaved repetitions per variant, median ms/window; "
"full-model sessions are same-session references",
**interleaved_latency(lat_sessions),
}
# ---- accuracy on the standard 10k corruption-free test subset ------------
if not args.skip_accuracy:
loader, n_clean = build_test_subset(args.data_dir, args.subset)
results["accuracy_subset"] = {
"description": "seed-42 file-level 70/15/15 test split, corrupted "
"windows excluded, seed-42 random subset (same as "
"quantize_bench/eval_ort_accuracy/static_ptq_bench)",
"subset_size": min(args.subset, n_clean) if args.subset else n_clean,
}
results["accuracy"] = {}
for name, sess in (("tiny_onnx_fp32", sess_fp32),
("tiny_onnx_int8_static_percentile_conv", sess_int8)):
print(f"\n=== accuracy: {name} ===")
results["accuracy"][name] = evaluate_ort(sess, loader, name)
print(json.dumps(results["accuracy"][name], indent=2))
# ---- merge into edge_optimization.json -----------------------------------
merged = {}
if os.path.exists(args.out):
with open(args.out) as f:
merged = json.load(f)
merged["tiny_variant"] = results
with open(args.out, "w") as f:
json.dump(merged, f, indent=2)
print(f"\nwrote {args.out}")
if __name__ == "__main__":
main()