mirror of
https://github.com/ruvnet/RuView
synced 2026-06-19 11:53:19 +00:00
17471e93ff
* feat(calibration): NodeGeometry transceiver-geometry recording (ADR-152 §2.1.1) PerceptAlign-motivated geometry capture at enrollment: per-node optional records (position, antenna orientation, inter-node distances, acquisition method) — recorded when known, never required. Event-sourced via EnrollmentEvent::GeometryRecorded (latest recording wins); persisted on SpecialistBank with serde defaults so pre-ADR-152 bank JSON loads cleanly (fixture-proven, and geometry-free banks serialize byte-shape-identical to the old schema); threaded through MultiNodeMixture as data only — the learned geometry embeddings and algorithmic fusion use are §2.1.2, deliberately deferred until the ADR-151 P6 LoRA heads exist. Geometry recorded from now on means banks captured today remain usable for layout-conditioned training later — you can't retroactively add geometry to data you didn't record. 8 new tests (3 geometry, 2 anchor, 2 bank, 1 multistatic) + full-loop extension (2-node geometry, one tape-measured + one unknown, surviving the bank JSON round-trip the runtime loads from). 50/50 calibration (both feature configs) + 23 CLI tests green. Co-Authored-By: RuFlo <ruv@ruv.net> * feat(training): two-checkerboard camera↔room calibration for ADR-079 labels (ADR-152 §2.1.3) Defends the camera-supervised pipeline against PerceptAlign's "coordinate overfitting": MediaPipe keypoints were emitted in raw camera coordinates with no shared frame and no transceiver-geometry metadata — the exact label shape that memorizes deployment layout and collapses cross-layout. - scripts/calibrate-camera-room.py + calibration_lib.py: OpenCV two-checkerboard calibration → versioned bundle JSON (intrinsics, camera→room extrinsics, checkerboard spec, transceiver geometry, sha256 calibration_id). Intrinsics resolve from file > cache > multi-view computation > loud-warning 2-view fallback. - collect-ground-truth.py --calibration <bundle>: every sample gains keypoints_room (unit bearing rays from the camera center in the room frame — documented projective alignment; raw image coords preserved so training chooses), camera_origin_room, calibration_id, and the transceiver geometry stamp. Without the flag, output is byte-identical to before (tested) + a one-line ADR-152 warning. Design finding (recorded for ADR-152): a single planar checkerboard's corner grid is centrosymmetric — the reversed corner ordering fits a ghost camera pose with IDENTICAL reprojection error, so per-board flip disambiguation is mathematically ill-posed. solve_two_board_extrinsics solves the joint wall+floor set over all 4 flip combinations, where the minimum is unique — an independent reason the TWO-checkerboard method is required, beyond what PerceptAlign states. 15 headless pytest tests green (synthetic corners: extrinsics recovery incl. ghost resolution, bundle round-trip + hash stability, ray transforms w/ distortion + cross-resolution, no-calibration byte identity). Co-Authored-By: RuFlo <ruv@ruv.net> * feat(benchmarks): WiFlow-STD reproduction harness + measurement (a) results (ADR-152 §2.2) Shipped checkpoint REFUTED (0.08% PCK@20, wrong keypoint normalization); 6 reproducibility defects documented (broken imports, corrupted dataset tail with float32-max garbage that NaN-poisons fp16 BatchNorm, unreachable test phase). After repairs, retraining with upstream defaults reproduces 96.09% PCK@20 full-test / 96.61% corruption-free (published 97.25%) on RTX 5080. Claims graded MEASURED-EQUIVALENT; 2.23M params + ~0.055 GFLOPs verified. Third-party code/weights/data stay out of tree (gitignored). Co-Authored-By: claude-flow <ruv@ruv.net> * feat: ADR-152 Rust integrations + ADR-153 802.11bf protocol model - calibration: GeometryEmbedding — 32-slot permutation-invariant NodeGeometry featurization for future LoRA-head conditioning (ADR-152 §2.1.2); derived SpecialistBank::geometry_embedding() accessor; 59 tests - train: MaePretrainConfig + patchify/random-mask with UNSW measured recipe (80% masking, (30,3) patches; ADR-152 §2.3, arXiv 2511.18792); strict no-truncate/no-NaN policy; proptest properties - train: WiFlowStdModel — tch-gated port of the verified ~96%-PCK@20 WiFlow-STD architecture (ADR-152 §2.2 beyond-SOTA); ungated param formula pinned to 2,225,042; 15/17-keypoint support; 239 crate tests - hardware: ieee80211bf forward-compatibility protocol model (ADR-153): SpecProfile gates, SensingCapabilities negotiation, required ConsentMode, session FSM, SensingTransport + SimTransport + OpportunisticCsiBridge; full acceptance checklist covered; 156+4 tests - deps: ruvector bumps per ADR-152 §2.6 survey (mincut/solver 2.0.6, attention 2.1.0, gnn 2.2.0); vendor/ruvector synced to a083bd77f - docs: ADR-153 accepted; ADR-152 §2.2 status, §2.4 amendment, §2.6 added Workspace: 162 test suites green (--no-default-features); Python proof PASS. Known pre-existing flake: homecore-api env_empty_falls_back_to_defaults (unserialized env-var mutation) — untouched, follow-up. Co-Authored-By: claude-flow <ruv@ruv.net> * docs: CHANGELOG + CLAUDE.md entries for ADR-152 integrations and ADR-153 Co-Authored-By: claude-flow <ruv@ruv.net> * fix(train): repair tch-backend bit-rot — gated path compiles and tests run again Mechanical API refresh against current tch: Vec::from(Tensor) -> try_from (+ explicit flatten), numel() usize cast, Rem/div ops -> remainder() / divide_scalar_mode(floor) — the latter fixed a silent true-division bug in heatmap argmax decoding; clamp(1.0, f64::MAX) -> clamp_min (torch 2.x scalar overflow panic); petgraph EdgeRef import; missing EvalMetrics and verify_checkpoint_dir APIs that tests documented. wiflow_std roundtrip test uses safetensors (.pt _save_parameters roundtrip broken in torch 2.11 Windows). Gated: 349 passed (incl. all 20 wiflow_std); ungated: unchanged. Known pre-existing: gaussian-heatmap convention mismatch (2 tests), proof seed race under parallel threads — documented, deliberate follow-ups. Co-Authored-By: claude-flow <ruv@ruv.net> * feat(train): WiFlow-STD PyTorch->tch weight import + numerical parity proof export_to_safetensors.py maps the retrained checkpoint (295 tensors -> 248 mapped, param sum exactly 2,225,042; num_batches_tracked dropped) into a tch-loadable safetensors plus a deterministic parity fixture. Gated #[ignore] integration test loads it strictly and asserts forward-pass agreement: max abs diff 1.192e-7 on the seed-42 fixture. dump_variable_names test makes the tch name layout authoritative. Zero architecture discrepancies found. Co-Authored-By: claude-flow <ruv@ruv.net> * fix: workflow-review findings — BN gamma init, ThresholdParams serde, init docs Concurrent validation workflow (2 review lanes + adversarial verification, 13 agents): 5 confirmed findings, 3 refuted. Fixes: - wiflow_std: pin BatchNorm gamma to 1.0 (tch default draws Uniform(0,1) — silently halves activations in from-scratch training; loaded checkpoints unaffected, parity re-verified after the change) - wiflow_std: document the conv-init divergences vs the reference's effective kaiming_normal(fan_out) re-init (from-scratch dynamics only) - ieee80211bf: ThresholdParams deserialization validates via try_from so the <=100 invariant holds for untrusted payloads (+ rejection test) Benchmarks (release, ruvzen): GeometryEmbedding 1.84us/call (542k/s), MAE tokenization 7.38us/window (135k/s), 802.11bf FSM 8.9M events/s — nothing suspicious. Co-Authored-By: claude-flow <ruv@ruv.net> * docs(adr): ADR-152 §2.1.4 gate resolved — PerceptAlign repo MIT, dataset on HF Co-Authored-By: claude-flow <ruv@ruv.net> * feat(benchmarks): edge optimization measured + measurement (b) blocked + 92.9% retraction Edge optimization (ADR-152 optimize track): ONNX Runtime fp32 is the CPU latency win (3.2 ms/window, ~3.4x faster than torch, parity 2.4e-7); ORT dynamic int8 reaches 2.44 MB (paper's ~2.2 MB claim plausible only via conv-capable toolchains; -0.16pt PCK@20, +18% MPJPE, 2x slower); torch dynamic quant converts 0% of this conv-only model; fp16 halves storage free but is slower on CPU. Measurement (b) BLOCKED-ON-DATA: only 1,077 paired ESP32 windows exist (stop rule <2k). Forensic recheck of the surviving April holdout RETRACTS the ADR-079 '92.9% PCK@20' figure: constant-output model, absolute (not torso) threshold, 69 near-static frames — mean predictor scores 100% under that protocol; torso-PCK@20 is 19.1%. Corroborates PR #535. Stale citations removed from user-guide, readme-details, ADR-152 §2.1.3; no-citation rule extended to ADR-079 accuracy claims. Unblock: >=2k-window multi-pose paired session + torso-PCK re-baseline. Co-Authored-By: claude-flow <ruv@ruv.net> * docs(user-guide): corrected camera-supervised collection tutorial Step 0 CSI-rate check + session-length math (window yield = frames/20 — the May session's 8x under-delivery was a ~12 Hz CSI rate, not an aligner bug); two-checkerboard calibration step (ADR-152 §2.1.3); pose-variety and confidence guidance; torso-normalized PCK + temporal-split + pred-variance eval protocol (lessons from the 92.9% retraction); scale presets re-keyed to realistic window counts. Co-Authored-By: claude-flow <ruv@ruv.net> * feat(benchmarks): static PTQ int8 (calibrated) results + overnight capture script Conv-only static QDQ beats dynamic int8 on accuracy (PCK@20 96.61-96.63% vs 96.52%, MPJPE +10% vs +18% over fp32) at ~equal size/latency; all-ops QDQ strictly worse (int8 activations through attention glue). Entropy calibration verified bit-identical to MinMax on this data. Deployment: ONNX fp32 for speed (3.2ms), static conv-only QDQ for smallest (2.53MB). Also: scripts/overnight-empty-capture.py — segmented UDP CSI recorder for empty-room baselines (no glob collisions, detach-safe). Co-Authored-By: claude-flow <ruv@ruv.net> * feat(benchmarks): measurement (b) MEASURED — optimization transfer only, mean-pose baseline wins WiFlow-STD fine-tuned on 2,046 fresh single-room ESP32 paired windows (temporal 70/15/15, 70->540 adapter, K=17): pretrained-init 65% PCK@20 vs scratch 0% (optimization transfer) but frozen-trunk ~0% (no feature transfer), and NOTHING beats the mean-pose baseline (95.9% PCK@20 — single subject, near-static normalized coords). Honesty gates held: pred std 0.0113 (non-constant model) but mean-baseline dominance means no citable CSI->pose capability from this data. ADR-152 open question 1 answered partially; definitive answer needs multi-subject/position data. Two new aligner findings: heterogeneous csi_shape with silent zero-padding (~20%), and extractCsiMatrix's transposed shape label (frame-major data, [nSc, nFrames] label) — fixes pending. Co-Authored-By: claude-flow <ruv@ruv.net> * feat(benchmarks): efficiency sweep MEASURED — half model dominates full reference Compact WiFlow-STD variants on the same data/split/protocol: half (843,834 params, 0.38x) strictly dominates the 2.23M reference (PCK@20 96.62 vs 96.61, PCK@50 99.47 vs 99.11, MPJPE 0.00898 vs 0.0094) — the published architecture is over-parameterized for its own benchmark. quarter (338k) 96.05%; tiny (56,290 params, 1/39.5) holds 94.11% — a ~220KB fp32 edge candidate. In-domain caveats recorded; cross-domain untested. Co-Authored-By: claude-flow <ruv@ruv.net> * feat(train): compact WiFlow-STD presets in Rust + tiny edge artifact (ADR-152) WiFlowStdConfig gains half()/quarter()/tiny() mirroring the overnight sweep exactly: TcnGroupsMode (Fixed/Gcd/Depthwise), input_pw_groups, derived stride schedule and decoder-mid (all default to upstream behavior; legacy serde JSON unaffected). Param formulas pin to trained ground truth first try: 843,834 / 338,600 / 56,290; default 2,225,042 pin and 1.192e-7 parity unchanged. 248 tests green. Tiny edge artifact (tiny_edge_bench.py): ONNX fp32 = 295 KB, 0.66 ms/win (~1,500/s CPU), 94.11% PCK@20 (matches sweep clean-test exactly; parity 1.49e-7). Static int8 is a bad trade at this scale (-1.43pt, +19% MPJPE, -16% size, slower) — recorded as negative result. Export note: width-16 breaks AdaptiveAvgPool((15,1)) TorchScript export; replaced by exact mean+matmul equivalent, proven by parity. Co-Authored-By: claude-flow <ruv@ruv.net> * fix: resolve all 10 confirmed code-review findings (7-angle review, 20/20 verified) wiflow_std: min_feature_width (default 15) replaces the keypoints->stride coupling — for_keypoints(17) now provably builds the trained [2,2,2,2] graph and pools 15->17, matching the validated Python protocol (pinned by tests); param_count() total on invalid configs; random_mask returns Result and rejects non-finite/out-of-range ratios; trainer checkpoints switched to safetensors (.pt VarStore roundtrip broken on Windows torch 2.11). ieee80211bf: SBP proxy now re-triggers instances and relays reports via Action::RelaySbpReport -> SensingFrame::SbpReport (clients consume via their existing path); missed_instances reset on success = consecutive semantics; SessionTable gains a guarded SBP entry point + unknown-id drop counter; initiator-role sessions reject inbound setup/SBP requests (RejectedNotSupported) closing the idle hijack; StartSetup/StartSbp outside Idle return InvalidStateForCommand; SBP validation unified through evaluate_setup with a 1:1 SetupStatus->SbpStatus mapping. events.rs split out to honor the 500-line cap. calibration/cli: enrollment geometry now actually reaches trained banks — both production call sites attach .with_geometry; --geometry flag on train-room and POST /enroll/geometry + train-body geometry on calibrate-serve give production a recording surface; geometry-free banks log the ADR-152 §2.1.2 note. benchmarks: corruption masks committed as ground truth (unregenerable after in-place cleaning; verified bit-identical regeneration from the pristine copy) + generate_corruption_masks.py producer; _bench_common.py dedups the 5x-copied shim/evaluate/seed/remap (post-refactor PCK@20 re-verified equal to the last digit); remote scripts get the mmap patch; tiny_edge --calib validated multiple-of-64; onnx_bench --help no longer executes (and overwrote) the export — artifact restored byte-exact. Workspace: 2,963 tests passed, 0 failed; Python proof PASS. Co-Authored-By: claude-flow <ruv@ruv.net> * ci: build workspace tests without debuginfo — runner disk exhaustion The combined 38-crate debug target exceeds the GitHub runner's disk ('final link failed: No space left on device'); the same tree measured 151GB locally with full debuginfo. CARGO_PROFILE_{DEV,TEST}_DEBUG=0 shrinks the target ~5-10x; debuginfo serves no purpose in CI test runs. Co-Authored-By: claude-flow <ruv@ruv.net>
334 lines
14 KiB
Python
334 lines
14 KiB
Python
"""ADR-152 edge optimization follow-up: ONNX Runtime STATIC post-training
|
|
quantization (calibration-based QDQ) of the retrained WiFlow-STD model, to
|
|
improve on the dynamic-int8 result (2.44 MB, PCK@20 96.52%, 6.5 ms/win b1).
|
|
|
|
Static PTQ pre-computes activation ranges from calibration data, so inference
|
|
uses QLinearConv/QDQ kernels instead of dynamic ConvInteger -- typically both
|
|
faster and (with good calibration) closer to fp32 accuracy.
|
|
|
|
Method:
|
|
- Calibration set: corruption-free windows drawn ONLY from the seed-42
|
|
file-level TRAINING split (same split as eval_repro.py; corrupted windows
|
|
excluded via results/nan_windows_mask.npy | big_windows_mask.npy), chosen
|
|
with np.random.default_rng(42). Never test windows.
|
|
- quantize_static, QuantFormat.QDQ, per-channel int8 weights, int8
|
|
activations; calibration methods MinMax / Entropy / Percentile(99.99);
|
|
scopes "all" (ORT default op set) vs "conv" (op_types_to_quantize=
|
|
["Conv"] -- leaves the attention path, which exports as Einsum/Softmax
|
|
and elementwise ops, in fp32).
|
|
- Model is pre-processed first (quant_pre_process: symbolic shape
|
|
inference + ORT graph optimization, folds BatchNormalization into Conv).
|
|
- Accuracy: identical protocol to eval_ort_accuracy.py -- the 10,000-window
|
|
seed-42 subset of the corruption-free test split (PCK@20/50, MPJPE).
|
|
- Latency: median ms/window at batch 1 (100 runs) and batch 64 (30 runs),
|
|
3 interleaved repetitions across all variants (fp32 and dynamic-int8
|
|
sessions included as same-session reference points).
|
|
|
|
Usage:
|
|
PYTHONUTF8=1 .venv/Scripts/python.exe static_ptq_bench.py \
|
|
[--data-dir <preprocessed_csi_data>] [--subset 10000]
|
|
[--calib-minmax 1000] [--calib-hist 512] [--skip-accuracy]
|
|
|
|
Writes/merges into results/edge_optimization.json under key "onnx_static_ptq".
|
|
"""
|
|
|
|
import argparse
|
|
import collections
|
|
import json
|
|
import os
|
|
import platform
|
|
import statistics
|
|
import sys
|
|
import time
|
|
|
|
import numpy as np
|
|
import torch
|
|
|
|
HERE = os.path.dirname(os.path.abspath(__file__))
|
|
sys.path.insert(0, HERE)
|
|
|
|
from _bench_common import RESULTS # noqa: E402
|
|
# quantize_bench sets up upstream imports + the np.load mmap patch
|
|
# (both via _bench_common.import_upstream)
|
|
from quantize_bench import build_test_subset # noqa: E402
|
|
import quantize_bench as qb # noqa: E402
|
|
from eval_ort_accuracy import evaluate_ort # noqa: E402
|
|
|
|
FP32_ONNX = os.path.join(RESULTS, "retrained_fp32_dynamic.onnx")
|
|
DYN_INT8_ONNX = os.path.join(RESULTS, "retrained_int8_ort_dynamic.onnx")
|
|
PREPROC_ONNX = os.path.join(RESULTS, "retrained_fp32_preproc.onnx")
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# calibration data: corruption-free TRAINING-split windows only
|
|
# ---------------------------------------------------------------------------
|
|
|
|
def build_calibration_windows(data_dir, n_windows):
|
|
"""Seed-42 file-level 70/15/15 TRAIN split (exactly as eval_repro.py),
|
|
minus corrupted windows, then a seed-42 random draw of n_windows."""
|
|
dataset = qb.PreprocessedCSIKeypointsDataset(
|
|
data_dir=data_dir, keypoint_scale=1000.0, enable_temporal_clean=True)
|
|
train_loader, _va, _te = qb.create_preprocessed_train_val_test_loaders(
|
|
dataset=dataset, batch_size=64, num_workers=0, random_seed=42)
|
|
train_indices = np.asarray(train_loader.dataset.indices)
|
|
|
|
corrupted = (np.load(os.path.join(RESULTS, "nan_windows_mask.npy"))
|
|
| np.load(os.path.join(RESULTS, "big_windows_mask.npy")))
|
|
clean = train_indices[~corrupted[train_indices]]
|
|
print(f"train split: {len(train_indices)} windows, "
|
|
f"{len(train_indices) - len(clean)} corrupted excluded, "
|
|
f"{len(clean)} clean")
|
|
|
|
rng = np.random.default_rng(42)
|
|
sel = np.sort(rng.choice(clean, size=n_windows, replace=False))
|
|
xs = np.stack([dataset[int(i)][0].numpy() for i in sel]).astype(np.float32)
|
|
print(f"calibration tensor: {xs.shape} from {n_windows} clean TRAIN windows")
|
|
return xs
|
|
|
|
|
|
def make_reader(windows, batch_size=64):
|
|
from onnxruntime.quantization import CalibrationDataReader
|
|
|
|
class WindowReader(CalibrationDataReader):
|
|
def __init__(self):
|
|
self._batches = [windows[i:i + batch_size]
|
|
for i in range(0, len(windows), batch_size)]
|
|
self._it = iter(self._batches)
|
|
|
|
def get_next(self):
|
|
b = next(self._it, None)
|
|
return None if b is None else {"input": b}
|
|
|
|
def rewind(self):
|
|
self._it = iter(self._batches)
|
|
|
|
def __len__(self):
|
|
return len(self._batches)
|
|
|
|
return WindowReader()
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# quantization variants
|
|
# ---------------------------------------------------------------------------
|
|
|
|
def preprocess_model():
|
|
from onnxruntime.quantization.shape_inference import quant_pre_process
|
|
quant_pre_process(FP32_ONNX, PREPROC_ONNX)
|
|
return PREPROC_ONNX
|
|
|
|
|
|
def quantize_variant(src, dst, method, scope, calib_windows):
|
|
from onnxruntime.quantization import (CalibrationMethod, QuantFormat,
|
|
QuantType, quantize_static)
|
|
methods = {
|
|
"minmax": CalibrationMethod.MinMax,
|
|
"entropy": CalibrationMethod.Entropy,
|
|
"percentile": CalibrationMethod.Percentile,
|
|
}
|
|
# NB: do NOT pass CalibMaxIntermediateOutputs -- in ORT 1.26 the MinMax
|
|
# calibrater clears its buffer every N batches and then raises
|
|
# "No data is collected" if the batch count is divisible by N.
|
|
extra = {}
|
|
if method == "percentile":
|
|
extra["CalibPercentile"] = 99.99
|
|
op_types = ["Conv"] if scope == "conv" else None
|
|
|
|
t0 = time.time()
|
|
quantize_static(
|
|
src, dst, make_reader(calib_windows),
|
|
quant_format=QuantFormat.QDQ,
|
|
op_types_to_quantize=op_types,
|
|
per_channel=True,
|
|
activation_type=QuantType.QInt8,
|
|
weight_type=QuantType.QInt8,
|
|
calibrate_method=methods[method],
|
|
extra_options=extra,
|
|
)
|
|
secs = time.time() - t0
|
|
|
|
import onnx
|
|
ops = collections.Counter(n.op_type for n in onnx.load(dst).graph.node)
|
|
return {
|
|
"file": os.path.basename(dst),
|
|
"size_bytes": os.path.getsize(dst),
|
|
"size_mb": os.path.getsize(dst) / 1e6,
|
|
"calibration": {"method": method,
|
|
"windows": int(len(calib_windows)),
|
|
"percentile": extra.get("CalibPercentile"),
|
|
"seconds": secs},
|
|
"scope": scope,
|
|
"per_channel": True,
|
|
"activation_type": "QInt8",
|
|
"weight_type": "QInt8",
|
|
"node_counts": {k: v for k, v in sorted(ops.items())},
|
|
}
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# latency (3 interleaved reps, like the latency_controlled_rerun)
|
|
# ---------------------------------------------------------------------------
|
|
|
|
def ort_session(path):
|
|
import onnxruntime as ort
|
|
return ort.InferenceSession(path, providers=["CPUExecutionProvider"])
|
|
|
|
|
|
def bench_ort(sess, batch, n_runs):
|
|
rng = np.random.default_rng(123)
|
|
x = rng.random((batch, 540, 20), dtype=np.float32)
|
|
inp = sess.get_inputs()[0].name
|
|
for _ in range(max(5, n_runs // 10)):
|
|
sess.run(None, {inp: x})
|
|
times = []
|
|
for _ in range(n_runs):
|
|
t0 = time.perf_counter()
|
|
sess.run(None, {inp: x})
|
|
times.append(time.perf_counter() - t0)
|
|
return statistics.median(times) * 1e3 / batch # ms/window
|
|
|
|
|
|
def interleaved_latency(sessions, reps=3, runs_b1=100, runs_b64=30):
|
|
lat = {name: {"batch1_reps": [], "batch64_reps": []} for name in sessions}
|
|
for rep in range(reps):
|
|
for name, sess in sessions.items():
|
|
lat[name]["batch1_reps"].append(bench_ort(sess, 1, runs_b1))
|
|
lat[name]["batch64_reps"].append(bench_ort(sess, 64, runs_b64))
|
|
print(f" rep {rep + 1}/{reps} {name}: "
|
|
f"b1={lat[name]['batch1_reps'][-1]:.2f} "
|
|
f"b64={lat[name]['batch64_reps'][-1]:.3f} ms/win", flush=True)
|
|
for name in lat:
|
|
lat[name]["batch1_ms_per_window_median"] = statistics.median(
|
|
lat[name]["batch1_reps"])
|
|
lat[name]["batch64_ms_per_window_median"] = statistics.median(
|
|
lat[name]["batch64_reps"])
|
|
return lat
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
|
|
def main():
|
|
import onnxruntime
|
|
parser = argparse.ArgumentParser()
|
|
parser.add_argument("--data-dir", default=os.path.join(
|
|
os.path.expanduser("~"), ".cache", "kagglehub", "datasets", "kaka2434",
|
|
"wiflow-dataset", "versions", "1", "preprocessed_csi_data"))
|
|
parser.add_argument("--subset", type=int, default=10000)
|
|
parser.add_argument("--calib-minmax", type=int, default=1000)
|
|
parser.add_argument("--calib-hist", type=int, default=512,
|
|
help="calibration windows for Entropy/Percentile "
|
|
"(histogram calibraters hold all intermediate "
|
|
"activations in RAM)")
|
|
parser.add_argument("--skip-accuracy", action="store_true")
|
|
parser.add_argument("--methods", default="minmax,entropy,percentile",
|
|
help="comma list of calibration methods to (re)run; "
|
|
"results merge into existing onnx_static_ptq")
|
|
parser.add_argument("--out", default=os.path.join(RESULTS, "edge_optimization.json"))
|
|
args = parser.parse_args()
|
|
|
|
results = {
|
|
"env": {
|
|
"onnxruntime": onnxruntime.__version__,
|
|
"torch": torch.__version__,
|
|
"platform": platform.platform(),
|
|
"source_model": os.path.basename(FP32_ONNX),
|
|
},
|
|
"variants": {},
|
|
}
|
|
|
|
# ---- calibration data (TRAIN split only) -------------------------------
|
|
calib_mm = build_calibration_windows(args.data_dir, args.calib_minmax)
|
|
calib_hist = calib_mm[:args.calib_hist]
|
|
|
|
# ---- preprocess + quantize ---------------------------------------------
|
|
print("\n=== quant_pre_process (shape inference + graph optimization) ===")
|
|
src = preprocess_model()
|
|
results["env"]["preprocessed_model"] = {
|
|
"file": os.path.basename(src),
|
|
"size_mb": os.path.getsize(src) / 1e6,
|
|
}
|
|
|
|
matrix = [(m, s) for m in args.methods.split(",")
|
|
for s in ("all", "conv")]
|
|
for method, scope in matrix:
|
|
name = f"{method}_{scope}"
|
|
dst = os.path.join(RESULTS, f"retrained_int8_static_{name}.onnx")
|
|
calib = calib_mm if method == "minmax" else calib_hist
|
|
print(f"\n=== quantize_static: {name} "
|
|
f"({len(calib)} calib windows) ===", flush=True)
|
|
try:
|
|
results["variants"][name] = quantize_variant(
|
|
src, dst, method, scope, calib)
|
|
print(f" {results['variants'][name]['size_mb']:.3f} MB")
|
|
except Exception as e: # noqa: BLE001
|
|
results["variants"][name] = {"error": f"{type(e).__name__}: {e}"}
|
|
print(f" FAILED: {e}")
|
|
|
|
# ---- fixture parity (sanity, batch 2) ----------------------------------
|
|
fixture = np.load(os.path.join(RESULTS, "parity_fixture.npz"))
|
|
fx, fy = fixture["input"], fixture["output"]
|
|
sessions = {}
|
|
for name, info in results["variants"].items():
|
|
if "error" in info:
|
|
continue
|
|
path = os.path.join(RESULTS, info["file"])
|
|
try:
|
|
sess = ort_session(path)
|
|
yq = sess.run(None, {sess.get_inputs()[0].name: fx})[0]
|
|
info["max_abs_diff_vs_fp32_fixture"] = float(np.abs(yq - fy).max())
|
|
sessions[name] = sess
|
|
except Exception as e: # noqa: BLE001
|
|
info["run_error"] = f"{type(e).__name__}: {e}"
|
|
print("\nfixture max-abs-diff vs fp32:",
|
|
{n: round(results["variants"][n].get("max_abs_diff_vs_fp32_fixture",
|
|
float("nan")), 5)
|
|
for n in results["variants"]})
|
|
|
|
# ---- latency: 3 interleaved reps incl. fp32 + dynamic-int8 reference ----
|
|
print("\n=== latency (3 interleaved reps) ===")
|
|
lat_sessions = {"onnx_fp32": ort_session(FP32_ONNX),
|
|
"onnx_int8_ort_dynamic": ort_session(DYN_INT8_ONNX)}
|
|
lat_sessions.update(sessions)
|
|
results["latency"] = {
|
|
"note": "3 interleaved repetitions per variant, median ms/window; "
|
|
"onnx_fp32 / onnx_int8_ort_dynamic are same-session references",
|
|
**interleaved_latency(lat_sessions),
|
|
}
|
|
|
|
# ---- accuracy on the standard 10k corruption-free test subset ----------
|
|
if not args.skip_accuracy:
|
|
loader, n_clean = build_test_subset(args.data_dir, args.subset)
|
|
results["accuracy_subset"] = {
|
|
"description": "seed-42 file-level 70/15/15 test split, corrupted "
|
|
"windows excluded, seed-42 random subset (same as "
|
|
"quantize_bench/eval_ort_accuracy)",
|
|
"subset_size": min(args.subset, n_clean) if args.subset else n_clean,
|
|
}
|
|
for name, sess in sessions.items():
|
|
print(f"\n=== accuracy: {name} ===")
|
|
results["variants"][name]["accuracy"] = evaluate_ort(
|
|
sess, loader, name)
|
|
print(json.dumps(results["variants"][name]["accuracy"], indent=2))
|
|
|
|
# ---- merge into edge_optimization.json ----------------------------------
|
|
merged = {}
|
|
if os.path.exists(args.out):
|
|
with open(args.out) as f:
|
|
merged = json.load(f)
|
|
prev = merged.get("onnx_static_ptq")
|
|
if prev: # nested merge so partial --methods reruns don't clobber
|
|
prev["env"] = results["env"]
|
|
prev["variants"].update(results["variants"])
|
|
prev.setdefault("latency", {}).update(results["latency"])
|
|
if "accuracy_subset" in results:
|
|
prev["accuracy_subset"] = results["accuracy_subset"]
|
|
else:
|
|
merged["onnx_static_ptq"] = results
|
|
with open(args.out, "w") as f:
|
|
json.dump(merged, f, indent=2)
|
|
print(f"\nwrote {args.out}")
|
|
|
|
|
|
if __name__ == "__main__":
|
|
main()
|