feat(train): AetherTemporalAggregator — wire wifi-densepose-temporal into the tch graph (#513 )

ADR-096 train integration. Additive — does NOT modify model.rs. The existing WiFiDensePoseModel forward stays bit-equivalent for back-compat. New code lives in temporal_aether.rs behind the `aether-sparse-temporal` feature flag (which itself requires `tch-backend`). Architecture: tch::Tensor [T, in_dim] ──── tch nn::Linear (q/k/v projections) ↓ [T, q_heads*head_dim] etc ↓ tch_to_tensor3 (CPU, f32, 1× copy) ↓ ruvllm_sparse_attention::Tensor3 ↓ AetherTemporalHead::forward() ↓ Tensor3 [T, q_heads, head_dim] ↓ tensor3_to_tch (1× copy) ↓ tch::Tensor [T, q_heads*head_dim] ↓ tch nn::Linear (output projection) ↓ tch::Tensor [T, in_dim] Why additive rather than swapping `apply_antenna_attention` / `apply_spatial_attention` in model.rs: those are over antenna and spatial axes, not temporal — ADR-096 §8.1 was right that AETHER doesn't currently HAVE a temporal-axis attention. This commit adds that path without disturbing the others, so the §5 validation gate can A/B the two options before flipping the production default. Scope notes: - B=1 prefill only this version. Multi-batch lands when §5 turns green and we need to take perf seriously. The forward expects `[T, in_dim]` not `[B, T, in_dim]`; documented in the file. - Streaming step() bridge deferred — KvCache lifecycle ties to PoseTrack per ADR-096 §8.5, which is signal-side not train-side. - Two CPU memory copies per call (in + out). For training-rate forwards (~100/sec at batch 16) this is negligible vs the actual attention work; for inference-rate streaming it'd be the bottleneck and a zero-copy path is the natural follow-up. Build verification: - Source compiles cleanly with cargo check on the host crate (`-p wifi-densepose-temporal`, 21/21 tests still passing). - The train crate's tch-backend build is environmentally blocked on this Windows machine — torch-sys fails to link against the system PyTorch 2.11 + MSVC 14.50 toolchain. This predates this commit and affects all tch-bound code paths in the workspace. CI runners with working libtorch will verify the new module builds; the source follows the same nn::Linear / Module patterns the existing model.rs uses. Feature gating ensures default builds are byte-equivalent. Off by default; enable with `--features aether-sparse-temporal`. Co-Authored-By: claude-flow <ruv@ruv.net>
test(temporal): dense-vs-sparse numerical A/B baseline (ADR-096 §5, #513 )
2026-06-09 10:13:17 +00:00 · 2026-05-08 12:42:41 -04:00 · 2026-05-08 12:16:05 -04:00 · 2026-05-08 12:12:17 -04:00 · 2026-05-08 12:06:26 -04:00 · 2026-05-08 12:02:36 -04:00
40 changed files with 4412 additions and 4 deletions
@@ -0,0 +1,369 @@
+# ADR-095: On-ESP32-S3 Temporal Modeling at the Edge via `ruvllm_sparse_attention` (no_std)
+
+| Field       | Value                                                                                                  |
+|-------------|--------------------------------------------------------------------------------------------------------|
+| **Status**  | Proposed (2026-05-07)                                                                                  |
+| **Date**    | 2026-05-07                                                                                             |
+| **Authors** | ruvnet, claude-flow                                                                                    |
+| **Related** | ADR-018, ADR-024, ADR-039, ADR-040, ADR-061, ADR-081, ADR-091; upstream ADR-189, ADR-190, ADR-192      |
+| **Branch**  | `feat/ruvllm-sparse-attention-edge`                                                                    |
+| **Tracking**| #513                                                                                                   |
+
+---
+
+## 1. Context
+
+Today the ESP32-S3 firmware in `firmware/esp32-csi-node/main/` does
+**physics-only** sensing on-device. The pipeline in `edge_processing.c`
+runs on Core 1 and produces:
+
+- Adaptive presence detection (`presence_score`).
+- Breathing-band (0.1–0.5 Hz) and heart-rate-band (0.8–2.0 Hz) biquad
+  IIR bandpass + zero-crossing BPM estimators.
+- A motion / fall flag (`flags` bits 0–2 in `edge_vitals_pkt_t` magic
+  `0xC5110002`, plus fused mmWave variant `0xC5110004` per ADR-063).
+- ADR-081 `rv_feature_state_t` (60 B at magic `0xC5110006`) emitted at
+  1–10 Hz from the adaptive controller's fast loop.
+
+There is **no learned model of any kind on the MCU**. The closest things
+are: ADR-039 Tier-1 compressed-CSI emission, ADR-040 WASM modules
+(Tier-3, but used by the user for ad-hoc DSP, not transformer
+inference), and the Rust-side AETHER embeddings (ADR-024) which run
+on the host, not the node. Anomaly detection that needs *temporal
+context* — "is this fall pattern consistent with a fall, or just a
+sit-down?" — is structurally absent. The fall debounce in v0.6.x
+(3-frame consecutive + 5 s cooldown, raised threshold 2.0 → 15.0 rad/s²)
+is a hand-tuned heuristic exactly because the firmware has nothing
+better to reason with.
+
+A second pressure point: the Tmr Svc / FreeRTOS stack is already
+sensitive. `edge_processing.c` lines 47–48 explicitly note that
+`process_frame + update_multi_person_vitals` combined used ~6.5–7.5 KB
+of the 8 KB task stack and that **scratch buffers were moved to static
+storage to avoid stack overflow.** Any new heavyweight workload — and
+a transformer forward pass is heavyweight — must therefore live in
+**its own FreeRTOS task with its own task stack**, not piggyback on
+the existing edge DSP task.
+
+The vendored crate `ruvllm_sparse_attention` v0.1.1 (released 2026-05-07,
+synced today at `vendor/ruvector/crates/ruvllm_sparse_attention/`)
+removes the previously-blocking `std` requirement. Per upstream
+**ADR-192**, the crate now compiles cleanly to
+`xtensa-esp32s3-none-elf` via `espup`, with a measured **376 KB
+release rlib**, zero runtime dependencies beyond `libm`, and was
+validated on a real ESP32-S3 (rev v0.2, 16 MB flash). It exposes
+`SubquadraticSparseAttention`, `KvCache` / `KvCacheF16`, `FastGrnnGate`,
+`IncrementalLandmarks`, `RuvLlmSparseBlock`, and a `Tensor3` value
+type. The kernel is O(N log N) by default and near-linear O(N) when
+the FastGRNN salience gate is enabled.
+
+This is the first time we have had a credible path to **on-device
+transformer inference for CSI** without a Python runtime, without
+TFLite, and without a coprocessor. It is also the right moment to
+decide *whether* we want it before code starts to land.
+
+---
+
+## 2. Decision
+
+Add a learned **temporal head** to the ESP32-S3 firmware running on
+the node itself, using `ruvllm_sparse_attention` compiled
+`--no-default-features` (no_std + alloc, optionally `+fp16`), driven
+by a small Rust component integrated into the ESP-IDF build. The
+temporal head runs **alongside** the existing physics-only pipeline,
+not as a replacement — physics gives us breathing/heart-rate/presence,
+the temporal head gives us classification and sequence-aware reasoning.
+
+Concretely:
+
+1. The temporal head consumes a rolling window of feature vectors
+   (initially the same `rv_feature_state_t` floats already produced
+   by ADR-081, plus optionally a small projection of recent CSI
+   amplitude statistics), length `N` ∈ [100, 500] frames, sampled at
+   the controller's fast-loop rate.
+2. It outputs a small set of **class logits** for the active
+   detection task. The first three deployable tasks are listed in
+   §4.
+3. It runs in its own FreeRTOS task on Core 1 (or pinned to whichever
+   core the WiFi driver is *not* on), at a cadence slower than the
+   fast loop — initially 1 Hz, classification-on-demand.
+4. The kernel is invoked through a thin C ABI (`ruv_temporal_init`,
+   `ruv_temporal_push_frame`, `ruv_temporal_classify`) exported from
+   a Rust static library linked into the ESP-IDF build the same way
+   the existing Tier-3 components are linked.
+5. Weights are stored as a flat `f32` (or `f16` with the `fp16`
+   feature) blob in the ESP32-S3 flash, loadable from either an
+   embedded `EMBED_FILES` resource (compile-time bake-in) or NVS
+   (post-flash provisioning, mirroring ADR-040's WASM-upload path).
+6. The temporal head is gated behind a Kconfig option
+   `CONFIG_CSI_TEMPORAL_HEAD_ENABLED`, **default off**, and is only
+   compiled into the 8 MB build profile until the flash math in §6
+   demonstrates 4 MB headroom.
+
+This ADR authorizes the architecture; it does **not** ship any of
+the firmware-side or training-side changes. Implementation lands in
+follow-up issues per the roadmap in §7.
+
+---
+
+## 3. Approach
+
+### 3.1 Build integration
+
+ESP-IDF v5.4 already supports Rust components via the
+`rust-esp32`-style template (a CMake `idf_component_register` shim
+that runs `cargo build --target xtensa-esp32s3-none-elf` and links
+the resulting static library). The new component lives at
+`firmware/esp32-csi-node/components/ruv_temporal/`:
+
+```
+ruv_temporal/
+  CMakeLists.txt          # component manifest, Rust build invocation
+  Cargo.toml              # crate config: no_std, deps on ruvllm_sparse_attention
+  build.rs                # generates the C header from #[no_mangle] exports
+  src/lib.rs              # public C ABI: init/push/classify/teardown
+  src/window.rs           # rolling frame buffer
+  src/weights.rs          # NVS / EMBED_FILES weight loader
+  include/ruv_temporal.h  # generated; consumed by edge_processing.c
+```
+
+Cargo features compiled in: `["fp16"]`. **Not** `parallel` (rayon
+needs threads, breaks no_std). **Not** `std`.
+
+### 3.2 Interface
+
+The C ABI is intentionally narrow. It does not expose `Tensor3`,
+attention configs, or any Rust types — only `float*` buffers and
+opaque handles:
+
+```c
+typedef struct ruv_temporal_ctx ruv_temporal_ctx_t;
+
+esp_err_t ruv_temporal_init(const uint8_t *weights, size_t wlen,
+                            uint32_t input_dim, uint32_t window,
+                            ruv_temporal_ctx_t **out_ctx);
+esp_err_t ruv_temporal_push(ruv_temporal_ctx_t *ctx, const float *frame);
+esp_err_t ruv_temporal_classify(ruv_temporal_ctx_t *ctx,
+                                float *logits, uint32_t n_classes);
+void      ruv_temporal_destroy(ruv_temporal_ctx_t *ctx);
+```
+
+`push` is the hot path and must be cheap (it just writes into a
+ring buffer in PSRAM if available, IRAM/DRAM otherwise). `classify`
+runs the actual sparse attention forward and is the budget-heavy
+call.
+
+### 3.3 Task topology
+
+A new task `ruv_temporal_task` with its own 16 KB stack, pinned to
+the same core as the edge DSP task (Core 1), fed via a FreeRTOS
+queue from the adaptive controller's fast loop. We do **not** call
+the kernel from the existing edge task — the edge stack is already
+near-full per the comment at `edge_processing.c:47-48` and recent
+fall-debounce / Tmr-Svc-stack work.
+
+### 3.4 Memory budget (per inference)
+
+With `N = 256` (window), `d_model = 32`, `n_heads = 4`, `head_dim = 8`,
+1–2 `RuvLlmSparseBlock` layers, `block_size = 64`, `window = 64`:
+
+- Weights: ~5–15 KB (single block, INT8 quant deferred to a later
+  ADR; FP16 default).
+- KV cache (FP16, full window): `2 * 256 * 4 * 8 * 2 B ≈ 16 KB`.
+- Activations (peak, with `forward_flash` tiling): ≈ 2 KB.
+- Working set: < 64 KB. Comfortable in PSRAM, possible in ISR-safe
+  internal SRAM.
+
+These are first-pass estimates; the precise numbers come out of the
+`forward_flash` benchmark on real hardware, which is exit criterion
+in §7.
+
+### 3.5 Compatibility with ADR-081 / ADR-039 / ADR-018
+
+The temporal head is a **consumer** of the same feature stream
+already flowing in the firmware. It does not alter:
+
+- ADR-018 raw CSI frame layout (`0xC5110001`).
+- ADR-039 Tier-1 compressed CSI (`0xC5110005`) or vitals
+  (`0xC5110002`).
+- ADR-063 fused vitals (`0xC5110004`).
+- ADR-081 `rv_feature_state_t` (`0xC5110006`) — this is the primary
+  input we tap.
+
+If the temporal head fires a classification, the result rides on a
+new `0xC5110007` packet (small: class id, confidence, monotonic seq,
+ts_us, CRC32). Allocation of that magic is deferred to the
+implementation PR — this ADR reserves the *concept*, not the byte
+layout.
+
+---
+
+## 4. Use cases that motivate this
+
+| Task | Why temporal context matters | Window | Class count |
+|------|------------------------------|--------|-------------|
+| **Gesture recognition** (wave / point / clap / kick) | Single-frame CSI snapshots can't disambiguate gestures from random motion. ~100-frame windows capture full gesture trajectories. | 100 frames @ 50 Hz = 2 s | 4–8 |
+| **Fall classification with sequence context** | The current heuristic ("> 15 rad/s² for 3 consecutive frames + 5 s cooldown") was raised to suppress false positives. A learned temporal head can distinguish a fall (rapid descent then stillness) from a sit-down (descent then sustained micro-motion) using the same input window. | 200 frames @ 50 Hz = 4 s | 3 (fall / sit / nothing) |
+| **Breathing-quality scoring** | Today's pipeline emits a BPM and a confidence float. A temporal head trained on labeled apnea / shallow / paradoxical / normal sequences can output a 4-class quality label that downstream consumers can render in one glance. | 500 frames @ 50 Hz = 10 s | 4 |
+| **"Is this normal for this room/time" anomaly detection** | Per-room SONA profiles (ADR-005) capture environment statistics, but anomaly *temporal shape* is currently checked host-side via embedding distance (ADR-024 §2.4 `temporal_baseline` index). A small on-device classifier can flag ahead of host roundtrip. | 300 frames | 2 (normal / anomalous) |
+
+These four cover the visible product gaps in the v0.6.x line.
+Gesture recognition is the headline; fall classification is the
+highest-impact for the eldercare scenarios v0.5.4 was tuned for.
+
+---
+
+## 5. Alternatives considered
+
+| Option | Why rejected |
+|--------|--------------|
+| **TFLite Micro** | Heavier runtime (~150 KB code + interpreter), pulls in C++ STL surface, no Rust-native API. Does not benefit from sparse attention specifically. We'd be re-paying the cost of a full inference framework when we only need one kernel. |
+| **Run all classifiers server-side** | Costs a full Tier-1 CSI uplink (~50–70 KB/s/node per ADR-039) just to feed a remote classifier, then a roundtrip back. Defeats the point of ADR-081's compact feature stream and makes the system worthless when the backhaul is down. Also leaks raw CSI to the network for purposes the user did not opt into. |
+| **Stay physics-only forever** | Cleanest from a maintenance standpoint, but loses gesture, structurally, and the fall-debounce hack will keep accreting per-deployment knobs. The product space already has commodity physics-only firmware (Bosch presence sensors, etc.); on-device transformer inference for CSI is what would *differentiate* RuView. |
+| **Use `ruvector-attention` (already in workspace) on-device** | `ruvector-attention` is `std`-bound today; doesn't compile to `xtensa-esp32s3-none-elf` without a port comparable in scope to upstream ADR-192. Even if ported, it doesn't give us GQA + streaming KV cache, which is the structural capability the new crate adds. |
+| **Wait for IEEE 802.11bf** | Different problem (standardised CSI exposure across vendors). Doesn't address whether the model runs on-device or off. |
+
+---
+
+## 6. Consequences
+
+### Positive
+
+- **Genuinely novel.** No competing CSI-sensing project ships
+  transformer inference on the MCU itself. The closest peers
+  (Espressif's ESP-DL, Edge Impulse) are non-attention CNN/RNN
+  pipelines.
+- **Latency.** Classification result is local — no backhaul,
+  no host roundtrip, sub-100 ms gesture-to-action.
+- **Privacy.** Raw CSI never leaves the node for these tasks.
+- **Reuses the ADR-081 feature stream** — the temporal head is a
+  consumer of the existing 60 B `rv_feature_state_t`, not a new
+  uplink format.
+- **Validated kernel.** Per upstream ADR-192, the no_std build was
+  validated on real ESP32-S3 hardware (MAC `ac:a7:04:e2:66:24`).
+  We are not betting on a paper crate.
+
+### Negative / tradeoffs
+
+- **Flash budget pressure on 4 MB boards.** Per `partitions_4mb.csv`,
+  each OTA slot is 1.875 MB (`0x1D0000`). The current build is
+  ~853 KiB. Adding a 376 KB rlib plus weights brings us to ~1.3 MB —
+  still under the slot ceiling but with little headroom for other
+  growth. **Decision: temporal head is 8 MB-only initially**, gated
+  behind `CONFIG_CSI_TEMPORAL_HEAD_ENABLED`. 4 MB enablement is a
+  separate ADR after we measure the actual incremental link size
+  (the 376 KB upstream number is for the rlib in isolation; the
+  linked-and-stripped final binary delta will be smaller).
+- **Rust toolchain dependency.** The ESP-IDF build now needs
+  `espup` + `cargo +esp` to be present on every developer machine
+  and CI runner. This is a real hurdle on Windows — see
+  `CLAUDE.local.md` for the existing Python-subprocess wrapper
+  required to run ESP-IDF cleanly. CI will need a parallel
+  Rust-toolchain step.
+- **One more thing to test.** QEMU (ADR-061) does not run the
+  ESP32-S3 Xtensa Rust binary today. The QEMU validator pipeline
+  will need a build matrix entry for "Rust component compiled but
+  classifier disabled" at minimum.
+- **Stack overflow risk.** Same hazard the v0.6.4 work just
+  navigated. Mitigated by §3.3 (own task, own stack); this needs
+  to be a code-review checklist item.
+- **Weights provenance.** Once we ship a model, we need a story
+  for *which model*, signed by *whom*, retrained *how often*. See
+  Open Questions §8.
+
+### Neutral
+
+- ADR-040's WASM Tier-3 path is **not** superseded. WASM remains
+  the right choice for user-uploaded modules. The temporal head is
+  a first-party signed-by-us component, with a different deploy
+  story.
+- The host-side ADR-024 AETHER pipeline is unchanged by this ADR.
+  ADR-096 covers the host-side use of the same crate.
+
+---
+
+## 7. Roadmap
+
+| Phase | Scope | Gating |
+|-------|-------|--------|
+| 0 | This ADR + ADR-096 land. No code. | Maintainer review of #513. |
+| 1 | New crate `wifi-densepose-temporal` (host-side only): defines the temporal-head architecture, training script, weight serialization format. | Phase 0 accepted. |
+| 2 | `ruv_temporal` ESP-IDF component scaffolding — empty kernel, just the C ABI and ring buffer. Compiles cleanly into 8 MB firmware. Adds ~5 KB to binary. | Phase 1 produces a serialised set of weights. |
+| 3 | Wire `ruvllm_sparse_attention` `forward` (not yet `forward_gated`) into the component. First on-target classification benchmark on COM7. Gate: end-to-end inference ≤ 50 ms with `N = 256`, no stack overflow under 24 h soak. | Phase 2 ABI stable. |
+| 4 | First trained classifier (gesture or fall, whichever has labelled data first). Hardware A/B: temporal-head decision vs current heuristic on a held-out set. Promotion criterion: temporal head matches or beats heuristic on F1 *and* false-positive rate. | Phase 3 latency gate met. |
+| 5 | 4 MB profile gating — measure actual binary delta, decide whether to enable on SuperMini. | Phase 4 in production on 8 MB. |
+| 6 | `forward_gated_with_fastgrnn` for long-window tasks (breathing-quality at N = 500). | Phase 4 stable. |
+
+---
+
+## 8. Open questions
+
+1. **Who trains the temporal heads?** Two options:
+   (a) host-side training on captured `rv_feature_state_t` traces
+   labelled in-app, then export to flat-buffer weights;
+   (b) teacher-distillation from the larger AETHER model (ADR-024)
+   running off-device, using soft labels. Option (b) is more
+   data-efficient but couples this ADR's ship date to ADR-024's
+   training-pipeline maturity. Open.
+2. **How are weights flashed?** Three options, in increasing
+   capability: NVS blob (small, safe, 4–8 KB ceiling per key),
+   `EMBED_FILES` baked into the firmware image (no runtime update),
+   OTA-updateable partition (mirrors ADR-040 RVF upload path,
+   biggest engineering cost). Phase 2/3 will pick one; my prior is
+   `EMBED_FILES` for the first model, OTA partition once we have
+   more than one.
+3. **Does the 376 KB rlib figure scale?** Upstream measured
+   376 KB for the kernel + the embedding/projection
+   weights for *their* test config. Adding 1–2
+   `RuvLlmSparseBlock` layers with embedding/projection weights
+   sized to actual CSI feature dimension may push this. Phase 2
+   will measure the on-target stripped-binary delta directly; if
+   the delta exceeds 600 KB we revisit the 4 MB story sooner.
+4. **What window length is right for fall classification?**
+   200 frames at 50 Hz = 4 s feels right based on the v0.6.4
+   debounce numbers (3-frame consecutive + 5 s cooldown is
+   essentially a 4-second decision window already). Empirical, not
+   architectural — set in Phase 4.
+5. **Quantisation.** First model ships FP16 (KV cache feature flag
+   already supports this). INT8 for both weights and activations
+   is a follow-up; the current crate has no INT8 path so it would
+   be a separate kernel.
+6. **What happens when the controller is in `RV_PROFILE_PASSIVE_LOW_RATE`?**
+   The fast loop slows down, so the input frame rate to the
+   temporal head drops. Either the head needs to handle variable
+   sample rate (resample at push time) or it stops emitting until
+   the controller goes back to active. Phase 1 design call.
+
+---
+
+## 9. Acceptance criteria
+
+This ADR is **Accepted** once:
+
+1. Maintainer review on #513 confirms the architecture.
+2. The follow-up implementation issue is filed and references this
+   ADR plus ADR-096 by number.
+3. ADR index in `docs/adr/README.md` (if present) has an ADR-095
+   row.
+
+This ADR is **Implemented** once:
+
+1. Phase 3 is in `main` with the gating Kconfig off by default.
+2. A Phase-4 hardware A/B has been published (witness-bundle
+   compatible per ADR-028).
+3. The QEMU validator (ADR-061) has at minimum a "compiles, doesn't
+   run" check for the Rust component.
+
+---
+
+## 10. Related
+
+ADR-018 (binary CSI frame), ADR-024 (AETHER contrastive embedding —
+host-side counterpart, see ADR-096), ADR-039 (edge intelligence
+tiers), ADR-040 (WASM Tier-3 modules — the *other* extensibility
+path), ADR-061 (QEMU CI), ADR-081 (adaptive controller, mesh plane,
+`rv_feature_state_t`), ADR-091 (stand-off radar tier — adjacent
+edge-intelligence ADR), upstream ADR-189 (KV cache incremental
+decode), upstream ADR-190 (GQA/MQA), upstream ADR-192 (no_std +
+alloc on ESP32-S3 — the structural unblock that makes this ADR
+possible).
@@ -0,0 +1,389 @@
+# ADR-096: AETHER Temporal Head via `ruvllm_sparse_attention::forward_gqa` + Streaming KV Cache
+
+| Field       | Value                                                                                 |
+|-------------|---------------------------------------------------------------------------------------|
+| **Status**  | Proposed (2026-05-07)                                                                 |
+| **Date**    | 2026-05-07                                                                            |
+| **Authors** | ruvnet, claude-flow                                                                   |
+| **Related** | ADR-014, ADR-016, ADR-024, ADR-095; upstream ADR-189, ADR-190, ADR-192                |
+| **Branch**  | `feat/ruvllm-sparse-attention-edge`                                                   |
+| **Tracking**| #513                                                                                  |
+
+---
+
+## 1. Context
+
+ADR-024 ("Project AETHER") specifies a contrastive CSI embedding
+model on top of the existing `CsiToPoseTransformer` backbone. It
+adds a 2-layer projection head to the per-keypoint features and
+trains it with InfoNCE + VICReg + (optional) cross-modal alignment.
+The **temporal aggregation** that turns per-frame backbone features
+into a window-level representation is described at the level of
+"a transformer encoder over the CSI window" — but ADR-024 does not
+pin a specific attention kernel. In the current code:
+
+- `v2/crates/wifi-densepose-train/src/model.rs` uses
+  `ruvector_attention::ScaledDotProductAttention` (line 34) and
+  applies `apply_antenna_attention` over the antenna-path dimension
+  and `apply_spatial_attention` over the spatial location dimension.
+  Both are dense.
+- The training-side temporal pooling currently runs at
+  `window_frames = 100` by default (`config.rs:165`), with
+  `proof.rs` and `trainer.rs` using shorter test windows of 4 and 2
+  respectively.
+- `v2/crates/wifi-densepose-signal/src/ruvsense/pose_tracker.rs`
+  consumes a 128-dim AETHER re-ID embedding (line 22, 263) but does
+  not perform the temporal aggregation itself — that happens
+  upstream.
+
+So the temporal head is a real seam in the codebase, but its
+specific attention kernel is *currently dense* and *currently not a
+named architectural decision*. This ADR makes that decision.
+
+The vendored `ruvllm_sparse_attention` v0.1.1 (synced today,
+released 2026-05-07) provides a different kind of temporal kernel:
+
+- **Subquadratic O(N log N)** sparse attention (`forward`,
+  `forward_flash`).
+- **Grouped-Query / Multi-Query Attention** (`forward_gqa`,
+  `forward_gqa_flash`) — shares K/V across query heads, the
+  pattern Mistral-7B and Llama-3 use.
+- **Streaming KV cache** (`KvCache`, `KvCacheF16`) with H2O
+  heavy-hitter eviction, allowing token-by-token decode in
+  **O(log T)** per step against an accumulated cache. See upstream
+  ADR-189.
+- **FastGRNN salience gate** for **near-linear O(N)** when the
+  log-stride candidate set can be pruned.
+
+These capabilities are qualitatively different from
+`ruvector-attention` 2.0.4, which is what the workspace uses today
+for spatial / antenna attention.
+
+---
+
+## 2. Decision
+
+The AETHER temporal head will be implemented with
+`ruvllm_sparse_attention::SubquadraticSparseAttention::forward_gqa`
+for prefill, and `decode_step` against a `KvCache` (with the `fp16`
+feature enabled) for streaming inference paths (online re-ID,
+incremental embedding extraction during a tracked session).
+
+Concretely:
+
+1. `wifi-densepose-train` adds `ruvllm_sparse_attention` as a
+   workspace dependency, **path-vendored** against
+   `vendor/ruvector/crates/ruvllm_sparse_attention` so the workspace
+   does not gain a crates.io publish dependency.
+2. The AETHER block factory takes a feature flag
+   (`temporal_head = "dense" | "sparse_gqa"`) selecting between the
+   current dense MHA path and the new sparse-GQA path. The default
+   for new training runs is `sparse_gqa`. Existing checkpoints
+   continue to load on `dense`.
+3. Signal-side consumers (the streaming embedding extraction used
+   by `pose_tracker.rs` for re-ID updates) call `decode_step` rather
+   than re-running prefill on every new frame — this is the
+   structural win that dense MHA cannot provide.
+4. We add an A/B benchmark gate (§5) before flipping the production
+   default. The default *training* config can move first; the
+   default *inference* config waits for the gate.
+
+This ADR sanctions the swap. It does not perform the swap; that
+lands in a follow-up implementation issue once both ADR-095 and
+ADR-096 are accepted.
+
+---
+
+## 3. Quantitative argument
+
+### 3.1 Edge-evaluation count
+
+For a single attention layer over `N` frames:
+
+| Path | Edge evaluations | At `N = 100` (today's default) | At `N = 1000` (10 s @ 100 Hz) | At `N = 8192` |
+|------|------------------|--------------------------------|-------------------------------|---------------|
+| Dense MHA | `N²` | 1.0 × 10⁴ | 1.0 × 10⁶ | 6.7 × 10⁷ |
+| Sparse `forward` (window + log-stride + landmarks) | ~`N · (W + log N + N/B)` | 1.4 × 10⁴ | 1.4 × 10⁴ | 1.1 × 10⁶ |
+| Sparse + FastGRNN | ~`N · (W + globals + K)` | constant in `N` | constant in `N` | constant in `N` |
+
+Numbers for the sparse rows are taken from upstream's measured
+table (`README.md:230-237`, "sparse-edge reduction vs causal dense
+attention"): 8192 → 29.3× edge reduction, 16384 → 57.5×, 32768 →
+113.2×.
+
+**The honest framing:** at the *current* AETHER default of
+`window_frames = 100`, dense MHA is essentially free and the
+sparse machinery has overhead — the per-token cost in upstream's
+benchmark is ~2.4 µs at `N = 256` and ~2.1 µs at `N = 128`. The
+sparse path probably *loses* below `N ≈ 128`. It starts winning at
+the 1 s + windows we'd realistically use for activity classification
+(`N = 200` at 50 Hz, `N = 500` for breathing-quality), and pulls
+ahead by 30–100× at the 10 s windows that long-context re-ID
+benefits from.
+
+### 3.2 Streaming decode
+
+Where dense MHA structurally cannot follow is incremental decode.
+Re-ID over a long-tracked person (a 5-minute session at 50 Hz =
+15,000 frames) with dense MHA requires recomputing attention from
+scratch every time the window slides. With `decode_step` against a
+`KvCache`:
+
+| Operation | Dense MHA | Sparse GQA + KV cache |
+|-----------|-----------|-----------------------|
+| Append one new frame to the embedding context | O(N²) | **O(log T)** |
+| Memory growth | O(N · d) per recompute | O(T · d_kv) cached, evicted by H2O heavy-hitter |
+| FP16 KV cache | n/a | available via `fp16` feature, halves memory |
+
+This is the qualitative capability dense MHA lacks. Even at small
+`N` where dense MHA is competitive on prefill, decode is structurally
+different: amortised O(1) per new frame vs O(N²) recompute.
+
+---
+
+## 4. Approach
+
+### 4.1 Workspace dependency
+
+Add to `v2/Cargo.toml`:
+
+```toml
+[workspace.dependencies]
+ruvllm_sparse_attention = {
+    path = "../vendor/ruvector/crates/ruvllm_sparse_attention",
+    default-features = false,
+    features = ["fp16"]
+}
+```
+
+`default-features = false` mirrors the rest of the workspace's
+`--no-default-features` posture (and matches what ADR-095 does on
+the firmware side, so both consumers have the same feature set).
+We **do not** pull `parallel` here — rayon doesn't help with
+inference-shaped batches at the sequence lengths we run, and it
+breaks ADR-095's no_std build if the dependency leaks.
+
+### 4.2 Crate placement
+
+Two viable homes for the AETHER temporal head:
+
+| Option | Tradeoffs |
+|--------|-----------|
+| **A. New `wifi-densepose-temporal` crate** | Cleanest. Unique import surface, easy to feature-gate. But: one more crate in the publishing order (CLAUDE.md crate table grows to 16). |
+| **B. Add to `wifi-densepose-train`** | Co-located with the model; no new crate; simpler workspace graph. But: `wifi-densepose-train` is heavyweight (`tch`, full training stack), and signal-side consumers would have to depend on the whole training crate just to run inference. |
+
+**Recommendation: A.** The temporal head is consumed by both
+`wifi-densepose-train` (training) and `wifi-densepose-signal`
+(inference, re-ID). Pulling those toward a shared third crate keeps
+the dependency arrows clean. Also matches ADR-095's
+`wifi-densepose-temporal` host-side training crate name —
+deliberate convergence.
+
+### 4.3 API sketch
+
+```rust
+pub struct AetherTemporalHead {
+    backend: TemporalBackend,
+    cache: Option<KvCache>,           // populated for streaming inference
+}
+
+pub enum TemporalBackend {
+    Dense(DenseMha),                  // current ruvector-attention path
+    SparseGqa(SubquadraticSparseAttention),
+}
+
+impl AetherTemporalHead {
+    pub fn new(cfg: &TemporalHeadConfig) -> Self;
+
+    /// Window-level prefill. Returns pooled [d_model] embedding.
+    pub fn forward(&self, frames: &Tensor3) -> Vec<f32>;
+
+    /// Incremental decode for streaming re-ID. Updates internal
+    /// cache and returns pooled embedding given a single new frame.
+    /// SparseGqa backend only.
+    pub fn step(&mut self, frame: &Tensor3) -> Result<Vec<f32>, TemporalError>;
+}
+```
+
+### 4.4 Selection rule
+
+In `forward_auto`'s spirit, the head selects the path based on
+`(window, n_q_heads, n_kv_heads)` of the model:
+
+- `window ≤ 64` and dense MHA is in the checkpoint: use dense path.
+- `n_q_heads != n_kv_heads`: use `forward_gqa`.
+- `n_q_heads == n_kv_heads` and `window > 64`: use `forward`.
+- Streaming (per-frame) inference: always `decode_step`.
+
+---
+
+## 5. Validation gate before flipping the inference default
+
+We do not flip the production inference default until *all four*
+of these pass on the most recent AETHER checkpoint:
+
+1. **Contrastive loss within 1%** of the dense baseline at the same
+   training budget (so the kernel substitution doesn't silently
+   regress the loss surface).
+2. **Re-ID rank-1 accuracy within 1 percentage point** of the dense
+   baseline on the held-out test split.
+3. **Spearman rank correlation ≥ 0.95** between dense-MHA and
+   sparse-GQA top-50 nearest-neighbour orderings on the
+   `env_fingerprint` and `person_track` HNSW indices (matches the
+   ADR-024 §2.5.3 quantisation-rank-preservation criterion).
+4. **Latency improvement ≥ 5×** at the deployed window length.
+
+Any of (1)–(3) failing rolls back the default; the kernel can stay
+in the codebase as opt-in, but is not what new training runs use.
+
+---
+
+## 6. Alternatives considered
+
+| Option | Why rejected |
+|--------|--------------|
+| **Keep dense MHA, period** | Simple, but caps the practical window length. The 10 s + windows that long-context re-ID and breathing-quality scoring want are exactly where dense MHA hurts. We'd be locking in a ceiling for no reason. |
+| **Use `ruvector-attention` 2.0.4 (already in workspace)** | It's what we use today for antenna and spatial attention. But it lacks GQA, lacks streaming KV cache, and its dependency story upstream is messy (`ruvector-attn-mincut` is stuck at 2.0.4 per the issue). It works, but it's not the right tool for *temporal* attention specifically. |
+| **Wait for `ruvector-attention 2.x` to add GQA + KV cache** | Speculative; no published roadmap. Meanwhile `ruvllm_sparse_attention` shipped real artifacts on 2026-05-07 and is path-vendorable today. |
+| **Use a non-attention temporal pooler (TCN / S4 / Mamba)** | All three are real options for time-series sensing; some research gives them a slight edge on long-horizon dependencies. But (a) we already have AETHER specified around attention in ADR-024, (b) the contrastive recipe is attention-tuned, (c) we'd be re-running the entire ADR-024 training story to swap to a different family. Switching to *sparse* attention preserves the ADR-024 mathematical apparatus exactly. |
+| **`forward_gated_with_fastgrnn` immediately** | Tempting because it's the O(N) path. But the gate adds approximation error on top of the sparsity-induced approximation error. Phase the introductions: prove sparse-GQA matches dense first, then layer the gate on top in a follow-up. |
+
+---
+
+## 7. Consequences
+
+### Positive
+
+- **Long windows are no longer scary.** `window_frames = 1000` for
+  10 s sessions becomes practical, not aspirational.
+- **Streaming re-ID gets a structural speedup.** Per-frame decode
+  cost goes from O(N²) to O(log T). Pose tracker cost is a real
+  budget today; this shrinks it.
+- **GQA fits the AETHER backbone better.** AETHER's per-keypoint
+  cross-attention already has a query/key shape mismatch (17
+  keypoint queries vs N CSI keys). GQA was designed for exactly
+  this asymmetry.
+- **Path-vendored, not crates.io-coupled.** No bind-time risk —
+  the crate ships from the vendored copy of upstream, and the
+  vendor was synced today (`e38347601`).
+- **Same kernel, two consumers.** ADR-095 wants this on the MCU;
+  this ADR wants it on the host. Path-vendoring once keeps the
+  versions in lockstep.
+- **Approximation error is bounded** by the local window +
+  log-stride + landmark pattern. Upstream's measurement (`README.md`
+  §FAQ) is "<1% perplexity on standard benchmarks" for the
+  causal case; we measure ours via §5's gate.
+
+### Negative
+
+- **Adds a workspace dependency** the team has to know about.
+  Mitigated by path-vendoring (no version-resolution risk).
+- **Approximation error is not zero.** For high-precision re-ID
+  this needs measurement. §5's gate is the safety net; if rank
+  correlation drops below 0.95 we don't flip the default.
+- **More moving parts in the temporal head.** Dense MHA has one
+  knob (number of heads). Sparse GQA has window, log-stride,
+  landmark block size, KV head count, and (later) gate top-K. We
+  pay this in default-config tuning effort.
+- **`KvCache` introduces session state** in a place that didn't
+  have it. Code that previously called a stateless `forward(...)`
+  now has to think about cache lifetime per tracked person. The
+  pose tracker (`pose_tracker.rs`) already has per-track state, so
+  the natural place for the cache is inside `PoseTrack`; needs a
+  small lifecycle review.
+- **Training and inference paths diverge slightly.** Training
+  always uses `forward` (full window prefill). Inference uses
+  `decode_step` for streaming. The two paths must be tested
+  separately; upstream's `forward` and `decode_step` are unit-test
+  parity-checked, but our wrapper has its own surface.
+
+### Neutral
+
+- ADR-024 is **not superseded.** The contrastive loss, the
+  augmentation strategy, the projection head, the HNSW indices —
+  all unchanged. This ADR makes a single architectural choice
+  inside ADR-024's "temporal aggregation" black box.
+- ADR-016 (RuVector training pipeline integration) is unaffected.
+  The other RuVector crates (`mincut`, `attn-mincut`,
+  `temporal-tensor`, `solver`, `attention`) keep their existing
+  roles in `model.rs`.
+
+---
+
+## 8. Open questions
+
+1. **What is the AETHER temporal head's actual current
+   architecture in code?** ADR-024 specifies the projection head
+   precisely (Linear → BN → ReLU → Linear → L2-norm) but the
+   *temporal aggregation* before that is not pinned. The closest
+   thing in `model.rs` today is `apply_antenna_attention` and
+   `apply_spatial_attention`, which are over antenna and spatial
+   axes, not the temporal axis. So this ADR is, in practice,
+   choosing the temporal kernel for the *first time* — not
+   replacing one. Worth confirming with the maintainer before the
+   implementation PR uses language like "swap" rather than "add".
+2. **What window length is the deployed AETHER tracker using
+   today?** The training default is 100 frames (`config.rs:165`),
+   but `proof.rs` uses 4 and `trainer.rs` uses 2. The realistic
+   deployment number determines how much of the §3.1 quantitative
+   argument is *currently* operative versus *future-state*. If the
+   answer is "we run AETHER on 4-frame windows", sparse pays
+   nothing today, and the case for this ADR rests entirely on the
+   long-window roadmap. If 100 or more, sparse already pays.
+3. **Is `FastGrnnGate` worth enabling for re-ID specifically?**
+   Probably not — re-ID benefits from full-sequence visibility,
+   and the gate's job is to *prune* long-range candidates. Save
+   the gate for activity classification (where transient movement
+   is the signal of interest, and saliency-based pruning matches
+   the use case). Confirm via §5's accuracy gate when we get there.
+4. **Does the cross-modal alignment loss (ADR-024 §2.2.4) need
+   any change?** The cross-modal loss operates on pooled
+   `z_csi` (already temporally aggregated) and pooled `z_pose`. As
+   long as the temporal aggregator returns a comparable pooled
+   vector, the loss is kernel-agnostic. Likely no change, but
+   worth a smoke test.
+5. **Where does the KV cache live for re-ID?** Per `pose_tracker.rs`,
+   each `PoseTrack` already has lifecycle (create / update /
+   evict). The natural place is `PoseTrack::kv_cache:
+   Option<KvCache>`, populated when the track first emits an
+   embedding. Eviction policy ties to `track.last_seen` — when
+   the track is dropped, drop the cache. Spec-level sanity check
+   only; needs a real design pass in the implementation PR.
+
+---
+
+## 9. Acceptance criteria
+
+This ADR is **Accepted** once:
+
+1. Maintainer review on #513 confirms the architecture and resolves
+   §8.1 (the "first-time choice vs replacement" framing).
+2. Open question §8.2 has a concrete answer (ideally a one-line
+   pointer to the production training config).
+3. The follow-up implementation issue is filed.
+
+This ADR is **Implemented** once:
+
+1. `wifi-densepose-temporal` (or equivalent) ships in the workspace
+   with a default-off feature flag exposing both dense and
+   sparse-GQA backends.
+2. §5's four-gate validation has run on the most recent AETHER
+   checkpoint and the result is published (witness-bundle
+   compatible per ADR-028 if the run is reproducible).
+3. The default for new training runs is `sparse_gqa`, with `dense`
+   still selectable for back-compat.
+
+---
+
+## 10. Related
+
+ADR-014 (signal SOTA), ADR-016 (RuVector training pipeline
+integration), ADR-024 (AETHER contrastive CSI embedding — this
+ADR fills in its temporal-aggregation black box), ADR-095
+(on-ESP32-S3 temporal modeling — same crate, different consumer),
+upstream ADR-189 (KV cache incremental decode — the basis for
+streaming re-ID), upstream ADR-190 (GQA / MQA — what AETHER's 17
+keypoint queries × N CSI keys asymmetry naturally maps onto),
+upstream ADR-192 (no_std + alloc support — the structural change
+that means the *same* kernel runs both on the host here and on
+the MCU under ADR-095).
@@ -0,0 +1,10 @@
+# Per-component cargo config so `cargo build` picks the xtensa target
+# without the caller having to remember `--target xtensa-esp32s3-none-elf`.
+# CMakeLists.txt still passes --target explicitly for clarity.
+
+[build]
+target = "xtensa-esp32s3-none-elf"
+
+# The esp toolchain ships precompiled core and alloc for
+# xtensa-esp32s3-none-elf, so build-std is unnecessary and (as of the
+# 2025-09-16 esp nightly) actively broken on portable_simd.
@@ -0,0 +1,49 @@
+# ESP-IDF component manifest for the ruv_temporal Rust staticlib (ADR-095).
+#
+# Build flow:
+#   - When CONFIG_CSI_TEMPORAL_HEAD_ENABLED is OFF (default): register an
+#     empty stub. main/temporal_task.c compiles the no-op shim path, no
+#     cargo, no Rust toolchain dependency. Default firmware build is
+#     unaffected.
+#   - When CONFIG_CSI_TEMPORAL_HEAD_ENABLED is ON: invoke
+#     `cargo +esp build --release --target xtensa-esp32s3-none-elf`,
+#     register the resulting libruv_temporal.a, and expose include/.
+#
+# add_custom_command is intentionally placed AFTER idf_component_register
+# because ESP-IDF runs every component's CMakeLists.txt twice — once in
+# script mode for dependency discovery (where add_custom_command is
+# forbidden), and once for the actual build.
+
+if(NOT CONFIG_CSI_TEMPORAL_HEAD_ENABLED)
+    # Feature disabled — register an empty component so the directory's
+    # mere existence doesn't break the build, but do NOT invoke cargo
+    # or pull include/ onto consumers' include paths (the C ABI header
+    # would advertise capabilities we cannot honour).
+    idf_component_register()
+    return()
+endif()
+
+set(RUV_TEMPORAL_DIR "${CMAKE_CURRENT_SOURCE_DIR}")
+set(RUV_TEMPORAL_TARGET "xtensa-esp32s3-none-elf")
+set(RUV_TEMPORAL_PROFILE "release")
+set(RUV_TEMPORAL_LIB
+    "${RUV_TEMPORAL_DIR}/target/${RUV_TEMPORAL_TARGET}/${RUV_TEMPORAL_PROFILE}/libruv_temporal.a")
+
+idf_component_register(
+    SRCS "shim.c"
+    INCLUDE_DIRS "include"
+    PRIV_REQUIRES "esp_common"
+)
+
+# Custom command + target run only at build time, not in script mode.
+add_custom_command(
+    OUTPUT "${RUV_TEMPORAL_LIB}"
+    WORKING_DIRECTORY "${RUV_TEMPORAL_DIR}"
+    COMMAND cargo +esp build --release --target ${RUV_TEMPORAL_TARGET}
+    COMMENT "Building ruv_temporal Rust staticlib for ${RUV_TEMPORAL_TARGET}"
+    VERBATIM
+)
+add_custom_target(ruv_temporal_rust_build ALL DEPENDS "${RUV_TEMPORAL_LIB}")
+
+add_dependencies(${COMPONENT_LIB} ruv_temporal_rust_build)
+target_link_libraries(${COMPONENT_LIB} INTERFACE "${RUV_TEMPORAL_LIB}")
@@ -0,0 +1,218 @@
+# This file is automatically @generated by Cargo.
+# It is not intended for manual editing.
+version = 4
+
+[[package]]
+name = "allocator-api2"
+version = "0.3.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "c583acf993cf4245c4acb0a2cc2ab1f9cc097de73411bb6d3647ff6af2b1013d"
+
+[[package]]
+name = "cfg-if"
+version = "1.0.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9330f8b2ff13f34540b44e946ef35111825727b38d33286ef986142615121801"
+
+[[package]]
+name = "critical-section"
+version = "1.2.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "790eea4361631c5e7d22598ecd5723ff611904e3344ce8720784c93e3d83d40b"
+
+[[package]]
+name = "crunchy"
+version = "0.2.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "460fbee9c2c2f33933d720630a6a0bac33ba7053db5344fac858d4b8952d77d5"
+
+[[package]]
+name = "darling"
+version = "0.21.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9cdf337090841a411e2a7f3deb9187445851f91b309c0c0a29e05f74a00a48c0"
+dependencies = [
+ "darling_core",
+ "darling_macro",
+]
+
+[[package]]
+name = "darling_core"
+version = "0.21.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1247195ecd7e3c85f83c8d2a366e4210d588e802133e1e355180a9870b517ea4"
+dependencies = [
+ "fnv",
+ "ident_case",
+ "proc-macro2",
+ "quote",
+ "syn",
+]
+
+[[package]]
+name = "darling_macro"
+version = "0.21.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d38308df82d1080de0afee5d069fa14b0326a88c14f15c5ccda35b4a6c414c81"
+dependencies = [
+ "darling_core",
+ "quote",
+ "syn",
+]
+
+[[package]]
+name = "document-features"
+version = "0.2.12"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d4b8a88685455ed29a21542a33abd9cb6510b6b129abadabdcef0f4c55bc8f61"
+dependencies = [
+ "litrs",
+]
+
+[[package]]
+name = "enumset"
+version = "1.1.12"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7f96a4a12fe60ac746ae295a1a4ecb5bb02debc20856506c8635288065f142de"
+dependencies = [
+ "enumset_derive",
+]
+
+[[package]]
+name = "enumset_derive"
+version = "0.15.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "4bd536557b58c682b217b8fb199afdff47cd3eff260623f19e77074eb073d63a"
+dependencies = [
+ "darling",
+ "proc-macro2",
+ "quote",
+ "syn",
+]
+
+[[package]]
+name = "esp-alloc"
+version = "0.8.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7e95f1de57ce5a6600368f3d3c931b0dfe00501661e96f5ab83bc5cdee031784"
+dependencies = [
+ "allocator-api2",
+ "cfg-if",
+ "critical-section",
+ "document-features",
+ "enumset",
+ "linked_list_allocator",
+]
+
+[[package]]
+name = "fnv"
+version = "1.0.7"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "3f9eec918d3f24069decb9af1554cad7c880e2da24a9afd88aca000531ab82c1"
+
+[[package]]
+name = "half"
+version = "2.7.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6ea2d84b969582b4b1864a92dc5d27cd2b77b622a8d79306834f1be5ba20d84b"
+dependencies = [
+ "cfg-if",
+ "crunchy",
+ "zerocopy",
+]
+
+[[package]]
+name = "ident_case"
+version = "1.0.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b9e0384b61958566e926dc50660321d12159025e767c18e043daf26b70104c39"
+
+[[package]]
+name = "libm"
+version = "0.2.16"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b6d2cec3eae94f9f509c767b45932f1ada8350c4bdb85af2fcab4a3c14807981"
+
+[[package]]
+name = "linked_list_allocator"
+version = "0.10.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "2b23ac50abb8261cb38c6e2a7192d3302e0836dac1628f6a93b82b4fad185897"
+
+[[package]]
+name = "litrs"
+version = "1.0.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "11d3d7f243d5c5a8b9bb5d6dd2b1602c0cb0b9db1621bafc7ed66e35ff9fe092"
+
+[[package]]
+name = "proc-macro2"
+version = "1.0.106"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "8fd00f0bb2e90d81d1044c2b32617f68fcb9fa3bb7640c23e9c748e53fb30934"
+dependencies = [
+ "unicode-ident",
+]
+
+[[package]]
+name = "quote"
+version = "1.0.45"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "41f2619966050689382d2b44f664f4bc593e129785a36d6ee376ddf37259b924"
+dependencies = [
+ "proc-macro2",
+]
+
+[[package]]
+name = "ruv_temporal"
+version = "0.1.0"
+dependencies = [
+ "critical-section",
+ "esp-alloc",
+ "ruvllm_sparse_attention",
+]
+
+[[package]]
+name = "ruvllm_sparse_attention"
+version = "0.1.1"
+dependencies = [
+ "half",
+ "libm",
+]
+
+[[package]]
+name = "syn"
+version = "2.0.117"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e665b8803e7b1d2a727f4023456bbbbe74da67099c585258af0ad9c5013b9b99"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "unicode-ident",
+]
+
+[[package]]
+name = "unicode-ident"
+version = "1.0.24"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e6e4313cd5fcd3dad5cafa179702e2b244f760991f45397d14d4ebf38247da75"
+
+[[package]]
+name = "zerocopy"
+version = "0.8.48"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "eed437bf9d6692032087e337407a86f04cd8d6a16a37199ed57949d415bd68e9"
+dependencies = [
+ "zerocopy-derive",
+]
+
+[[package]]
+name = "zerocopy-derive"
+version = "0.8.48"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "70e3cd084b1788766f53af483dd21f93881ff30d7320490ec3ef7526d203bad4"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn",
+]
@@ -0,0 +1,35 @@
+[package]
+name = "ruv_temporal"
+version = "0.1.0"
+edition = "2021"
+license = "MIT"
+description = "ESP32-S3 on-device temporal head for WiFi-DensePose (ADR-095, #513)"
+publish = false
+
+[lib]
+crate-type = ["staticlib"]
+name = "ruv_temporal"
+
+# Don't get pulled into the v2 workspace — this crate cross-compiles to
+# xtensa-esp32s3-none-elf, the workspace targets host x86_64.
+[workspace]
+
+[dependencies]
+ruvllm_sparse_attention = { path = "../../../../vendor/ruvector/crates/ruvllm_sparse_attention", default-features = false, features = ["fp16"] }
+
+# Minimal no_std + alloc plumbing. esp-alloc supplies a GlobalAlloc that
+# punches through to ESP-IDF's heap_caps_malloc; critical-section provides
+# the lock primitive linked_list_allocator wants on no_std targets.
+esp-alloc = "0.8"
+critical-section = "1"
+
+[profile.release]
+opt-level = "s"
+lto = true
+codegen-units = 1
+panic = "abort"
+strip = true
+
+[profile.dev]
+opt-level = 1
+panic = "abort"
@@ -0,0 +1,86 @@
+# `ruv_temporal` — ESP32-S3 on-device temporal head
+
+ESP-IDF component implementing ADR-095 (#513). The Rust staticlib at
+`src/lib.rs` wraps `ruvllm_sparse_attention` (vendored at
+`vendor/ruvector/crates/ruvllm_sparse_attention`) and exposes a narrow
+C ABI declared in `include/ruv_temporal.h`.
+
+## Status
+
+| Phase | Scope | State |
+|-------|-------|-------|
+| 4 — Scaffold | Cargo.toml, src/{lib.rs,window.rs,weights.rs}, include/ruv_temporal.h, CMakeLists.txt, .cargo/config.toml | **Done.** |
+| 5 — Cross-compile | `cargo +esp build --release --target xtensa-esp32s3-none-elf` produces `libruv_temporal.a`. | **Blocked** — see below. |
+| 6 — Wire from edge_processing.c | FreeRTOS task on Core 1, queue from adaptive_controller fast loop, push() in fast tick, classify() at 1 Hz, emit `0xC5110007` packet. | **Done** in `main/temporal_task.c` (no-op shim path verified by 8MB firmware build with feature off). |
+| 7 — COM8 validation | Flash 8MB build with `CONFIG_CSI_TEMPORAL_HEAD_ENABLED=y`, soak ≥5 min, check no Tmr Svc / task_wdt overflow. | Pending board reattach. |
+
+## Module map
+
+| File | Purpose |
+|------|---------|
+| `src/lib.rs` | C ABI: `ruv_temporal_init / push / classify / destroy / kernel_self_test` |
+| `src/window.rs` | `FrameRing` rolling buffer used by `ruv_temporal_push` |
+| `src/weights.rs` | Loader-side mirror of host `wifi_densepose_temporal::weights`. Parses the `.rvne` blob format (magic `RVNE`, version 1, FP32/FP16, CRC32-IEEE). Bit-exact with the host crate; a blob produced by the host's `WeightBlob::serialize()` parses here byte-for-byte. |
+| `include/ruv_temporal.h` | Public C header consumed by `main/temporal_task.c` |
+| `shim.c` | Empty C shim for `idf_component_register` |
+
+## Phase 5 blocker — esp toolchain rust-src bug
+
+The system esp toolchain at `C:\Users\ruv\.rustup\toolchains\esp` has
+no precompiled `core` for `xtensa-esp32s3-none-elf`. It requires
+`-Z build-std=core,alloc`, but the bundled rust-src snapshot
+(`esp` channel, nightly 2025-09-16) hits two known bugs when build-std
+compiles `core`:
+
+1. `library/portable-simd/crates/core_simd/src/simd/ptr/mut_ptr.rs` —
+   `Copy` trait and `size_of` not in scope, ~16,000 errors.
+2. `library/core` itself — "cannot resolve a prelude import",
+   "attributes starting with `rustc` are reserved", `concat!` macro
+   not found.
+
+These are upstream Rust nightly snapshot regressions, not anything
+this component is doing wrong. The fix is to refresh the esp toolchain
+to a newer nightly:
+
+```powershell
+C:/Users/ruv/.cargo/bin/espup.exe install
+# (re-source export-esp.ps1 / export-esp.sh after install)
+```
+
+`espup install` pulls the latest pinned esp Rust + LLVM. It is a
+~1.5 GB download and ~5-10 min install. That step lands in the next
+loop iteration of #513 implementation work.
+
+## Build (once Phase 5 unblocks)
+
+From this directory:
+
+```bash
+cargo +esp build --release --target xtensa-esp32s3-none-elf
+```
+
+Output:
+`target/xtensa-esp32s3-none-elf/release/libruv_temporal.a`.
+
+ESP-IDF's `idf.py build` will pick this up via `CMakeLists.txt` —
+`add_custom_command` runs the cargo build before
+`idf_component_register` consumes the static library.
+
+## C ABI summary
+
+```c
+esp_err_t ruv_temporal_init(const uint8_t *weights, size_t wlen,
+                            uint32_t input_dim, uint32_t window_len,
+                            uint32_t n_classes,
+                            ruv_temporal_ctx_t **out_ctx);
+esp_err_t ruv_temporal_push(ruv_temporal_ctx_t *ctx, const float *frame);
+esp_err_t ruv_temporal_classify(ruv_temporal_ctx_t *ctx,
+                                float *logits, uint32_t n_classes);
+void      ruv_temporal_destroy(ruv_temporal_ctx_t *ctx);
+esp_err_t ruv_temporal_kernel_self_test(void);
+```
+
+Threading: caller is responsible. Per ADR-095 §3.3, the firmware will
+spawn a single dedicated FreeRTOS task that owns the context and
+serialises all calls — push() and classify() are not internally
+synchronised.
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: MIT
+ *
+ * ESP32-S3 on-device temporal head — public C ABI (ADR-095, #513).
+ *
+ * Consumed by edge_processing.c / adaptive_controller.c. Backed by a
+ * Rust staticlib that wraps `ruvllm_sparse_attention`. See
+ * components/ruv_temporal/src/lib.rs for the implementation.
+ *
+ * Threading: NOT internally synchronised. Per ADR-095 §3.3 callers run
+ * a single dedicated FreeRTOS task that owns the context and
+ * serialises push() and classify(). init() and destroy() are NOT safe
+ * against concurrent push/classify on the same handle.
+ */
+
+#pragma once
+
+#include <stddef.h>
+#include <stdint.h>
+#include "esp_err.h"
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+typedef struct RuvTemporalCtx ruv_temporal_ctx_t;
+
+/* Allocate a temporal-head context.
+ *
+ * weights      — flat-buffer of model weights (Phase 5 wires the format),
+ *                may be NULL during Phase 4 scaffolding.
+ * weights_len  — bytes of `weights`, 0 if weights is NULL.
+ * input_dim    — feature dimension per frame (e.g. 60 for rv_feature_state_t).
+ * window_len   — number of frames in the rolling window (e.g. 256).
+ * n_classes    — output logit count (e.g. 4 for gesture, 3 for fall).
+ * out_ctx      — receives the new context pointer on ESP_OK.
+ *
+ * Returns ESP_OK on success, ESP_ERR_INVALID_ARG for null/zero inputs,
+ * ESP_ERR_NO_MEM if buffer allocation fails.
+ */
+esp_err_t ruv_temporal_init(const uint8_t *weights,
+                            size_t weights_len,
+                            uint32_t input_dim,
+                            uint32_t window_len,
+                            uint32_t n_classes,
+                            ruv_temporal_ctx_t **out_ctx);
+
+/* Push one feature frame into the rolling window. Hot path — cheap,
+ * no allocation. `frame` must point to at least `input_dim` floats.
+ */
+esp_err_t ruv_temporal_push(ruv_temporal_ctx_t *ctx, const float *frame);
+
+/* Run the temporal-head forward and write `n_classes` class logits
+ * into the caller-owned `logits` buffer (must be at least n_classes
+ * floats). `n_classes` must match the value passed to init().
+ */
+esp_err_t ruv_temporal_classify(ruv_temporal_ctx_t *ctx,
+                                float *logits,
+                                uint32_t n_classes);
+
+/* Release a context allocated by ruv_temporal_init. Safe on NULL. */
+void ruv_temporal_destroy(ruv_temporal_ctx_t *ctx);
+
+/* Self-test — proves the upstream sparse-attention kernel links and
+ * runs. Returns ESP_OK on success. Useful as a smoke check on first
+ * boot before allocating a real context.
+ */
+esp_err_t ruv_temporal_kernel_self_test(void);
+
+#ifdef __cplusplus
+}
+#endif
@@ -0,0 +1,6 @@
+# Pin to the esp toolchain so casual `cargo build` (without +esp) lands
+# on the xtensa-capable rustc/cargo. Per ADR-095, espup must be
+# installed on every developer machine and CI runner.
+
+[toolchain]
+channel = "esp"
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: MIT
+ *
+ * Minimal C shim so ESP-IDF's idf_component_register has a SRCS file.
+ * The real C ABI lives in src/lib.rs (Rust staticlib) and is exposed
+ * through include/ruv_temporal.h.
+ *
+ * Intentionally empty — do not put logic here.
+ */
+
+#include "ruv_temporal.h"
@@ -0,0 +1,242 @@
+// On-ESP32-S3 temporal head — C ABI for the ESP-IDF firmware (ADR-095, #513).
+//
+// This crate is `staticlib` no_std + alloc. It is compiled to
+// `xtensa-esp32s3-none-elf` and linked into the firmware via the ESP-IDF
+// component glue in CMakeLists.txt. The host-side analog
+// (`wifi-densepose-temporal`) tracks ADR-096; the two crates intentionally
+// share the same `ruvllm_sparse_attention` kernel so behaviour is identical
+// across host and node.
+//
+// Status (Phase 4 of #513): C ABI surface + ring buffer scaffold.
+//   - `ruv_temporal_init`        ✓ scaffolded
+//   - `ruv_temporal_push`        ✓ scaffolded (writes to ring buffer)
+//   - `ruv_temporal_classify`    ✓ scaffolded (kernel forward stub)
+//   - `ruv_temporal_destroy`     ✓ scaffolded
+//
+// Phase 5 wires real weights, panic_handler, and the global allocator to
+// ESP-IDF's heap. Phase 6 wires the ABI calls from edge_processing.c into
+// a dedicated FreeRTOS task per ADR-095 §3.3.
+
+#![no_std]
+#![no_main]
+extern crate alloc;
+
+use alloc::boxed::Box;
+use core::ffi::c_void;
+
+mod weights;
+mod window;
+use weights::{WeightBlobView, WeightLoadError};
+use window::FrameRing;
+
+// ---- ESP-IDF compatible error codes ---------------------------------------
+//
+// Matches the `esp_err_t` typedef in `esp_err.h`. We don't need the full
+// set — these four cover the contract advertised in ruv_temporal.h.
+
+const ESP_OK: i32 = 0;
+const ESP_FAIL: i32 = -1;
+const ESP_ERR_INVALID_ARG: i32 = 0x102;
+const ESP_ERR_NO_MEM: i32 = 0x101;
+
+// ---- Allocator ------------------------------------------------------------
+//
+// esp-alloc punches through to ESP-IDF's heap_caps_malloc. The ESP-IDF
+// runtime calls `esp_alloc::HEAP.add_region(...)` from C startup before
+// the first Rust allocation; without that wiring we'd hit OOM on the
+// first Vec push. That wiring lands in Phase 5 along with the rest of
+// the firmware-side glue.
+#[global_allocator]
+static ALLOCATOR: esp_alloc::EspHeap = esp_alloc::EspHeap::empty();
+
+// ---- Panic handler --------------------------------------------------------
+//
+// Production firmware would route to ESP-IDF's `esp_system_abort` so the
+// crash shows up in core dumps. For Phase 4 scaffolding we simply halt —
+// keeps the staticlib self-contained without dragging in `esp-idf-sys`.
+
+#[panic_handler]
+fn on_panic(_info: &core::panic::PanicInfo) -> ! {
+    loop {
+        // wait-for-interrupt would be nicer; this is fine until Phase 5
+        // hooks into esp_system_abort.
+    }
+}
+
+// ---- Context object (opaque to C callers) ---------------------------------
+
+pub struct RuvTemporalCtx {
+    input_dim: u32,
+    window_len: u32,
+    n_classes: u32,
+    ring: FrameRing,
+}
+
+// ---- Public C ABI ---------------------------------------------------------
+
+/// Initialise a temporal-head context. Allocates and returns an opaque
+/// pointer through `out_ctx`. Returns ESP_OK on success, an esp_err_t on
+/// failure. Caller must release with `ruv_temporal_destroy`.
+#[no_mangle]
+pub extern "C" fn ruv_temporal_init(
+    weights: *const u8,
+    weights_len: usize,
+    input_dim: u32,
+    window_len: u32,
+    n_classes: u32,
+    out_ctx: *mut *mut RuvTemporalCtx,
+) -> i32 {
+    if out_ctx.is_null() || input_dim == 0 || window_len == 0 || n_classes == 0 {
+        return ESP_ERR_INVALID_ARG;
+    }
+
+    // Optional weights blob: when caller passes a non-NULL pointer,
+    // parse and validate it. Caller can pass NULL during the Phase 4/5
+    // bring-up window when the kernel forward isn't actually consuming
+    // weights yet — we just want the parse path itself proven on the
+    // device. Once Phase 5 unblocks and the kernel is wired, Phase 6
+    // makes a non-NULL weights argument required.
+    if !weights.is_null() && weights_len > 0 {
+        // SAFETY: caller asserts the buffer covers `weights_len` bytes
+        // and outlives this call. Borrowed-slice parse — no copy.
+        let buf = unsafe { core::slice::from_raw_parts(weights, weights_len) };
+        match WeightBlobView::parse(buf) {
+            Ok(view) => {
+                // Sanity-check that the blob's declared shape matches
+                // the runtime arguments. A blob with input_dim=32 in
+                // a context configured for input_dim=16 is a deploy bug
+                // we want to catch at init() not at first classify().
+                if view.header.input_dim as u32 != input_dim
+                    || view.header.n_classes as u32 != n_classes
+                {
+                    return ESP_ERR_INVALID_ARG;
+                }
+                // Phase 5+: stash view into the context for the kernel
+                // to consume. For now the parse itself is the proof
+                // that the format crossed the host/firmware boundary.
+            }
+            Err(e) => return weights::weight_load_err_to_esp(&e),
+        }
+    }
+
+    let ring = match FrameRing::new(window_len as usize, input_dim as usize) {
+        Some(r) => r,
+        None => return ESP_ERR_NO_MEM,
+    };
+
+    let ctx = Box::new(RuvTemporalCtx {
+        input_dim,
+        window_len,
+        n_classes,
+        ring,
+    });
+    unsafe { *out_ctx = Box::into_raw(ctx) };
+    ESP_OK
+}
+
+/// Push one feature frame into the rolling window. Hot path — must stay
+/// cheap (no allocation, no kernel work).
+#[no_mangle]
+pub extern "C" fn ruv_temporal_push(ctx: *mut RuvTemporalCtx, frame: *const f32) -> i32 {
+    if ctx.is_null() || frame.is_null() {
+        return ESP_ERR_INVALID_ARG;
+    }
+    let ctx = unsafe { &mut *ctx };
+    let slice = unsafe { core::slice::from_raw_parts(frame, ctx.input_dim as usize) };
+    ctx.ring.push(slice);
+    ESP_OK
+}
+
+/// Run the temporal-head forward and write `n_classes` logits into the
+/// caller-owned `logits` buffer. Returns ESP_OK on success.
+///
+/// Phase 4 stub: writes a zero-vector. Phase 5 wires the real
+/// `SubquadraticSparseAttention::forward_gqa` over the ring buffer
+/// contents. The signature is what edge_processing.c will call — that
+/// part of the contract is stable now.
+#[no_mangle]
+pub extern "C" fn ruv_temporal_classify(
+    ctx: *mut RuvTemporalCtx,
+    logits: *mut f32,
+    n_classes: u32,
+) -> i32 {
+    if ctx.is_null() || logits.is_null() {
+        return ESP_ERR_INVALID_ARG;
+    }
+    let ctx = unsafe { &*ctx };
+    if n_classes != ctx.n_classes {
+        return ESP_ERR_INVALID_ARG;
+    }
+    let out = unsafe { core::slice::from_raw_parts_mut(logits, n_classes as usize) };
+    for slot in out.iter_mut() {
+        *slot = 0.0;
+    }
+    let _ = ctx.window_len; // future: feed ring -> attention -> classifier head
+    ESP_OK
+}
+
+/// Release a context allocated by `ruv_temporal_init`.
+#[no_mangle]
+pub extern "C" fn ruv_temporal_destroy(ctx: *mut RuvTemporalCtx) {
+    if ctx.is_null() {
+        return;
+    }
+    unsafe {
+        drop(Box::from_raw(ctx));
+    }
+}
+
+// ---- Static guard ---------------------------------------------------------
+//
+// Force a *use* of the upstream crate so the link line proves the crate is
+// reachable from the staticlib. Without this the compiler may strip the
+// dependency entirely in Phase 4 since classify() doesn't yet call into it.
+#[doc(hidden)]
+#[no_mangle]
+pub extern "C" fn ruv_temporal_kernel_self_test() -> i32 {
+    use ruvllm_sparse_attention::{SparseAttentionConfig, SubquadraticSparseAttention, Tensor3};
+    let cfg = SparseAttentionConfig {
+        window: 4,
+        block_size: 2,
+        global_tokens: alloc::vec![0],
+        causal: true,
+        use_log_stride: true,
+        use_landmarks: true,
+        sort_candidates: false,
+    };
+    if SubquadraticSparseAttention::new(cfg).is_err() {
+        return ESP_FAIL;
+    }
+    let _ = Tensor3::zeros(0, 1, 1);
+    ESP_OK
+}
+
+// Prevent dead-code drop of the C ABI when the linker is aggressive.
+#[used]
+static _ABI_KEEPALIVE: [extern "C" fn(); 5] = [
+    keepalive_init,
+    keepalive_push,
+    keepalive_classify,
+    keepalive_destroy,
+    keepalive_self_test,
+];
+
+extern "C" fn keepalive_init() {
+    let _ = ruv_temporal_init;
+}
+extern "C" fn keepalive_push() {
+    let _ = ruv_temporal_push;
+}
+extern "C" fn keepalive_classify() {
+    let _ = ruv_temporal_classify;
+}
+extern "C" fn keepalive_destroy() {
+    let _ = ruv_temporal_destroy;
+}
+extern "C" fn keepalive_self_test() {
+    let _ = ruv_temporal_kernel_self_test;
+}
+
+// Avoid "unused" warnings on the c_void import while the actual handle
+// type is what callers receive.
+const _: Option<*const c_void> = None;
@@ -0,0 +1,194 @@
+// Firmware-side mirror of `wifi-densepose-temporal::weights`. Same wire
+// format, same magic, same CRC polynomial — a blob produced by the
+// host's `WeightBlob::serialize()` parses here byte-for-byte.
+//
+// no_std + alloc. The host side keeps weights as `Vec<u8>` because it
+// owns the buffer; the firmware loader takes a borrowed `&[u8]` slice
+// (the blob lives in flash via EMBED_FILES, or a heap mmap from NVS,
+// neither of which the loader should re-allocate).
+//
+// Stays *byte-exact* in lockstep with `v2/crates/wifi-densepose-temporal/src/weights.rs`.
+// When the host format changes, this file changes in the same commit
+// and bumps `BLOB_VERSION`; mismatched versions refuse to load.
+
+use core::convert::TryInto;
+use core::fmt;
+
+pub const BLOB_MAGIC: u32 = 0x5256_4E45; // "RVNE"
+pub const BLOB_VERSION: u16 = 1;
+pub const BLOB_HEADER_LEN: usize = 24;
+pub const BLOB_FOOTER_LEN: usize = 4;
+
+#[derive(Clone, Copy, Debug, Eq, PartialEq)]
+pub enum WeightDtype {
+    F32,
+    F16,
+}
+
+#[derive(Clone, Copy, Debug)]
+pub struct WeightBlobHeader {
+    pub dtype: WeightDtype,
+    pub input_dim: u16,
+    pub n_q_heads: u16,
+    pub n_kv_heads: u16,
+    pub head_dim: u16,
+    pub n_layers: u16,
+    pub n_classes: u16,
+}
+
+impl WeightBlobHeader {
+    pub fn elem_bytes(&self) -> usize {
+        match self.dtype {
+            WeightDtype::F32 => 4,
+            WeightDtype::F16 => 2,
+        }
+    }
+
+    fn validate(&self) -> Result<(), WeightLoadError> {
+        if self.input_dim == 0
+            || self.n_q_heads == 0
+            || self.n_kv_heads == 0
+            || self.head_dim == 0
+        {
+            return Err(WeightLoadError::ZeroDim);
+        }
+        if self.n_q_heads % self.n_kv_heads != 0 {
+            return Err(WeightLoadError::InvalidGqaRatio);
+        }
+        if self.n_layers == 0 || self.n_classes < 2 {
+            return Err(WeightLoadError::DegenerateShape);
+        }
+        Ok(())
+    }
+}
+
+/// A parsed view into a weights blob. Holds borrowed slices into the
+/// caller-owned buffer — no allocation, no copy. The firmware's
+/// kernel reads weights directly from this view.
+#[derive(Clone, Copy)]
+pub struct WeightBlobView<'a> {
+    pub header: WeightBlobHeader,
+    pub weights: &'a [u8],
+}
+
+impl<'a> WeightBlobView<'a> {
+    /// Parse a blob, validating magic / version / size / CRC. Returns
+    /// a borrowed view; the input `buf` must outlive the view.
+    pub fn parse(buf: &'a [u8]) -> Result<Self, WeightLoadError> {
+        if buf.len() < BLOB_HEADER_LEN + BLOB_FOOTER_LEN {
+            return Err(WeightLoadError::TooShort);
+        }
+
+        let magic = u32::from_le_bytes(buf[0..4].try_into().unwrap());
+        if magic != BLOB_MAGIC {
+            return Err(WeightLoadError::BadMagic);
+        }
+        let version = u16::from_le_bytes(buf[4..6].try_into().unwrap());
+        if version != BLOB_VERSION {
+            return Err(WeightLoadError::WrongVersion(version));
+        }
+        let flags = buf[6];
+        let dtype = match flags & 0x01 {
+            0 => WeightDtype::F32,
+            _ => WeightDtype::F16,
+        };
+
+        let input_dim = u16::from_le_bytes(buf[8..10].try_into().unwrap());
+        let n_q_heads = u16::from_le_bytes(buf[10..12].try_into().unwrap());
+        let n_kv_heads = u16::from_le_bytes(buf[12..14].try_into().unwrap());
+        let head_dim = u16::from_le_bytes(buf[14..16].try_into().unwrap());
+        let n_layers = u16::from_le_bytes(buf[16..18].try_into().unwrap());
+        let n_classes = u16::from_le_bytes(buf[18..20].try_into().unwrap());
+        let weights_len = u32::from_le_bytes(buf[20..24].try_into().unwrap()) as usize;
+
+        let expected = BLOB_HEADER_LEN + weights_len + BLOB_FOOTER_LEN;
+        if buf.len() != expected {
+            return Err(WeightLoadError::SizeMismatch);
+        }
+
+        let stored_crc = u32::from_le_bytes(buf[buf.len() - 4..].try_into().unwrap());
+        let computed = crc32_ieee(&buf[..buf.len() - 4]);
+        if stored_crc != computed {
+            return Err(WeightLoadError::CrcMismatch);
+        }
+
+        let header = WeightBlobHeader {
+            dtype,
+            input_dim,
+            n_q_heads,
+            n_kv_heads,
+            head_dim,
+            n_layers,
+            n_classes,
+        };
+        header.validate()?;
+
+        let weights_start = BLOB_HEADER_LEN;
+        let weights_end = weights_start + weights_len;
+        Ok(Self {
+            header,
+            weights: &buf[weights_start..weights_end],
+        })
+    }
+}
+
+/// Loader-side error. Distinct from the host-side `TemporalError` so
+/// the firmware can map specific cases to specific `esp_err_t` codes.
+#[derive(Clone, Copy, Debug, Eq, PartialEq)]
+pub enum WeightLoadError {
+    TooShort,
+    BadMagic,
+    WrongVersion(u16),
+    SizeMismatch,
+    CrcMismatch,
+    ZeroDim,
+    InvalidGqaRatio,
+    DegenerateShape,
+}
+
+impl fmt::Display for WeightLoadError {
+    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+        match self {
+            Self::TooShort => write!(f, "weight blob too short"),
+            Self::BadMagic => write!(f, "weight blob: bad magic"),
+            Self::WrongVersion(v) => write!(f, "weight blob: unsupported version {}", v),
+            Self::SizeMismatch => write!(f, "weight blob: declared length doesn't match buffer"),
+            Self::CrcMismatch => write!(f, "weight blob: CRC32 mismatch"),
+            Self::ZeroDim => write!(f, "weight blob: zero-valued dimension(s)"),
+            Self::InvalidGqaRatio => write!(f, "weight blob: n_q_heads not divisible by n_kv_heads"),
+            Self::DegenerateShape => write!(f, "weight blob: n_layers=0 or n_classes<2"),
+        }
+    }
+}
+
+/// Map loader errors to esp_err_t-style codes for the C ABI. Defined
+/// here rather than in lib.rs so the mapping stays adjacent to the
+/// error type and can't drift.
+pub const fn weight_load_err_to_esp(err: &WeightLoadError) -> i32 {
+    match err {
+        WeightLoadError::TooShort
+        | WeightLoadError::BadMagic
+        | WeightLoadError::WrongVersion(_)
+        | WeightLoadError::SizeMismatch => 0x102, // ESP_ERR_INVALID_ARG
+        WeightLoadError::CrcMismatch => 0x10C, // ESP_ERR_INVALID_CRC
+        WeightLoadError::ZeroDim
+        | WeightLoadError::InvalidGqaRatio
+        | WeightLoadError::DegenerateShape => 0x103, // ESP_ERR_INVALID_SIZE
+    }
+}
+
+/// Same polynomial as `temporal_task.c::crc32_ieee` and the host-side
+/// `wifi_densepose_temporal::weights::crc32_ieee`. The whole point of
+/// keeping it bit-for-bit identical across all three sites is so a
+/// blob round-trips without re-computing.
+fn crc32_ieee(data: &[u8]) -> u32 {
+    let mut crc = 0xFFFF_FFFFu32;
+    for &b in data {
+        crc ^= b as u32;
+        for _ in 0..8 {
+            let mask = 0u32.wrapping_sub(crc & 1);
+            crc = (crc >> 1) ^ (0xEDB8_8320 & mask);
+        }
+    }
+    !crc
+}
@@ -0,0 +1,74 @@
+// Rolling frame buffer for the temporal head input window (ADR-095 §3.2).
+//
+// The hot path (`ruv_temporal_push`) writes one frame per call. The
+// buffer is sized at `init` time; pushes wrap. `classify` reads the
+// most-recent `window_len` frames in chronological order, oldest-first.
+//
+// Allocation policy: one `Vec<f32>` of size `window_len * input_dim`,
+// owned by the context. No per-push allocation — we just memcpy into
+// the next slot.
+
+use alloc::vec;
+use alloc::vec::Vec;
+
+pub struct FrameRing {
+    buf: Vec<f32>,
+    window_len: usize,
+    input_dim: usize,
+    next_write: usize,
+    filled: usize,
+}
+
+impl FrameRing {
+    pub fn new(window_len: usize, input_dim: usize) -> Option<Self> {
+        if window_len == 0 || input_dim == 0 {
+            return None;
+        }
+        let total = window_len.checked_mul(input_dim)?;
+        Some(Self {
+            buf: vec![0.0; total],
+            window_len,
+            input_dim,
+            next_write: 0,
+            filled: 0,
+        })
+    }
+
+    pub fn push(&mut self, frame: &[f32]) {
+        let n = core::cmp::min(frame.len(), self.input_dim);
+        let off = self.next_write * self.input_dim;
+        self.buf[off..off + n].copy_from_slice(&frame[..n]);
+        // Zero-pad tail when the caller's frame is shorter than input_dim.
+        for s in &mut self.buf[off + n..off + self.input_dim] {
+            *s = 0.0;
+        }
+        self.next_write = (self.next_write + 1) % self.window_len;
+        if self.filled < self.window_len {
+            self.filled += 1;
+        }
+    }
+
+    /// Iterate over the buffer in chronological order, oldest-first.
+    /// Yields one slice of `input_dim` floats per call. Used by
+    /// `ruv_temporal_classify` to flatten into the kernel input.
+    pub fn iter_chronological(&self) -> impl Iterator<Item = &[f32]> + '_ {
+        let start = if self.filled < self.window_len {
+            0
+        } else {
+            self.next_write
+        };
+        (0..self.filled).map(move |i| {
+            let row = (start + i) % self.window_len;
+            let off = row * self.input_dim;
+            &self.buf[off..off + self.input_dim]
+        })
+    }
+
+    pub fn len(&self) -> usize {
+        self.filled
+    }
+
+    pub fn capacity(&self) -> usize {
+        self.window_len
+    }
+}
@@ -9,10 +9,19 @@ set(SRCS
    "rv_feature_state.c"
    "rv_mesh.c"
    "adaptive_controller.c"
+    # ADR-095 / #513 — on-device temporal head (no-op shims when CONFIG_CSI_TEMPORAL_HEAD_ENABLED off)
+    "temporal_task.c"
 )

 set(REQUIRES "")

+# ADR-095: link the Rust ruv_temporal staticlib only when the feature is on,
+# so the default firmware build doesn't depend on the (currently blocked)
+# esp Rust toolchain.
+if(CONFIG_CSI_TEMPORAL_HEAD_ENABLED)
+    list(APPEND REQUIRES ruv_temporal)
+endif()
+
 # ADR-061: Mock CSI generator for QEMU testing + ADR-081 mock radio binding
 if(CONFIG_CSI_MOCK_ENABLED)
    list(APPEND SRCS "mock_csi.c" "rv_radio_ops_mock.c")
@@ -323,3 +323,56 @@ menu "Mock CSI (QEMU Testing)"
        depends on CSI_MOCK_ENABLED
        default n
 endmenu
+
+menu "On-device temporal head (ADR-095, #513)"
+
+    config CSI_TEMPORAL_HEAD_ENABLED
+        bool "Enable on-device temporal-head classification"
+        default n
+        help
+            Compiles the ruv_temporal FreeRTOS task that runs a learned
+            transformer-style temporal head over the rv_feature_state
+            stream. Backed by the Rust ruvllm_sparse_attention staticlib
+            in components/ruv_temporal/. Default off — the Rust component
+            requires the esp Rust toolchain (see component README) and
+            adds ~376 KB to the firmware image. Off-board (8 MB) only
+            until the binary delta is measured on real hardware.
+
+    config TEMPORAL_INPUT_DIM
+        int "Input feature dimension"
+        depends on CSI_TEMPORAL_HEAD_ENABLED
+        default 16
+        range 1 256
+        help
+            Per-frame feature dimension fed into the temporal head.
+            16 matches a small projection of rv_feature_state_t; bump
+            after the host-side training crate fixes the model schema.
+
+    config TEMPORAL_WINDOW_LEN
+        int "Rolling window length (frames)"
+        depends on CSI_TEMPORAL_HEAD_ENABLED
+        default 256
+        range 32 1024
+        help
+            Number of feature frames the temporal head reasons over.
+            256 frames at the controller's 5 Hz fast-loop rate is ~50 s.
+
+    config TEMPORAL_N_CLASSES
+        int "Number of output classes"
+        depends on CSI_TEMPORAL_HEAD_ENABLED
+        default 4
+        range 2 16
+        help
+            Number of classification logits the model produces. Must be
+            ≤ TEMPORAL_MAX_LOGITS in temporal_task.c (16).
+
+    config TEMPORAL_CLASSIFY_PERIOD_MS
+        int "Classification cadence (ms)"
+        depends on CSI_TEMPORAL_HEAD_ENABLED
+        default 1000
+        range 100 60000
+        help
+            How often the temporal task runs ruv_temporal_classify and
+            emits a 0xC5110007 packet. Default 1 s.
+
+endmenu
@@ -19,6 +19,7 @@
 #include "edge_processing.h"
 #include "stream_sender.h"
 #include "csi_collector.h"
+#include "temporal_task.h"  /* ADR-095 / #513: on-device temporal head */

 #include <string.h>
 #include "freertos/FreeRTOS.h"
@@ -314,6 +315,18 @@ static void emit_feature_state(void)
    if (sent < 0) {
        ESP_LOGW(TAG, "feature_state emit failed");
    }
+
+    /* ADR-095 / #513: feed the same 9 feature floats into the on-device
+     * temporal head if it is enabled. Non-blocking — drops are logged
+     * by temporal_task itself, never by us. With CONFIG_CSI_TEMPORAL_HEAD_ENABLED
+     * off, this resolves to a single ESP_ERR_NOT_SUPPORTED return. */
+    const float feat[9] = {
+        pkt.motion_score, pkt.presence_score,
+        pkt.respiration_bpm, pkt.respiration_conf,
+        pkt.heartbeat_bpm, pkt.heartbeat_conf,
+        pkt.anomaly_score, pkt.env_shift_score, pkt.node_coherence,
+    };
+    (void)temporal_task_push_frame(feat, 9);
 }

 static void slow_loop_cb(TimerHandle_t t)
@@ -21,6 +21,7 @@

 #include "csi_collector.h"
 #include "stream_sender.h"
+#include "temporal_task.h"  /* ADR-095 / #513 */
 #include "nvs_config.h"
 #include "edge_processing.h"
 #include "ota_update.h"
@@ -310,6 +311,22 @@ void app_main(void)
                 esp_err_to_name(adapt_ret));
    }

+    /* ADR-095 / #513: spin up the on-device temporal head. Returns
+     * ESP_ERR_NOT_SUPPORTED when CONFIG_CSI_TEMPORAL_HEAD_ENABLED is
+     * off — that is the default and not an error. The fast loop
+     * pushes feature frames; the task runs classify at a slower
+     * cadence and emits 0xC5110007 packets. */
+#ifdef CONFIG_CSI_TEMPORAL_HEAD_ENABLED
+    esp_err_t tmp_ret = temporal_task_start(
+        (uint32_t)CONFIG_TEMPORAL_INPUT_DIM,
+        (uint32_t)CONFIG_TEMPORAL_WINDOW_LEN,
+        (uint32_t)CONFIG_TEMPORAL_N_CLASSES);
+    if (tmp_ret != ESP_OK) {
+        ESP_LOGW(TAG, "temporal task init failed: %s",
+                 esp_err_to_name(tmp_ret));
+    }
+#endif
+
    /* Initialize power management. */
    power_mgmt_init(g_nvs_config.power_duty);

@@ -0,0 +1,304 @@
+/**
+ * @file temporal_task.c
+ * @brief ADR-095 / #513 — On-device temporal head FreeRTOS task.
+ *
+ * Owns the only `ruv_temporal_ctx_t` in the firmware. Receives feature
+ * frames from the adaptive_controller fast loop via a FreeRTOS queue,
+ * pushes them into the rolling window, and at ~1 Hz runs a
+ * classification forward through the Rust `ruvllm_sparse_attention`
+ * staticlib (when built — see CONFIG_CSI_TEMPORAL_HEAD_ENABLED).
+ *
+ * The whole file compiles down to no-op shims when the feature is off,
+ * so adaptive_controller.c can call `temporal_task_push_frame()`
+ * unconditionally — the function returns ESP_ERR_NOT_SUPPORTED and
+ * costs one nullable check.
+ */
+
+#include "temporal_task.h"
+
+#include <string.h>
+#include "esp_log.h"
+#include "esp_timer.h"
+#include "sdkconfig.h"
+
+static const char *TAG = "temporal";
+
+#ifdef CONFIG_CSI_TEMPORAL_HEAD_ENABLED
+
+#include "freertos/FreeRTOS.h"
+#include "freertos/queue.h"
+#include "freertos/task.h"
+
+#include "csi_collector.h"  /* node_id */
+#include "stream_sender.h"
+#include "ruv_temporal.h"   /* C ABI from components/ruv_temporal */
+
+/* Queue depth — picked so that the adaptive controller's fast loop
+ * (default 5 Hz) can't overrun the temporal task even if classify()
+ * stalls for ~6 s. Drops beyond that are logged. */
+#define TEMPORAL_QUEUE_DEPTH 32
+
+/* Stack sized per ADR-095 §3.3. The kernel forward + intermediate
+ * tensors are bounded by `forward_flash` tiling, but rv_feature_state
+ * marshalling, logging, and stream_sender_send all share this stack. */
+#define TEMPORAL_TASK_STACK 16384
+
+/* Pinned to Core 1, like edge_dsp. WiFi runs on Core 0 — keep them
+ * apart so the temporal forward doesn't compete with CSI capture. */
+#define TEMPORAL_TASK_CORE 1
+
+/* Classification cadence in milliseconds. 1 Hz is the ADR-095 §3 default. */
+#ifndef CONFIG_TEMPORAL_CLASSIFY_PERIOD_MS
+#define CONFIG_TEMPORAL_CLASSIFY_PERIOD_MS 1000
+#endif
+
+/* Maximum logits buffer — sized to the largest n_classes any of the
+ * ADR-095 §4 use cases needs (anomaly = 2, fall = 3, gesture = 8). */
+#define TEMPORAL_MAX_LOGITS 16
+
+/* ---- Module state ----------------------------------------------------- */
+
+typedef struct {
+    float frame[TEMPORAL_MAX_LOGITS * 8]; /* generous; trimmed via input_dim */
+    uint32_t frame_len;
+} temporal_msg_t;
+
+static QueueHandle_t        s_queue;
+static TaskHandle_t         s_task;
+static ruv_temporal_ctx_t  *s_ctx;
+static uint32_t             s_input_dim;
+static uint32_t             s_window_len;
+static uint32_t             s_n_classes;
+static uint32_t             s_seq;
+static uint32_t             s_drop_count;
+static uint64_t             s_last_drop_log_us;
+
+/* Lightweight CRC32 (IEEE 802.3 polynomial 0xEDB88320), table-free.
+ * Used only for the 36-byte classification packet — speed isn't
+ * critical. Existing firmware has its own CRC32 in csi_collector.c
+ * but we don't link against it from here to keep coupling narrow. */
+static uint32_t crc32_ieee(const uint8_t *data, size_t len)
+{
+    uint32_t crc = 0xFFFFFFFFu;
+    for (size_t i = 0; i < len; i++) {
+        crc ^= data[i];
+        for (int b = 0; b < 8; b++) {
+            uint32_t mask = -(int32_t)(crc & 1u);
+            crc = (crc >> 1) ^ (0xEDB88320u & mask);
+        }
+    }
+    return ~crc;
+}
+
+static void emit_classification(const float *logits, uint32_t n)
+{
+    /* Find argmax + margin in one pass. */
+    uint32_t argmax = 0;
+    float top1 = logits[0];
+    float top2 = -1e30f;
+    for (uint32_t i = 1; i < n; i++) {
+        float v = logits[i];
+        if (v > top1) {
+            top2 = top1;
+            top1 = v;
+            argmax = i;
+        } else if (v > top2) {
+            top2 = v;
+        }
+    }
+
+    rv_temporal_pkt_t pkt;
+    memset(&pkt, 0, sizeof(pkt));
+    pkt.magic = RV_TEMPORAL_PKT_MAGIC;
+    pkt.version = 1;
+    pkt.n_classes = (uint16_t)n;
+    pkt.node_id = csi_collector_get_node_id();
+    pkt.ts_us = (uint64_t)esp_timer_get_time();
+    pkt.seq = ++s_seq;
+    pkt.argmax = (uint8_t)argmax;
+    pkt.top_logit = top1;
+    pkt.top1_minus_top2 = top1 - top2;
+    pkt.crc32 = crc32_ieee((const uint8_t *)&pkt, sizeof(pkt) - sizeof(pkt.crc32));
+
+    int sent = stream_sender_send((const uint8_t *)&pkt, sizeof(pkt));
+    if (sent < 0) {
+        ESP_LOGW(TAG, "classification emit failed");
+    }
+}
+
+static void temporal_task_loop(void *arg)
+{
+    (void)arg;
+    ESP_LOGI(TAG, "temporal task online (window=%u dim=%u classes=%u core=%d)",
+             (unsigned)s_window_len, (unsigned)s_input_dim,
+             (unsigned)s_n_classes, TEMPORAL_TASK_CORE);
+
+    /* Self-test the kernel link before touching real frames. */
+    if (ruv_temporal_kernel_self_test() != ESP_OK) {
+        ESP_LOGE(TAG, "ruv_temporal_kernel_self_test FAILED — temporal head disabled");
+        s_ctx = NULL;
+        vTaskDelete(NULL);
+        return;
+    }
+
+    uint64_t next_classify_us = esp_timer_get_time()
+                              + (uint64_t)CONFIG_TEMPORAL_CLASSIFY_PERIOD_MS * 1000ull;
+    float logits[TEMPORAL_MAX_LOGITS];
+
+    for (;;) {
+        temporal_msg_t msg;
+        /* Block up to 100 ms for a frame, then check if it's time to
+         * classify. This double-poll keeps the cadence honest even
+         * during long quiet periods. */
+        if (xQueueReceive(s_queue, &msg, pdMS_TO_TICKS(100)) == pdTRUE) {
+            if (s_ctx != NULL) {
+                (void)ruv_temporal_push(s_ctx, msg.frame);
+            }
+        }
+
+        uint64_t now_us = esp_timer_get_time();
+        if (now_us >= next_classify_us && s_ctx != NULL) {
+            esp_err_t cret = ruv_temporal_classify(s_ctx, logits, s_n_classes);
+            if (cret == ESP_OK) {
+                emit_classification(logits, s_n_classes);
+            } else {
+                ESP_LOGW(TAG, "classify returned 0x%x", (unsigned)cret);
+            }
+            next_classify_us = now_us
+                             + (uint64_t)CONFIG_TEMPORAL_CLASSIFY_PERIOD_MS * 1000ull;
+        }
+
+        /* Coalesce drop-count logs to once per second so a backlog
+         * doesn't flood the serial console. */
+        if (s_drop_count > 0 && now_us - s_last_drop_log_us > 1000000ull) {
+            ESP_LOGW(TAG, "queue full — dropped %u feature frames",
+                     (unsigned)s_drop_count);
+            s_drop_count = 0;
+            s_last_drop_log_us = now_us;
+        }
+    }
+}
+
+esp_err_t temporal_task_start(uint32_t input_dim,
+                              uint32_t window_len,
+                              uint32_t n_classes)
+{
+    if (s_task != NULL) {
+        return ESP_OK;  /* idempotent */
+    }
+    if (input_dim == 0 || window_len == 0 || n_classes == 0) {
+        return ESP_ERR_INVALID_ARG;
+    }
+    if (n_classes > TEMPORAL_MAX_LOGITS) {
+        ESP_LOGE(TAG, "n_classes=%u exceeds TEMPORAL_MAX_LOGITS=%d",
+                 (unsigned)n_classes, TEMPORAL_MAX_LOGITS);
+        return ESP_ERR_INVALID_SIZE;
+    }
+
+    /* Allocate the kernel context. Phase 4 stub returns ESP_OK without
+     * weights; Phase 5b will accept a real weights blob. */
+    esp_err_t ret = ruv_temporal_init(NULL, 0, input_dim, window_len, n_classes,
+                                      &s_ctx);
+    if (ret != ESP_OK) {
+        ESP_LOGE(TAG, "ruv_temporal_init failed: 0x%x", (unsigned)ret);
+        return ret;
+    }
+
+    s_input_dim = input_dim;
+    s_window_len = window_len;
+    s_n_classes = n_classes;
+    s_seq = 0;
+    s_drop_count = 0;
+    s_last_drop_log_us = 0;
+
+    s_queue = xQueueCreate(TEMPORAL_QUEUE_DEPTH, sizeof(temporal_msg_t));
+    if (s_queue == NULL) {
+        ESP_LOGE(TAG, "queue create failed");
+        ruv_temporal_destroy(s_ctx);
+        s_ctx = NULL;
+        return ESP_ERR_NO_MEM;
+    }
+
+    BaseType_t ok = xTaskCreatePinnedToCore(
+        temporal_task_loop, "ruv_temporal", TEMPORAL_TASK_STACK,
+        NULL, 4 /* priority, below edge_dsp */,
+        &s_task, TEMPORAL_TASK_CORE);
+    if (ok != pdPASS) {
+        ESP_LOGE(TAG, "task create failed");
+        vQueueDelete(s_queue);
+        s_queue = NULL;
+        ruv_temporal_destroy(s_ctx);
+        s_ctx = NULL;
+        return ESP_ERR_NO_MEM;
+    }
+    return ESP_OK;
+}
+
+esp_err_t temporal_task_push_frame(const float *frame, uint32_t frame_len)
+{
+    if (frame == NULL || frame_len == 0) {
+        return ESP_ERR_INVALID_ARG;
+    }
+    if (s_queue == NULL) {
+        return ESP_ERR_NOT_FOUND;
+    }
+    temporal_msg_t msg;
+    uint32_t cap = (uint32_t)(sizeof(msg.frame) / sizeof(msg.frame[0]));
+    uint32_t n = (frame_len < cap) ? frame_len : cap;
+    if (n < s_input_dim) {
+        /* Pad short frames with zeros so the rolling window stays
+         * dimension-stable from the kernel's perspective. */
+        memcpy(msg.frame, frame, n * sizeof(float));
+        memset(&msg.frame[n], 0, (s_input_dim - n) * sizeof(float));
+        msg.frame_len = s_input_dim;
+    } else {
+        memcpy(msg.frame, frame, s_input_dim * sizeof(float));
+        msg.frame_len = s_input_dim;
+    }
+
+    /* Non-blocking — temporal head is best-effort. */
+    if (xQueueSend(s_queue, &msg, 0) != pdPASS) {
+        s_drop_count++;
+        return ESP_ERR_TIMEOUT;
+    }
+    return ESP_OK;
+}
+
+void temporal_task_stop(void)
+{
+    if (s_task != NULL) {
+        vTaskDelete(s_task);
+        s_task = NULL;
+    }
+    if (s_queue != NULL) {
+        vQueueDelete(s_queue);
+        s_queue = NULL;
+    }
+    if (s_ctx != NULL) {
+        ruv_temporal_destroy(s_ctx);
+        s_ctx = NULL;
+    }
+}
+
+#else  /* !CONFIG_CSI_TEMPORAL_HEAD_ENABLED */
+
+esp_err_t temporal_task_start(uint32_t input_dim,
+                              uint32_t window_len,
+                              uint32_t n_classes)
+{
+    (void)input_dim;
+    (void)window_len;
+    (void)n_classes;
+    return ESP_ERR_NOT_SUPPORTED;
+}
+
+esp_err_t temporal_task_push_frame(const float *frame, uint32_t frame_len)
+{
+    (void)frame;
+    (void)frame_len;
+    return ESP_ERR_NOT_SUPPORTED;
+}
+
+void temporal_task_stop(void) {}
+
+#endif /* CONFIG_CSI_TEMPORAL_HEAD_ENABLED */
@@ -0,0 +1,98 @@
+/* SPDX-License-Identifier: MIT
+ *
+ * temporal_task.h — On-device temporal head FreeRTOS task (ADR-095, #513).
+ *
+ * Owns the lifecycle of the `ruv_temporal_ctx_t` from
+ * components/ruv_temporal/include/ruv_temporal.h. Exposes:
+ *
+ *   1. `temporal_task_start()` — spawn the task with its own 16 KB stack
+ *      pinned to Core 1, allocate a feed queue. Caller (main.c) ignores
+ *      ESP_ERR_NOT_SUPPORTED when CONFIG_CSI_TEMPORAL_HEAD_ENABLED is off.
+ *   2. `temporal_task_push_frame()` — non-blocking enqueue from the
+ *      adaptive_controller fast loop. Drops on full queue (logs once
+ *      per second) — the temporal head is best-effort, the physics-only
+ *      path keeps producing vitals regardless.
+ *   3. `temporal_task_stop()` — cleanly tear down (currently used only
+ *      for tests; production firmware never calls this).
+ *
+ * Thread safety: per ADR-095 §3.3 the temporal task itself is the
+ * single owner of the underlying `ruv_temporal_ctx_t`. Callers
+ * communicate exclusively via the FreeRTOS queue.
+ *
+ * Output: every ~1 s the task runs `ruv_temporal_classify` and emits a
+ * `0xC5110007 RV_TEMPORAL_CLASSIFICATION` packet via stream_sender.
+ */
+
+#pragma once
+
+#include <stdint.h>
+#include "esp_err.h"
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/* Magic for the classification packet (ADR-095 §3.5). 0xC5110001..0006
+ * are taken; 0007 is the next free slot. */
+#define RV_TEMPORAL_PKT_MAGIC 0xC5110007u
+
+/* On-the-wire packet for one classification result. Little-endian.
+ * Size: 40 bytes. CRC covers everything before it.
+ *
+ * Field layout (bytes):
+ *   [00..04)  magic              4
+ *   [04..06)  version            2
+ *   [06..08)  n_classes          2
+ *   [08..09)  node_id            1
+ *   [09..0C)  reserved           3
+ *   [0C..14)  ts_us              8
+ *   [14..18)  seq                4
+ *   [18..19)  argmax             1
+ *   [19..1C)  reserved2          3
+ *   [1C..20)  top_logit          4
+ *   [20..24)  top1_minus_top2    4
+ *   [24..28)  crc32              4
+ *   total: 40
+ */
+typedef struct __attribute__((packed)) {
+    uint32_t magic;          /* 0xC5110007 */
+    uint16_t version;        /* 1 */
+    uint16_t n_classes;      /* matches init() value */
+    uint8_t  node_id;        /* csi_collector_get_node_id() */
+    uint8_t  reserved[3];
+    uint64_t ts_us;          /* esp_timer_get_time() at classify */
+    uint32_t seq;             /* monotonic, increments per emit */
+    uint8_t  argmax;          /* highest-logit class */
+    uint8_t  reserved2[3];
+    float    top_logit;       /* logits[argmax] */
+    float    top1_minus_top2; /* margin — useful for downstream gating */
+    uint32_t crc32;
+} rv_temporal_pkt_t;
+
+/* Build-time guard so the wire format never silently changes. */
+_Static_assert(sizeof(rv_temporal_pkt_t) == 40,
+               "rv_temporal_pkt_t must be 40 bytes (ADR-095 §3.5)");
+
+/* Start the temporal task. Returns ESP_ERR_NOT_SUPPORTED when the
+ * feature is compiled out — caller should treat that as a non-error
+ * and continue. Returns ESP_OK on success.
+ *
+ * input_dim   : feature dimension per frame (e.g. 60 for rv_feature_state_t)
+ * window_len  : rolling window in frames (e.g. 256)
+ * n_classes   : number of output logits the model produces (e.g. 4)
+ */
+esp_err_t temporal_task_start(uint32_t input_dim,
+                              uint32_t window_len,
+                              uint32_t n_classes);
+
+/* Non-blocking push from the adaptive_controller fast loop. Returns
+ * ESP_OK on enqueue, ESP_ERR_NOT_FOUND if the task isn't running,
+ * ESP_ERR_TIMEOUT if the queue was full. Never blocks the caller. */
+esp_err_t temporal_task_push_frame(const float *frame, uint32_t frame_len);
+
+/* Optional teardown — currently unit-test only. */
+void temporal_task_stop(void);
+
+#ifdef __cplusplus
+}
+#endif
@@ -231,6 +231,18 @@ dependencies = [
 "wait-timeout",
 ]

+[[package]]
+name = "async-compression"
+version = "0.4.42"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e79b3f8a79cccc2898f31920fc69f304859b3bd567490f75ebf51ae1c792a9ac"
+dependencies = [
+ "compression-codecs",
+ "compression-core",
+ "pin-project-lite",
+ "tokio",
+]
+
 [[package]]
 name = "async-trait"
 version = "0.1.89"
@@ -318,7 +330,7 @@ dependencies = [
 "sync_wrapper 1.0.2",
 "tokio",
 "tokio-tungstenite",
- "tower",
+ "tower 0.5.3",
 "tower-layer",
 "tower-service",
 "tracing",
@@ -871,6 +883,23 @@ dependencies = [
 "memchr",
 ]

+[[package]]
+name = "compression-codecs"
+version = "0.4.38"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ce2548391e9c1929c21bf6aa2680af86fe4c1b33e6cea9ac1cfeec0bd11218cf"
+dependencies = [
+ "compression-core",
+ "flate2",
+ "memchr",
+]
+
+[[package]]
+name = "compression-core"
+version = "0.4.32"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "cc14f565cf027a105f7a44ccf9e5b424348421a1d8952a8fc9d499d313107789"
+
 [[package]]
 name = "concurrent-queue"
 version = "2.5.0"
@@ -2371,6 +2400,16 @@ version = "0.16.1"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "841d1cc9bed7f9236f321df977030373f4a4163ae1a7dbfe1a51a2c1a51d9100"

+[[package]]
+name = "hdrhistogram"
+version = "7.5.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "765c9198f173dd59ce26ff9f95ef0aafd0a0fe01fb9d72841bc5066a4c06511d"
+dependencies = [
+ "byteorder",
+ "num-traits",
+]
+
 [[package]]
 name = "heapless"
 version = "0.6.1"
@@ -3892,13 +3931,35 @@ name = "nvsim"
 version = "0.3.0"
 dependencies = [
 "approx 0.5.1",
+ "criterion",
+ "js-sys",
 "rand 0.8.5",
 "rand_chacha 0.3.1",
 "serde",
+ "serde-wasm-bindgen",
 "serde_json",
 "sha2",
 "thiserror 1.0.69",
 "tracing",
+ "wasm-bindgen",
+]
+
+[[package]]
+name = "nvsim-server"
+version = "0.3.0"
+dependencies = [
+ "axum",
+ "clap",
+ "futures-util",
+ "nvsim",
+ "serde",
+ "serde_json",
+ "thiserror 1.0.69",
+ "tokio",
+ "tower 0.4.13",
+ "tower-http 0.5.2",
+ "tracing",
+ "tracing-subscriber",
 ]

 [[package]]
@@ -4487,6 +4548,26 @@ dependencies = [
 "siphasher 1.0.2",
 ]

+[[package]]
+name = "pin-project"
+version = "1.1.11"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f1749c7ed4bcaf4c3d0a3efc28538844fb29bcdd7d2b67b2be7e20ba861ff517"
+dependencies = [
+ "pin-project-internal",
+]
+
+[[package]]
+name = "pin-project-internal"
+version = "1.1.11"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d9b20ed30f105399776b9c883e68e536ef602a16ae6f596d2c473591d6ad64c6"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn 2.0.117",
+]
+
 [[package]]
 name = "pin-project-lite"
 version = "0.2.17"
@@ -5278,7 +5359,7 @@ dependencies = [
 "sync_wrapper 1.0.2",
 "tokio",
 "tokio-native-tls",
- "tower",
+ "tower 0.5.3",
 "tower-http 0.6.8",
 "tower-service",
 "url",
@@ -5311,7 +5392,7 @@ dependencies = [
 "sync_wrapper 1.0.2",
 "tokio",
 "tokio-util",
- "tower",
+ "tower 0.5.3",
 "tower-http 0.6.8",
 "tower-service",
 "url",
@@ -5798,6 +5879,14 @@ version = "2.0.4"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "178f93f84a4a72c582026a45d9b8710acf188df4a22a25434c5dbba1df6c4cac"

+[[package]]
+name = "ruvllm_sparse_attention"
+version = "0.1.1"
+dependencies = [
+ "half",
+ "libm",
+]
+
 [[package]]
 name = "ryu"
 version = "1.0.23"
@@ -7379,6 +7468,27 @@ dependencies = [
 "zip 0.6.6",
 ]

+[[package]]
+name = "tower"
+version = "0.4.13"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b8fa9be0de6cf49e536ce1851f987bd21a43b771b09473c3549a6c853db37c1c"
+dependencies = [
+ "futures-core",
+ "futures-util",
+ "hdrhistogram",
+ "indexmap 1.9.3",
+ "pin-project",
+ "pin-project-lite",
+ "rand 0.8.5",
+ "slab",
+ "tokio",
+ "tokio-util",
+ "tower-layer",
+ "tower-service",
+ "tracing",
+]
+
 [[package]]
 name = "tower"
 version = "0.5.3"
@@ -7401,8 +7511,10 @@ version = "0.5.2"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "1e9cd434a998747dd2c4276bc96ee2e0c7a2eadf3cae88e52be55a05fa9053f5"
 dependencies = [
+ "async-compression",
 "bitflags 2.11.0",
 "bytes",
+ "futures-core",
 "futures-util",
 "http 1.4.0",
 "http-body 1.0.1",
@@ -7433,7 +7545,7 @@ dependencies = [
 "http-body 1.0.1",
 "iri-string",
 "pin-project-lite",
- "tower",
+ "tower 0.5.3",
 "tower-layer",
 "tower-service",
 ]
@@ -8385,6 +8497,7 @@ dependencies = [
 "serde",
 "serde_json",
 "tokio",
+ "tower-http 0.5.2",
 ]

 [[package]]
@@ -8452,6 +8565,15 @@ dependencies = [
 "wifi-densepose-ruvector",
 ]

+[[package]]
+name = "wifi-densepose-temporal"
+version = "0.1.0"
+dependencies = [
+ "approx 0.5.1",
+ "ruvllm_sparse_attention",
+ "thiserror 1.0.69",
+]
+
 [[package]]
 name = "wifi-densepose-train"
 version = "0.3.0"
@@ -16,6 +16,7 @@ members = [
    "crates/wifi-densepose-wifiscan",
    "crates/wifi-densepose-vitals",
    "crates/wifi-densepose-ruvector",
+    "crates/wifi-densepose-temporal",
    "crates/wifi-densepose-desktop",
    "crates/wifi-densepose-pointcloud",
    "crates/wifi-densepose-geo",
@@ -131,6 +132,11 @@ ruvector-attention = "2.0.4"
 ruvector-crv = "0.1.1"
 ruvector-gnn = { version = "2.0.5", default-features = false }

+# ruvllm sparse attention (path-vendored per ADR-095/096)
+# Default-features=false keeps the kernel no_std-clean so the same workspace
+# version is consumable by the upcoming ESP-IDF Rust component (ADR-095).
+ruvllm_sparse_attention = { path = "../vendor/ruvector/crates/ruvllm_sparse_attention", default-features = false, features = ["fp16"] }
+

 # Internal crates
 wifi-densepose-core = { version = "0.3.0", path = "crates/wifi-densepose-core" }
@@ -143,6 +149,7 @@ wifi-densepose-hardware = { version = "0.3.0", path = "crates/wifi-densepose-har
 wifi-densepose-wasm = { version = "0.3.0", path = "crates/wifi-densepose-wasm" }
 wifi-densepose-mat = { version = "0.3.0", path = "crates/wifi-densepose-mat" }
 wifi-densepose-ruvector = { version = "0.3.0", path = "crates/wifi-densepose-ruvector" }
+wifi-densepose-temporal = { version = "0.1.0", path = "crates/wifi-densepose-temporal" }

 [profile.release]
 lto = true
@@ -0,0 +1,27 @@
+[package]
+name = "wifi-densepose-temporal"
+version = "0.1.0"
+edition = "2021"
+license = "MIT"
+description = "AETHER temporal head for WiFi-DensePose — sparse-GQA attention over CSI feature windows (ADR-096)"
+repository = "https://github.com/ruvnet/RuView"
+
+[dependencies]
+ruvllm_sparse_attention = { workspace = true }
+thiserror = "1"
+
+[dev-dependencies]
+approx = "0.5"
+
+[features]
+default = []
+# Enable FP16 KV cache path (mirrors the firmware-side ADR-095 build).
+fp16 = []
+
+[[example]]
+name = "init_random_blob"
+path = "examples/init_random_blob.rs"
+
+[[example]]
+name = "bench_speedup"
+path = "examples/bench_speedup.rs"
@@ -0,0 +1,146 @@
+# `wifi-densepose-temporal`
+
+AETHER temporal head over CSI feature windows. Sparse-GQA attention via
+`ruvllm_sparse_attention`, with a streaming `KvCache` decode path for
+online re-ID and incremental classification.
+
+Implements the host side of [ADR-096](../../../docs/adr/ADR-096-aether-temporal-head-sparse-gqa.md);
+mirrored on the firmware side at
+[`firmware/esp32-csi-node/components/ruv_temporal/`](../../../firmware/esp32-csi-node/components/ruv_temporal/).
+
+## Quick start
+
+```rust
+use wifi_densepose_temporal::{AetherTemporalHead, TemporalHeadConfig, Tensor3};
+
+// Default config matches AETHER's MQA shape:
+//   q_heads=4, kv_heads=1, head_dim=32, window=32, block_size=16, causal=true
+let cfg = TemporalHeadConfig::default_aether();
+let head = AetherTemporalHead::new(&cfg)?;
+
+// Prefill: full window forward
+let out = head.forward(&q, &k, &v)?;        // shape: (window, q_heads, head_dim)
+
+// Streaming: O(log T) per new frame against an accumulated cache
+let mut cache = head.make_cache(/* capacity */ 1024)?;
+for new_frame in stream {
+    let (q1, k1, v1) = project(&new_frame); // each seq=1
+    let attn_out = head.step(&q1, &k1, &v1, &mut cache)?;
+    // pool, run classifier head, etc
+}
+```
+
+## Backends
+
+`TemporalBackendKind` selects between two paths (ADR-096 §4.4):
+
+| Backend | When | Cost |
+|---|---|---|
+| `SparseGqa` | New training runs (default) | O(N log N) prefill, O(log T) decode |
+| `Dense` | Reserved for back-compat | Returns `TemporalError::DenseBackendNotImplemented` for now (ADR-096 §4.4 follow-up) |
+
+The `SparseGqa` backend dispatches at `forward()` time:
+
+- `q_heads == kv_heads` → `forward()` (sparse MHA)
+- `q_heads != kv_heads` → `forward_gqa()` (GQA / MQA)
+
+## Streaming semantics
+
+`step()` is the structural advantage over dense MHA — append `(k, v)` to the
+caller-owned cache and decode the new `q` in O(log T) per token.
+
+- `q`/`k`/`v` must each have `seq == 1` (multi-token q is the prefill path).
+- `KvCache` lifetime is the caller's. Per ADR-096 §8.5 the natural lifetime
+  is per-`PoseTrack` (re-ID) or per-session (online classification). When
+  the track drops, drop the cache.
+- Cache fills are the caller's problem. Upstream H2O heavy-hitter eviction
+  is opt-in; this crate's wrapper doesn't pre-pick a policy.
+
+Headline correctness test: `streaming_step_matches_forward_at_last_position`
+proves token-by-token `step()` produces the same output as a single-shot
+`forward()` at position `N-1`, max_abs_err < 1e-3.
+
+## Weight blob format (`.rvne`)
+
+Wire format for transferring trained weights to the firmware.
+[`weights.rs`](src/weights.rs) defines the host side; the firmware mirror
+at [`components/ruv_temporal/src/weights.rs`](../../../firmware/esp32-csi-node/components/ruv_temporal/src/weights.rs)
+parses it bit-for-bit.
+
+| Section | Bytes | Contents |
+|---|---|---|
+| Header | 24 | magic `RVNE` / version 1 / dtype flag (FP32 \| FP16) / dims |
+| Weights | variable | flat per-layer arrays, dtype as flagged |
+| Footer | 4 | CRC32-IEEE over everything before |
+
+Hard-break versioning: bumping `version` means firmware refuses to load.
+Adding fields goes behind reserved flag bits, never by reorder.
+
+```rust
+let blob = WeightBlob::new(header, weights)?;
+let bytes = blob.serialize();          // host
+// ...
+let view = WeightBlobView::parse(&bytes)?;  // firmware (no_std, borrowed slice)
+```
+
+## Examples
+
+| Example | Run |
+|---|---|
+| `init_random_blob` | `cargo run -p wifi-densepose-temporal --example init_random_blob -- model.rvne` — emits a 41 KB AETHER-shaped `.rvne` |
+| `bench_speedup` | `cargo run -p wifi-densepose-temporal --example bench_speedup --release` — sparse-vs-dense speedup curve |
+
+Captured benchmark results: [`benches_results.md`](benches_results.md).
+
+## Tests
+
+```
+cargo test -p wifi-densepose-temporal
+```
+
+| Suite | Tests | What |
+|---|---|---|
+| `tests/smoke.rs` | 5 | Forward at AETHER default, MHA dispatch, GQA dispatch, dense-rejected, invalid-GQA-rejected, N=1000 long window |
+| `tests/weight_blob.rs` | 8 | Roundtrip FP32 + FP16, bad magic / version / size / CRC / GQA, layout anchor |
+| `tests/blob_e2e.rs` | 2 | Realistic 25 KB+ filesystem roundtrip, deterministic seed reproducibility |
+| `tests/streaming.rs` | 3 | step()-matches-forward at last position, multi-token-q rejected, make_cache shape |
+
+**18/18 passing as of commit `247794a2c`.**
+
+## Status of ADR-096 claims
+
+| Claim | Status | Evidence |
+|---|---|---|
+| O(N log N) sparse vs O(N²) dense | **Empirically confirmed** | `bench_speedup` measures 21.21× at N=1024; cost-growth ratios match theory (dense 274×, sparse 24× for 16× more tokens) |
+| `step()` matches `forward()` at last position | **Proven** | `streaming_step_matches_forward_at_last_position` test |
+| Wire format consistent host↔firmware | **Proven** | 3 sites with same magic/version/CRC, 41-KB blob roundtrips through filesystem in tests |
+| Path-vendored, no crates.io coupling | **Confirmed** | Workspace dep is `path = "../vendor/ruvector/crates/ruvllm_sparse_attention"` |
+| 30–100× at long windows | **Partial** | 21.21× measured at N=1024 in single-run wall-clock; higher N + criterion would push closer to the 30× lower bound |
+
+## Status of ADR-095 surface (firmware)
+
+`AetherTemporalHead` is the host-side analog of the firmware on-device path.
+The firmware Rust component scaffold and C-side wiring are complete; the
+Rust component cross-compile is currently blocked by an upstream esp-rs
+nightly-bundle inconsistency. See
+[`components/ruv_temporal/README.md`](../../../firmware/esp32-csi-node/components/ruv_temporal/README.md)
+for details.
+
+When the toolchain unblocks, no changes to this crate are needed —
+`weights.rs` is already mirrored, `Tensor3` and `KvCache` cross the
+boundary unchanged, and the C ABI consumed by `temporal_task.c` is stable.
+
+## Open questions (still applicable from ADR-096 §8)
+
+- The deployed AETHER tracker's actual window length is what determines
+  whether sparse pays off in production. At training default of 100 frames,
+  sparse begins to win (5–6× at N=128–256). At the 1000-frame roadmap
+  target, the speedup is much larger (21× measured).
+- Streaming GQA decode is an upstream roadmap item; the current
+  `decode_step` is wired for the MHA branch. When upstream ships GQA
+  decode (post-ADR-189/190), `AetherTemporalHead.step` gets a GQA dispatch
+  branch added without any public API change.
+
+## License
+
+MIT.
@@ -0,0 +1,72 @@
+# Bench results — sparse vs dense prefill
+
+Output of `cargo run -p wifi-densepose-temporal --example bench_speedup --release`
+on a Windows 11 / x86_64 dev box, 2026-05-08. Single-run wall-clock,
+pure-Rust vs pure-Rust (no SIMD/threads on either side). Reproduce by
+running the example yourself; results vary 2–3× between machines and
+power states, but the **trends across N** are what matter.
+
+## Sparse-vs-dense prefill speedup
+
+Config: `q_heads=4, kv_heads=4, head_dim=32, window=16, block_size=32, causal=true`.
+
+| N      | Dense (ms)   | Sparse (ms)  | Speedup |
+|--------|-------------:|-------------:|--------:|
+|     64 |        0.262 |        0.141 |    1.86× |
+|    128 |        1.120 |        0.335 |    3.34× |
+|    256 |        4.129 |        0.711 |    5.81× |
+|    512 |       19.230 |        2.356 |    8.16× |
+|   1024 |       71.904 |        3.389 | **21.21×** |
+
+## Asymptotic check
+
+ADR-096 §3.1 claimed dense scales as O(N²) and sparse as O(N log N).
+The measured 64→1024 cost growth (16× more tokens) is:
+
+| Path   | 64 ms | 1024 ms | Growth | Theory |
+|--------|------:|--------:|-------:|-------:|
+| Dense  | 0.262 |  71.904 |   274× | 256× = 16² |
+| Sparse | 0.141 |   3.389 |    24× | ~27× = 16 · log(1024)/log(64) |
+
+Dense's 274× growth matches `N²` cleanly. Sparse's 24× growth matches
+`N log N` to within measurement noise. **The asymptotic complexity
+claim is empirically supported on this hardware.**
+
+## Why N=64 is only 1.86× and not faster
+
+ADR-096 §3.1 already called this out: at the AETHER training default
+of `window_frames = 100`, dense MHA is essentially free and the sparse
+machinery has overhead — the per-token candidate-set construction,
+landmark indexing, and global-token bookkeeping are constant-factor
+costs that only amortize past N ≈ 200. The speedup-vs-N curve
+inflects sharply between N=128 and N=256 because that's where dense's
+N² term starts dominating its constants.
+
+If a downstream consumer is using AETHER on 4-frame windows
+(`proof.rs`, `trainer.rs`), this ADR pays nothing. The case rests
+entirely on the long-window roadmap.
+
+## What this benchmark doesn't measure
+
+- **Decode-step latency.** `streaming_step_matches_forward_at_last_position`
+  proves correctness; this bench doesn't measure how fast `decode_step`
+  runs vs a hypothetical dense-MHA decode (which would be O(N²) recompute
+  every step — structurally not even comparable).
+- **Memory.** KvCache + FP16 halves the K/V footprint vs FP32, which
+  matters more on the firmware than on x86_64 host. Phase 5 unblocking
+  is the prerequisite for measuring this on real hardware.
+- **GQA dispatch.** This config uses `q_heads == kv_heads` to force
+  the MHA branch, so dense and sparse operate on the same shape.
+  Real AETHER will probably want `kv_heads=1` (MQA) which halves
+  the KV memory and is what the default head config picks.
+
+## How to run
+
+```
+cargo run -p wifi-densepose-temporal --example bench_speedup --release
+```
+
+Release mode is mandatory. Debug builds run sparse 5–10× slower than
+release because the candidate-set construction has tight inner loops
+that benefit hard from `-O3`. Don't draw conclusions from `cargo run`
+without `--release`.
@@ -0,0 +1,151 @@
+// Measure sparse-GQA prefill cost vs dense MHA at N = {64, 128, 256, 512, 1024}.
+// ADR-096 §3.1 claimed 30–100× edge-evaluation reduction at long windows;
+// this is the empirical check.
+//
+// Run with:   cargo run -p wifi-densepose-temporal --example bench_speedup --release
+//
+// Caveat: single-run wall-clock on one machine — not a rigorous benchmark.
+// Trends across N matter more than the absolute numbers, and results vary
+// 2–3× between machines / power states. The point is to confirm the
+// magnitude of the speedup is what the ADR claimed, not a perf-engineering
+// dashboard. For that, use criterion + a dedicated machine.
+
+use std::time::Instant;
+
+use ruvllm_sparse_attention::{dense_attention, AttentionBackend, SparseAttentionConfig, SubquadraticSparseAttention, Tensor3};
+use wifi_densepose_temporal::{TemporalBackendKind, TemporalHeadConfig, AetherTemporalHead};
+
+fn make_qkv(seq: usize, heads: usize, dim: usize) -> (Tensor3, Tensor3, Tensor3) {
+    // Simple deterministic init — content doesn't matter for timing,
+    // but we want each benchmark run to use the same numbers.
+    let mut q = Tensor3::zeros(seq, heads, dim);
+    let mut k = Tensor3::zeros(seq, heads, dim);
+    let mut v = Tensor3::zeros(seq, heads, dim);
+    for s in 0..seq {
+        for h in 0..heads {
+            for d in 0..dim {
+                let qv = ((s * 31 + h * 7 + d) as f32).sin() * 0.1;
+                let kv = (((s * 17 + h * 3 + d) as f32).cos()) * 0.1;
+                q.set(s, h, d, qv);
+                k.set(s, h, d, kv);
+                v.set(s, h, d, kv * 0.5);
+            }
+        }
+    }
+    (q, k, v)
+}
+
+fn time_run<F: FnMut()>(label: &str, runs: usize, mut f: F) -> f64 {
+    // 1 warmup + `runs` measurements. Wall clock; release-mode only is
+    // meaningful (debug builds run sparse 5–10× slower than release).
+    f();
+    let start = Instant::now();
+    for _ in 0..runs {
+        f();
+    }
+    let total_ms = start.elapsed().as_secs_f64() * 1000.0;
+    let avg_ms = total_ms / runs as f64;
+    println!("  {label:<36}  {avg_ms:>8.3} ms/run  ({runs} runs)");
+    avg_ms
+}
+
+fn bench_at(seq: usize) -> (f64, f64, f64) {
+    println!();
+    println!("=== seq = {seq} ===");
+
+    // MHA shape (q_heads == kv_heads) so dense_attention and the sparse
+    // forward path operate on the same tensor shape — direct timing
+    // comparison without GQA bookkeeping confounding the result.
+    let heads = 4;
+    let dim = 32;
+    let (q, k, v) = make_qkv(seq, heads, dim);
+
+    // Dense reference. dense_attention is the upstream's naive O(N²)
+    // pure-Rust kernel — same scale, same shape, no SIMD acceleration —
+    // a fair head-to-head against the equally-pure-Rust sparse path.
+    let runs_dense = if seq <= 128 { 50 } else if seq <= 512 { 10 } else { 3 };
+    let dense_ms = time_run(
+        &format!("dense_attention (causal=true)"),
+        runs_dense,
+        || {
+            let _ = dense_attention(&q, &k, &v, true).expect("dense forward");
+        },
+    );
+
+    // Sparse via the AETHER head wrapper — same code path the production
+    // training/inference would use, not the lower-level SubquadraticSparseAttention.
+    // Window/block_size kept small so the sparse pattern actually drops
+    // candidates at all benchmark lengths (otherwise at N=64 with default
+    // config we'd touch the entire sequence and look the same as dense).
+    let cfg = TemporalHeadConfig {
+        backend: TemporalBackendKind::SparseGqa,
+        q_heads: heads,
+        kv_heads: heads, // MHA — match dense
+        head_dim: dim,
+        window: 16,
+        block_size: 32,
+        causal: true,
+    };
+    let head = AetherTemporalHead::new(&cfg).expect("construct head");
+    let runs_sparse = if seq <= 128 { 50 } else if seq <= 512 { 30 } else { 10 };
+    let sparse_ms = time_run(
+        "AetherTemporalHead.forward (sparse)",
+        runs_sparse,
+        || {
+            let _ = head.forward(&q, &k, &v).expect("sparse forward");
+        },
+    );
+
+    // Also measure SubquadraticSparseAttention directly — bypasses our
+    // wrapper, useful for confirming the wrapper isn't introducing
+    // measurable overhead.
+    let attn = SubquadraticSparseAttention::new(SparseAttentionConfig {
+        window: 16,
+        block_size: 32,
+        global_tokens: vec![0],
+        causal: true,
+        use_log_stride: true,
+        use_landmarks: true,
+        sort_candidates: false,
+    })
+    .expect("construct attn");
+    let raw_ms = time_run(
+        "Subquadratic.forward (raw, no wrapper)",
+        runs_sparse,
+        || {
+            let _ = attn.forward(&q, &k, &v).expect("raw sparse forward");
+        },
+    );
+
+    let speedup = dense_ms / sparse_ms;
+    println!("  -> sparse/dense speedup           {speedup:>6.2}×");
+
+    (dense_ms, sparse_ms, speedup)
+}
+
+fn main() {
+    println!("ADR-096 §3.1 empirical speedup check");
+    println!("====================================");
+    println!("Pure-Rust vs pure-Rust, no SIMD/threads, single-run wall-clock.");
+    println!("Trends across N matter more than absolute numbers.");
+
+    let lengths = [64, 128, 256, 512, 1024];
+    let mut rows: Vec<(usize, f64, f64, f64)> = Vec::new();
+    for &n in &lengths {
+        let (dense_ms, sparse_ms, speedup) = bench_at(n);
+        rows.push((n, dense_ms, sparse_ms, speedup));
+    }
+
+    println!();
+    println!("Summary");
+    println!("  N      dense (ms)   sparse (ms)   speedup");
+    println!("  ----   ----------   -----------   -------");
+    for (n, d, s, sp) in &rows {
+        println!("  {n:<5}  {d:>10.3}   {s:>11.3}   {sp:>5.2}×");
+    }
+    println!();
+    println!("ADR-096 §3.1 claim: ~30× edge reduction at N=8192,");
+    println!("growing roughly N/log(N). At N=1024 the claim is ~5–10×;");
+    println!("at N=64 the sparse machinery is overhead-bound (sparse may");
+    println!("lose, see ADR-096 §3.1 'honest framing' paragraph).");
+}
@@ -0,0 +1,142 @@
+// Emit a deterministic-seeded random weight blob in the .rvne format
+// (ADR-095 / #513 Phase 1 of the training-side roadmap).
+//
+// This is a *demo*, not a trained model — the weights are PRNG output.
+// Its purpose is to:
+//   1. Document end-to-end how the host produces a blob (i.e. the
+//      example IS the recipe a real trainer follows: build a header,
+//      fill the weights buffer, call WeightBlob::new + .serialize(),
+//      write to disk).
+//   2. Provide a reproducible test fixture the firmware loader can
+//      consume once the toolchain unblocks (ADR-095 Phase 5).
+//   3. Anchor the byte-level format so refactors that change the
+//      output silently are caught by the byte-count assertion at
+//      the bottom.
+//
+// Usage:
+//   cargo run -p wifi-densepose-temporal --example init_random_blob
+//   cargo run -p wifi-densepose-temporal --example init_random_blob -- /tmp/model.rvne
+
+use std::env;
+use std::fs;
+use std::path::PathBuf;
+
+use wifi_densepose_temporal::{WeightBlob, WeightBlobHeader, WeightDtype};
+
+/// Match the AETHER default head shape from
+/// `TemporalHeadConfig::default_aether()` — staying coherent with the
+/// crate's other defaults means a real trainer can drop this example
+/// in as the starting point with one search-and-replace.
+fn aether_default_header() -> WeightBlobHeader {
+    WeightBlobHeader {
+        dtype: WeightDtype::F32,
+        input_dim: 16,
+        n_q_heads: 4,
+        n_kv_heads: 1, // MQA — one shared K/V across the 4 query heads
+        head_dim: 32,
+        n_layers: 2,
+        n_classes: 4, // gesture-class default; firmware Kconfig matches
+    }
+}
+
+/// Compute the raw byte count for one transformer block at the given
+/// shape. This is the *intent-of-the-format* number, kept here so
+/// changes to it (and to the kernel's expectation) stay in sync.
+///
+/// Per-layer weights consist of:
+///   - input projection : input_dim × (n_q_heads × head_dim)   = Wq
+///   - K projection      : input_dim × (n_kv_heads × head_dim)  = Wk
+///   - V projection      : input_dim × (n_kv_heads × head_dim)  = Wv
+///   - O projection      : (n_q_heads × head_dim) × input_dim    = Wo
+fn per_layer_floats(h: &WeightBlobHeader) -> usize {
+    let id = h.input_dim as usize;
+    let q_total = h.n_q_heads as usize * h.head_dim as usize;
+    let kv_total = h.n_kv_heads as usize * h.head_dim as usize;
+    id * q_total          // Wq
+        + id * kv_total   // Wk
+        + id * kv_total   // Wv
+        + q_total * id    // Wo
+}
+
+/// Plus a final classifier head: input_dim × n_classes.
+fn classifier_floats(h: &WeightBlobHeader) -> usize {
+    h.input_dim as usize * h.n_classes as usize
+}
+
+/// xorshift64* — tiny deterministic PRNG. Don't use for crypto;
+/// this is a fixed-seed init so two runs of the example produce
+/// byte-identical blobs.
+fn xorshift_step(state: &mut u64) -> u64 {
+    let mut x = *state;
+    x ^= x << 13;
+    x ^= x >> 7;
+    x ^= x << 17;
+    *state = x;
+    x.wrapping_mul(2685821657736338717u64)
+}
+
+/// Map the high 32 bits of a u64 to a small symmetric float in
+/// [-0.1, 0.1). Tight bound so the resulting model produces sensible
+/// pre-softmax logits even though it's untrained.
+fn next_init_f32(state: &mut u64) -> f32 {
+    let bits = (xorshift_step(state) >> 32) as u32;
+    // Map to [0, 1) then scale to [-0.1, 0.1)
+    let unit = (bits as f32) / (u32::MAX as f32);
+    (unit - 0.5) * 0.2
+}
+
+fn build_random_weights(header: &WeightBlobHeader, seed: u64) -> Vec<u8> {
+    let total_floats =
+        per_layer_floats(header) * header.n_layers as usize + classifier_floats(header);
+    let mut out = Vec::with_capacity(total_floats * 4);
+    let mut state = seed;
+    for _ in 0..total_floats {
+        let f = next_init_f32(&mut state);
+        out.extend_from_slice(&f.to_le_bytes());
+    }
+    out
+}
+
+fn main() -> Result<(), Box<dyn std::error::Error>> {
+    let path = env::args()
+        .nth(1)
+        .map(PathBuf::from)
+        .unwrap_or_else(|| PathBuf::from("model_init.rvne"));
+
+    let header = aether_default_header();
+    let weights = build_random_weights(&header, 0xC511_0007_DEAD_BEEFu64);
+    let weights_len = weights.len();
+
+    let blob = WeightBlob::new(header.clone(), weights)?;
+    let bytes = blob.serialize();
+    let serialized_len = bytes.len();
+
+    fs::write(&path, &bytes)?;
+
+    // Re-parse to prove the artifact we just wrote is loadable. Same
+    // path the firmware loader will follow once the toolchain unblocks.
+    let parsed = WeightBlob::parse(&fs::read(&path)?)?;
+
+    println!("wrote   : {}", path.display());
+    println!("dtype   : {:?}", parsed.header.dtype);
+    println!(
+        "shape   : input_dim={}, q_heads={}, kv_heads={}, head_dim={}, layers={}, classes={}",
+        parsed.header.input_dim,
+        parsed.header.n_q_heads,
+        parsed.header.n_kv_heads,
+        parsed.header.head_dim,
+        parsed.header.n_layers,
+        parsed.header.n_classes,
+    );
+    println!(
+        "weights : {} bytes ({} f32 elements)",
+        weights_len,
+        weights_len / 4
+    );
+    println!(
+        "total   : {} bytes (header 24 + weights {} + crc 4)",
+        serialized_len, weights_len
+    );
+
+    Ok(())
+}
@@ -0,0 +1,70 @@
+use crate::TemporalError;
+
+/// Backend choice per ADR-096 §4.4.
+///
+/// * `Dense` — back-compat path against `ruvector-attention`. Reserved;
+///   not yet implemented in this crate (returns a typed error so callers
+///   can fail loudly during config validation rather than at forward()).
+/// * `SparseGqa` — `ruvllm_sparse_attention` `forward_gqa` for prefill,
+///   `decode_step` against `KvCache` for streaming inference.
+#[derive(Clone, Copy, Debug, Eq, PartialEq)]
+pub enum TemporalBackendKind {
+    Dense,
+    SparseGqa,
+}
+
+#[derive(Clone, Debug)]
+pub struct TemporalHeadConfig {
+    pub backend: TemporalBackendKind,
+
+    /// Number of query heads. For pure MHA, equals `kv_heads`.
+    pub q_heads: usize,
+    /// Number of key/value heads. Must divide `q_heads`. GQA group size
+    /// is `q_heads / kv_heads`.
+    pub kv_heads: usize,
+    /// Per-head feature dimension.
+    pub head_dim: usize,
+
+    /// Local attention window radius (sparse pattern primitive #1, ADR-096 §3).
+    pub window: usize,
+    /// Landmark block size (sparse pattern primitive #3).
+    pub block_size: usize,
+    /// Whether the attention is causal. AETHER temporal aggregation is
+    /// causal (cannot peek at future CSI frames during streaming re-ID).
+    pub causal: bool,
+}
+
+impl TemporalHeadConfig {
+    /// Default config sized for the AETHER training default
+    /// (`window_frames = 100`) but with the sparse machinery wired up
+    /// so the long-window roadmap (10 s / 1000 frames) only requires
+    /// changing `window` at the call site, not re-architecting.
+    pub fn default_aether() -> Self {
+        Self {
+            backend: TemporalBackendKind::SparseGqa,
+            q_heads: 4,
+            kv_heads: 1, // MQA — collapses to one shared K/V across query heads
+            head_dim: 32,
+            window: 32,
+            block_size: 16,
+            causal: true,
+        }
+    }
+
+    pub fn validate(&self) -> Result<(), TemporalError> {
+        if self.q_heads == 0 || self.kv_heads == 0 || self.head_dim == 0 {
+            return Err(TemporalError::InvalidConfig(
+                "q_heads, kv_heads, head_dim must all be > 0",
+            ));
+        }
+        if self.q_heads % self.kv_heads != 0 {
+            return Err(TemporalError::InvalidConfig(
+                "q_heads must be divisible by kv_heads (GQA constraint)",
+            ));
+        }
+        if self.block_size == 0 {
+            return Err(TemporalError::InvalidConfig("block_size must be > 0"));
+        }
+        Ok(())
+    }
+}
@@ -0,0 +1,44 @@
+use ruvllm_sparse_attention::{dense_attention, Tensor3};
+
+use crate::{TemporalError, TemporalHeadConfig};
+
+/// Dense MHA backend (ADR-096 §5 A/B baseline).
+///
+/// Wraps upstream `dense_attention` — the naive O(N²) reference kernel.
+/// Same approximation surface as classical scaled-dot-product attention,
+/// no log-stride / landmarks / windowing. Exists primarily as the
+/// reference path for the §5 validation gate (rank correlation,
+/// contrastive-loss parity, latency baseline).
+///
+/// Has no streaming counterpart: dense MHA structurally cannot do
+/// O(log T) decode — every new token requires recomputing the full
+/// attention matrix. Callers that want streaming must use SparseGqa.
+pub struct DenseHead {
+    causal: bool,
+    cfg: TemporalHeadConfig,
+}
+
+impl DenseHead {
+    pub fn new(cfg: &TemporalHeadConfig) -> Result<Self, TemporalError> {
+        cfg.validate()?;
+        Ok(Self {
+            causal: cfg.causal,
+            cfg: cfg.clone(),
+        })
+    }
+
+    pub fn cfg(&self) -> &TemporalHeadConfig {
+        &self.cfg
+    }
+
+    /// Naive O(N²) prefill. Q/K/V must share the same head count
+    /// (no GQA) — `dense_attention` upstream enforces it.
+    pub fn forward(
+        &self,
+        q: &Tensor3,
+        k: &Tensor3,
+        v: &Tensor3,
+    ) -> Result<Tensor3, TemporalError> {
+        Ok(dense_attention(q, k, v, self.causal)?)
+    }
+}
@@ -0,0 +1,28 @@
+use thiserror::Error;
+
+#[derive(Debug, Error)]
+pub enum TemporalError {
+    #[error("temporal head config invalid: {0}")]
+    InvalidConfig(&'static str),
+
+    /// Retained for back-compat with v0.1 callers; superseded by the
+    /// per-operation errors below now that Dense is implemented.
+    #[error("dense MHA backend not implemented yet (ADR-096 §4.4 follow-up)")]
+    DenseBackendNotImplemented,
+
+    /// Dense MHA has no notion of an accumulated KV cache — every
+    /// new frame requires recomputing the full N² attention matrix
+    /// (the structural gap ADR-096 §3.2 flagged). Callers that want
+    /// streaming decode must use the SparseGqa backend.
+    #[error("dense backend does not support streaming step(); use SparseGqa for online decode")]
+    BackendDoesNotSupportStreaming,
+
+    #[error("sparse attention kernel error: {0}")]
+    Kernel(String),
+}
+
+impl From<ruvllm_sparse_attention::AttentionError> for TemporalError {
+    fn from(e: ruvllm_sparse_attention::AttentionError) -> Self {
+        TemporalError::Kernel(format!("{e}"))
+    }
+}
@@ -0,0 +1,105 @@
+// AETHER temporal head over CSI feature windows (ADR-096).
+//
+// Wraps `ruvllm_sparse_attention::SubquadraticSparseAttention` so AETHER
+// callers in `wifi-densepose-train` and `wifi-densepose-signal` can swap
+// dense MHA for sparse-GQA without touching the contrastive recipe.
+//
+// Status: scaffolding for ADR-096 §4.3. Sparse backend is functional;
+// the dense back-compat backend is a follow-up (Phase 2 of the roadmap
+// in #513). Streaming `step()` lands once the per-track KvCache lifecycle
+// (ADR-096 §8.5) is finalized.
+
+pub mod config;
+pub mod dense;
+pub mod error;
+pub mod sparse;
+pub mod weights;
+
+pub use config::{TemporalBackendKind, TemporalHeadConfig};
+pub use dense::DenseHead;
+pub use error::TemporalError;
+pub use sparse::SparseGqaHead;
+pub use weights::{
+    WeightBlob, WeightBlobHeader, WeightDtype, WEIGHT_BLOB_HEADER_LEN, WEIGHT_BLOB_MAGIC,
+    WEIGHT_BLOB_VERSION,
+};
+
+// Re-export the upstream Tensor3 + KvCache so callers don't need a
+// direct `ruvllm_sparse_attention` dep.
+pub use ruvllm_sparse_attention::{KvCache, Tensor3};
+
+/// Thin facade so callers can pick a backend by name.
+///
+/// Both backends implement `forward()` for prefill. Only `SparseGqa`
+/// implements `step()` (streaming O(log T) decode against KvCache);
+/// dense MHA structurally lacks a streaming counterpart and returns
+/// `TemporalError::BackendDoesNotSupportStreaming` on `step()`.
+pub enum AetherTemporalHead {
+    SparseGqa(SparseGqaHead),
+    Dense(DenseHead),
+}
+
+impl AetherTemporalHead {
+    pub fn new(cfg: &TemporalHeadConfig) -> Result<Self, TemporalError> {
+        match cfg.backend {
+            TemporalBackendKind::SparseGqa => {
+                Ok(AetherTemporalHead::SparseGqa(SparseGqaHead::new(cfg)?))
+            }
+            TemporalBackendKind::Dense => Ok(AetherTemporalHead::Dense(DenseHead::new(cfg)?)),
+        }
+    }
+
+    /// Window-level prefill. Returns the per-token attention output as
+    /// a Tensor3 of shape (window, q_heads, head_dim). Pooling to a
+    /// single embedding is the caller's responsibility — different
+    /// AETHER consumers use different pool ops (mean for re-ID,
+    /// last-token for streaming).
+    pub fn forward(
+        &self,
+        q: &Tensor3,
+        k: &Tensor3,
+        v: &Tensor3,
+    ) -> Result<Tensor3, TemporalError> {
+        match self {
+            AetherTemporalHead::SparseGqa(h) => h.forward(q, k, v),
+            AetherTemporalHead::Dense(h) => h.forward(q, k, v),
+        }
+    }
+
+    /// Streaming decode (ADR-096 §3.2). Caller owns the `cache`; the
+    /// natural lifetime is per-tracked-person (one cache per
+    /// `PoseTrack`, dropped when the track evicts).
+    ///
+    /// Returns the attention output for the single new token. Caller
+    /// is responsible for downstream pooling / classifier head.
+    ///
+    /// Dense backend returns `BackendDoesNotSupportStreaming` — no
+    /// dense-MHA-with-KV-cache equivalent exists, by design.
+    pub fn step(
+        &self,
+        q_new: &Tensor3,
+        k_new: &Tensor3,
+        v_new: &Tensor3,
+        cache: &mut KvCache,
+    ) -> Result<Tensor3, TemporalError> {
+        match self {
+            AetherTemporalHead::SparseGqa(h) => h.step(q_new, k_new, v_new, cache),
+            AetherTemporalHead::Dense(_) => {
+                Err(TemporalError::BackendDoesNotSupportStreaming)
+            }
+        }
+    }
+
+    /// Allocate a `KvCache` sized correctly for this head. Convenience
+    /// wrapper so AETHER's `pose_tracker.rs` doesn't need to import
+    /// the upstream crate.
+    ///
+    /// Dense backend returns `BackendDoesNotSupportStreaming` — there
+    /// is no cache to size for a dense kernel.
+    pub fn make_cache(&self, capacity: usize) -> Result<KvCache, TemporalError> {
+        match self {
+            AetherTemporalHead::SparseGqa(h) => Ok(h.make_cache(capacity)),
+            AetherTemporalHead::Dense(_) => Err(TemporalError::BackendDoesNotSupportStreaming),
+        }
+    }
+}
@@ -0,0 +1,112 @@
+use ruvllm_sparse_attention::{
+    AttentionBackend, KvCache, SparseAttentionConfig, SubquadraticSparseAttention, Tensor3,
+};
+
+use crate::{TemporalError, TemporalHeadConfig};
+
+/// AETHER temporal head implemented with `ruvllm_sparse_attention`.
+///
+/// The selection rule from ADR-096 §4.4 is enforced at `forward()`
+/// time: when `q_heads == kv_heads` we use `forward()` (plain MHA
+/// over the sparse pattern); when they differ we use `forward_gqa()`.
+/// The streaming `step()` path is staged behind a follow-up — KvCache
+/// lifecycle ties to `PoseTrack` per ADR-096 §8.5 and lives on the
+/// caller, not here.
+pub struct SparseGqaHead {
+    cfg: TemporalHeadConfig,
+    attn: SubquadraticSparseAttention,
+}
+
+impl SparseGqaHead {
+    pub fn new(cfg: &TemporalHeadConfig) -> Result<Self, TemporalError> {
+        cfg.validate()?;
+
+        let attn_cfg = SparseAttentionConfig {
+            window: cfg.window,
+            block_size: cfg.block_size,
+            global_tokens: alloc_first_token(),
+            causal: cfg.causal,
+            use_log_stride: true,
+            use_landmarks: true,
+            sort_candidates: false,
+        };
+
+        let attn = SubquadraticSparseAttention::new(attn_cfg)?;
+        Ok(Self {
+            cfg: cfg.clone(),
+            attn,
+        })
+    }
+
+    pub fn cfg(&self) -> &TemporalHeadConfig {
+        &self.cfg
+    }
+
+    pub fn forward(
+        &self,
+        q: &Tensor3,
+        k: &Tensor3,
+        v: &Tensor3,
+    ) -> Result<Tensor3, TemporalError> {
+        // ADR-096 §4.4: dispatch by GQA shape.
+        if self.cfg.q_heads == self.cfg.kv_heads {
+            // Pure MHA — sparse `forward` is the right path.
+            Ok(self.attn.forward(q, k, v)?)
+        } else {
+            // GQA / MQA — kv_heads < q_heads, group share factor = q/kv.
+            Ok(self.attn.forward_gqa(q, k, v)?)
+        }
+    }
+
+    /// Streaming decode for re-ID and online classification (ADR-096 §3.2).
+    ///
+    /// Given one new token's q/k/v, append (k, v) to `cache` and return
+    /// the attention output for that one position against the full
+    /// accumulated history. Cost is O(log T) per step against a cache
+    /// of capacity T — the structural advantage over dense MHA's O(N²)
+    /// recompute that ADR-096 specifically calls out as the
+    /// dense-MHA-cannot-follow path.
+    ///
+    /// Cache lifetime is owned by the caller. Per ADR-096 §8.5 the
+    /// natural place is one cache per `PoseTrack` (re-ID) or one cache
+    /// per active session (online classification). When the track is
+    /// dropped, drop the cache.
+    pub fn step(
+        &self,
+        q_new: &Tensor3,
+        k_new: &Tensor3,
+        v_new: &Tensor3,
+        cache: &mut KvCache,
+    ) -> Result<Tensor3, TemporalError> {
+        if q_new.seq != 1 || k_new.seq != 1 || v_new.seq != 1 {
+            return Err(TemporalError::InvalidConfig(
+                "step() requires single-token q/k/v (seq == 1 each)",
+            ));
+        }
+        // Append must succeed before decode_step sees the cache; if
+        // the cache fills, the caller is responsible for eviction or
+        // resetting per ADR-096 §3.2 (H2O heavy-hitter eviction is
+        // available upstream but kept opt-in).
+        cache.try_append(k_new, v_new)?;
+        Ok(self.attn.decode_step(q_new, cache)?)
+    }
+
+    /// Construct a KvCache sized for this head's shape. Convenience
+    /// so callers don't need to import the upstream crate directly.
+    pub fn make_cache(&self, capacity: usize) -> KvCache {
+        KvCache::new(
+            capacity,
+            self.cfg.kv_heads,
+            self.cfg.head_dim,
+            self.cfg.block_size,
+        )
+    }
+}
+
+/// Always treat token 0 as a global anchor — AETHER's contrastive
+/// recipe (ADR-024) gives the first token a special role as the
+/// "session start" reference embedding, and global tokens in the
+/// sparse pattern preserve full visibility for that one position.
+fn alloc_first_token() -> Vec<usize> {
+    vec![0]
+}
@@ -0,0 +1,231 @@
+// Wire format for the temporal-head weights blob.
+//
+// One blob describes one model. Both ends speak it:
+//   - Host-side (this crate): training emits a blob via `WeightBlob::serialize`.
+//   - Firmware-side (`firmware/esp32-csi-node/components/ruv_temporal`):
+//     loads it via a mirrored parser. The blob is the *only* thing
+//     that crosses the host/firmware boundary at deploy time, so the
+//     format must be stable, self-describing, and version-gated.
+//
+// Layout (little-endian throughout):
+//
+//   header  16 B
+//     [0x00..0x04)  magic         u32  = 0x52564E45 ("RVNE" — RuVector Neural Edge)
+//     [0x04..0x06)  version       u16  = 1
+//     [0x06..0x07)  flags         u8   bit 0 = 0:fp32 / 1:fp16 weights
+//     [0x07..0x08)  reserved      u8
+//     [0x08..0x0A)  input_dim     u16  per-frame feature dim
+//     [0x0A..0x0C)  n_q_heads     u16  query head count
+//     [0x0C..0x0E)  n_kv_heads    u16  key/value head count (≤ n_q_heads, divides it)
+//     [0x0E..0x10)  head_dim      u16  per-head feature dim
+//
+//   body                          variable
+//     [0x10..0x12)  n_layers      u16
+//     [0x12..0x14)  n_classes     u16
+//     [0x14..0x18)  weights_len   u32  bytes of weights payload (after this header)
+//     [0x18..end-4) weights       weights_len bytes — flat per-layer arrays
+//                                 in the order the kernel reads them
+//   footer                        4 B
+//     [end-4..end)  crc32         u32  IEEE 802.3, covers everything before
+//
+// Total size = 16 (header) + 2+2+4 (body header) + weights_len + 4 (crc) = 28 + weights_len
+//
+// Versioning: bumping `version` is a hard break — firmware refuses to
+// load a blob whose version it doesn't know. Adding a *new* field is
+// done by reserving a new flag bit and treating the field as
+// post-weights when the bit is set; never reorder existing fields.
+
+use crate::error::TemporalError;
+
+pub const WEIGHT_BLOB_MAGIC: u32 = 0x5256_4E45; // "RVNE"
+pub const WEIGHT_BLOB_VERSION: u16 = 1;
+pub const WEIGHT_BLOB_HEADER_LEN: usize = 16 + 2 + 2 + 4; // 24
+pub const WEIGHT_BLOB_FOOTER_LEN: usize = 4;
+
+#[derive(Clone, Copy, Debug, Eq, PartialEq)]
+pub enum WeightDtype {
+    F32,
+    F16,
+}
+
+#[derive(Clone, Debug)]
+pub struct WeightBlobHeader {
+    pub dtype: WeightDtype,
+    pub input_dim: u16,
+    pub n_q_heads: u16,
+    pub n_kv_heads: u16,
+    pub head_dim: u16,
+    pub n_layers: u16,
+    pub n_classes: u16,
+}
+
+impl WeightBlobHeader {
+    /// Element size in bytes for the configured dtype.
+    pub fn elem_bytes(&self) -> usize {
+        match self.dtype {
+            WeightDtype::F32 => 4,
+            WeightDtype::F16 => 2,
+        }
+    }
+
+    /// Validate that the structural numbers make sense — caught here
+    /// rather than at first kernel call so the host-side training
+    /// tool can't accidentally emit a blob the firmware will reject
+    /// at boot.
+    pub fn validate(&self) -> Result<(), TemporalError> {
+        if self.input_dim == 0
+            || self.n_q_heads == 0
+            || self.n_kv_heads == 0
+            || self.head_dim == 0
+        {
+            return Err(TemporalError::InvalidConfig(
+                "header: zero-valued dimension(s)",
+            ));
+        }
+        if self.n_q_heads % self.n_kv_heads != 0 {
+            return Err(TemporalError::InvalidConfig(
+                "header: n_q_heads must be divisible by n_kv_heads (GQA)",
+            ));
+        }
+        if self.n_layers == 0 || self.n_classes < 2 {
+            return Err(TemporalError::InvalidConfig(
+                "header: n_layers must be ≥ 1 and n_classes ≥ 2",
+            ));
+        }
+        Ok(())
+    }
+}
+
+/// A complete weight blob: header + raw weights bytes.
+///
+/// Weights are kept as `Vec<u8>` rather than `Vec<f32>` / `Vec<f16>` so
+/// the firmware loader (which is no_std and may not have the `half`
+/// crate) can `mmap` the body and read either dtype directly.
+pub struct WeightBlob {
+    pub header: WeightBlobHeader,
+    pub weights: Vec<u8>,
+}
+
+impl WeightBlob {
+    pub fn new(header: WeightBlobHeader, weights: Vec<u8>) -> Result<Self, TemporalError> {
+        header.validate()?;
+        let elem = header.elem_bytes();
+        if weights.len() % elem != 0 {
+            return Err(TemporalError::InvalidConfig(
+                "weights length is not a multiple of dtype element size",
+            ));
+        }
+        Ok(Self { header, weights })
+    }
+
+    /// Serialize to the wire format. Stable across rebuilds — this is
+    /// the contract the firmware loader speaks.
+    pub fn serialize(&self) -> Vec<u8> {
+        let total = WEIGHT_BLOB_HEADER_LEN + self.weights.len() + WEIGHT_BLOB_FOOTER_LEN;
+        let mut out = Vec::with_capacity(total);
+
+        // header
+        out.extend_from_slice(&WEIGHT_BLOB_MAGIC.to_le_bytes());
+        out.extend_from_slice(&WEIGHT_BLOB_VERSION.to_le_bytes());
+        let flags: u8 = match self.header.dtype {
+            WeightDtype::F32 => 0,
+            WeightDtype::F16 => 1,
+        };
+        out.push(flags);
+        out.push(0); // reserved
+        out.extend_from_slice(&self.header.input_dim.to_le_bytes());
+        out.extend_from_slice(&self.header.n_q_heads.to_le_bytes());
+        out.extend_from_slice(&self.header.n_kv_heads.to_le_bytes());
+        out.extend_from_slice(&self.header.head_dim.to_le_bytes());
+
+        // body header
+        out.extend_from_slice(&self.header.n_layers.to_le_bytes());
+        out.extend_from_slice(&self.header.n_classes.to_le_bytes());
+        out.extend_from_slice(&(self.weights.len() as u32).to_le_bytes());
+
+        // weights
+        out.extend_from_slice(&self.weights);
+
+        // footer: crc32 over everything written so far
+        let crc = crc32_ieee(&out);
+        out.extend_from_slice(&crc.to_le_bytes());
+        out
+    }
+
+    /// Parse a blob, validating magic / version / size / CRC.
+    pub fn parse(buf: &[u8]) -> Result<Self, TemporalError> {
+        if buf.len() < WEIGHT_BLOB_HEADER_LEN + WEIGHT_BLOB_FOOTER_LEN {
+            return Err(TemporalError::InvalidConfig("blob too short"));
+        }
+
+        let magic = u32::from_le_bytes(buf[0..4].try_into().unwrap());
+        if magic != WEIGHT_BLOB_MAGIC {
+            return Err(TemporalError::InvalidConfig("bad magic"));
+        }
+        let version = u16::from_le_bytes(buf[4..6].try_into().unwrap());
+        if version != WEIGHT_BLOB_VERSION {
+            return Err(TemporalError::InvalidConfig("unsupported blob version"));
+        }
+        let flags = buf[6];
+        let dtype = match flags & 0x01 {
+            0 => WeightDtype::F32,
+            _ => WeightDtype::F16,
+        };
+
+        let input_dim = u16::from_le_bytes(buf[8..10].try_into().unwrap());
+        let n_q_heads = u16::from_le_bytes(buf[10..12].try_into().unwrap());
+        let n_kv_heads = u16::from_le_bytes(buf[12..14].try_into().unwrap());
+        let head_dim = u16::from_le_bytes(buf[14..16].try_into().unwrap());
+
+        let n_layers = u16::from_le_bytes(buf[16..18].try_into().unwrap());
+        let n_classes = u16::from_le_bytes(buf[18..20].try_into().unwrap());
+        let weights_len = u32::from_le_bytes(buf[20..24].try_into().unwrap()) as usize;
+
+        // sanity-check size before slicing
+        let expected = WEIGHT_BLOB_HEADER_LEN + weights_len + WEIGHT_BLOB_FOOTER_LEN;
+        if buf.len() != expected {
+            return Err(TemporalError::InvalidConfig(
+                "blob length doesn't match weights_len in header",
+            ));
+        }
+
+        // CRC check: cover everything before the trailing 4-byte CRC
+        let stored_crc = u32::from_le_bytes(buf[buf.len() - 4..].try_into().unwrap());
+        let computed = crc32_ieee(&buf[..buf.len() - 4]);
+        if stored_crc != computed {
+            return Err(TemporalError::InvalidConfig("blob CRC mismatch"));
+        }
+
+        let header = WeightBlobHeader {
+            dtype,
+            input_dim,
+            n_q_heads,
+            n_kv_heads,
+            head_dim,
+            n_layers,
+            n_classes,
+        };
+        header.validate()?;
+
+        let weights_start = WEIGHT_BLOB_HEADER_LEN;
+        let weights_end = weights_start + weights_len;
+        let weights = buf[weights_start..weights_end].to_vec();
+
+        Ok(Self { header, weights })
+    }
+}
+
+/// IEEE 802.3 CRC32 (poly 0xEDB88320), table-free. Same polynomial
+/// the firmware-side loader uses (`temporal_task.c::crc32_ieee`) so a
+/// blob produced here parses there.
+pub(crate) fn crc32_ieee(data: &[u8]) -> u32 {
+    let mut crc = 0xFFFF_FFFFu32;
+    for &b in data {
+        crc ^= b as u32;
+        for _ in 0..8 {
+            let mask = 0u32.wrapping_sub(crc & 1);
+            crc = (crc >> 1) ^ (0xEDB8_8320 & mask);
+        }
+    }
+    !crc
+}
@@ -0,0 +1,114 @@
+//! End-to-end test: write a deterministic-seeded weight blob to disk,
+//! read it back, parse it. Mirrors what the host-side training tool
+//! does (training run finishes → emit .rvne) and what the firmware
+//! loader will do once the toolchain unblocks (boot → mmap NVS or
+//! EMBED_FILES blob → parse → run kernel).
+//!
+//! Sized realistically (~26 KB for the AETHER default shape) so the
+//! perf and CRC paths see a meaningful payload.
+
+use std::fs;
+
+use wifi_densepose_temporal::{WeightBlob, WeightBlobHeader, WeightDtype};
+
+fn aether_default_header() -> WeightBlobHeader {
+    WeightBlobHeader {
+        dtype: WeightDtype::F32,
+        input_dim: 16,
+        n_q_heads: 4,
+        n_kv_heads: 1,
+        head_dim: 32,
+        n_layers: 2,
+        n_classes: 4,
+    }
+}
+
+fn xorshift_step(state: &mut u64) -> u64 {
+    let mut x = *state;
+    x ^= x << 13;
+    x ^= x >> 7;
+    x ^= x << 17;
+    *state = x;
+    x.wrapping_mul(2685821657736338717u64)
+}
+
+fn deterministic_weights(byte_len: usize, seed: u64) -> Vec<u8> {
+    let mut out = Vec::with_capacity(byte_len);
+    let mut state = seed;
+    while out.len() < byte_len {
+        let bits = xorshift_step(&mut state) >> 32;
+        let unit = (bits as u32 as f32) / (u32::MAX as f32);
+        let f = (unit - 0.5) * 0.2;
+        out.extend_from_slice(&f.to_le_bytes());
+    }
+    out.truncate(byte_len);
+    out
+}
+
+#[test]
+fn realistic_blob_roundtrips_through_filesystem() {
+    // AETHER default + 2 layers + classifier head: enough to exercise
+    // a non-trivial weights region without making the test slow.
+    let header = aether_default_header();
+
+    // Per-layer floats: input_dim*(q_heads*head_dim) for Wq, twice
+    // input_dim*(kv_heads*head_dim) for Wk and Wv, q_heads*head_dim*input_dim
+    // for Wo. Plus classifier head input_dim*n_classes.
+    let per_layer = (header.input_dim as usize)
+        * (header.n_q_heads as usize * header.head_dim as usize)
+        + 2 * (header.input_dim as usize)
+            * (header.n_kv_heads as usize * header.head_dim as usize)
+        + (header.n_q_heads as usize * header.head_dim as usize)
+            * (header.input_dim as usize);
+    let total_floats = per_layer * header.n_layers as usize
+        + header.input_dim as usize * header.n_classes as usize;
+    let weights_bytes = total_floats * 4;
+    assert!(weights_bytes > 25_000);
+
+    let weights = deterministic_weights(weights_bytes, 0xC511_0007_DEAD_BEEFu64);
+    let blob = WeightBlob::new(header, weights).expect("construct");
+    let serialized = blob.serialize();
+
+    // Filesystem leg — the realistic firmware loader path mmap or
+    // streaming-reads from NVS / EMBED_FILES. We use a temp file
+    // per platform; on Windows std::env::temp_dir() works fine.
+    let mut tmp = std::env::temp_dir();
+    tmp.push("wifi-densepose-temporal-e2e.rvne");
+    fs::write(&tmp, &serialized).expect("write");
+    let read_back = fs::read(&tmp).expect("read");
+    assert_eq!(read_back, serialized, "filesystem corrupted bytes");
+
+    let parsed = WeightBlob::parse(&read_back).expect("parse");
+    assert_eq!(parsed.header.input_dim, 16);
+    assert_eq!(parsed.header.n_q_heads, 4);
+    assert_eq!(parsed.header.n_kv_heads, 1);
+    assert_eq!(parsed.header.head_dim, 32);
+    assert_eq!(parsed.header.n_layers, 2);
+    assert_eq!(parsed.header.n_classes, 4);
+    assert_eq!(parsed.weights.len(), weights_bytes);
+
+    // Cleanup — best-effort, don't fail the test on Windows file lock.
+    let _ = fs::remove_file(&tmp);
+}
+
+#[test]
+fn deterministic_seed_produces_byte_identical_blobs() {
+    // The training script needs reproducibility — given the same
+    // config and seed, two runs must produce byte-identical output.
+    // This is what makes a witness-bundle (ADR-028) over the trained
+    // weights meaningful.
+    let header = aether_default_header();
+    let bytes = 4096;
+
+    let w1 = deterministic_weights(bytes, 0x1234u64);
+    let w2 = deterministic_weights(bytes, 0x1234u64);
+    assert_eq!(w1, w2, "PRNG not deterministic at fixed seed");
+
+    let blob1 = WeightBlob::new(header.clone(), w1).expect("ok");
+    let blob2 = WeightBlob::new(header, w2).expect("ok");
+    assert_eq!(
+        blob1.serialize(),
+        blob2.serialize(),
+        "serialization not deterministic"
+    );
+}
@@ -0,0 +1,184 @@
+//! Numerical A/B test for ADR-096 §5: do Dense and SparseGqa produce
+//! comparable outputs on the same input?
+//!
+//! Background. Sparse attention is *structurally* an approximation —
+//! it skips edges that the local window + log-stride + landmark
+//! pattern decided wouldn't matter. The §5 validation gate cares
+//! about whether that approximation degrades downstream metrics
+//! (contrastive loss, rank-1 accuracy, Spearman correlation), not
+//! whether outputs are bit-equal. This file establishes the *direct*
+//! output-level error envelope so the gate can be calibrated against
+//! it.
+//!
+//! Two regimes:
+//!
+//! 1. **Sparse pattern is dense.** When window ≥ N AND block_size ≥ N
+//!    AND every position is global, sparse and dense visit the same
+//!    edge set. Output divergence then reflects only floating-point
+//!    accumulation order, which is a tight bound (~1e-5 for f32 sums
+//!    of ~100 terms at 0.1 magnitude).
+//!
+//! 2. **Sparse pattern is sparse.** Default config drops most edges
+//!    at long N. Output divergence here is the *real* approximation
+//!    error — and the §5 gate's tolerances apply downstream of it.
+
+use ruvllm_sparse_attention::Tensor3;
+use wifi_densepose_temporal::{
+    AetherTemporalHead, TemporalBackendKind, TemporalHeadConfig,
+};
+
+fn make_qkv(seq: usize, heads: usize, dim: usize) -> (Tensor3, Tensor3, Tensor3) {
+    let mut q = Tensor3::zeros(seq, heads, dim);
+    let mut k = Tensor3::zeros(seq, heads, dim);
+    let mut v = Tensor3::zeros(seq, heads, dim);
+    for s in 0..seq {
+        for h in 0..heads {
+            for d in 0..dim {
+                let qv = ((s * 31 + h * 7 + d) as f32).sin() * 0.1;
+                let kv = (((s * 17 + h * 3 + d) as f32).cos()) * 0.1;
+                q.set(s, h, d, qv);
+                k.set(s, h, d, kv);
+                v.set(s, h, d, kv * 0.5);
+            }
+        }
+    }
+    (q, k, v)
+}
+
+fn max_abs_err(a: &Tensor3, b: &Tensor3) -> f32 {
+    let (s, h, d) = a.shape();
+    assert_eq!((s, h, d), b.shape(), "shape mismatch");
+    let mut max_err = 0.0f32;
+    for ti in 0..s {
+        for hi in 0..h {
+            for di in 0..d {
+                let e = (a.get(ti, hi, di) - b.get(ti, hi, di)).abs();
+                if e > max_err {
+                    max_err = e;
+                }
+            }
+        }
+    }
+    max_err
+}
+
+fn mean_abs_err(a: &Tensor3, b: &Tensor3) -> f32 {
+    let (s, h, d) = a.shape();
+    let mut sum = 0.0f32;
+    let mut n = 0usize;
+    for ti in 0..s {
+        for hi in 0..h {
+            for di in 0..d {
+                sum += (a.get(ti, hi, di) - b.get(ti, hi, di)).abs();
+                n += 1;
+            }
+        }
+    }
+    sum / n.max(1) as f32
+}
+
+#[test]
+fn dense_and_sparse_agree_when_sparse_pattern_is_dense() {
+    // Saturate the sparse pattern: window ≥ N means the local-window
+    // primitive includes every causal predecessor, so the attention
+    // edge set is identical to dense MHA's. The remaining gap is
+    // floating-point accumulation order (sparse goes
+    // window-then-stride-then-landmark, dense goes naive 0..i).
+    let seq = 32;
+    let heads = 4;
+    let dim = 16;
+    let (q, k, v) = make_qkv(seq, heads, dim);
+
+    let dense_cfg = TemporalHeadConfig {
+        backend: TemporalBackendKind::Dense,
+        q_heads: heads,
+        kv_heads: heads,
+        head_dim: dim,
+        window: seq, // saturate
+        block_size: seq,
+        causal: true,
+    };
+    let sparse_cfg = TemporalHeadConfig {
+        backend: TemporalBackendKind::SparseGqa,
+        ..dense_cfg.clone()
+    };
+
+    let dense = AetherTemporalHead::new(&dense_cfg).expect("dense");
+    let sparse = AetherTemporalHead::new(&sparse_cfg).expect("sparse");
+
+    let d = dense.forward(&q, &k, &v).expect("dense forward");
+    let s = sparse.forward(&q, &k, &v).expect("sparse forward");
+
+    let max_err = max_abs_err(&d, &s);
+    let mean_err = mean_abs_err(&d, &s);
+
+    // 1e-4 covers a generous f32-summation-order envelope at 0.1
+    // input magnitude. If this ever blows up, either the saturation
+    // assumption is wrong (window/block_size no longer covers
+    // everything) or the kernel changed semantics.
+    assert!(
+        max_err < 1.0e-4,
+        "saturated-pattern max_abs_err exceeds 1e-4: max={max_err} mean={mean_err}"
+    );
+}
+
+#[test]
+fn dense_and_sparse_diverge_predictably_at_long_n() {
+    // The interesting case: real sparse pattern (window << N), real
+    // approximation. We don't assert a specific error bound here —
+    // that's what ADR-096 §5's validation gate calibrates. We only
+    // check the numbers come out finite and plausible (per-position
+    // outputs stay within a few × the input magnitude after
+    // attention-weighted averaging — softmax can't blow them up).
+    let seq = 256;
+    let heads = 4;
+    let dim = 16;
+    let (q, k, v) = make_qkv(seq, heads, dim);
+
+    let dense_cfg = TemporalHeadConfig {
+        backend: TemporalBackendKind::Dense,
+        q_heads: heads,
+        kv_heads: heads,
+        head_dim: dim,
+        window: seq, // dense — placeholder; ignored by Dense backend
+        block_size: seq,
+        causal: true,
+    };
+    let sparse_cfg = TemporalHeadConfig {
+        backend: TemporalBackendKind::SparseGqa,
+        q_heads: heads,
+        kv_heads: heads,
+        head_dim: dim,
+        window: 16,    // realistic sparse window
+        block_size: 32,
+        causal: true,
+    };
+
+    let dense = AetherTemporalHead::new(&dense_cfg).expect("dense");
+    let sparse = AetherTemporalHead::new(&sparse_cfg).expect("sparse");
+
+    let d = dense.forward(&q, &k, &v).expect("dense forward");
+    let s = sparse.forward(&q, &k, &v).expect("sparse forward");
+
+    let max_err = max_abs_err(&d, &s);
+    let mean_err = mean_abs_err(&d, &s);
+
+    // Sanity bounds. Inputs are scaled to 0.1, attention is a softmax
+    // average so outputs stay in roughly [-0.1, 0.1]. If max_err > 1.0
+    // something is structurally broken (NaN, underflow, etc).
+    assert!(
+        max_err.is_finite() && mean_err.is_finite(),
+        "non-finite error: max={max_err} mean={mean_err}"
+    );
+    assert!(
+        max_err < 1.0,
+        "implausibly large divergence: max={max_err} mean={mean_err}"
+    );
+
+    // Print the numbers so they're visible when running `cargo test --
+    // --nocapture`. These are what ADR-096 §5's gate would calibrate
+    // against on real AETHER inputs.
+    eprintln!(
+        "dense_vs_sparse @ N={seq}, window=16, block=32: max_abs_err={max_err:.6e}, mean_abs_err={mean_err:.6e}"
+    );
+}
@@ -0,0 +1,133 @@
+//! Smoke tests for the AETHER sparse-GQA temporal head (ADR-096 §5 gate is
+//! a separate accuracy benchmark; this file just proves the wiring works).
+
+use wifi_densepose_temporal::{
+    AetherTemporalHead, TemporalBackendKind, TemporalHeadConfig, TemporalError, Tensor3,
+};
+
+fn make_qkv(seq: usize, q_heads: usize, kv_heads: usize, dim: usize) -> (Tensor3, Tensor3, Tensor3) {
+    // Deterministic synthetic CSI-like activations so the test is
+    // reproducible across machines without bringing in `rand`.
+    let mut q = Tensor3::zeros(seq, q_heads, dim);
+    for s in 0..seq {
+        for h in 0..q_heads {
+            for d in 0..dim {
+                let v = ((s * 31 + h * 7 + d) as f32).sin() * 0.1;
+                q.set(s, h, d, v);
+            }
+        }
+    }
+    let mut k = Tensor3::zeros(seq, kv_heads, dim);
+    let mut v = Tensor3::zeros(seq, kv_heads, dim);
+    for s in 0..seq {
+        for h in 0..kv_heads {
+            for d in 0..dim {
+                let kv = (((s * 17 + h * 3 + d) as f32).cos()) * 0.1;
+                k.set(s, h, d, kv);
+                v.set(s, h, d, kv * 0.5);
+            }
+        }
+    }
+    (q, k, v)
+}
+
+#[test]
+fn sparse_gqa_forward_runs_at_aether_default() {
+    let cfg = TemporalHeadConfig::default_aether();
+    let head = AetherTemporalHead::new(&cfg).expect("construct");
+
+    let (q, k, vt) = make_qkv(64, cfg.q_heads, cfg.kv_heads, cfg.head_dim);
+    let out = head.forward(&q, &k, &vt).expect("forward");
+    let (oseq, oh, od) = out.shape();
+    assert_eq!(oseq, 64);
+    assert_eq!(oh, cfg.q_heads);
+    assert_eq!(od, cfg.head_dim);
+}
+
+#[test]
+fn sparse_mha_path_runs_when_qkv_heads_match() {
+    // q_heads == kv_heads forces the `forward` (non-GQA) branch.
+    let cfg = TemporalHeadConfig {
+        backend: TemporalBackendKind::SparseGqa,
+        q_heads: 2,
+        kv_heads: 2,
+        head_dim: 16,
+        window: 8,
+        block_size: 4,
+        causal: true,
+    };
+    let head = AetherTemporalHead::new(&cfg).expect("construct");
+    let (q, k, vt) = make_qkv(32, 2, 2, 16);
+    let out = head.forward(&q, &k, &vt).expect("forward");
+    assert_eq!(out.shape(), (32, 2, 16));
+}
+
+#[test]
+fn dense_backend_forward_runs_with_matching_shape() {
+    // Dense_attention upstream requires q_heads == kv_heads (no GQA).
+    // Use MHA shape; n_classes/n_layers don't matter for forward-only.
+    let cfg = TemporalHeadConfig {
+        backend: TemporalBackendKind::Dense,
+        q_heads: 4,
+        kv_heads: 4,
+        head_dim: 16,
+        window: 8,
+        block_size: 4,
+        causal: true,
+    };
+    let head = AetherTemporalHead::new(&cfg).expect("construct dense");
+    let (q, k, v) = make_qkv(32, 4, 4, 16);
+    let out = head.forward(&q, &k, &v).expect("dense forward");
+    assert_eq!(out.shape(), (32, 4, 16));
+}
+
+#[test]
+fn dense_backend_step_returns_streaming_error() {
+    let cfg = TemporalHeadConfig {
+        backend: TemporalBackendKind::Dense,
+        q_heads: 4,
+        kv_heads: 4,
+        head_dim: 16,
+        window: 8,
+        block_size: 4,
+        causal: true,
+    };
+    let head = AetherTemporalHead::new(&cfg).expect("construct dense");
+    let cache_err = head.make_cache(32).err().expect("no cache for dense");
+    matches!(cache_err, TemporalError::BackendDoesNotSupportStreaming);
+}
+
+#[test]
+fn invalid_gqa_ratio_rejected_at_construction() {
+    let cfg = TemporalHeadConfig {
+        backend: TemporalBackendKind::SparseGqa,
+        q_heads: 5,
+        kv_heads: 2, // 5 % 2 != 0
+        head_dim: 16,
+        window: 8,
+        block_size: 4,
+        causal: true,
+    };
+    let err = AetherTemporalHead::new(&cfg).err().expect("rejected");
+    matches!(err, TemporalError::InvalidConfig(_));
+}
+
+#[test]
+fn long_window_at_aether_roadmap_target() {
+    // ADR-096 §3.1 roadmap target: 10 s @ 100 Hz = 1000 frames. Verify
+    // the kernel actually runs at this length so the long-window claim
+    // is more than aspirational.
+    let cfg = TemporalHeadConfig {
+        backend: TemporalBackendKind::SparseGqa,
+        q_heads: 4,
+        kv_heads: 1,
+        head_dim: 16,
+        window: 64,
+        block_size: 32,
+        causal: true,
+    };
+    let head = AetherTemporalHead::new(&cfg).expect("construct");
+    let (q, k, vt) = make_qkv(1000, 4, 1, 16);
+    let out = head.forward(&q, &k, &vt).expect("forward at N=1000");
+    assert_eq!(out.shape(), (1000, 4, 16));
+}
@@ -0,0 +1,139 @@
+//! ADR-096 §3.2 streaming-decode test: token-by-token `step()` against
+//! a `KvCache` should match a single-shot `forward()` over the same
+//! Q/K/V at the final position. This is the structural advantage
+//! dense MHA can't follow — proving it stays correct under streaming
+//! is what the §5 validation gate would care about most.
+
+use wifi_densepose_temporal::{
+    AetherTemporalHead, TemporalBackendKind, TemporalHeadConfig, Tensor3,
+};
+
+fn make_qkv(seq: usize, q_heads: usize, kv_heads: usize, dim: usize) -> (Tensor3, Tensor3, Tensor3) {
+    let mut q = Tensor3::zeros(seq, q_heads, dim);
+    let mut k = Tensor3::zeros(seq, kv_heads, dim);
+    let mut v = Tensor3::zeros(seq, kv_heads, dim);
+    for s in 0..seq {
+        for h in 0..q_heads {
+            for d in 0..dim {
+                let val = ((s * 31 + h * 7 + d) as f32).sin() * 0.1;
+                q.set(s, h, d, val);
+            }
+        }
+        for h in 0..kv_heads {
+            for d in 0..dim {
+                let val = (((s * 17 + h * 3 + d) as f32).cos()) * 0.1;
+                k.set(s, h, d, val);
+                v.set(s, h, d, val * 0.5);
+            }
+        }
+    }
+    (q, k, v)
+}
+
+fn slice_token(t: &Tensor3, idx: usize) -> Tensor3 {
+    let (_, heads, dim) = t.shape();
+    let mut out = Tensor3::zeros(1, heads, dim);
+    for h in 0..heads {
+        for d in 0..dim {
+            out.set(0, h, d, t.get(idx, h, d));
+        }
+    }
+    out
+}
+
+fn config_mha_small() -> TemporalHeadConfig {
+    // Equal q/k heads forces the `forward` MHA branch — `decode_step`
+    // upstream is wired to this branch, not the GQA branch (which has
+    // its own decode path coming in upstream's roadmap).
+    TemporalHeadConfig {
+        backend: TemporalBackendKind::SparseGqa,
+        q_heads: 2,
+        kv_heads: 2,
+        head_dim: 16,
+        window: 8,
+        block_size: 4,
+        causal: true,
+    }
+}
+
+#[test]
+fn streaming_step_matches_forward_at_last_position() {
+    let cfg = config_mha_small();
+    let head = AetherTemporalHead::new(&cfg).expect("construct");
+
+    let seq = 16usize;
+    let (q, k, v) = make_qkv(seq, cfg.q_heads, cfg.kv_heads, cfg.head_dim);
+
+    // Reference: single-shot forward over the full sequence.
+    let reference = head.forward(&q, &k, &v).expect("forward");
+
+    // Streaming: append k/v one token at a time, decode the new q.
+    let mut cache = head.make_cache(seq).expect("cache");
+    let mut last_out: Option<Tensor3> = None;
+    for t in 0..seq {
+        let qt = slice_token(&q, t);
+        let kt = slice_token(&k, t);
+        let vt = slice_token(&v, t);
+        last_out = Some(head.step(&qt, &kt, &vt, &mut cache).expect("step"));
+    }
+    let streamed = last_out.expect("at least one step");
+
+    // Compare the streamed last-token output to the reference's
+    // last-token output. Tolerance is generous because numerical
+    // accumulation differs between the two paths even at exact
+    // mathematical equivalence.
+    let (s_seq, s_heads, s_dim) = streamed.shape();
+    assert_eq!((s_seq, s_heads, s_dim), (1, cfg.q_heads, cfg.head_dim));
+    let mut max_abs_err: f32 = 0.0;
+    for h in 0..cfg.q_heads {
+        for d in 0..cfg.head_dim {
+            let a = streamed.get(0, h, d);
+            let b = reference.get(seq - 1, h, d);
+            let err = (a - b).abs();
+            if err > max_abs_err {
+                max_abs_err = err;
+            }
+        }
+    }
+    // 1e-3 absolute is a comfortable bound for activations of this
+    // magnitude (~0.1 input scale). Tighten if the kernel ever
+    // promises closer match.
+    assert!(
+        max_abs_err < 1.0e-3,
+        "streaming/forward divergence at last token exceeds 1e-3: max_abs_err = {max_abs_err}"
+    );
+}
+
+#[test]
+fn step_rejects_multi_token_q() {
+    let cfg = config_mha_small();
+    let head = AetherTemporalHead::new(&cfg).expect("construct");
+    let mut cache = head.make_cache(8).expect("cache");
+
+    // Build a 2-token Q/K/V — `step` must reject (its contract is
+    // single-token decode).
+    let (q, k, v) = make_qkv(2, cfg.q_heads, cfg.kv_heads, cfg.head_dim);
+    let err = head.step(&q, &k, &v, &mut cache).err().expect("rejected");
+    let s = format!("{err}");
+    assert!(
+        s.contains("single-token") || s.to_lowercase().contains("seq"),
+        "expected single-token rejection, got: {s}"
+    );
+}
+
+#[test]
+fn make_cache_returns_kvcache_with_correct_shape() {
+    // Smoke test that the convenience wrapper plumbs the right dims
+    // into KvCache::new — the upstream constructor takes
+    // (capacity, kv_heads, dim, block_size) and we want to make sure
+    // we're not transposing any of those.
+    let cfg = config_mha_small();
+    let head = AetherTemporalHead::new(&cfg).expect("construct");
+    let mut cache = head.make_cache(32).expect("cache");
+
+    // Append one token shaped for kv_heads × head_dim — should not error.
+    let (_, k, v) = make_qkv(1, cfg.q_heads, cfg.kv_heads, cfg.head_dim);
+    let kt = slice_token(&k, 0);
+    let vt = slice_token(&v, 0);
+    cache.try_append(&kt, &vt).expect("append shape ok");
+}
@@ -0,0 +1,140 @@
+//! Roundtrip + corruption-detection tests for the temporal head's
+//! weight-blob wire format. The format is the contract between
+//! host-side training and firmware-side inference — when this test
+//! file changes, both ends update in lockstep.
+
+use wifi_densepose_temporal::{
+    WeightBlob, WeightBlobHeader, WeightDtype, WEIGHT_BLOB_HEADER_LEN, WEIGHT_BLOB_MAGIC,
+    WEIGHT_BLOB_VERSION,
+};
+
+fn header_default() -> WeightBlobHeader {
+    WeightBlobHeader {
+        dtype: WeightDtype::F32,
+        input_dim: 16,
+        n_q_heads: 4,
+        n_kv_heads: 1,
+        head_dim: 32,
+        n_layers: 2,
+        n_classes: 4,
+    }
+}
+
+#[test]
+fn roundtrip_fp32() {
+    let header = header_default();
+    let weights: Vec<u8> = (0..1024).map(|i| (i & 0xFF) as u8).collect();
+    let blob = WeightBlob::new(header, weights).expect("ok");
+    let serialized = blob.serialize();
+    let parsed = WeightBlob::parse(&serialized).expect("parse");
+    assert_eq!(parsed.header.input_dim, 16);
+    assert_eq!(parsed.header.n_q_heads, 4);
+    assert_eq!(parsed.header.n_kv_heads, 1);
+    assert_eq!(parsed.header.head_dim, 32);
+    assert_eq!(parsed.header.n_layers, 2);
+    assert_eq!(parsed.header.n_classes, 4);
+    assert_eq!(parsed.header.dtype, WeightDtype::F32);
+    assert_eq!(parsed.weights.len(), 1024);
+}
+
+#[test]
+fn roundtrip_fp16() {
+    let header = WeightBlobHeader {
+        dtype: WeightDtype::F16,
+        ..header_default()
+    };
+    // FP16 means 2 bytes per element — 512 bytes = 256 elements.
+    let weights: Vec<u8> = (0..512).map(|i| (i & 0xFF) as u8).collect();
+    let blob = WeightBlob::new(header, weights).expect("ok");
+    let serialized = blob.serialize();
+    let parsed = WeightBlob::parse(&serialized).expect("parse");
+    assert_eq!(parsed.header.dtype, WeightDtype::F16);
+    assert_eq!(parsed.weights.len(), 512);
+}
+
+#[test]
+fn parse_rejects_bad_magic() {
+    let header = header_default();
+    let blob = WeightBlob::new(header, vec![0u8; 16]).expect("ok");
+    let mut bytes = blob.serialize();
+    bytes[0] = 0xFF; // corrupt magic
+    let err = WeightBlob::parse(&bytes).err().expect("rejected");
+    assert!(format!("{err}").contains("magic"));
+}
+
+#[test]
+fn parse_rejects_wrong_version() {
+    let header = header_default();
+    let blob = WeightBlob::new(header, vec![0u8; 16]).expect("ok");
+    let mut bytes = blob.serialize();
+    bytes[4] = 99; // bump version
+    bytes[5] = 0;
+    let err = WeightBlob::parse(&bytes).err().expect("rejected");
+    assert!(format!("{err}").contains("version"));
+}
+
+#[test]
+fn parse_rejects_size_mismatch() {
+    let header = header_default();
+    let blob = WeightBlob::new(header, vec![0u8; 64]).expect("ok");
+    let mut bytes = blob.serialize();
+    // truncate the weights region by 4 bytes — total length now
+    // doesn't match the weights_len field.
+    bytes.drain(WEIGHT_BLOB_HEADER_LEN..WEIGHT_BLOB_HEADER_LEN + 4);
+    let err = WeightBlob::parse(&bytes).err().expect("rejected");
+    assert!(format!("{err}").contains("length") || format!("{err}").contains("CRC"));
+}
+
+#[test]
+fn parse_rejects_crc_corruption() {
+    let header = header_default();
+    let blob = WeightBlob::new(header, vec![0xAAu8; 32]).expect("ok");
+    let mut bytes = blob.serialize();
+    // flip a bit in the middle of the weights region
+    let mid = WEIGHT_BLOB_HEADER_LEN + 5;
+    bytes[mid] ^= 0x01;
+    let err = WeightBlob::parse(&bytes).err().expect("rejected");
+    assert!(format!("{err}").contains("CRC"));
+}
+
+#[test]
+fn parse_rejects_invalid_gqa_ratio_in_header() {
+    // Manually craft bytes where n_q_heads % n_kv_heads != 0 to ensure
+    // header.validate() fires from inside parse(). Easiest: build a
+    // valid blob then patch the n_kv_heads field.
+    let header = header_default();
+    let blob = WeightBlob::new(header, vec![0u8; 16]).expect("ok");
+    let mut bytes = blob.serialize();
+    // n_kv_heads is at offset 12..14; set it to 3 so 4 % 3 != 0.
+    bytes[12] = 3;
+    bytes[13] = 0;
+    // Re-CRC so we can be sure the validator (not the CRC) is what
+    // rejects this case.
+    let new_crc = crc32_ieee(&bytes[..bytes.len() - 4]);
+    let crc_off = bytes.len() - 4;
+    bytes[crc_off..].copy_from_slice(&new_crc.to_le_bytes());
+    let err = WeightBlob::parse(&bytes).err().expect("rejected");
+    assert!(format!("{err}").to_lowercase().contains("gqa"));
+}
+
+#[test]
+fn header_constants_match_wire_layout() {
+    // Anchor the public constants so they can't drift silently.
+    assert_eq!(WEIGHT_BLOB_MAGIC, 0x5256_4E45);
+    assert_eq!(WEIGHT_BLOB_VERSION, 1);
+    assert_eq!(WEIGHT_BLOB_HEADER_LEN, 24);
+}
+
+// Mirror of the production CRC32 so the size-mismatch / GQA tests can
+// re-CRC after their patch. Kept out of the public API.
+fn crc32_ieee(data: &[u8]) -> u32 {
+    let mut crc = 0xFFFF_FFFFu32;
+    for &b in data {
+        crc ^= b as u32;
+        for _ in 0..8 {
+            let mask = 0u32.wrapping_sub(crc & 1);
+            crc = (crc >> 1) ^ (0xEDB8_8320 & mask);
+        }
+    }
+    !crc
+}
@@ -24,6 +24,11 @@ required-features = ["tch-backend"]
 default = []
 tch-backend = ["tch"]
 cuda = ["tch-backend"]
+# ADR-096 sparse-GQA temporal head. Pulls wifi-densepose-temporal in
+# alongside tch — the new path is additive, doesn't touch the existing
+# model.rs code paths, and stays opt-in until the §5 validation gate
+# clears.
+aether-sparse-temporal = ["tch-backend", "dep:wifi-densepose-temporal"]

 [dependencies]
 # Internal crates
@@ -54,6 +59,10 @@ ruvector-temporal-tensor = { workspace = true }
 ruvector-solver = { workspace = true }
 ruvector-attention = { workspace = true }

+# AETHER temporal head (ADR-096). Optional + tch-gated — only meaningful
+# alongside the existing tch-bound model graph.
+wifi-densepose-temporal = { workspace = true, optional = true }
+
 # Data loading
 ndarray-npy.workspace = true
 memmap2 = "0.9"
@@ -69,6 +69,13 @@ pub mod proof;
 #[cfg(feature = "tch-backend")]
 pub mod trainer;

+// ADR-096 AETHER temporal head — additive integration. Pulled in via
+// the `aether-sparse-temporal` feature, which itself requires
+// `tch-backend`. Kept under its own cfg so the existing build with
+// just `tch-backend` is byte-equivalent to before.
+#[cfg(feature = "aether-sparse-temporal")]
+pub mod temporal_aether;
+
 // Convenient re-exports at the crate root.
 pub use config::TrainingConfig;
 pub use dataset::{CsiDataset, CsiSample, DataLoader, MmFiDataset, SyntheticCsiDataset, SyntheticConfig};
@@ -0,0 +1,178 @@
+//! ADR-096 AETHER temporal head — `tch::nn` bridge.
+//!
+//! Additive integration: wires `wifi-densepose-temporal` (sparse-GQA
+//! attention + streaming KvCache) into the train crate's tch graph.
+//! Does NOT modify the existing `WiFiDensePoseModel` forward in
+//! `model.rs` — that path stays bit-equivalent for back-compat. Use
+//! this aggregator alongside the existing model when you want a
+//! temporal-axis pooling on top of per-frame backbone features.
+//!
+//! Bridge boundary:
+//!   tch::Tensor [T, in_dim]  →  Tensor3 (seq=T, heads, dim)  →  attention
+//!                            ←  Tensor3                       ←  forward()
+//!   tch::Tensor [in_dim] (pooled embedding)
+//!
+//! Memory pattern: tch.copy_data → Vec<f32> → Tensor3::from_vec on the
+//! way in; Tensor3 raw → Tensor::of_slice on the way out. Two host
+//! copies per call. For training-rate forwards (~100 calls/sec at
+//! batch 16) this is negligible vs the actual attention work; for
+//! inference-rate streaming it'd be the bottleneck and a
+//! zero-copy path is the natural Phase 2.
+//!
+//! Only the B=1 prefill path is implemented in this commit. Multi-batch
+//! and the streaming `step()` bridge land when the §5 validation gate
+//! turns green and we need to take the perf hit seriously.
+//!
+//! Feature-gated: `aether-sparse-temporal` (also requires `tch-backend`).
+
+use tch::{
+    nn::{self, Module},
+    Device, Kind, Tensor,
+};
+
+use wifi_densepose_temporal::{
+    AetherTemporalHead, TemporalBackendKind, TemporalError, TemporalHeadConfig, Tensor3,
+};
+
+/// Aggregator: tch-side projections + the pure-Rust sparse attention
+/// kernel + a tch-side output projection. The projection layers are
+/// `nn::Linear` so they participate in the tch VarStore the same way
+/// the rest of the model does — gradients, save/load, etc.
+pub struct AetherTemporalAggregator {
+    cfg: TemporalHeadConfig,
+    in_dim: i64,
+
+    // tch-side learnable projections.
+    q_proj: nn::Linear,
+    k_proj: nn::Linear,
+    v_proj: nn::Linear,
+    o_proj: nn::Linear,
+
+    // The kernel itself is configuration-only; no weights live inside
+    // because the sparse attention forward is purely a function of
+    // q/k/v + the SparseAttentionConfig.
+    head: AetherTemporalHead,
+}
+
+impl AetherTemporalAggregator {
+    /// Build the aggregator. `vs` is the tch namespace under which
+    /// the four projection layers register. `in_dim` is the input
+    /// feature dimension per frame (e.g. backbone output dim).
+    pub fn new(vs: nn::Path, in_dim: i64, cfg: TemporalHeadConfig) -> Result<Self, TemporalError> {
+        cfg.validate()?;
+        // Backend has to be Sparse — Dense projections would still
+        // work, but the whole point of this integration is the new
+        // sparse-GQA path. If a caller wants dense, they can keep
+        // using `apply_antenna_attention` / `apply_spatial_attention`
+        // from model.rs.
+        if !matches!(cfg.backend, TemporalBackendKind::SparseGqa) {
+            return Err(TemporalError::InvalidConfig(
+                "aggregator only wires SparseGqa; use existing model.rs paths for dense",
+            ));
+        }
+
+        let total_q = (cfg.q_heads * cfg.head_dim) as i64;
+        let total_kv = (cfg.kv_heads * cfg.head_dim) as i64;
+
+        let q_proj = nn::linear(&vs / "q_proj", in_dim, total_q, Default::default());
+        let k_proj = nn::linear(&vs / "k_proj", in_dim, total_kv, Default::default());
+        let v_proj = nn::linear(&vs / "v_proj", in_dim, total_kv, Default::default());
+        let o_proj = nn::linear(&vs / "o_proj", total_q, in_dim, Default::default());
+
+        let head = AetherTemporalHead::new(&cfg)?;
+
+        Ok(Self {
+            cfg,
+            in_dim,
+            q_proj,
+            k_proj,
+            v_proj,
+            o_proj,
+            head,
+        })
+    }
+
+    /// Forward over a single sequence of frames. Input shape:
+    /// `[T, in_dim]` (NB: B=1 only this version — see file header).
+    /// Returns the per-token attention output passed through the
+    /// output projection: `[T, in_dim]`.
+    ///
+    /// Pooling (mean over T, last-token, attention-pool, etc.) is the
+    /// caller's job — different downstream consumers want different
+    /// pools and we don't want to bake one in.
+    pub fn forward(&self, frames: &Tensor) -> Result<Tensor, TemporalError> {
+        let dims = frames.size();
+        if dims.len() != 2 || dims[1] != self.in_dim {
+            return Err(TemporalError::InvalidConfig(
+                "aggregator.forward expects [T, in_dim] tch::Tensor",
+            ));
+        }
+        let t = dims[0] as usize;
+        let device = frames.device();
+
+        // ── Project to Q/K/V on the tch side ──────────────────────
+        let q_th = self.q_proj.forward(frames); // [T, q_heads*head_dim]
+        let k_th = self.k_proj.forward(frames); // [T, kv_heads*head_dim]
+        let v_th = self.v_proj.forward(frames); // [T, kv_heads*head_dim]
+
+        // ── Bridge to Tensor3 (CPU, f32) ──────────────────────────
+        let q_t3 = tch_to_tensor3(&q_th, t, self.cfg.q_heads, self.cfg.head_dim)?;
+        let k_t3 = tch_to_tensor3(&k_th, t, self.cfg.kv_heads, self.cfg.head_dim)?;
+        let v_t3 = tch_to_tensor3(&v_th, t, self.cfg.kv_heads, self.cfg.head_dim)?;
+
+        // ── Sparse attention forward (pure-Rust path) ────────────
+        let attn_out = self.head.forward(&q_t3, &k_t3, &v_t3)?;
+
+        // ── Bridge back to tch ───────────────────────────────────
+        let attn_th = tensor3_to_tch(&attn_out, device);
+        // attn_th shape is [T, q_heads*head_dim].
+
+        // ── Output projection on tch side ────────────────────────
+        let out = self.o_proj.forward(&attn_th); // [T, in_dim]
+        Ok(out)
+    }
+}
+
+/// Reshape a `[T, heads*head_dim]` tch::Tensor on (any device, any
+/// kind) into a CPU `Tensor3(seq=T, heads, head_dim)`. Forces f32 +
+/// CPU + contiguous memory; copies once.
+fn tch_to_tensor3(
+    th: &Tensor,
+    seq: usize,
+    heads: usize,
+    head_dim: usize,
+) -> Result<Tensor3, TemporalError> {
+    let dims = th.size();
+    if dims.len() != 2 || dims[0] as usize != seq || dims[1] as usize != heads * head_dim {
+        return Err(TemporalError::InvalidConfig(
+            "tch_to_tensor3 shape mismatch",
+        ));
+    }
+    let cpu = th.to_kind(Kind::Float).to_device(Device::Cpu).contiguous();
+    let total = seq * heads * head_dim;
+    let mut buf = vec![0.0f32; total];
+    cpu.copy_data(&mut buf, total);
+    // tch row-major flatten gives [seq][heads*head_dim]. Tensor3
+    // expects [seq][heads][dim] in the same row-major order, so the
+    // contiguous bytes are layout-compatible — no per-element
+    // transpose required.
+    Tensor3::from_vec(buf, seq, heads, head_dim)
+        .map_err(|e| TemporalError::InvalidConfig(Box::leak(format!("from_vec: {e}").into_boxed_str())))
+}
+
+/// Inverse of `tch_to_tensor3`: take a `Tensor3(seq, heads, dim)` and
+/// produce a `[seq, heads*dim]` tch::Tensor on the requested device.
+fn tensor3_to_tch(t3: &Tensor3, device: Device) -> Tensor {
+    let (seq, heads, dim) = t3.shape();
+    // Tensor3 stores seq×heads×dim contiguously; flatten heads/dim
+    // by reading the row at each (seq, head) and concatenating.
+    let mut flat = Vec::with_capacity(seq * heads * dim);
+    for s in 0..seq {
+        for h in 0..heads {
+            flat.extend_from_slice(t3.row(s, h));
+        }
+    }
+    Tensor::from_slice(&flat)
+        .reshape([seq as i64, (heads * dim) as i64])
+        .to_device(device)
+}