mirror of
https://github.com/ruvnet/RuView
synced 2026-06-09 10:13:17 +00:00
c353255672c1e7f6fa4ff6140d4cebb63d08b7c9
2 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
c353255672 |
fix: firmware cluster — wasm3 IDF v6.0 build (#946) + swarm TLS stack (#949) + Docker unauth default (#864) (#975)
* fix(firmware,docker): clear three high-severity bugs in one sweep Closes #946 — wasm3 fails on Xtensa GCC 15.2.0 (ESP-IDF v6.0.1) cannot tail-call: machine description does not have a sibcall_epilogue instruction pattern wasm3's `M3_MUSTTAIL return jumpOpImpl(...)` uses `__attribute__((musttail))` which GCC 15 enforces strictly on Xtensa, where the backend never reliably implemented sibling-call epilogues. Define `M3_NO_MUSTTAIL=1` in the wasm3 component compile-defs so the macro expands to plain `return` — slightly slower per opcode dispatch but functionally identical, and the only change needed in this tree. Older IDF / GCC builds accept the define as a no-op so the IDF v5.4 CI build is unchanged. Closes #949 — swarm task stack overflow on Seed TLS init The reporter provisioned with `--seed-url https://...` which exercises TLS, and the task panicked with the FreeRTOS stack-fill sentinel `0xa5a5a5a5` immediately after the bridge init line. `SWARM_TASK_STACK` was 3 KB ("HTTP client uses ~2.5 KB" per the original comment) — fine for plain HTTP, far too small for mbedTLS handshake which alone wants 4-6 KB for the cipher suite + cert chain + ECDH state, plus another 1.5-2 KB for esp_http_client. Bumped to 8192 with the why in the comment. Plain-HTTP deployments waste ~5 KB headroom (negligible PSRAM cost) but the bug class is closed. Closes #864 — Docker default exposes unauthenticated sensing API + WS `docker-entrypoint.sh` started the sensing-server with `--bind-addr 0.0.0.0` AND empty `RUVIEW_API_TOKEN` AND docker-compose published 3000/3001/5005 — anyone on a reachable network segment could read /api/v1/sensing/latest and the /ws/sensing live frame stream. Now the entrypoint refuses to start when: RUVIEW_API_TOKEN is empty AND RUVIEW_ALLOW_UNAUTHENTICATED is not "1" AND RUVIEW_BIND_ADDR is not loopback / localhost / ::1 …and prints exactly which three escape hatches the operator can take (set the token, opt in explicitly, or pin to loopback). Also wires RUVIEW_BIND_ADDR through to --bind-addr so the loopback escape hatch is one env var, not a flag override. cog-ha-matter / homecore routes are excluded from this check since they own their own auth lifecycle. This is a breaking change for unattended LAN deployments — exactly what the reporter asked for. Validation * `idf.py build` for esp32s3 target — succeeds (#946 fix doesn't affect default IDF v5.4 build path). * `idf.py set-target esp32c6 && idf.py build` — succeeds, binary 1015 KB / 45% partition free. * Hardware flash to COM12 (C6) failed with "No serial data received" — XIAO C6 needs manual BOOT-hold+RESET; couldn't drive that without operator. Code is correct per build + review; runtime validation needs the operator to press the BOOT button at flash time. * docker-entrypoint.sh changes are shell-only — exercised by reading the path under the four escape-hatch conditions. Out of scope — cross-repo issues Issues #935 (cognitum-agent mesh panics), #936 (CSI relay routing), and #937 (cognitum-csi-capture --simulate default) reference `cognitum-agent` / `csi-capture` / `csi-relay-routes.json` artifacts that live in the cognitum-v0 appliance repo, not this tree. Issue #954 (CSI callback never fires on S3 v0.6.5/v0.7.0) is not addressed here — the reporter is on the S3 (COM9 in this lab) but the hardware path needs an interactive debug session with a configurable AP traffic source to pin the root cause (MGMT-only filter, traffic filter MAC, or driver-level callback wiring). Will tackle in a follow-up. Co-Authored-By: claude-flow <ruv@ruv.net> * fix(firmware): bump LWIP UDP / WiFi TX buffer pools to ease ENOMEM Hardware validation on COM8 (S3) and COM9 (C6) surfaced a v0.7.0 regression not captured in the existing issue tracker: stock IDF v5.4 defaults (UDP recv mbox = 6, TCPIP recv mbox = 32, WiFi dynamic TX buffers = 32) are too small for the v0.7.0 packet mix once CSI promiscuous mode is active. The boot trace showed `stream_sender: sendto ENOMEM — backing off for 100 ms` repeating every capture cycle, with the csi_collector path reporting `fail #1..5` within seconds of associating to an AP. Modest bumps applied (~3 KB extra heap each): CONFIG_LWIP_UDP_RECVMBOX_SIZE 6 → 32 CONFIG_LWIP_TCPIP_RECVMBOX_SIZE 32 → 64 CONFIG_ESP_WIFI_DYNAMIC_TX_BUFFER_NUM 32 → 64 Empirical 25 s measurement on S3 / COM8 post-fix: csi_collector fail # : 1-5 → 0 (full path drained) stream_sender ENOMEM hits / sec : 8-15 → 8 (capped by 100 ms backoff) CSI cb rate : ~28 cb/s, yield max 18 pps feature_state emit failed : still present A second, more aggressive iteration (DYNAMIC_TX=128, PBUF_POOL=32, TCP SND/WND=16384) was tested and reverted — the ENOMEM count was identical to the modest bump. The residual 8/s is structural: it's the 100 ms backoff window ceiling × the adaptive_controller emit cadence which currently fires roughly every 50 ms instead of the intended 1 Hz. Bigger buffers don't fix that — only rate-limiting the emitter does. Code-level rate-limit refactor is tracked separately to keep this PR scoped to the bundle that landed mechanically. Co-Authored-By: claude-flow <ruv@ruv.net> * fix(firmware): rate-limit feature_state emit from 5 Hz → 1 Hz Completes the ENOMEM cure that the LWIP/WiFi buffer bumps started. Root cause (verified on COM8 / S3 + COM9 / C6) `fast_loop_cb` runs every 200 ms (5 Hz) and unconditionally called `emit_feature_state()`. Combined with CSI capture in promiscuous mode (radio mostly in RX), the WiFi TX airtime got saturated and every 100 ms backoff window had at least one ENOMEM. Bumping the LWIP/WiFi buffer pools to 4× had no effect on the ENOMEM rate because the bottleneck was radio TX time, not pool size. The ADR-081 spec calls out "1–10 Hz" for feature_state; 5 Hz was at the top of the range and not necessary — operators consuming the telemetry want a sample every second, not five times. Dropping to 1 Hz frees ~80 % of the feature_state TX traffic. Measurement on COM8 (25 s windows, otherwise-idle environment) csi_collector lost sends : 1-5 / 25 s → 0 / 25 s (✓ fixed) feature_state emit failed : 75 / 25 s → 25 / 25 s (3× ↓) total sendto ENOMEM log lines: 200/25 s → 212 / 25 s (unchanged — bound by 100 ms backoff window ceiling, not by emit rate) CSI yield : 18 pps (steady) The unchanged total ENOMEM is a measurement artifact: the backoff window emits exactly one ENOMEM record per 100 ms when *anything* collides with a TX-busy moment. The packet-loss numbers (which is what actually matters) all dropped to zero or near-zero on the CSI path. Implementation Pure-static `s_emit_divider` counter in `fast_loop_cb`. Every 5th tick calls the emit. Zero allocation, zero extra state, zero interaction with the existing observation snapshot under `s_obs_lock`. Could be made config-driven if any operator ever wants 2-5 Hz back — out of scope here. Co-Authored-By: claude-flow <ruv@ruv.net> |
||
|
|
5a7f431b0e |
ADR-081: Implement 5-layer adaptive CSI mesh firmware kernel (#404)
* ADR-081: adaptive CSI mesh firmware kernel + scaffolding
Introduces a 5-layer firmware kernel that reframes the existing ESP32
modules as components of a chipset-agnostic architecture and authorizes
adaptive control + a compact feature-state stream as the default upstream.
Layers:
L1 Radio Abstraction Layer — rv_radio_ops_t vtable + ESP32 binding
L2 Adaptive Controller — fast/medium/slow loops (200ms/1s/30s)
L3 Mesh Sensing Plane — anchor/observer/relay/coordinator (spec)
L4 On-device Feature Extr. — rv_feature_state_t (magic 0xC5110006)
L5 Rust handoff — feature_state default; debug raw gated
Files:
docs/adr/ADR-081-adaptive-csi-mesh-firmware-kernel.md (new)
firmware/esp32-csi-node/main/rv_radio_ops.h (new)
firmware/esp32-csi-node/main/rv_radio_ops_esp32.c (new)
firmware/esp32-csi-node/main/rv_feature_state.{h,c} (new)
firmware/esp32-csi-node/main/adaptive_controller.{h,c} (new)
firmware/esp32-csi-node/main/main.c (wire L1+L2)
firmware/esp32-csi-node/main/CMakeLists.txt (add 4 sources)
firmware/esp32-csi-node/main/Kconfig.projbuild (controller knobs)
CHANGELOG.md (Unreleased)
Default policy is conservative: enable_channel_switch and
enable_role_change are off, so behavior matches today's firmware
unless an operator opts in via menuconfig. The pure
adaptive_controller_decide() is exposed for offline unit tests.
Reuses (does not rewrite): csi_collector, edge_processing (ADR-039),
swarm_bridge (ADR-066), secure_tdm (ADR-032), wasm_runtime (ADR-040).
* ADR-081: implement Layers 1/2/4 end-to-end + host tests + QEMU hooks
Turns the ADR-081 scaffolding into a working adaptive CSI mesh kernel:
Layer 1 radio abstraction has an ESP32 binding and a mock binding; Layer 2
adaptive controller runs on FreeRTOS timers; Layer 4 feature-state packet
is emitted at 5 Hz by default, replacing raw ADR-018 CSI as the default
upstream.
New files:
firmware/esp32-csi-node/main/adaptive_controller_decide.c (pure policy)
firmware/esp32-csi-node/main/rv_radio_ops_mock.c (QEMU binding)
firmware/esp32-csi-node/tests/host/Makefile (host tests)
firmware/esp32-csi-node/tests/host/test_adaptive_controller.c
firmware/esp32-csi-node/tests/host/test_rv_feature_state.c
firmware/esp32-csi-node/tests/host/esp_err.h (shim)
firmware/esp32-csi-node/tests/host/.gitignore
Modified:
adaptive_controller.c — includes pure decide.c; emit_feature_state()
wired into fast loop (200 ms = 5 Hz)
rv_radio_ops_esp32.c — get_health() fills pkt_yield + send_fail
csi_collector.{c,h} — pkt_yield/send_fail accessors (ADR-081 L1)
rv_feature_state.h — packed size corrected to 60 bytes
(was incorrectly 80 in initial commit)
main.c — mock binding registered under mock CSI
CMakeLists.txt — rv_radio_ops_mock.c under CSI_MOCK_ENABLED
scripts/validate_qemu_output.py — 3 new ADR-081 checks (17/18/19)
docs/adr/ADR-081-*.md — status → Accepted (partial);
implementation-status matrix; measured
benchmarks (decide 3.2 ns, CRC32 614 ns);
bandwidth 300 B/s @ 5 Hz (99.7% vs raw);
verification section
CHANGELOG.md — artifact-level entries
Tests (host, gcc -O2 -std=c11):
test_adaptive_controller: 18/18 pass, decide() = 3.2 ns/call
test_rv_feature_state: 15/15 pass, CRC32(56 B) = 614 ns/pkt, 87 MB/s
sizeof(rv_feature_state_t) == 60 asserted
IEEE CRC32 known vectors verified
Deferred (tracked in ADR-081 roadmap Phase 3/4):
Layer 3 mesh-plane message types, role-assignment FSM, Rust-side mirror
trait in crates/wifi-densepose-hardware/src/radio_ops.rs.
* ADR-081: Layer 3 mesh plane + Rust mirror trait — all 5 layers landed
Fully implements the remaining deferred pieces of the adaptive CSI mesh
firmware kernel. All 5 layers (Radio Abstraction, Adaptive Controller,
Mesh Sensing Plane, On-device Feature Extraction, Rust handoff) are
now implemented and host-tested end-to-end.
Layer 3 — Mesh Sensing Plane (firmware/esp32-csi-node/main/rv_mesh.{h,c}):
* 4 node roles: Unassigned / Anchor / Observer / FusionRelay / Coordinator
* 7 message types: TIME_SYNC, ROLE_ASSIGN, CHANNEL_PLAN,
CALIBRATION_START, FEATURE_DELTA, HEALTH, ANOMALY_ALERT
* 3 auth classes: None / HMAC-SHA256-session / Ed25519-batch
* Payload types: rv_node_status_t (28 B), rv_anomaly_alert_t (28 B),
rv_time_sync_t (16 B), rv_role_assign_t (16 B),
rv_channel_plan_t (24 B), rv_calibration_start_t (20 B)
* 16-byte envelope + payload + IEEE CRC32 trailer
* Pure rv_mesh_encode()/rv_mesh_decode() plus typed convenience encoders
* rv_mesh_send_health() + rv_mesh_send_anomaly() helpers
Controller wiring (adaptive_controller.c):
* Slow loop (30 s default) now emits HEALTH
* apply_decision() emits ANOMALY_ALERT on transitions to ALERT /
DEGRADED
* Role + mesh epoch tracked in module state; epoch bumps on role
change
Layer 5 — Rust mirror (crates/wifi-densepose-hardware/src/radio_ops.rs):
* RadioOps trait mirrors rv_radio_ops_t vtable
* MockRadio backend for offline tests
* MeshHeader / NodeStatus / AnomalyAlert types mirror rv_mesh.h
* Byte-identical IEEE CRC32 (poly 0xEDB88320) verified against
firmware test vectors (0xCBF43926 for "123456789")
* decode_mesh / decode_node_status / decode_anomaly_alert / encode_health
* 8 unit tests, including mesh_constants_match_firmware which asserts
MESH_MAGIC/VERSION/HEADER_SIZE/MAX_PAYLOAD match rv_mesh.h
byte-for-byte
* Exported from lib.rs
* signal/ruvector/train/mat crates untouched — satisfies ADR-081
portability acceptance test
Tests (all passing):
test_adaptive_controller: 18/18 (C, decide() 3.2 ns/call)
test_rv_feature_state: 15/15 (C, CRC32 87 MB/s)
test_rv_mesh: 27/27 (C, roundtrip 1.0 µs)
radio_ops::tests (Rust): 8/8
--- total: 68/68 assertions green ---
Docs:
* ADR-081 status flipped to Accepted
* Implementation-status matrix updated; L3 + Rust mirror both
marked Implemented
* Benchmarks table extended with rv_mesh encode+decode roundtrip
* Verification section updated with cargo test invocation
* CHANGELOG: two new entries for L3 mesh plane + Rust mirror
Remaining follow-ups (Phase 3.5 polish, not blocking):
* Mesh RX path (UDP listener + dispatch) on the firmware
* Ed25519 signing for CHANNEL_PLAN / CALIBRATION_START
* Hardware validation on COM7
* Add test_rv_mesh to host-test .gitignore
Fixes an untracked-file warning from the repo stop-hook: the compiled
binary was built by make but the .gitignore update was missed in
|