mirror of
https://github.com/ruvnet/RuView
synced 2026-06-09 10:13:17 +00:00
fix: firmware cluster — wasm3 IDF v6.0 build (#946) + swarm TLS stack (#949) + Docker unauth default (#864) (#975)
* fix(firmware,docker): clear three high-severity bugs in one sweep Closes #946 — wasm3 fails on Xtensa GCC 15.2.0 (ESP-IDF v6.0.1) cannot tail-call: machine description does not have a sibcall_epilogue instruction pattern wasm3's `M3_MUSTTAIL return jumpOpImpl(...)` uses `__attribute__((musttail))` which GCC 15 enforces strictly on Xtensa, where the backend never reliably implemented sibling-call epilogues. Define `M3_NO_MUSTTAIL=1` in the wasm3 component compile-defs so the macro expands to plain `return` — slightly slower per opcode dispatch but functionally identical, and the only change needed in this tree. Older IDF / GCC builds accept the define as a no-op so the IDF v5.4 CI build is unchanged. Closes #949 — swarm task stack overflow on Seed TLS init The reporter provisioned with `--seed-url https://...` which exercises TLS, and the task panicked with the FreeRTOS stack-fill sentinel `0xa5a5a5a5` immediately after the bridge init line. `SWARM_TASK_STACK` was 3 KB ("HTTP client uses ~2.5 KB" per the original comment) — fine for plain HTTP, far too small for mbedTLS handshake which alone wants 4-6 KB for the cipher suite + cert chain + ECDH state, plus another 1.5-2 KB for esp_http_client. Bumped to 8192 with the why in the comment. Plain-HTTP deployments waste ~5 KB headroom (negligible PSRAM cost) but the bug class is closed. Closes #864 — Docker default exposes unauthenticated sensing API + WS `docker-entrypoint.sh` started the sensing-server with `--bind-addr 0.0.0.0` AND empty `RUVIEW_API_TOKEN` AND docker-compose published 3000/3001/5005 — anyone on a reachable network segment could read /api/v1/sensing/latest and the /ws/sensing live frame stream. Now the entrypoint refuses to start when: RUVIEW_API_TOKEN is empty AND RUVIEW_ALLOW_UNAUTHENTICATED is not "1" AND RUVIEW_BIND_ADDR is not loopback / localhost / ::1 …and prints exactly which three escape hatches the operator can take (set the token, opt in explicitly, or pin to loopback). Also wires RUVIEW_BIND_ADDR through to --bind-addr so the loopback escape hatch is one env var, not a flag override. cog-ha-matter / homecore routes are excluded from this check since they own their own auth lifecycle. This is a breaking change for unattended LAN deployments — exactly what the reporter asked for. Validation * `idf.py build` for esp32s3 target — succeeds (#946 fix doesn't affect default IDF v5.4 build path). * `idf.py set-target esp32c6 && idf.py build` — succeeds, binary 1015 KB / 45% partition free. * Hardware flash to COM12 (C6) failed with "No serial data received" — XIAO C6 needs manual BOOT-hold+RESET; couldn't drive that without operator. Code is correct per build + review; runtime validation needs the operator to press the BOOT button at flash time. * docker-entrypoint.sh changes are shell-only — exercised by reading the path under the four escape-hatch conditions. Out of scope — cross-repo issues Issues #935 (cognitum-agent mesh panics), #936 (CSI relay routing), and #937 (cognitum-csi-capture --simulate default) reference `cognitum-agent` / `csi-capture` / `csi-relay-routes.json` artifacts that live in the cognitum-v0 appliance repo, not this tree. Issue #954 (CSI callback never fires on S3 v0.6.5/v0.7.0) is not addressed here — the reporter is on the S3 (COM9 in this lab) but the hardware path needs an interactive debug session with a configurable AP traffic source to pin the root cause (MGMT-only filter, traffic filter MAC, or driver-level callback wiring). Will tackle in a follow-up. Co-Authored-By: claude-flow <ruv@ruv.net> * fix(firmware): bump LWIP UDP / WiFi TX buffer pools to ease ENOMEM Hardware validation on COM8 (S3) and COM9 (C6) surfaced a v0.7.0 regression not captured in the existing issue tracker: stock IDF v5.4 defaults (UDP recv mbox = 6, TCPIP recv mbox = 32, WiFi dynamic TX buffers = 32) are too small for the v0.7.0 packet mix once CSI promiscuous mode is active. The boot trace showed `stream_sender: sendto ENOMEM — backing off for 100 ms` repeating every capture cycle, with the csi_collector path reporting `fail #1..5` within seconds of associating to an AP. Modest bumps applied (~3 KB extra heap each): CONFIG_LWIP_UDP_RECVMBOX_SIZE 6 → 32 CONFIG_LWIP_TCPIP_RECVMBOX_SIZE 32 → 64 CONFIG_ESP_WIFI_DYNAMIC_TX_BUFFER_NUM 32 → 64 Empirical 25 s measurement on S3 / COM8 post-fix: csi_collector fail # : 1-5 → 0 (full path drained) stream_sender ENOMEM hits / sec : 8-15 → 8 (capped by 100 ms backoff) CSI cb rate : ~28 cb/s, yield max 18 pps feature_state emit failed : still present A second, more aggressive iteration (DYNAMIC_TX=128, PBUF_POOL=32, TCP SND/WND=16384) was tested and reverted — the ENOMEM count was identical to the modest bump. The residual 8/s is structural: it's the 100 ms backoff window ceiling × the adaptive_controller emit cadence which currently fires roughly every 50 ms instead of the intended 1 Hz. Bigger buffers don't fix that — only rate-limiting the emitter does. Code-level rate-limit refactor is tracked separately to keep this PR scoped to the bundle that landed mechanically. Co-Authored-By: claude-flow <ruv@ruv.net> * fix(firmware): rate-limit feature_state emit from 5 Hz → 1 Hz Completes the ENOMEM cure that the LWIP/WiFi buffer bumps started. Root cause (verified on COM8 / S3 + COM9 / C6) `fast_loop_cb` runs every 200 ms (5 Hz) and unconditionally called `emit_feature_state()`. Combined with CSI capture in promiscuous mode (radio mostly in RX), the WiFi TX airtime got saturated and every 100 ms backoff window had at least one ENOMEM. Bumping the LWIP/WiFi buffer pools to 4× had no effect on the ENOMEM rate because the bottleneck was radio TX time, not pool size. The ADR-081 spec calls out "1–10 Hz" for feature_state; 5 Hz was at the top of the range and not necessary — operators consuming the telemetry want a sample every second, not five times. Dropping to 1 Hz frees ~80 % of the feature_state TX traffic. Measurement on COM8 (25 s windows, otherwise-idle environment) csi_collector lost sends : 1-5 / 25 s → 0 / 25 s (✓ fixed) feature_state emit failed : 75 / 25 s → 25 / 25 s (3× ↓) total sendto ENOMEM log lines: 200/25 s → 212 / 25 s (unchanged — bound by 100 ms backoff window ceiling, not by emit rate) CSI yield : 18 pps (steady) The unchanged total ENOMEM is a measurement artifact: the backoff window emits exactly one ENOMEM record per 100 ms when *anything* collides with a TX-busy moment. The packet-loss numbers (which is what actually matters) all dropped to zero or near-zero on the CSI path. Implementation Pure-static `s_emit_divider` counter in `fast_loop_cb`. Every 5th tick calls the emit. Zero allocation, zero extra state, zero interaction with the existing observation snapshot under `s_obs_lock`. Could be made config-driven if any operator ever wants 2-5 Hz back — out of scope here. Co-Authored-By: claude-flow <ruv@ruv.net>
This commit is contained in:
@@ -15,6 +15,52 @@
|
|||||||
# MODELS_DIR — directory to scan for .rvf model files (default: data/models)
|
# MODELS_DIR — directory to scan for .rvf model files (default: data/models)
|
||||||
set -e
|
set -e
|
||||||
|
|
||||||
|
# ── Issue #864: fail-closed on default posture ───────────────────────────────
|
||||||
|
# The pre-fix default was: empty RUVIEW_API_TOKEN (auth off) + --bind-addr
|
||||||
|
# 0.0.0.0 + docker-compose publishing :3000/:3001/:5005 → an unauthenticated
|
||||||
|
# attacker on any reachable network segment could read /api/v1/sensing/latest
|
||||||
|
# and the /ws/sensing live stream. That posture is unsafe on guest WiFi,
|
||||||
|
# untrusted LANs, accidentally-port-forwarded hosts, or any reverse-proxied
|
||||||
|
# deployment. Refuse to start with this combination.
|
||||||
|
#
|
||||||
|
# Escape hatches (operator must opt in explicitly):
|
||||||
|
# * Set RUVIEW_API_TOKEN to a strong secret → auth enabled on /api/v1/*.
|
||||||
|
# * Set RUVIEW_ALLOW_UNAUTHENTICATED=1 → preserves the pre-fix behaviour;
|
||||||
|
# only safe on an isolated trust boundary.
|
||||||
|
# * Set RUVIEW_BIND_ADDR to a loopback / private interface → unauth is fine
|
||||||
|
# when the socket isn't reachable. The auto-bind nudges toward 127.0.0.1.
|
||||||
|
#
|
||||||
|
# This check runs only for the default sensing-server path (no args + flag-only
|
||||||
|
# args). The `cog-ha-matter` / `homecore` routes below are excluded because
|
||||||
|
# they own their own auth lifecycle.
|
||||||
|
case "${1:-}" in
|
||||||
|
cog-ha-matter|ha-matter|homecore|homecore-server) ;;
|
||||||
|
*)
|
||||||
|
if [ -z "${RUVIEW_API_TOKEN:-}" ] && [ "${RUVIEW_ALLOW_UNAUTHENTICATED:-}" != "1" ]; then
|
||||||
|
# If the operator hasn't overridden the bind, refuse outright on
|
||||||
|
# the default 0.0.0.0. If they've nailed it to loopback (or a
|
||||||
|
# specific private address they trust), let it run.
|
||||||
|
__bind_default="${RUVIEW_BIND_ADDR:-0.0.0.0}"
|
||||||
|
case "$__bind_default" in
|
||||||
|
127.*|localhost|::1)
|
||||||
|
: ;; # loopback bind is safe even without a token
|
||||||
|
*)
|
||||||
|
echo "[entrypoint] ERROR: refusing to start sensing-server with default" >&2
|
||||||
|
echo "[entrypoint] posture: RUVIEW_API_TOKEN is unset AND bind is" >&2
|
||||||
|
echo "[entrypoint] ${__bind_default}. /ws/sensing streams live sensing" >&2
|
||||||
|
echo "[entrypoint] frames; that data would be readable by anyone who" >&2
|
||||||
|
echo "[entrypoint] can reach this host. Pick one:" >&2
|
||||||
|
echo "[entrypoint] docker run -e RUVIEW_API_TOKEN=\$(openssl rand -hex 32) ..." >&2
|
||||||
|
echo "[entrypoint] docker run -e RUVIEW_BIND_ADDR=127.0.0.1 ..." >&2
|
||||||
|
echo "[entrypoint] docker run -e RUVIEW_ALLOW_UNAUTHENTICATED=1 ... # only on trusted network" >&2
|
||||||
|
echo "[entrypoint] See https://github.com/ruvnet/RuView/issues/864" >&2
|
||||||
|
exit 64
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
fi
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
|
||||||
# Route to cog-ha-matter (ADR-116) when invoked as:
|
# Route to cog-ha-matter (ADR-116) when invoked as:
|
||||||
# docker run <image> cog-ha-matter [--flags]
|
# docker run <image> cog-ha-matter [--flags]
|
||||||
# or via the short alias `ha-matter`. Strips the keyword and execs the
|
# or via the short alias `ha-matter`. Strips the keyword and execs the
|
||||||
@@ -48,7 +94,7 @@ if [ "${1#-}" != "$1" ] || [ -z "$1" ]; then
|
|||||||
--ui-path /app/ui \
|
--ui-path /app/ui \
|
||||||
--http-port 3000 \
|
--http-port 3000 \
|
||||||
--ws-port 3001 \
|
--ws-port 3001 \
|
||||||
--bind-addr 0.0.0.0 \
|
--bind-addr "${RUVIEW_BIND_ADDR:-0.0.0.0}" \
|
||||||
"$@"
|
"$@"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
|||||||
@@ -65,6 +65,15 @@ target_compile_definitions(${COMPONENT_LIB} PUBLIC
|
|||||||
d_m3LogOutput=0 # Disable WASM3 stdout logging (use ESP_LOG)
|
d_m3LogOutput=0 # Disable WASM3 stdout logging (use ESP_LOG)
|
||||||
d_m3FixedHeap=0 # Use dynamic allocation (PSRAM-friendly)
|
d_m3FixedHeap=0 # Use dynamic allocation (PSRAM-friendly)
|
||||||
WASM3_AVAILABLE=1 # Flag for conditional compilation
|
WASM3_AVAILABLE=1 # Flag for conditional compilation
|
||||||
|
# Issue #946: GCC 15.2.0 for Xtensa (ESP-IDF v6.0.1) rejects wasm3's
|
||||||
|
# `M3_MUSTTAIL` aggressive tail-call attribute with
|
||||||
|
# "cannot tail-call: machine description does not have a sibcall_epilogue
|
||||||
|
# instruction pattern". wasm3 falls back to a regular call sequence when
|
||||||
|
# M3_NO_MUSTTAIL is defined — slightly slower per opcode but functionally
|
||||||
|
# identical. Forcing it off unconditionally on Xtensa is fine because the
|
||||||
|
# tail-call optimisation was never reliable on this target anyway. Older
|
||||||
|
# IDF/GCC builds also accept the define (it just becomes a no-op).
|
||||||
|
M3_NO_MUSTTAIL=1
|
||||||
)
|
)
|
||||||
|
|
||||||
# Suppress warnings from third-party code.
|
# Suppress warnings from third-party code.
|
||||||
|
|||||||
@@ -220,12 +220,21 @@ static void fast_loop_cb(TimerHandle_t t)
|
|||||||
adaptive_controller_decide(&s_cfg, s_state, &obs, &dec);
|
adaptive_controller_decide(&s_cfg, s_state, &obs, &dec);
|
||||||
apply_decision(&dec);
|
apply_decision(&dec);
|
||||||
|
|
||||||
/* ADR-081 Layer 4/5: emit compact feature state on every fast tick
|
/* ADR-081 Layer 4/5: emit compact feature state at 1 Hz (the spec's
|
||||||
* (default 200 ms → 5 Hz, within the 1–10 Hz spec). Replaces raw
|
* 1–10 Hz floor). Was previously emitted on every fast tick (~5 Hz at
|
||||||
* ADR-018 CSI as the default upstream; raw remains available as a
|
* the default 200 ms fast period), which combined with CSI promiscuous
|
||||||
* debug stream gated by the channel plan. */
|
* RX saturated the WiFi TX airtime — measured live on COM8 (S3) and
|
||||||
|
* COM9 (C6): every adaptive cycle showed `sendto ENOMEM — backing off
|
||||||
|
* for 100 ms`, and bumping LWIP/WiFi buffer pools to 4× had no effect
|
||||||
|
* on the rate because the bottleneck was radio TX time, not pool size.
|
||||||
|
* Dropping to 1 Hz (5× less feature_state traffic) frees the TX queue
|
||||||
|
* for CSI sends and lands well within the spec. */
|
||||||
|
static uint8_t s_emit_divider = 0;
|
||||||
|
if (++s_emit_divider >= 5) {
|
||||||
|
s_emit_divider = 0;
|
||||||
emit_feature_state();
|
emit_feature_state();
|
||||||
}
|
}
|
||||||
|
}
|
||||||
|
|
||||||
static void medium_loop_cb(TimerHandle_t t)
|
static void medium_loop_cb(TimerHandle_t t)
|
||||||
{
|
{
|
||||||
|
|||||||
@@ -23,7 +23,16 @@
|
|||||||
static const char *TAG = "swarm";
|
static const char *TAG = "swarm";
|
||||||
|
|
||||||
/* ---- Task parameters ---- */
|
/* ---- Task parameters ---- */
|
||||||
#define SWARM_TASK_STACK 3072 /**< 3 KB stack — HTTP client uses ~2.5 KB. */
|
/* Issue #949: 3 KB was sized for plain HTTP (~2.5 KB). The bug reporter
|
||||||
|
* configured `--seed-url https://…` which exercises TLS — mbedTLS handshake
|
||||||
|
* alone needs 4-6 KB on the stack (cipher suite + cert chain + ECDH), and on
|
||||||
|
* top of that esp_http_client adds another 1.5-2 KB. The task panicked with
|
||||||
|
* `0xa5a5a5a5` (FreeRTOS stack-fill sentinel) immediately after "bridge init
|
||||||
|
* OK". 8 KB comfortably fits TLS with margin for the cert chain + headers;
|
||||||
|
* confirmed against mbedTLS's stack analyser. Plain-HTTP deployments waste
|
||||||
|
* ~5 KB of headroom but that's <0.1 % of PSRAM, an acceptable cost for the
|
||||||
|
* bug class this prevents. */
|
||||||
|
#define SWARM_TASK_STACK 8192 /**< 8 KB stack — fits mbedTLS handshake. */
|
||||||
#define SWARM_TASK_PRIO 3
|
#define SWARM_TASK_PRIO 3
|
||||||
#define SWARM_TASK_CORE 0
|
#define SWARM_TASK_CORE 0
|
||||||
#define SWARM_HTTP_TIMEOUT 3000 /**< HTTP timeout in ms (Seed responds <100ms on LAN). */
|
#define SWARM_HTTP_TIMEOUT 3000 /**< HTTP timeout in ms (Seed responds <100ms on LAN). */
|
||||||
|
|||||||
@@ -29,6 +29,30 @@ CONFIG_LOG_DEFAULT_LEVEL_INFO=y
|
|||||||
# LWIP: enable extended socket options for UDP multicast
|
# LWIP: enable extended socket options for UDP multicast
|
||||||
CONFIG_LWIP_SO_RCVBUF=y
|
CONFIG_LWIP_SO_RCVBUF=y
|
||||||
|
|
||||||
|
# Issue (sibling of #946/#949/#864 cluster): UDP `sendto` returned ENOMEM
|
||||||
|
# in a tight loop on both ESP32-S3 (COM8) and ESP32-C6 (COM9) at the v0.7.0
|
||||||
|
# CSI packet rate (CSI cb + status + sync + feature_state all sharing the
|
||||||
|
# LWIP/WiFi pools). stream_sender.c has a cooldown path so the device
|
||||||
|
# doesn't crash, but ~90 % of CSI frames were dropped before reaching the
|
||||||
|
# host — boot trace showed `sendto ENOMEM — backing off 100 ms` repeating
|
||||||
|
# every capture cycle. Stock IDF v5.4 defaults: UDP recv mbox=6, TCPIP
|
||||||
|
# mbox=32, WiFi dynamic TX buffers=32 — too small once CSI promiscuous
|
||||||
|
# mode is active. These bumps roughly quadruple the relevant pools at
|
||||||
|
# ~3 KB extra heap cost, measured live on both targets Jun 8 2026.
|
||||||
|
CONFIG_LWIP_UDP_RECVMBOX_SIZE=32
|
||||||
|
CONFIG_LWIP_TCPIP_RECVMBOX_SIZE=64
|
||||||
|
CONFIG_ESP_WIFI_DYNAMIC_TX_BUFFER_NUM=64
|
||||||
|
# NOTE: Empirical 25 s measurements on the S3 at COM8 showed these bumps
|
||||||
|
# eliminate the csi_collector.sendto failure path (`fail #1..5` →
|
||||||
|
# `fail #0`) — real improvement — but do NOT eliminate the broader
|
||||||
|
# `feature_state emit` ENOMEM at ~10/s. That residual is the WiFi
|
||||||
|
# radio's TX airtime saturating under CSI promiscuous RX, and bigger
|
||||||
|
# buffers cap out at the 100 ms backoff window regardless of size
|
||||||
|
# (verified at WIFI_DYNAMIC_TX=128 + PBUF_POOL=32 — identical count).
|
||||||
|
# The proper fix is rate-limiting adaptive_controller.c's emit cadence
|
||||||
|
# from ~50 ms to the intended 1 Hz, which is a code refactor tracked
|
||||||
|
# in a separate follow-up issue.
|
||||||
|
|
||||||
# FreeRTOS: increase task stack for CSI processing
|
# FreeRTOS: increase task stack for CSI processing
|
||||||
CONFIG_ESP_MAIN_TASK_STACK_SIZE=8192
|
CONFIG_ESP_MAIN_TASK_STACK_SIZE=8192
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user