mirror of
https://github.com/ruvnet/RuView
synced 2026-06-09 10:13:17 +00:00
Compare commits
3 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 3b4e151507 | |||
| 68d47a25d5 | |||
| 0d3292314b |
@@ -0,0 +1,627 @@
|
||||
# ADR-081: Gesture-Controlled Data Visualization
|
||||
|
||||
- **Status**: Proposed
|
||||
- **Date**: 2026-04-07
|
||||
- **Deciders**: ruv
|
||||
- **Relates to**: ADR-079 (Camera Ground-Truth Training), ADR-029 (RuvSense Gesture Recognition), ADR-072 (WiFlow Architecture), ADR-076 (CNN Spectrogram Embeddings)
|
||||
|
||||
## Context
|
||||
|
||||
RuView can now track 17 COCO keypoints at 92.9% PCK@20 (ADR-079) and detect gestures
|
||||
via DTW template matching (ADR-029). These capabilities exist independently — pose
|
||||
estimation produces skeleton coordinates, and the UI displays static charts. There is no
|
||||
system that connects hand/arm movements to interactive data exploration.
|
||||
|
||||
Gesture-controlled visualization would let users manipulate charts and graphs by waving
|
||||
their hands in front of the ESP32 sensing zone — no mouse, no touchscreen, no wearable.
|
||||
This is particularly valuable for:
|
||||
|
||||
- **Lab/cleanroom** — gloved hands can't use touchscreens
|
||||
- **Kitchen/workshop** — dirty or wet hands
|
||||
- **Presentations** — stand back and gesture at projected dashboards
|
||||
- **Accessibility** — motor impairments that make mouse use difficult
|
||||
- **Digital signage** — public displays without touch hardware
|
||||
|
||||
### Why Camera + CSI Fusion
|
||||
|
||||
Camera alone can do gesture control (e.g., Leap Motion, MediaPipe Hands). CSI alone can
|
||||
detect coarse gestures (ADR-029). The fusion provides:
|
||||
|
||||
| Modality | Strengths | Weaknesses |
|
||||
|----------|-----------|-----------|
|
||||
| Camera (MediaPipe Hands) | 21 hand landmarks, finger-level precision, 30fps | Requires line of sight, lighting dependent, privacy concern |
|
||||
| CSI (ESP32) | Through-wall, works in dark, privacy-preserving, $9 | Coarse spatial resolution, no finger tracking |
|
||||
| **Fusion** | **Finger precision near camera + coarse tracking everywhere** | Requires both sensors during training |
|
||||
|
||||
The fusion model trains on camera + CSI pairs (like ADR-079), then deploys in two modes:
|
||||
1. **Camera-assisted** — full precision when camera is available
|
||||
2. **CSI-only** — reduced but functional gesture control without camera
|
||||
|
||||
## Decision
|
||||
|
||||
Build a gesture-to-visualization control system that maps hand/arm movements to chart
|
||||
interactions using fused camera + CSI input.
|
||||
|
||||
### Gesture Vocabulary
|
||||
|
||||
#### Navigation Gestures (arm-level, CSI-detectable)
|
||||
|
||||
| Gesture | Motion | Chart Action | CSI Feasibility |
|
||||
|---------|--------|-------------|-----------------|
|
||||
| **Swipe left** | Open hand sweeps left | Pan chart left / previous dataset | High — clear directional motion |
|
||||
| **Swipe right** | Open hand sweeps right | Pan chart right / next dataset | High |
|
||||
| **Swipe up** | Open hand sweeps up | Scroll up / zoom out | High |
|
||||
| **Swipe down** | Open hand sweeps down | Scroll down / zoom in | High |
|
||||
| **Push forward** | Palm pushes toward screen | Select / drill into data point | Medium — depth motion harder |
|
||||
| **Pull back** | Hand pulls away from screen | Back / zoom out | Medium |
|
||||
| **Circular CW** | Hand circles clockwise | Increase value / rotate view | Medium — temporal pattern |
|
||||
| **Circular CCW** | Hand circles counter-clockwise | Decrease value / rotate back | Medium |
|
||||
| **Hold still** | Hand stationary 2+ seconds | Hover / show tooltip | High — absence of motion |
|
||||
| **Both hands apart** | Arms spread outward | Expand / zoom into selection | High — bilateral motion |
|
||||
| **Both hands together** | Arms move inward | Collapse / zoom out | High |
|
||||
|
||||
#### Precision Gestures (finger-level, camera-required)
|
||||
|
||||
| Gesture | Motion | Chart Action | Sensor |
|
||||
|---------|--------|-------------|--------|
|
||||
| **Pinch zoom** | Thumb + index spread/close | Continuous zoom | Camera only |
|
||||
| **Point** | Index finger extended | Cursor position on chart | Camera only |
|
||||
| **Grab** | Close fist | Grab and drag data point | Camera only |
|
||||
| **Thumb up** | Thumbs up | Confirm / approve | Camera only |
|
||||
| **Thumb down** | Thumbs down | Reject / undo | Camera only |
|
||||
| **Two-finger rotate** | Two fingers twist | Rotate 3D visualization | Camera only |
|
||||
| **Finger slider** | Index finger moves along axis | Adjust parameter value | Camera only |
|
||||
|
||||
### Architecture
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ Input Layer │
|
||||
│ │
|
||||
│ ESP32 CSI (UDP 5005) ──→ CSI Gesture Detector (DTW + WiFlow) │
|
||||
│ ↓ │
|
||||
│ Webcam (MediaPipe Hands) ──→ Hand Landmark Tracker (21 joints) │
|
||||
│ ↓ │
|
||||
│ Gesture Fusion Engine │
|
||||
│ ├── CSI coarse: swipe/circle/hold │
|
||||
│ ├── Camera fine: pinch/point/grab │
|
||||
│ └── Confidence weighting by modality │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
↓
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ Gesture Interpreter │
|
||||
│ │
|
||||
│ Raw gestures ──→ State Machine ──→ Chart Commands │
|
||||
│ │
|
||||
│ States: │
|
||||
│ IDLE ──(motion detected)──→ TRACKING │
|
||||
│ TRACKING ──(gesture matched)──→ ACTING │
|
||||
│ ACTING ──(gesture complete)──→ COOLDOWN │
|
||||
│ COOLDOWN ──(500ms)──→ IDLE │
|
||||
│ │
|
||||
│ Debounce: 200ms minimum gesture duration │
|
||||
│ Cooldown: 500ms between consecutive gestures │
|
||||
│ Confidence threshold: 0.7 for CSI, 0.9 for camera │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
↓
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ Visualization Controller │
|
||||
│ │
|
||||
│ Chart Commands ──→ WebSocket ──→ UI │
|
||||
│ │
|
||||
│ Commands: │
|
||||
│ { type: "pan", dx: -0.1, dy: 0 } │
|
||||
│ { type: "zoom", factor: 1.2, center: [0.5, 0.5] } │
|
||||
│ { type: "select", x: 0.45, y: 0.62 } │
|
||||
│ { type: "rotate", angle: 15 } │
|
||||
│ { type: "slider", axis: "x", value: 0.73 } │
|
||||
│ { type: "hover", x: 0.45, y: 0.62 } │
|
||||
│ { type: "back" } │
|
||||
│ { type: "confirm" } │
|
||||
│ { type: "reject" } │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
↓
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ Visualization UI │
|
||||
│ │
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
||||
│ │ Line Chart │ │ Bar Chart │ │ 3D Scatter │ │
|
||||
│ │ (time │ │ (category │ │ (spatial │ │
|
||||
│ │ series) │ │ compare) │ │ data) │ │
|
||||
│ └─────────────┘ └─────────────┘ └─────────────┘ │
|
||||
│ │
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
||||
│ │ Heatmap │ │ Gauge │ │ Spectrogram │ │
|
||||
│ │ (CSI grid) │ │ (vitals) │ │ (frequency) │ │
|
||||
│ └─────────────┘ └─────────────┘ └─────────────┘ │
|
||||
│ │
|
||||
│ Visual feedback: gesture cursor overlay + action indicator │
|
||||
│ Framework: D3.js / Observable Plot in existing UI │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Gesture Detection Pipeline
|
||||
|
||||
#### CSI Gesture Detection (arm-level)
|
||||
|
||||
Extends the existing DTW gesture classifier (ADR-029) with WiFlow pose input:
|
||||
|
||||
```
|
||||
CSI [35, 20] ──→ WiFlow lite ──→ 17 keypoints ──→ Extract arm features:
|
||||
- Wrist velocity (dx/dt, dy/dt)
|
||||
- Elbow angle (shoulder-elbow-wrist)
|
||||
- Bilateral symmetry (left vs right)
|
||||
- Motion energy (frame differencing)
|
||||
↓
|
||||
DTW template matching:
|
||||
- 11 gesture templates
|
||||
- Sliding window (1s)
|
||||
- Top match + confidence
|
||||
```
|
||||
|
||||
#### Camera Gesture Detection (finger-level)
|
||||
|
||||
Uses MediaPipe Hands (21 landmarks per hand, 30fps):
|
||||
|
||||
```
|
||||
Webcam ──→ MediaPipe Hands ──→ 21 landmarks × 2 hands ──→ Extract:
|
||||
- Finger states (extended/curled)
|
||||
- Pinch distance (thumb-index)
|
||||
- Grab state (all fingers curled)
|
||||
- Point direction (index ray)
|
||||
- Hand center velocity
|
||||
↓
|
||||
Rule-based classifier:
|
||||
- Pinch: thumb-index < 0.05
|
||||
- Point: only index extended
|
||||
- Grab: all fingers curled
|
||||
- Thumbs up/down: thumb angle
|
||||
```
|
||||
|
||||
#### Fusion Strategy
|
||||
|
||||
```
|
||||
CSI confidence ──┐
|
||||
├──→ Weighted fusion ──→ Final gesture + confidence
|
||||
Camera conf ──┘
|
||||
|
||||
Rules:
|
||||
- If both agree: confidence = max(csi_conf, cam_conf) + 0.1 * min(csi_conf, cam_conf)
|
||||
- If only CSI: use CSI gesture, confidence *= 0.8
|
||||
- If only camera: use camera gesture, confidence *= 0.95
|
||||
- If conflict: prefer camera for fine gestures, CSI for coarse gestures
|
||||
- Minimum confidence for action: 0.6
|
||||
```
|
||||
|
||||
### Chart Interaction Mapping
|
||||
|
||||
#### Line Chart (Time Series)
|
||||
|
||||
| Gesture | Action | Parameters |
|
||||
|---------|--------|-----------|
|
||||
| Swipe left/right | Pan time axis | dx proportional to swipe speed |
|
||||
| Pinch zoom | Zoom time axis | Continuous, centered on hand position |
|
||||
| Both hands apart/together | Zoom (CSI-only alternative) | Binary zoom in/out |
|
||||
| Point | Show tooltip at nearest data point | x from index finger position |
|
||||
| Hold still | Sticky tooltip | Duration-based activation |
|
||||
| Swipe up/down | Switch dataset / Y-axis scale | Discrete steps |
|
||||
|
||||
#### Bar Chart (Category Comparison)
|
||||
|
||||
| Gesture | Action | Parameters |
|
||||
|---------|--------|-----------|
|
||||
| Swipe left/right | Navigate categories | One category per swipe |
|
||||
| Point | Highlight bar | Nearest bar to finger X position |
|
||||
| Push forward | Select bar for drill-down | Depth gesture |
|
||||
| Grab + drag | Reorder bars | Camera-only |
|
||||
| Circular | Sort ascending/descending | Direction determines order |
|
||||
|
||||
#### 3D Scatter Plot
|
||||
|
||||
| Gesture | Action | Parameters |
|
||||
|---------|--------|-----------|
|
||||
| Swipe left/right | Rotate Y axis | Angle proportional to speed |
|
||||
| Swipe up/down | Rotate X axis | Angle proportional to speed |
|
||||
| Two-finger rotate | Rotate Z axis | Camera-only |
|
||||
| Pinch zoom | Zoom | Camera-only |
|
||||
| Both hands apart | Zoom in (CSI alternative) | Binary |
|
||||
| Point | Highlight nearest point | Ray-cast from finger direction |
|
||||
|
||||
#### Heatmap (CSI Grid)
|
||||
|
||||
| Gesture | Action | Parameters |
|
||||
|---------|--------|-----------|
|
||||
| Swipe | Pan view | dx, dy |
|
||||
| Pinch | Zoom region | Center + scale |
|
||||
| Hold | Show cell value | Position-based |
|
||||
| Circular | Adjust color scale range | CW = expand, CCW = contract |
|
||||
|
||||
#### Gauge (Vital Signs)
|
||||
|
||||
| Gesture | Action | Parameters |
|
||||
|---------|--------|-----------|
|
||||
| Swipe left/right | Switch vital (HR → BR → SpO2) | Discrete |
|
||||
| Circular CW | Set high alert threshold | Continuous |
|
||||
| Circular CCW | Set low alert threshold | Continuous |
|
||||
| Thumb up | Acknowledge alert | Binary |
|
||||
|
||||
### Visual Feedback: AR Camera Overlay
|
||||
|
||||
The primary view is the **live camera feed with AR overlays** — the person is visible
|
||||
with charts, skeleton, and data rendered on top. This creates a "Minority Report" style
|
||||
interface where you see yourself manipulating data in real-time.
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────┐
|
||||
│ │
|
||||
│ ╔══════════════════════════════════════════════════════════╗ │
|
||||
│ ║ ║ │
|
||||
│ ║ [Live Camera Feed — person visible] ║ │
|
||||
│ ║ ║ │
|
||||
│ ║ ╭─────╮ ║ │
|
||||
│ ║ │ │ ← skeleton overlay (17 keypoints) ║ │
|
||||
│ ║ ╰──┬──╯ ║ │
|
||||
│ ║ ╱ ╲ ║ │
|
||||
│ ║ ╱ ╲ ┌──────────────────────┐ ║ │
|
||||
│ ║ │ │ │ CSI Amplitude Chart │ ║ │
|
||||
│ ║ │ 🖐→ │ │ ┌─╮ ╭─╮ ╭──╮ │ ║ │
|
||||
│ ║ │ │ │ │ ╰─╯ ╰───╯ │ │ ║ │
|
||||
│ ║ ╲ ╱ │ │ │ │ ║ │
|
||||
│ ║ ╲ ╱ └──────────────────────┘ ║ │
|
||||
│ ║ │ │ ↑ chart follows hand position ║ │
|
||||
│ ║ ╱ ╲ ║ │
|
||||
│ ║ ╱ ╲ ║ │
|
||||
│ ║ ║ │
|
||||
│ ╚══════════════════════════════════════════════════════════╝ │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────────────────────┐ │
|
||||
│ │ LOWER THIRD │ │
|
||||
│ │ ┌────┐ │ │
|
||||
│ │ │ pi │ RuView Sensing HR: 72 BPM BR: 16 BPM │ │
|
||||
│ │ │ │ v0.7.0 Presence: 1 Motion: 0.23 │ │
|
||||
│ │ └────┘ │ │
|
||||
│ │ [logo] [gesture: Swipe Right] [CSI ●] [CAM ●] [28fps]│ │
|
||||
│ └──────────────────────────────────────────────────────────┘ │
|
||||
└──────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
#### AR Overlay Layers (bottom to top)
|
||||
|
||||
| Layer | Content | Opacity | Update Rate |
|
||||
|-------|---------|---------|-------------|
|
||||
| 0 | Live camera feed (full frame) | 100% | 30fps |
|
||||
| 1 | Skeleton overlay (17 keypoints + bones) | 70% | 30fps |
|
||||
| 2 | Gesture cursor (hand position + state) | 90% | 30fps |
|
||||
| 3 | Floating chart (anchored to hand/body region) | 85% | 30fps |
|
||||
| 4 | Data labels + tooltips | 95% | On gesture |
|
||||
| 5 | Lower third (RuView branding + vitals + status) | 95% | 1fps |
|
||||
|
||||
#### Floating Chart Placement
|
||||
|
||||
Charts are **anchored to the person's body** and follow movement:
|
||||
|
||||
```
|
||||
Placement rules:
|
||||
- Default: chart floats to the right of the person's dominant hand
|
||||
- If hand moves left: chart slides to left side
|
||||
- Chart stays within frame bounds (never clips off-screen)
|
||||
- Multiple charts: stack vertically with 10% gap
|
||||
- Inactive charts: shrink to thumbnail and anchor near shoulder
|
||||
|
||||
Chart anchor point = wrist_position + offset(0.15, -0.1) // right and slightly above hand
|
||||
Chart size: 30% of frame width × 20% of frame height
|
||||
```
|
||||
|
||||
#### Lower Third Design
|
||||
|
||||
The lower third bar provides persistent status in broadcast-style framing:
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────┐
|
||||
│ ┌──────┐ │
|
||||
│ │ pi │ RuView Sensing v0.7.0 │
|
||||
│ │ │ ────────────────────────────────────────────── │
|
||||
│ │ logo │ HR: 72 BPM | BR: 16 BPM | Persons: 1 │
|
||||
│ └──────┘ Motion: Low | Gesture: Swipe Right | 28fps │
|
||||
│ [CSI ●] [CAM ●] [FUSE] PCK@20: 92.9% │
|
||||
└──────────────────────────────────────────────────────────────┘
|
||||
|
||||
Design:
|
||||
- Background: semi-transparent dark (#1a1a2e, 80% opacity)
|
||||
- Logo: RuView "pi" icon (32x32px), left-aligned
|
||||
- Text: white (#ffffff) primary, gray (#a0a0a0) secondary
|
||||
- Accent: teal (#00d4aa) for active indicators
|
||||
- Height: 15% of frame
|
||||
- Font: system monospace for data, sans-serif for labels
|
||||
- Divider: thin teal line separating logo from data
|
||||
```
|
||||
|
||||
#### RuView Logo Placement
|
||||
|
||||
```
|
||||
The "pi" logo appears in two contexts:
|
||||
|
||||
1. Lower third (persistent):
|
||||
- Position: bottom-left corner, 12px padding
|
||||
- Size: 32x32px
|
||||
- Style: white outline on dark background
|
||||
- Always visible during gesture mode
|
||||
|
||||
2. Watermark (optional):
|
||||
- Position: top-right corner, 8px padding
|
||||
- Size: 24x24px, 30% opacity
|
||||
- Style: subtle, doesn't interfere with data
|
||||
```
|
||||
|
||||
#### Skeleton Rendering Style
|
||||
|
||||
```
|
||||
Keypoint rendering:
|
||||
- Detected joints: teal circles (#00d4aa), radius 6px
|
||||
- Low-confidence joints: gray circles (#666), radius 4px
|
||||
- Active hand (gesturing): yellow highlight (#ffcc00), radius 8px, glow effect
|
||||
|
||||
Bone rendering:
|
||||
- Normal bones: teal lines (#00d4aa), 2px stroke
|
||||
- Active arm (gesturing): yellow lines (#ffcc00), 3px stroke, glow
|
||||
- Torso: slightly thicker (3px) to anchor the skeleton visually
|
||||
|
||||
Style: dark-theme friendly, high contrast against camera feed
|
||||
```
|
||||
|
||||
**Cursor types:**
|
||||
- **Open hand** — teal ring around wrist, rays extending from fingers
|
||||
- **Pointing** — teal ray from index finger toward chart
|
||||
- **Grabbing** — yellow fist icon, chart border highlights
|
||||
- **Pinching** — two teal dots (thumb + index) with distance line
|
||||
- **Ghost cursor** — CSI-only mode: larger, more diffuse circle (no finger detail)
|
||||
|
||||
### Data Flow Protocol
|
||||
|
||||
WebSocket messages from gesture engine to UI:
|
||||
|
||||
```typescript
|
||||
interface GestureEvent {
|
||||
type: 'gesture';
|
||||
gesture: 'swipe_left' | 'swipe_right' | 'swipe_up' | 'swipe_down'
|
||||
| 'pinch_zoom' | 'point' | 'grab' | 'hold' | 'circle_cw'
|
||||
| 'circle_ccw' | 'push' | 'pull' | 'spread' | 'contract'
|
||||
| 'thumb_up' | 'thumb_down';
|
||||
confidence: number; // 0-1
|
||||
source: 'csi' | 'camera' | 'fusion';
|
||||
position?: [number, number]; // Normalized [0,1] hand position
|
||||
velocity?: [number, number]; // Hand velocity for proportional control
|
||||
param?: number; // Gesture-specific parameter (pinch distance, rotation angle)
|
||||
}
|
||||
|
||||
interface CursorEvent {
|
||||
type: 'cursor';
|
||||
x: number; // 0-1 normalized
|
||||
y: number; // 0-1 normalized
|
||||
state: 'tracking' | 'pointing' | 'grabbing' | 'pinching' | 'idle';
|
||||
hands: number; // 0, 1, or 2
|
||||
}
|
||||
|
||||
interface StatusEvent {
|
||||
type: 'status';
|
||||
csi_active: boolean;
|
||||
camera_active: boolean;
|
||||
mode: 'fusion' | 'csi_only' | 'camera_only';
|
||||
fps: number;
|
||||
gesture_count: number; // Total gestures detected this session
|
||||
}
|
||||
```
|
||||
|
||||
### Training the CSI Gesture Model
|
||||
|
||||
Extends ADR-079's camera ground-truth pipeline:
|
||||
|
||||
```bash
|
||||
# 1. Collect gesture training data (camera + CSI, 10 min)
|
||||
# Perform each gesture 20+ times with natural variation
|
||||
python scripts/collect-gesture-gt.py --duration 600 --gestures all --preview
|
||||
|
||||
# 2. Label gesture segments (auto-detected from camera)
|
||||
node scripts/label-gestures.js \
|
||||
--gt data/ground-truth/gestures-*.jsonl \
|
||||
--csi data/recordings/csi-*.jsonl
|
||||
|
||||
# 3. Train gesture classifier
|
||||
node scripts/train-gesture-model.js \
|
||||
--data data/gestures/labeled-*.jsonl \
|
||||
--scale lite
|
||||
|
||||
# 4. Deploy
|
||||
# CSI-only mode: gestures detected from WiFlow keypoint motion
|
||||
# Fusion mode: camera adds finger-level precision
|
||||
```
|
||||
|
||||
**Training data per gesture:** ~20 examples × 11 gestures = 220 labeled samples.
|
||||
With augmentation (time warp, amplitude noise): ~1,000 effective samples.
|
||||
|
||||
### Optimization: ruvector-cnn Spectrogram Gesture Classification
|
||||
|
||||
Replace DTW template matching with a CNN operating on CSI spectrograms via the
|
||||
`ruvector-cnn` WASM package (ADR-076). This treats each gesture as an image
|
||||
classification problem on the CSI time-frequency representation.
|
||||
|
||||
#### Why CNN Over DTW
|
||||
|
||||
| | DTW (current, ADR-029) | CNN Spectrogram (proposed) |
|
||||
|---|---|---|
|
||||
| Input | 1D keypoint trajectories | 2D CSI spectrogram image |
|
||||
| Features | Hand-crafted (wrist velocity, elbow angle) | Learned end-to-end |
|
||||
| Robustness | Sensitive to speed variation | Warp-invariant (pooling layers) |
|
||||
| Multi-scale | Single scale | Hierarchical (dilated convolutions) |
|
||||
| Training | Template recording + DTW distance | Supervised from camera labels |
|
||||
| New gestures | Record new template | Retrain (or few-shot with embedding) |
|
||||
| Accuracy | ~85% (DTW literature) | ~95%+ (CNN on spectrograms, literature) |
|
||||
|
||||
#### Pipeline
|
||||
|
||||
```
|
||||
CSI [N_subcarriers, T=30] (1-second window)
|
||||
↓
|
||||
Spectrogram transform: STFT per subcarrier
|
||||
→ [N_sub, F_bins, T_bins] ≈ [35, 16, 15]
|
||||
↓
|
||||
Reshape to grayscale image: [35×16, 15] = [560, 15]
|
||||
→ Resize to [64, 64] (bilinear)
|
||||
↓
|
||||
ruvector-cnn CnnEmbedder (WASM-accelerated)
|
||||
→ 128-dim gesture embedding
|
||||
↓
|
||||
Classifier head: Linear(128 → 18 gestures) + softmax
|
||||
→ gesture_id + confidence
|
||||
```
|
||||
|
||||
#### ruvector-cnn Integration
|
||||
|
||||
The `@ruvector/cnn` WASM package provides:
|
||||
|
||||
```javascript
|
||||
const { init, CnnEmbedder, InfoNCELoss } = require('@ruvector/cnn');
|
||||
await init();
|
||||
|
||||
// Create embedder for 64x64 CSI spectrogram "images"
|
||||
const embedder = new CnnEmbedder({
|
||||
inputSize: 64,
|
||||
embeddingDim: 128,
|
||||
normalize: true,
|
||||
});
|
||||
|
||||
// Extract embedding from CSI spectrogram
|
||||
const spectrogram = csiToSpectrogram(csiWindow); // [64, 64] Uint8Array
|
||||
const embedding = embedder.extract(spectrogram, 64, 64);
|
||||
|
||||
// Classify gesture via nearest-neighbor to trained templates
|
||||
const gesture = classifyGesture(embedding, gestureTemplates);
|
||||
```
|
||||
|
||||
#### Training with Contrastive + Classification
|
||||
|
||||
Two-phase training using ruvector-cnn's built-in losses:
|
||||
|
||||
**Phase 1: Contrastive embedding (unsupervised)**
|
||||
```javascript
|
||||
const loss = new InfoNCELoss(0.07);
|
||||
// Same gesture performed at different speeds → positive pairs
|
||||
// Different gestures → negative pairs
|
||||
// Train CnnEmbedder to cluster same-gesture spectrograms
|
||||
```
|
||||
|
||||
**Phase 2: Gesture classification (supervised)**
|
||||
```javascript
|
||||
// Linear classifier on frozen embeddings
|
||||
// 18 gestures × 20 examples each = 360 labeled samples
|
||||
// Camera auto-labels: MediaPipe Hands detects gesture type
|
||||
```
|
||||
|
||||
#### Dual-Path Architecture
|
||||
|
||||
Run both CNN and DTW in parallel for maximum robustness:
|
||||
|
||||
```
|
||||
CSI input ──┬──→ WiFlow → keypoints → DTW templates → gesture_A (conf_A)
|
||||
│
|
||||
└──→ Spectrogram → ruvector-cnn → embedding → classifier → gesture_B (conf_B)
|
||||
|
||||
Fusion: if gesture_A == gesture_B → conf = max(conf_A, conf_B) + 0.15
|
||||
if conflict → pick higher confidence
|
||||
if only one detects → use it at 0.8× confidence
|
||||
```
|
||||
|
||||
This dual-path approach provides:
|
||||
- **DTW** catches gestures the CNN might miss (novel variations)
|
||||
- **CNN** provides higher accuracy for trained gesture types
|
||||
- **Fusion** reduces false positives (both must agree for high-confidence)
|
||||
|
||||
### Optimization: Temporal Gesture Encoding
|
||||
|
||||
Alternative lightweight path for when ruvector-cnn WASM overhead matters
|
||||
(e.g., ESP32 edge deployment):
|
||||
|
||||
```
|
||||
Keypoint sequence [T=30 frames, 1 second]:
|
||||
wrist_x[0..29], wrist_y[0..29],
|
||||
elbow_angle[0..29],
|
||||
hand_velocity[0..29]
|
||||
↓
|
||||
1D CNN (k=5, d=[1,2,4]) → 64-dim gesture embedding
|
||||
↓
|
||||
Nearest-neighbor to gesture templates (cosine distance)
|
||||
↓
|
||||
Top gesture + confidence
|
||||
```
|
||||
|
||||
This is lighter than DTW for real-time use and can be trained end-to-end with
|
||||
the WiFlow backbone (shared TCN features).
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
scripts/
|
||||
collect-gesture-gt.py # Camera + CSI gesture data collection
|
||||
label-gestures.js # Auto-label gesture segments from camera
|
||||
train-gesture-model.js # Train CSI gesture classifier
|
||||
gesture-server.js # WebSocket gesture detection server
|
||||
|
||||
ui/
|
||||
components/
|
||||
GestureOverlay.js # Cursor + feedback overlay
|
||||
GestureChart.js # Gesture-controlled chart wrapper
|
||||
GestureStatus.js # Sensor health bar
|
||||
services/
|
||||
gesture.service.js # WebSocket client for gesture events
|
||||
```
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- **Hands-free data exploration** — manipulate charts without touching anything
|
||||
- **Works in dark/dirty/gloved conditions** — CSI-only mode needs no camera
|
||||
- **Natural interaction** — swipe, pinch, point are intuitive
|
||||
- **Builds on existing infrastructure** — WiFlow + DTW + MediaPipe all exist
|
||||
- **Dual-mode deployment** — degrade gracefully from fusion to CSI-only
|
||||
- **Low latency** — WiFlow inference is 0.79ms, gesture detection adds ~5ms
|
||||
|
||||
### Negative
|
||||
|
||||
- **Learning curve** — users must learn gesture vocabulary
|
||||
- **False positives** — normal movement may trigger gestures (mitigated by state machine + cooldown)
|
||||
- **CSI-only precision** — coarse gestures only without camera
|
||||
- **Single-user** — multi-user gesture disambiguation is hard
|
||||
|
||||
### Risks
|
||||
|
||||
| Risk | Probability | Impact | Mitigation |
|
||||
|------|-------------|--------|------------|
|
||||
| Gesture false positives from normal movement | Medium | High | State machine with IDLE→TRACKING threshold, 200ms debounce, 0.7 confidence gate |
|
||||
| CSI gestures too coarse for chart control | Medium | Medium | Camera fallback for precision; CSI handles navigation-level gestures only |
|
||||
| Latency > 100ms feels unresponsive | Low | High | WiFlow 0.79ms + gesture 5ms + WebSocket <10ms = ~16ms total |
|
||||
| User fatigue ("gorilla arm") | Medium | Medium | Support seated gestures; small wrist movements, not full arm sweeps |
|
||||
| MediaPipe Hands not detecting in low light | Medium | Low | CSI-only fallback; works in complete darkness |
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
| Phase | Task | Effort | Dependencies |
|
||||
|-------|------|--------|-------------|
|
||||
| P1 | `gesture-server.js` — WebSocket server with camera hand tracking | 3 hrs | MediaPipe Hands model |
|
||||
| P2 | Camera gesture classifier (rule-based from hand landmarks) | 2 hrs | P1 |
|
||||
| P3 | CSI gesture classifier (WiFlow keypoints → DTW templates) | 3 hrs | WiFlow model (ADR-079) |
|
||||
| P4 | Fusion engine (confidence-weighted merge) | 2 hrs | P2 + P3 |
|
||||
| P5 | `GestureOverlay.js` — cursor + feedback UI component | 2 hrs | P1 |
|
||||
| P6 | `GestureChart.js` — gesture-controlled D3 chart wrapper | 4 hrs | P4 + P5 |
|
||||
| P7 | Gesture training data collection + model training | 2 hrs | P3 |
|
||||
| P8 | Integration with existing sensing UI | 2 hrs | P6 |
|
||||
| **Total** | | **~20 hrs** | |
|
||||
|
||||
## References
|
||||
|
||||
- MediaPipe Hands — Google's 21-landmark hand tracking (30fps, CPU)
|
||||
- ADR-029 — RuvSense DTW gesture recognition
|
||||
- ADR-079 — Camera ground-truth training pipeline (92.9% PCK@20)
|
||||
- Leap Motion — commercial gesture controller (comparison point)
|
||||
- SolidJS/D3 gesture interaction patterns
|
||||
- "GestureWiFi" (IEEE 2023) — WiFi gesture recognition survey
|
||||
Reference in New Issue
Block a user