Agent guide — voice-chat

Orientation for AI agents making changes in this repo. Keep edits scoped; this pipeline has several sharp edges that a naive refactor will hit.

What this project is

A local real-time voice (and optionally video-avatar) chat server. FastAPI + WebSocket on the backend, a small vanilla-JS UI on the frontend. All ML runs locally on a single NVIDIA GPU; there are no cloud API calls at runtime.

Two-tier architecture

Audio pipeline (always on): mic PCM → vad.py → asr.py → llm.py → tts.py → PCM back to the browser. Orchestrated by pipeline.py via ConversationSession. Supports barge-in (user speaks → cancel in-flight reply).
Video pipeline (optional, gated by config.video.enabled): per assistant turn, video.py's VideoEngine calls video_models/wan22.py (LightX2V Wan2.2-I2V) to produce base frames, then video_models/musetalk.py to lip-sync them to the TTS audio, then video_models/muxer.py to produce an MP4. The MP4 is sent over the same /ws/chat WebSocket as a speaking_clip message.

The audio path must keep working when video.enabled is false. Don't make video models load-bearing for the audio pipeline.

Key files

server/main.py — FastAPI app, WebSocket, video/avatar HTTP endpoints
server/models.py — lifetime of all models; ModelManager.video_engine is None when video is disabled
server/pipeline.py — per-session audio pipeline + video branch
server/config.py — parses config.yml
server/video.py — VideoConfig, LoRASpec, VideoEngine (library vs reflective modes)
server/video_models/wan22.py — LightX2V wrapper; fp8 + GGUF loading; Blackwell patches
configs/lightx2v/ — LightX2V inference config templates; must match wan22_dit_quant_scheme
tests/unit/ — GPU-free tests, runnable on Windows host
tests/component/ — end-to-end tests, must run inside the Docker container

Conventions

Config: single source of truth is config.yml → server/config.py dataclasses. Don't read env vars for runtime behaviour; if you need a new knob, add it to the dataclass and document it in config.yml.
Logging: log = logging.getLogger(__name__) at module top; log level is set once in server/main.py. INFO for lifecycle, DEBUG for per-chunk chatter.
Async: WebSocket handlers and endpoints are async, but heavy model work is sync — wrap via asyncio.to_thread(...) at the call site (see set_avatar in main.py).
Concurrency: VideoEngine serialises generation with self._lock. Don't call model methods without holding it from another thread.
Tests: every non-trivial logic change in server/video.py or server/pipeline.py should have a corresponding tests/unit/ test. GPU-dependent behaviour goes in tests/component/.

Gotchas

GPU architecture. This is tuned for RTX 5090 / Blackwell (SM120) with PyTorch 2.8 + Triton 3.4. Several upstream kernels (flashinfer, flash_attn3, sgl_kernel fp8 matmul, Triton-fused scale/shift) are broken or unavailable there. See server/video_models/AGENT.md before touching the Wan2.2 wrapper.
First launch is slow. Hugging Face downloads land in the huggingface-cache Docker volume; a cold run pulls >20 GB.
Wan-AI/Wan2.2-I2V-A14B ships bf16 DIT shards we don't want — BASE_REPO_IGNORE_PATTERNS in wan22.py excludes them. Keep that list in sync if the repo layout changes.
LoRA targets matter. Wan2.2 is a MoE (high_noise + low_noise sub-models). A LoRA with the wrong target loads silently and produces subtly wrong output.
Don't mix audio+video state. The audio pipeline must not block on video generation; video is produced for a turn after the full reply audio is available, and sent as a separate message.

When in doubt

Run python -m pytest tests/unit -v — it's fast and catches most regressions.
For GPU changes, run the lowest-numbered relevant component test first; they're ordered to isolate failure to a single stage.
Check memory (auto-loaded) for prior-session findings that aren't in the code.

4.5 KiB Raw Permalink Blame History