4.5 KiB
Agent guide — voice-chat
Orientation for AI agents making changes in this repo. Keep edits scoped; this pipeline has several sharp edges that a naive refactor will hit.
What this project is
A local real-time voice (and optionally video-avatar) chat server. FastAPI + WebSocket on the backend, a small vanilla-JS UI on the frontend. All ML runs locally on a single NVIDIA GPU; there are no cloud API calls at runtime.
Two-tier architecture
-
Audio pipeline (always on): mic PCM → vad.py → asr.py → llm.py → tts.py → PCM back to the browser. Orchestrated by pipeline.py via
ConversationSession. Supports barge-in (user speaks → cancel in-flight reply). -
Video pipeline (optional, gated by
config.video.enabled): per assistant turn, video.py'sVideoEnginecalls video_models/wan22.py (LightX2V Wan2.2-I2V) to produce base frames, then video_models/musetalk.py to lip-sync them to the TTS audio, then video_models/muxer.py to produce an MP4. The MP4 is sent over the same/ws/chatWebSocket as aspeaking_clipmessage.
The audio path must keep working when video.enabled is false. Don't make video models load-bearing for the audio pipeline.
Key files
- server/main.py — FastAPI app, WebSocket, video/avatar HTTP endpoints
- server/models.py — lifetime of all models;
ModelManager.video_engineisNonewhen video is disabled - server/pipeline.py — per-session audio pipeline + video branch
- server/config.py — parses config.yml
- server/video.py —
VideoConfig,LoRASpec,VideoEngine(library vs reflective modes) - server/video_models/wan22.py — LightX2V wrapper; fp8 + GGUF loading; Blackwell patches
- configs/lightx2v/ — LightX2V inference config templates; must match
wan22_dit_quant_scheme - tests/unit/ — GPU-free tests, runnable on Windows host
- tests/component/ — end-to-end tests, must run inside the Docker container
Conventions
- Config: single source of truth is config.yml →
server/config.pydataclasses. Don't read env vars for runtime behaviour; if you need a new knob, add it to the dataclass and document it inconfig.yml. - Logging:
log = logging.getLogger(__name__)at module top; log level is set once inserver/main.py. INFO for lifecycle, DEBUG for per-chunk chatter. - Async: WebSocket handlers and endpoints are async, but heavy model work is sync — wrap via
asyncio.to_thread(...)at the call site (seeset_avatarinmain.py). - Concurrency:
VideoEngineserialises generation withself._lock. Don't call model methods without holding it from another thread. - Tests: every non-trivial logic change in
server/video.pyorserver/pipeline.pyshould have a correspondingtests/unit/test. GPU-dependent behaviour goes intests/component/.
Gotchas
- GPU architecture. This is tuned for RTX 5090 / Blackwell (SM120) with PyTorch 2.8 + Triton 3.4. Several upstream kernels (flashinfer, flash_attn3, sgl_kernel fp8 matmul, Triton-fused scale/shift) are broken or unavailable there. See server/video_models/AGENT.md before touching the Wan2.2 wrapper.
- First launch is slow. Hugging Face downloads land in the
huggingface-cacheDocker volume; a cold run pulls >20 GB. - Wan-AI/Wan2.2-I2V-A14B ships bf16 DIT shards we don't want —
BASE_REPO_IGNORE_PATTERNSin wan22.py excludes them. Keep that list in sync if the repo layout changes. - LoRA targets matter. Wan2.2 is a MoE (high_noise + low_noise sub-models). A LoRA with the wrong
targetloads silently and produces subtly wrong output. - Don't mix audio+video state. The audio pipeline must not block on video generation; video is produced for a turn after the full reply audio is available, and sent as a separate message.
When in doubt
- Run
python -m pytest tests/unit -v— it's fast and catches most regressions. - For GPU changes, run the lowest-numbered relevant component test first; they're ordered to isolate failure to a single stage.
- Check memory (auto-loaded) for prior-session findings that aren't in the code.