Files
2026-04-16 10:00:37 -04:00

4.5 KiB

Agent guide — voice-chat

Orientation for AI agents making changes in this repo. Keep edits scoped; this pipeline has several sharp edges that a naive refactor will hit.

What this project is

A local real-time voice (and optionally video-avatar) chat server. FastAPI + WebSocket on the backend, a small vanilla-JS UI on the frontend. All ML runs locally on a single NVIDIA GPU; there are no cloud API calls at runtime.

Two-tier architecture

  1. Audio pipeline (always on): mic PCM → vad.pyasr.pyllm.pytts.py → PCM back to the browser. Orchestrated by pipeline.py via ConversationSession. Supports barge-in (user speaks → cancel in-flight reply).

  2. Video pipeline (optional, gated by config.video.enabled): per assistant turn, video.py's VideoEngine calls video_models/wan22.py (LightX2V Wan2.2-I2V) to produce base frames, then video_models/musetalk.py to lip-sync them to the TTS audio, then video_models/muxer.py to produce an MP4. The MP4 is sent over the same /ws/chat WebSocket as a speaking_clip message.

The audio path must keep working when video.enabled is false. Don't make video models load-bearing for the audio pipeline.

Key files

Conventions

  • Config: single source of truth is config.ymlserver/config.py dataclasses. Don't read env vars for runtime behaviour; if you need a new knob, add it to the dataclass and document it in config.yml.
  • Logging: log = logging.getLogger(__name__) at module top; log level is set once in server/main.py. INFO for lifecycle, DEBUG for per-chunk chatter.
  • Async: WebSocket handlers and endpoints are async, but heavy model work is sync — wrap via asyncio.to_thread(...) at the call site (see set_avatar in main.py).
  • Concurrency: VideoEngine serialises generation with self._lock. Don't call model methods without holding it from another thread.
  • Tests: every non-trivial logic change in server/video.py or server/pipeline.py should have a corresponding tests/unit/ test. GPU-dependent behaviour goes in tests/component/.

Gotchas

  • GPU architecture. This is tuned for RTX 5090 / Blackwell (SM120) with PyTorch 2.8 + Triton 3.4. Several upstream kernels (flashinfer, flash_attn3, sgl_kernel fp8 matmul, Triton-fused scale/shift) are broken or unavailable there. See server/video_models/AGENT.md before touching the Wan2.2 wrapper.
  • First launch is slow. Hugging Face downloads land in the huggingface-cache Docker volume; a cold run pulls >20 GB.
  • Wan-AI/Wan2.2-I2V-A14B ships bf16 DIT shards we don't want — BASE_REPO_IGNORE_PATTERNS in wan22.py excludes them. Keep that list in sync if the repo layout changes.
  • LoRA targets matter. Wan2.2 is a MoE (high_noise + low_noise sub-models). A LoRA with the wrong target loads silently and produces subtly wrong output.
  • Don't mix audio+video state. The audio pipeline must not block on video generation; video is produced for a turn after the full reply audio is available, and sent as a separate message.

When in doubt

  • Run python -m pytest tests/unit -v — it's fast and catches most regressions.
  • For GPU changes, run the lowest-numbered relevant component test first; they're ordered to isolate failure to a single stage.
  • Check memory (auto-loaded) for prior-session findings that aren't in the code.