# Voice Chat A real-time voice conversation app powered by local AI models. Speak into your mic and get spoken responses back — and optionally a lip-synced talking-head video of a chosen avatar — all running on your own GPU with no cloud APIs. ## Pipeline **Mic input** → **VAD** (Silero ONNX) → **ASR** (Qwen3-ASR-0.6B) → **LLM** (Qwen3.5-0.8B) → **TTS** (Kokoro) → **Speaker output** When the optional video stack is enabled, each assistant turn also produces an MP4 via: **TTS audio + avatar image** → **Wan2.2-Lightning I2V** (LightX2V, fp8 or GGUF) → **MuseTalk lip-sync** → **ffmpeg mux** → **`speaking_clip` WebSocket message** - **VAD** — Silero VAD via ONNX Runtime, detects speech/silence boundaries on CPU - **ASR** — Qwen3-ASR-0.6B, bfloat16 on CUDA - **LLM** — Qwen3.5-0.8B (local) or any model served by LM Studio - **TTS** — Kokoro, streams sentence-by-sentence audio at 24 kHz - **Barge-in** — interrupt the assistant mid-response by speaking - **Video (optional)** — Wan2.2 I2V 14B MoE with LightX2V distill LoRAs, fp8 or GGUF-quantised DIT, lip-synced by MuseTalk. Two modes: - `library` — pre-bakes a small set of speaking base clips per avatar, picks round-robin per turn, lip-syncs on the fly - `reflective` — generates a fresh Wan2.2 clip per turn from a prompt derived from the assistant's reply ## Requirements - NVIDIA GPU with CUDA 12.8 support (tested on RTX 5090 / SM120 Blackwell) - Docker + Docker Compose with the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) - ~24 GB VRAM recommended when video is enabled (fp8); ~16 GB with `gguf-Q4_K_M` ## Quick Start ```bash docker compose up --build ``` Then open [http://localhost:8000](http://localhost:8000) in your browser. Models are downloaded from Hugging Face on first launch and cached in a Docker volume (`huggingface-cache`) so they persist across rebuilds. ## Configuration All runtime behaviour is driven by [config.yml](config.yml): - `llm.backend` — `local` (in-process transformers) or `lmstudio` (talks to a local LM Studio server) - `llm.system_prompt`, `llm.max_cache_tokens` — prompt and KV-cache limit per session - `video.enabled` — master toggle for the avatar video stack. When `false`, no video models load and the app behaves exactly like the audio-only pipeline - `video.mode` — `library` or `reflective` - `video.models.wan22_dit_quant_scheme` — `gguf-Q8_0` (default) or `gguf-Q4_K_M` for lower VRAM; any GGUF level LightX2V supports (dense 5B Turbo is GGUF-only) - `video.loras` — list of LoRA adapters applied to the dense Wan2.2-TI2V-5B DIT at load time. Each entry has `path`, `weight`, `target` (always `both` — the 5B DIT is not MoE; legacy `high_noise`/`low_noise` values are coerced), and optional `name`. User LoRAs are mounted from `./loras/` into the container at `/cache/loras/` ## Local Development (without Docker) ```bash # Install PyTorch with CUDA 12.8 pip install torch --index-url https://download.pytorch.org/whl/cu128 # Install auto-gptq pip install "auto-gptq>=0.7.1" --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu128/ # Install dependencies pip install -r requirements.txt # Run python run.py ``` The server starts on port 8000. ## API - `GET /` — browser UI - `WS /ws/chat` — bidirectional audio + control WebSocket - `POST /api/set-voice` — switch the Kokoro voice - `POST /api/set-avatar` *(video only)* — upload an avatar image; (re)generates idle + library clips - `GET /api/idle-clip` *(video only)* — cached idle loop MP4 - `POST /api/set-video-mode` *(video only)* — switch between `off` / `library` / `reflective` - `POST /api/reload-loras` *(video only)* — hot-swap the LoRA stack; regenerates the idle clip ## Project Structure ``` server/ main.py — FastAPI app, WebSocket + video endpoints models.py — Model loading and management (audio + optional video) pipeline.py — VAD -> ASR -> LLM -> TTS orchestration, video branch config.py — config.yml parsing vad.py — Silero VAD (ONNX) streaming wrapper asr.py — Speech recognition engine llm.py — Language model engine (local + LM Studio backends) tts.py — Kokoro TTS engine audio_utils.py — PCM/float32 conversion helpers video.py — VideoEngine orchestrator + VideoConfig + LoRASpec video_models/ wan22.py — LightX2V Wan2.2 I2V pipeline wrapper (fp8 + GGUF) musetalk.py — MuseTalk lip-sync wrapper muxer.py — ffmpeg helpers: frames -> MP4, frames+audio -> MP4 configs/ lightx2v/ — LightX2V inference config templates (fp8 + GGUF variants) static/ — Browser UI (HTML/JS/CSS) avatars/ — uploaded avatar images (gitignored) loras/ — user-supplied Wan2.2 LoRAs, mounted into the container reference_audio/ — Kokoro voice reference samples tests/ — unit + component tests (see tests/README.md) ```