5.0 KiB
Voice Chat
A real-time voice conversation app powered by local AI models. Speak into your mic and get spoken responses back — and optionally a lip-synced talking-head video of a chosen avatar — all running on your own GPU with no cloud APIs.
Pipeline
Mic input → VAD (Silero ONNX) → ASR (Qwen3-ASR-0.6B) → LLM (Qwen3.5-0.8B) → TTS (Kokoro) → Speaker output
When the optional video stack is enabled, each assistant turn also produces an MP4 via:
TTS audio + avatar image → Wan2.2-Lightning I2V (LightX2V, fp8 or GGUF) → MuseTalk lip-sync → ffmpeg mux → speaking_clip WebSocket message
- VAD — Silero VAD via ONNX Runtime, detects speech/silence boundaries on CPU
- ASR — Qwen3-ASR-0.6B, bfloat16 on CUDA
- LLM — Qwen3.5-0.8B (local) or any model served by LM Studio
- TTS — Kokoro, streams sentence-by-sentence audio at 24 kHz
- Barge-in — interrupt the assistant mid-response by speaking
- Video (optional) — Wan2.2 I2V 14B MoE with LightX2V distill LoRAs, fp8 or GGUF-quantised DIT, lip-synced by MuseTalk. Two modes:
library— pre-bakes a small set of speaking base clips per avatar, picks round-robin per turn, lip-syncs on the flyreflective— generates a fresh Wan2.2 clip per turn from a prompt derived from the assistant's reply
Requirements
- NVIDIA GPU with CUDA 12.8 support (tested on RTX 5090 / SM120 Blackwell)
- Docker + Docker Compose with the NVIDIA Container Toolkit
- ~24 GB VRAM recommended when video is enabled (fp8); ~16 GB with
gguf-Q4_K_M
Quick Start
docker compose up --build
Then open http://localhost:8000 in your browser.
Models are downloaded from Hugging Face on first launch and cached in a Docker volume (huggingface-cache) so they persist across rebuilds.
Configuration
All runtime behaviour is driven by config.yml:
llm.backend—local(in-process transformers) orlmstudio(talks to a local LM Studio server)llm.system_prompt,llm.max_cache_tokens— prompt and KV-cache limit per sessionvideo.enabled— master toggle for the avatar video stack. Whenfalse, no video models load and the app behaves exactly like the audio-only pipelinevideo.mode—libraryorreflectivevideo.models.wan22_dit_quant_scheme—gguf-Q8_0(default) orgguf-Q4_K_Mfor lower VRAM; any GGUF level LightX2V supports (dense 5B Turbo is GGUF-only)video.loras— list of LoRA adapters applied to the dense Wan2.2-TI2V-5B DIT at load time. Each entry haspath,weight,target(alwaysboth— the 5B DIT is not MoE; legacyhigh_noise/low_noisevalues are coerced), and optionalname. User LoRAs are mounted from./loras/into the container at/cache/loras/
Local Development (without Docker)
# Install PyTorch with CUDA 12.8
pip install torch --index-url https://download.pytorch.org/whl/cu128
# Install auto-gptq
pip install "auto-gptq>=0.7.1" --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu128/
# Install dependencies
pip install -r requirements.txt
# Run
python run.py
The server starts on port 8000.
API
GET /— browser UIWS /ws/chat— bidirectional audio + control WebSocketPOST /api/set-voice— switch the Kokoro voicePOST /api/set-avatar(video only) — upload an avatar image; (re)generates idle + library clipsGET /api/idle-clip(video only) — cached idle loop MP4POST /api/set-video-mode(video only) — switch betweenoff/library/reflectivePOST /api/reload-loras(video only) — hot-swap the LoRA stack; regenerates the idle clip
Project Structure
server/
main.py — FastAPI app, WebSocket + video endpoints
models.py — Model loading and management (audio + optional video)
pipeline.py — VAD -> ASR -> LLM -> TTS orchestration, video branch
config.py — config.yml parsing
vad.py — Silero VAD (ONNX) streaming wrapper
asr.py — Speech recognition engine
llm.py — Language model engine (local + LM Studio backends)
tts.py — Kokoro TTS engine
audio_utils.py — PCM/float32 conversion helpers
video.py — VideoEngine orchestrator + VideoConfig + LoRASpec
video_models/
wan22.py — LightX2V Wan2.2 I2V pipeline wrapper (fp8 + GGUF)
musetalk.py — MuseTalk lip-sync wrapper
muxer.py — ffmpeg helpers: frames -> MP4, frames+audio -> MP4
configs/
lightx2v/ — LightX2V inference config templates (fp8 + GGUF variants)
static/ — Browser UI (HTML/JS/CSS)
avatars/ — uploaded avatar images (gitignored)
loras/ — user-supplied Wan2.2 LoRAs, mounted into the container
reference_audio/ — Kokoro voice reference samples
tests/ — unit + component tests (see tests/README.md)