Files
2026-04-16 10:00:37 -04:00

5.0 KiB

Voice Chat

A real-time voice conversation app powered by local AI models. Speak into your mic and get spoken responses back — and optionally a lip-synced talking-head video of a chosen avatar — all running on your own GPU with no cloud APIs.

Pipeline

Mic inputVAD (Silero ONNX) → ASR (Qwen3-ASR-0.6B) → LLM (Qwen3.5-0.8B) → TTS (Kokoro) → Speaker output

When the optional video stack is enabled, each assistant turn also produces an MP4 via:

TTS audio + avatar imageWan2.2-Lightning I2V (LightX2V, fp8 or GGUF) → MuseTalk lip-syncffmpeg muxspeaking_clip WebSocket message

  • VAD — Silero VAD via ONNX Runtime, detects speech/silence boundaries on CPU
  • ASR — Qwen3-ASR-0.6B, bfloat16 on CUDA
  • LLM — Qwen3.5-0.8B (local) or any model served by LM Studio
  • TTS — Kokoro, streams sentence-by-sentence audio at 24 kHz
  • Barge-in — interrupt the assistant mid-response by speaking
  • Video (optional) — Wan2.2 I2V 14B MoE with LightX2V distill LoRAs, fp8 or GGUF-quantised DIT, lip-synced by MuseTalk. Two modes:
    • library — pre-bakes a small set of speaking base clips per avatar, picks round-robin per turn, lip-syncs on the fly
    • reflective — generates a fresh Wan2.2 clip per turn from a prompt derived from the assistant's reply

Requirements

  • NVIDIA GPU with CUDA 12.8 support (tested on RTX 5090 / SM120 Blackwell)
  • Docker + Docker Compose with the NVIDIA Container Toolkit
  • ~24 GB VRAM recommended when video is enabled (fp8); ~16 GB with gguf-Q4_K_M

Quick Start

docker compose up --build

Then open http://localhost:8000 in your browser.

Models are downloaded from Hugging Face on first launch and cached in a Docker volume (huggingface-cache) so they persist across rebuilds.

Configuration

All runtime behaviour is driven by config.yml:

  • llm.backendlocal (in-process transformers) or lmstudio (talks to a local LM Studio server)
  • llm.system_prompt, llm.max_cache_tokens — prompt and KV-cache limit per session
  • video.enabled — master toggle for the avatar video stack. When false, no video models load and the app behaves exactly like the audio-only pipeline
  • video.modelibrary or reflective
  • video.models.wan22_dit_quant_schemegguf-Q8_0 (default) or gguf-Q4_K_M for lower VRAM; any GGUF level LightX2V supports (dense 5B Turbo is GGUF-only)
  • video.loras — list of LoRA adapters applied to the dense Wan2.2-TI2V-5B DIT at load time. Each entry has path, weight, target (always both — the 5B DIT is not MoE; legacy high_noise/low_noise values are coerced), and optional name. User LoRAs are mounted from ./loras/ into the container at /cache/loras/

Local Development (without Docker)

# Install PyTorch with CUDA 12.8
pip install torch --index-url https://download.pytorch.org/whl/cu128

# Install auto-gptq
pip install "auto-gptq>=0.7.1" --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu128/

# Install dependencies
pip install -r requirements.txt

# Run
python run.py

The server starts on port 8000.

API

  • GET / — browser UI
  • WS /ws/chat — bidirectional audio + control WebSocket
  • POST /api/set-voice — switch the Kokoro voice
  • POST /api/set-avatar (video only) — upload an avatar image; (re)generates idle + library clips
  • GET /api/idle-clip (video only) — cached idle loop MP4
  • POST /api/set-video-mode (video only) — switch between off / library / reflective
  • POST /api/reload-loras (video only) — hot-swap the LoRA stack; regenerates the idle clip

Project Structure

server/
  main.py          — FastAPI app, WebSocket + video endpoints
  models.py        — Model loading and management (audio + optional video)
  pipeline.py      — VAD -> ASR -> LLM -> TTS orchestration, video branch
  config.py        — config.yml parsing
  vad.py           — Silero VAD (ONNX) streaming wrapper
  asr.py           — Speech recognition engine
  llm.py           — Language model engine (local + LM Studio backends)
  tts.py           — Kokoro TTS engine
  audio_utils.py   — PCM/float32 conversion helpers
  video.py         — VideoEngine orchestrator + VideoConfig + LoRASpec
  video_models/
    wan22.py       — LightX2V Wan2.2 I2V pipeline wrapper (fp8 + GGUF)
    musetalk.py    — MuseTalk lip-sync wrapper
    muxer.py       — ffmpeg helpers: frames -> MP4, frames+audio -> MP4

configs/
  lightx2v/        — LightX2V inference config templates (fp8 + GGUF variants)

static/            — Browser UI (HTML/JS/CSS)
avatars/           — uploaded avatar images (gitignored)
loras/             — user-supplied Wan2.2 LoRAs, mounted into the container
reference_audio/   — Kokoro voice reference samples
tests/             — unit + component tests (see tests/README.md)