Files
2026-04-16 10:00:37 -04:00

105 lines
5.0 KiB
Markdown

# Voice Chat
A real-time voice conversation app powered by local AI models. Speak into your mic and get spoken responses back — and optionally a lip-synced talking-head video of a chosen avatar — all running on your own GPU with no cloud APIs.
## Pipeline
**Mic input** → **VAD** (Silero ONNX) → **ASR** (Qwen3-ASR-0.6B) → **LLM** (Qwen3.5-0.8B) → **TTS** (Kokoro) → **Speaker output**
When the optional video stack is enabled, each assistant turn also produces an MP4 via:
**TTS audio + avatar image** → **Wan2.2-Lightning I2V** (LightX2V, fp8 or GGUF) → **MuseTalk lip-sync** → **ffmpeg mux** → **`speaking_clip` WebSocket message**
- **VAD** — Silero VAD via ONNX Runtime, detects speech/silence boundaries on CPU
- **ASR** — Qwen3-ASR-0.6B, bfloat16 on CUDA
- **LLM** — Qwen3.5-0.8B (local) or any model served by LM Studio
- **TTS** — Kokoro, streams sentence-by-sentence audio at 24 kHz
- **Barge-in** — interrupt the assistant mid-response by speaking
- **Video (optional)** — Wan2.2 I2V 14B MoE with LightX2V distill LoRAs, fp8 or GGUF-quantised DIT, lip-synced by MuseTalk. Two modes:
- `library` — pre-bakes a small set of speaking base clips per avatar, picks round-robin per turn, lip-syncs on the fly
- `reflective` — generates a fresh Wan2.2 clip per turn from a prompt derived from the assistant's reply
## Requirements
- NVIDIA GPU with CUDA 12.8 support (tested on RTX 5090 / SM120 Blackwell)
- Docker + Docker Compose with the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
- ~24 GB VRAM recommended when video is enabled (fp8); ~16 GB with `gguf-Q4_K_M`
## Quick Start
```bash
docker compose up --build
```
Then open [http://localhost:8000](http://localhost:8000) in your browser.
Models are downloaded from Hugging Face on first launch and cached in a Docker volume (`huggingface-cache`) so they persist across rebuilds.
## Configuration
All runtime behaviour is driven by [config.yml](config.yml):
- `llm.backend``local` (in-process transformers) or `lmstudio` (talks to a local LM Studio server)
- `llm.system_prompt`, `llm.max_cache_tokens` — prompt and KV-cache limit per session
- `video.enabled` — master toggle for the avatar video stack. When `false`, no video models load and the app behaves exactly like the audio-only pipeline
- `video.mode``library` or `reflective`
- `video.models.wan22_dit_quant_scheme``gguf-Q8_0` (default) or `gguf-Q4_K_M` for lower VRAM; any GGUF level LightX2V supports (dense 5B Turbo is GGUF-only)
- `video.loras` — list of LoRA adapters applied to the dense Wan2.2-TI2V-5B DIT at load time. Each entry has `path`, `weight`, `target` (always `both` — the 5B DIT is not MoE; legacy `high_noise`/`low_noise` values are coerced), and optional `name`. User LoRAs are mounted from `./loras/` into the container at `/cache/loras/`
## Local Development (without Docker)
```bash
# Install PyTorch with CUDA 12.8
pip install torch --index-url https://download.pytorch.org/whl/cu128
# Install auto-gptq
pip install "auto-gptq>=0.7.1" --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu128/
# Install dependencies
pip install -r requirements.txt
# Run
python run.py
```
The server starts on port 8000.
## API
- `GET /` — browser UI
- `WS /ws/chat` — bidirectional audio + control WebSocket
- `POST /api/set-voice` — switch the Kokoro voice
- `POST /api/set-avatar` *(video only)* — upload an avatar image; (re)generates idle + library clips
- `GET /api/idle-clip` *(video only)* — cached idle loop MP4
- `POST /api/set-video-mode` *(video only)* — switch between `off` / `library` / `reflective`
- `POST /api/reload-loras` *(video only)* — hot-swap the LoRA stack; regenerates the idle clip
## Project Structure
```
server/
main.py — FastAPI app, WebSocket + video endpoints
models.py — Model loading and management (audio + optional video)
pipeline.py — VAD -> ASR -> LLM -> TTS orchestration, video branch
config.py — config.yml parsing
vad.py — Silero VAD (ONNX) streaming wrapper
asr.py — Speech recognition engine
llm.py — Language model engine (local + LM Studio backends)
tts.py — Kokoro TTS engine
audio_utils.py — PCM/float32 conversion helpers
video.py — VideoEngine orchestrator + VideoConfig + LoRASpec
video_models/
wan22.py — LightX2V Wan2.2 I2V pipeline wrapper (fp8 + GGUF)
musetalk.py — MuseTalk lip-sync wrapper
muxer.py — ffmpeg helpers: frames -> MP4, frames+audio -> MP4
configs/
lightx2v/ — LightX2V inference config templates (fp8 + GGUF variants)
static/ — Browser UI (HTML/JS/CSS)
avatars/ — uploaded avatar images (gitignored)
loras/ — user-supplied Wan2.2 LoRAs, mounted into the container
reference_audio/ — Kokoro voice reference samples
tests/ — unit + component tests (see tests/README.md)
```