working ok

2026-04-16 10:00:37 -04:00
parent 9debc56137
commit 129df7d1fa
24 changed files with 674 additions and 539 deletions
@@ -1,21 +1,29 @@
 # Voice Chat

-A real-time voice conversation app powered by local AI models. Speak into your mic and get spoken responses back — all running on your own GPU with no cloud APIs.
+A real-time voice conversation app powered by local AI models. Speak into your mic and get spoken responses back — and optionally a lip-synced talking-head video of a chosen avatar — all running on your own GPU with no cloud APIs.

 ## Pipeline

 **Mic input** &rarr; **VAD** (Silero ONNX) &rarr; **ASR** (Qwen3-ASR-0.6B) &rarr; **LLM** (Qwen3.5-0.8B) &rarr; **TTS** (Kokoro) &rarr; **Speaker output**

+When the optional video stack is enabled, each assistant turn also produces an MP4 via:
+
+**TTS audio + avatar image** &rarr; **Wan2.2-Lightning I2V** (LightX2V, fp8 or GGUF) &rarr; **MuseTalk lip-sync** &rarr; **ffmpeg mux** &rarr; **`speaking_clip` WebSocket message**
+
 - **VAD** — Silero VAD via ONNX Runtime, detects speech/silence boundaries on CPU
 - **ASR** — Qwen3-ASR-0.6B, bfloat16 on CUDA
- **LLM** — Qwen3.5-0.8B, loaded via transformers
+- **LLM** — Qwen3.5-0.8B (local) or any model served by LM Studio
 - **TTS** — Kokoro, streams sentence-by-sentence audio at 24 kHz
 - **Barge-in** — interrupt the assistant mid-response by speaking
+- **Video (optional)** — Wan2.2 I2V 14B MoE with LightX2V distill LoRAs, fp8 or GGUF-quantised DIT, lip-synced by MuseTalk. Two modes:
+  - `library` — pre-bakes a small set of speaking base clips per avatar, picks round-robin per turn, lip-syncs on the fly
+  - `reflective` — generates a fresh Wan2.2 clip per turn from a prompt derived from the assistant's reply

 ## Requirements

- NVIDIA GPU with CUDA 12.8 support
+- NVIDIA GPU with CUDA 12.8 support (tested on RTX 5090 / SM120 Blackwell)
 - Docker + Docker Compose with the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
+- ~24 GB VRAM recommended when video is enabled (fp8); ~16 GB with `gguf-Q4_K_M`

 ## Quick Start

@@ -27,6 +35,17 @@ Then open [http://localhost:8000](http://localhost:8000) in your browser.

 Models are downloaded from Hugging Face on first launch and cached in a Docker volume (`huggingface-cache`) so they persist across rebuilds.

+## Configuration
+
+All runtime behaviour is driven by [config.yml](config.yml):
+
+- `llm.backend` — `local` (in-process transformers) or `lmstudio` (talks to a local LM Studio server)
+- `llm.system_prompt`, `llm.max_cache_tokens` — prompt and KV-cache limit per session
+- `video.enabled` — master toggle for the avatar video stack. When `false`, no video models load and the app behaves exactly like the audio-only pipeline
+- `video.mode` — `library` or `reflective`
+- `video.models.wan22_dit_quant_scheme` — `gguf-Q8_0` (default) or `gguf-Q4_K_M` for lower VRAM; any GGUF level LightX2V supports (dense 5B Turbo is GGUF-only)
+- `video.loras` — list of LoRA adapters applied to the dense Wan2.2-TI2V-5B DIT at load time. Each entry has `path`, `weight`, `target` (always `both` — the 5B DIT is not MoE; legacy `high_noise`/`low_noise` values are coerced), and optional `name`. User LoRAs are mounted from `./loras/` into the container at `/cache/loras/`
+
 ## Local Development (without Docker)

 ```bash
@@ -45,17 +64,41 @@ python run.py

 The server starts on port 8000.

+## API
+
+- `GET /` — browser UI
+- `WS /ws/chat` — bidirectional audio + control WebSocket
+- `POST /api/set-voice` — switch the Kokoro voice
+- `POST /api/set-avatar` *(video only)* — upload an avatar image; (re)generates idle + library clips
+- `GET  /api/idle-clip` *(video only)* — cached idle loop MP4
+- `POST /api/set-video-mode` *(video only)* — switch between `off` / `library` / `reflective`
+- `POST /api/reload-loras` *(video only)* — hot-swap the LoRA stack; regenerates the idle clip
+
 ## Project Structure

 ```
 server/
-  main.py       — FastAPI app, WebSocket endpoint
-  models.py     — Model loading and management
-  pipeline.py   — VAD -> ASR -> LLM -> TTS orchestration
-  vad.py        — Silero VAD (ONNX) streaming wrapper
-  asr.py        — Speech recognition engine
-  llm.py        — Language model engine
-  tts.py        — Kokoro TTS engine
-  audio_utils.py — PCM/float32 conversion helpers
-static/         — Browser UI (HTML/JS/CSS)
+  main.py          — FastAPI app, WebSocket + video endpoints
+  models.py        — Model loading and management (audio + optional video)
+  pipeline.py      — VAD -> ASR -> LLM -> TTS orchestration, video branch
+  config.py        — config.yml parsing
+  vad.py           — Silero VAD (ONNX) streaming wrapper
+  asr.py           — Speech recognition engine
+  llm.py           — Language model engine (local + LM Studio backends)
+  tts.py           — Kokoro TTS engine
+  audio_utils.py   — PCM/float32 conversion helpers
+  video.py         — VideoEngine orchestrator + VideoConfig + LoRASpec
+  video_models/
+    wan22.py       — LightX2V Wan2.2 I2V pipeline wrapper (fp8 + GGUF)
+    musetalk.py    — MuseTalk lip-sync wrapper
+    muxer.py       — ffmpeg helpers: frames -> MP4, frames+audio -> MP4
+
+configs/
+  lightx2v/        — LightX2V inference config templates (fp8 + GGUF variants)
+
+static/            — Browser UI (HTML/JS/CSS)
+avatars/           — uploaded avatar images (gitignored)
+loras/             — user-supplied Wan2.2 LoRAs, mounted into the container
+reference_audio/   — Kokoro voice reference samples
+tests/             — unit + component tests (see tests/README.md)
 ```