working ok
This commit is contained in:
@@ -1,21 +1,29 @@
|
||||
# Voice Chat
|
||||
|
||||
A real-time voice conversation app powered by local AI models. Speak into your mic and get spoken responses back — all running on your own GPU with no cloud APIs.
|
||||
A real-time voice conversation app powered by local AI models. Speak into your mic and get spoken responses back — and optionally a lip-synced talking-head video of a chosen avatar — all running on your own GPU with no cloud APIs.
|
||||
|
||||
## Pipeline
|
||||
|
||||
**Mic input** → **VAD** (Silero ONNX) → **ASR** (Qwen3-ASR-0.6B) → **LLM** (Qwen3.5-0.8B) → **TTS** (Kokoro) → **Speaker output**
|
||||
|
||||
When the optional video stack is enabled, each assistant turn also produces an MP4 via:
|
||||
|
||||
**TTS audio + avatar image** → **Wan2.2-Lightning I2V** (LightX2V, fp8 or GGUF) → **MuseTalk lip-sync** → **ffmpeg mux** → **`speaking_clip` WebSocket message**
|
||||
|
||||
- **VAD** — Silero VAD via ONNX Runtime, detects speech/silence boundaries on CPU
|
||||
- **ASR** — Qwen3-ASR-0.6B, bfloat16 on CUDA
|
||||
- **LLM** — Qwen3.5-0.8B, loaded via transformers
|
||||
- **LLM** — Qwen3.5-0.8B (local) or any model served by LM Studio
|
||||
- **TTS** — Kokoro, streams sentence-by-sentence audio at 24 kHz
|
||||
- **Barge-in** — interrupt the assistant mid-response by speaking
|
||||
- **Video (optional)** — Wan2.2 I2V 14B MoE with LightX2V distill LoRAs, fp8 or GGUF-quantised DIT, lip-synced by MuseTalk. Two modes:
|
||||
- `library` — pre-bakes a small set of speaking base clips per avatar, picks round-robin per turn, lip-syncs on the fly
|
||||
- `reflective` — generates a fresh Wan2.2 clip per turn from a prompt derived from the assistant's reply
|
||||
|
||||
## Requirements
|
||||
|
||||
- NVIDIA GPU with CUDA 12.8 support
|
||||
- NVIDIA GPU with CUDA 12.8 support (tested on RTX 5090 / SM120 Blackwell)
|
||||
- Docker + Docker Compose with the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
|
||||
- ~24 GB VRAM recommended when video is enabled (fp8); ~16 GB with `gguf-Q4_K_M`
|
||||
|
||||
## Quick Start
|
||||
|
||||
@@ -27,6 +35,17 @@ Then open [http://localhost:8000](http://localhost:8000) in your browser.
|
||||
|
||||
Models are downloaded from Hugging Face on first launch and cached in a Docker volume (`huggingface-cache`) so they persist across rebuilds.
|
||||
|
||||
## Configuration
|
||||
|
||||
All runtime behaviour is driven by [config.yml](config.yml):
|
||||
|
||||
- `llm.backend` — `local` (in-process transformers) or `lmstudio` (talks to a local LM Studio server)
|
||||
- `llm.system_prompt`, `llm.max_cache_tokens` — prompt and KV-cache limit per session
|
||||
- `video.enabled` — master toggle for the avatar video stack. When `false`, no video models load and the app behaves exactly like the audio-only pipeline
|
||||
- `video.mode` — `library` or `reflective`
|
||||
- `video.models.wan22_dit_quant_scheme` — `gguf-Q8_0` (default) or `gguf-Q4_K_M` for lower VRAM; any GGUF level LightX2V supports (dense 5B Turbo is GGUF-only)
|
||||
- `video.loras` — list of LoRA adapters applied to the dense Wan2.2-TI2V-5B DIT at load time. Each entry has `path`, `weight`, `target` (always `both` — the 5B DIT is not MoE; legacy `high_noise`/`low_noise` values are coerced), and optional `name`. User LoRAs are mounted from `./loras/` into the container at `/cache/loras/`
|
||||
|
||||
## Local Development (without Docker)
|
||||
|
||||
```bash
|
||||
@@ -45,17 +64,41 @@ python run.py
|
||||
|
||||
The server starts on port 8000.
|
||||
|
||||
## API
|
||||
|
||||
- `GET /` — browser UI
|
||||
- `WS /ws/chat` — bidirectional audio + control WebSocket
|
||||
- `POST /api/set-voice` — switch the Kokoro voice
|
||||
- `POST /api/set-avatar` *(video only)* — upload an avatar image; (re)generates idle + library clips
|
||||
- `GET /api/idle-clip` *(video only)* — cached idle loop MP4
|
||||
- `POST /api/set-video-mode` *(video only)* — switch between `off` / `library` / `reflective`
|
||||
- `POST /api/reload-loras` *(video only)* — hot-swap the LoRA stack; regenerates the idle clip
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
server/
|
||||
main.py — FastAPI app, WebSocket endpoint
|
||||
models.py — Model loading and management
|
||||
pipeline.py — VAD -> ASR -> LLM -> TTS orchestration
|
||||
vad.py — Silero VAD (ONNX) streaming wrapper
|
||||
asr.py — Speech recognition engine
|
||||
llm.py — Language model engine
|
||||
tts.py — Kokoro TTS engine
|
||||
audio_utils.py — PCM/float32 conversion helpers
|
||||
static/ — Browser UI (HTML/JS/CSS)
|
||||
main.py — FastAPI app, WebSocket + video endpoints
|
||||
models.py — Model loading and management (audio + optional video)
|
||||
pipeline.py — VAD -> ASR -> LLM -> TTS orchestration, video branch
|
||||
config.py — config.yml parsing
|
||||
vad.py — Silero VAD (ONNX) streaming wrapper
|
||||
asr.py — Speech recognition engine
|
||||
llm.py — Language model engine (local + LM Studio backends)
|
||||
tts.py — Kokoro TTS engine
|
||||
audio_utils.py — PCM/float32 conversion helpers
|
||||
video.py — VideoEngine orchestrator + VideoConfig + LoRASpec
|
||||
video_models/
|
||||
wan22.py — LightX2V Wan2.2 I2V pipeline wrapper (fp8 + GGUF)
|
||||
musetalk.py — MuseTalk lip-sync wrapper
|
||||
muxer.py — ffmpeg helpers: frames -> MP4, frames+audio -> MP4
|
||||
|
||||
configs/
|
||||
lightx2v/ — LightX2V inference config templates (fp8 + GGUF variants)
|
||||
|
||||
static/ — Browser UI (HTML/JS/CSS)
|
||||
avatars/ — uploaded avatar images (gitignored)
|
||||
loras/ — user-supplied Wan2.2 LoRAs, mounted into the container
|
||||
reference_audio/ — Kokoro voice reference samples
|
||||
tests/ — unit + component tests (see tests/README.md)
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user