live-voice-chat/README.md

# Voice Chat

A real-time voice conversation app powered by local AI models. Speak into your mic and get spoken responses back — and optionally a lip-synced talking-head video of a chosen avatar — all running on your own GPU with no cloud APIs.

## Pipeline

**Mic input** &rarr; **VAD** (Silero ONNX) &rarr; **ASR** (Qwen3-ASR-0.6B) &rarr; **LLM** (Qwen3.5-0.8B) &rarr; **TTS** (Kokoro) &rarr; **Speaker output**

When the optional video stack is enabled, each assistant turn also produces an MP4 via:

**TTS audio + avatar image** &rarr; **Wan2.2-Lightning I2V** (LightX2V, fp8 or GGUF) &rarr; **MuseTalk lip-sync** &rarr; **ffmpeg mux** &rarr; **`speaking_clip` WebSocket message**

- **VAD** — Silero VAD via ONNX Runtime, detects speech/silence boundaries on CPU
- **ASR** — Qwen3-ASR-0.6B, bfloat16 on CUDA
- **LLM** — Qwen3.5-0.8B (local) or any model served by LM Studio
- **TTS** — Kokoro, streams sentence-by-sentence audio at 24 kHz
- **Barge-in** — interrupt the assistant mid-response by speaking
- **Video (optional)** — Wan2.2 I2V 14B MoE with LightX2V distill LoRAs, fp8 or GGUF-quantised DIT, lip-synced by MuseTalk. Two modes:
  - `library` — pre-bakes a small set of speaking base clips per avatar, picks round-robin per turn, lip-syncs on the fly
  - `reflective` — generates a fresh Wan2.2 clip per turn from a prompt derived from the assistant's reply

## Requirements

- NVIDIA GPU with CUDA 12.8 support (tested on RTX 5090 / SM120 Blackwell)
- Docker + Docker Compose with the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
- ~24 GB VRAM recommended when video is enabled (fp8); ~16 GB with `gguf-Q4_K_M`

## Quick Start

```bash
docker compose up --build
```

Then open [http://localhost:8000](http://localhost:8000) in your browser.

Models are downloaded from Hugging Face on first launch and cached in a Docker volume (`huggingface-cache`) so they persist across rebuilds.

## Configuration

All runtime behaviour is driven by [config.yml](config.yml):

- `llm.backend` — `local` (in-process transformers) or `lmstudio` (talks to a local LM Studio server)
- `llm.system_prompt`, `llm.max_cache_tokens` — prompt and KV-cache limit per session
- `video.enabled` — master toggle for the avatar video stack. When `false`, no video models load and the app behaves exactly like the audio-only pipeline
- `video.mode` — `library` or `reflective`
- `video.models.wan22_dit_quant_scheme` — `gguf-Q8_0` (default) or `gguf-Q4_K_M` for lower VRAM; any GGUF level LightX2V supports (dense 5B Turbo is GGUF-only)
- `video.loras` — list of LoRA adapters applied to the dense Wan2.2-TI2V-5B DIT at load time. Each entry has `path`, `weight`, `target` (always `both` — the 5B DIT is not MoE; legacy `high_noise`/`low_noise` values are coerced), and optional `name`. User LoRAs are mounted from `./loras/` into the container at `/cache/loras/`

## Local Development (without Docker)

```bash
# Install PyTorch with CUDA 12.8
pip install torch --index-url https://download.pytorch.org/whl/cu128

# Install auto-gptq
pip install "auto-gptq>=0.7.1" --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu128/

# Install dependencies
pip install -r requirements.txt

# Run
python run.py
```

The server starts on port 8000.

## API

- `GET /` — browser UI
- `WS /ws/chat` — bidirectional audio + control WebSocket
- `POST /api/set-voice` — switch the Kokoro voice
- `POST /api/set-avatar` *(video only)* — upload an avatar image; (re)generates idle + library clips
- `GET  /api/idle-clip` *(video only)* — cached idle loop MP4
- `POST /api/set-video-mode` *(video only)* — switch between `off` / `library` / `reflective`
- `POST /api/reload-loras` *(video only)* — hot-swap the LoRA stack; regenerates the idle clip

## Project Structure

```
server/
  main.py          — FastAPI app, WebSocket + video endpoints
  models.py        — Model loading and management (audio + optional video)
  pipeline.py      — VAD -> ASR -> LLM -> TTS orchestration, video branch
  config.py        — config.yml parsing
  vad.py           — Silero VAD (ONNX) streaming wrapper
  asr.py           — Speech recognition engine
  llm.py           — Language model engine (local + LM Studio backends)
  tts.py           — Kokoro TTS engine
  audio_utils.py   — PCM/float32 conversion helpers
  video.py         — VideoEngine orchestrator + VideoConfig + LoRASpec
  video_models/
    wan22.py       — LightX2V Wan2.2 I2V pipeline wrapper (fp8 + GGUF)
    musetalk.py    — MuseTalk lip-sync wrapper
    muxer.py       — ffmpeg helpers: frames -> MP4, frames+audio -> MP4

configs/
  lightx2v/        — LightX2V inference config templates (fp8 + GGUF variants)

static/            — Browser UI (HTML/JS/CSS)
avatars/           — uploaded avatar images (gitignored)
loras/             — user-supplied Wan2.2 LoRAs, mounted into the container
reference_audio/   — Kokoro voice reference samples
tests/             — unit + component tests (see tests/README.md)
```