56923ff424386f7b5db6c97796eb3cd5ca2e6142
Voice Chat
A real-time voice conversation app powered by local AI models. Speak into your mic and get spoken responses back — all running on your own GPU with no cloud APIs.
Pipeline
Mic input → VAD (Silero ONNX) → ASR (Qwen3-ASR-0.6B) → LLM (Qwen3.5-0.8B) → TTS (Kokoro) → Speaker output
- VAD — Silero VAD via ONNX Runtime, detects speech/silence boundaries on CPU
- ASR — Qwen3-ASR-0.6B, bfloat16 on CUDA
- LLM — Qwen3.5-0.8B, loaded via transformers
- TTS — Kokoro, streams sentence-by-sentence audio at 24 kHz
- Barge-in — interrupt the assistant mid-response by speaking
Requirements
- NVIDIA GPU with CUDA 12.8 support
- Docker + Docker Compose with the NVIDIA Container Toolkit
Quick Start
docker compose up --build
Then open http://localhost:8000 in your browser.
Models are downloaded from Hugging Face on first launch and cached in a Docker volume (huggingface-cache) so they persist across rebuilds.
Local Development (without Docker)
# Install PyTorch with CUDA 12.8
pip install torch --index-url https://download.pytorch.org/whl/cu128
# Install auto-gptq
pip install "auto-gptq>=0.7.1" --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu128/
# Install dependencies
pip install -r requirements.txt
# Run
python run.py
The server starts on port 8000.
Project Structure
server/
main.py — FastAPI app, WebSocket endpoint
models.py — Model loading and management
pipeline.py — VAD -> ASR -> LLM -> TTS orchestration
vad.py — Silero VAD (ONNX) streaming wrapper
asr.py — Speech recognition engine
llm.py — Language model engine
tts.py — Kokoro TTS engine
audio_utils.py — PCM/float32 conversion helpers
static/ — Browser UI (HTML/JS/CSS)
Description
Languages
Python
64.7%
JavaScript
23.1%
CSS
6.4%
HTML
3%
Dockerfile
2.8%