2026-04-08 04:42:12 -04:00
2026-04-07 03:58:35 -04:00
2026-04-08 04:42:12 -04:00
2026-04-07 03:58:35 -04:00
2026-04-08 03:05:26 -04:00
2026-04-08 03:05:26 -04:00
2026-04-08 03:05:26 -04:00
2026-04-08 03:05:26 -04:00
2026-04-08 04:08:42 -04:00
2026-04-08 04:08:42 -04:00
2026-04-07 03:58:35 -04:00

Voice Chat

A real-time voice conversation app powered by local AI models. Speak into your mic and get spoken responses back — all running on your own GPU with no cloud APIs.

Pipeline

Mic inputVAD (Silero ONNX) → ASR (Qwen3-ASR-0.6B) → LLM (Qwen3.5-0.8B) → TTS (Kokoro) → Speaker output

  • VAD — Silero VAD via ONNX Runtime, detects speech/silence boundaries on CPU
  • ASR — Qwen3-ASR-0.6B, bfloat16 on CUDA
  • LLM — Qwen3.5-0.8B, loaded via transformers
  • TTS — Kokoro, streams sentence-by-sentence audio at 24 kHz
  • Barge-in — interrupt the assistant mid-response by speaking

Requirements

Quick Start

docker compose up --build

Then open http://localhost:8000 in your browser.

Models are downloaded from Hugging Face on first launch and cached in a Docker volume (huggingface-cache) so they persist across rebuilds.

Local Development (without Docker)

# Install PyTorch with CUDA 12.8
pip install torch --index-url https://download.pytorch.org/whl/cu128

# Install auto-gptq
pip install "auto-gptq>=0.7.1" --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu128/

# Install dependencies
pip install -r requirements.txt

# Run
python run.py

The server starts on port 8000.

Project Structure

server/
  main.py       — FastAPI app, WebSocket endpoint
  models.py     — Model loading and management
  pipeline.py   — VAD -> ASR -> LLM -> TTS orchestration
  vad.py        — Silero VAD (ONNX) streaming wrapper
  asr.py        — Speech recognition engine
  llm.py        — Language model engine
  tts.py        — Kokoro TTS engine
  audio_utils.py — PCM/float32 conversion helpers
static/         — Browser UI (HTML/JS/CSS)
S
Description
No description provided
Readme 525 KiB
Languages
Python 64.7%
JavaScript 23.1%
CSS 6.4%
HTML 3%
Dockerfile 2.8%