diff --git a/Dockerfile b/Dockerfile index d41be01..131ba56 100644 --- a/Dockerfile +++ b/Dockerfile @@ -23,7 +23,7 @@ WORKDIR /app # Install PyTorch 2.7+ with CUDA 12.8 support (includes Blackwell/sm_120 support) RUN python3.11 -m pip install --no-cache-dir \ - torch torchvision \ + torch \ --index-url https://download.pytorch.org/whl/cu128 # Install auto-gptq pre-built wheel for CUDA 12.8 (avoids compiling from source) diff --git a/README.md b/README.md new file mode 100644 index 0000000..bb99cbd --- /dev/null +++ b/README.md @@ -0,0 +1,61 @@ +# Voice Chat + +A real-time voice conversation app powered by local AI models. Speak into your mic and get spoken responses back — all running on your own GPU with no cloud APIs. + +## Pipeline + +**Mic input** → **VAD** (Silero ONNX) → **ASR** (Qwen3-ASR-0.6B) → **LLM** (Qwen3.5-0.8B) → **TTS** (Kokoro) → **Speaker output** + +- **VAD** — Silero VAD via ONNX Runtime, detects speech/silence boundaries on CPU +- **ASR** — Qwen3-ASR-0.6B, bfloat16 on CUDA +- **LLM** — Qwen3.5-0.8B, loaded via transformers +- **TTS** — Kokoro, streams sentence-by-sentence audio at 24 kHz +- **Barge-in** — interrupt the assistant mid-response by speaking + +## Requirements + +- NVIDIA GPU with CUDA 12.8 support +- Docker + Docker Compose with the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) + +## Quick Start + +```bash +docker compose up --build +``` + +Then open [http://localhost:8000](http://localhost:8000) in your browser. + +Models are downloaded from Hugging Face on first launch and cached in a Docker volume (`huggingface-cache`) so they persist across rebuilds. + +## Local Development (without Docker) + +```bash +# Install PyTorch with CUDA 12.8 +pip install torch --index-url https://download.pytorch.org/whl/cu128 + +# Install auto-gptq +pip install "auto-gptq>=0.7.1" --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu128/ + +# Install dependencies +pip install -r requirements.txt + +# Run +python run.py +``` + +The server starts on port 8000. + +## Project Structure + +``` +server/ + main.py — FastAPI app, WebSocket endpoint + models.py — Model loading and management + pipeline.py — VAD -> ASR -> LLM -> TTS orchestration + vad.py — Silero VAD (ONNX) streaming wrapper + asr.py — Speech recognition engine + llm.py — Language model engine + tts.py — Kokoro TTS engine + audio_utils.py — PCM/float32 conversion helpers +static/ — Browser UI (HTML/JS/CSS) +```