live-voice-chat/experimental/lightx2v_5b/README.md

# LightX2V + Wan2.2-TI2V-5B-Turbo (GGUF) Experiment

Swap the 14B MoE distill for the dense 5B Turbo model, keeping the LightX2V backend.
Hypothesis: half the parameters → lower VRAM footprint (can coexist with the running
server) and faster per-step compute, with the Turbo 4-step distill preserving wall time.

## Config

- **Model**: `hum-ma/Wan2.2-TI2V-5B-Turbo-GGUF` (Q8_0 by default — swap to Q4_K_M via env)
- **Base repo** (configs, T5, VAE): `Wan-AI/Wan2.2-TI2V-5B`
- **model_cls**: `wan2.2` (dense, single DIT — not MoE)
- **Steps**: 4 (Turbo distill)
- **Resolution**: 480×480, 81 frames @ 16 fps

## Key implementation details

- **Dense model (`wan2.2`)**: Uses single DIT checkpoint, not MoE — requires different dtype patching than the 14B pipeline
- **GGUF dequant → fp16**: Requires `DTYPE=FP16` and patches for T5 (bf16→fp16 wrapper), VAE (→fp16), and DIT pre/post weights (fp32→fp16)
- **Wan 2.2 VAE**: 48 latent channels with 16× spatial compression (vs 16 channels / 8× for Wan 2.1) — config must set `vae_stride: [4,16,16]` and `num_channels_latents: 48`
- **fp8 T5**: Uses `lightx2v/Encoders` fp8 checkpoint (~4.9 GB vs ~11.4 GB bf16)
- **Blackwell (SM120)**: Needs `_patch_fp8_scaled_mm_for_blackwell` to replace sgl_kernel's fp8 GEMM

## Why a separate container

Reuses the existing `voice-chat-voice-chat` image (LightX2V already installed) but runs
under its own compose profile so it doesn't interfere with the live server volumes or
startup. Shares the HF cache volume so model downloads are reused.

## Running

```bash
# Ensure main image is built
docker compose build voice-chat

# Stage model (downloads base + Turbo Q8 GGUF, ~6 GB)
docker compose -f experimental/lightx2v_5b/docker-compose.yml --profile experimental \
    run --rm lightx2v-5b python /app/experimental/lightx2v_5b/setup_model.py

# Run benchmark
docker compose -f experimental/lightx2v_5b/docker-compose.yml --profile experimental \
    run --rm lightx2v-5b python /app/experimental/lightx2v_5b/test_i2v.py
```

Reports peak VRAM and wall time for an 81-frame 480p clip.

## Results

| Metric | Value |
|--------|-------|
| Model load | ~43s |
| VRAM after load | 6.53 GB |
| T5 encode | ~1s |
| VAE encode | ~0.5s |

Awaiting full end-to-end benchmark completion for wall time and peak VRAM.

## Go / no-go criteria

- **Go**: < 45s per 81-frame clip AND peak VRAM < 12 GB (leaves ~20 GB for the server)
- **No-go**: keep the 14B MoE Q4_K_M pipeline

### Baselines

- **vLLM-Omni + fp16 Turbo-5B**: 1663s / 22.5 GB — decisive no-go
- **LightX2V + 14B MoE Q4_K_M**: ~30s/clip, ~14.5 GB VRAM (current production pipeline)