Files
bhetherman 9debc56137 Add LightX2V + Wan2.2-TI2V-5B-Turbo GGUF experiment
Benchmarks the dense 5B Turbo model (Q8_0 GGUF + fp8 T5) as a
lower-VRAM alternative to the 14B MoE pipeline. Includes dtype
patches for dense WanModel, Wan 2.2 VAE config (48 channels, 16x
spatial), and Blackwell fp8 workaround.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 01:27:45 -04:00

66 lines
2.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# LightX2V + Wan2.2-TI2V-5B-Turbo (GGUF) Experiment
Swap the 14B MoE distill for the dense 5B Turbo model, keeping the LightX2V backend.
Hypothesis: half the parameters → lower VRAM footprint (can coexist with the running
server) and faster per-step compute, with the Turbo 4-step distill preserving wall time.
## Config
- **Model**: `hum-ma/Wan2.2-TI2V-5B-Turbo-GGUF` (Q8_0 by default — swap to Q4_K_M via env)
- **Base repo** (configs, T5, VAE): `Wan-AI/Wan2.2-TI2V-5B`
- **model_cls**: `wan2.2` (dense, single DIT — not MoE)
- **Steps**: 4 (Turbo distill)
- **Resolution**: 480×480, 81 frames @ 16 fps
## Key implementation details
- **Dense model (`wan2.2`)**: Uses single DIT checkpoint, not MoE — requires different dtype patching than the 14B pipeline
- **GGUF dequant → fp16**: Requires `DTYPE=FP16` and patches for T5 (bf16→fp16 wrapper), VAE (→fp16), and DIT pre/post weights (fp32→fp16)
- **Wan 2.2 VAE**: 48 latent channels with 16× spatial compression (vs 16 channels / 8× for Wan 2.1) — config must set `vae_stride: [4,16,16]` and `num_channels_latents: 48`
- **fp8 T5**: Uses `lightx2v/Encoders` fp8 checkpoint (~4.9 GB vs ~11.4 GB bf16)
- **Blackwell (SM120)**: Needs `_patch_fp8_scaled_mm_for_blackwell` to replace sgl_kernel's fp8 GEMM
## Why a separate container
Reuses the existing `voice-chat-voice-chat` image (LightX2V already installed) but runs
under its own compose profile so it doesn't interfere with the live server volumes or
startup. Shares the HF cache volume so model downloads are reused.
## Running
```bash
# Ensure main image is built
docker compose build voice-chat
# Stage model (downloads base + Turbo Q8 GGUF, ~6 GB)
docker compose -f experimental/lightx2v_5b/docker-compose.yml --profile experimental \
run --rm lightx2v-5b python /app/experimental/lightx2v_5b/setup_model.py
# Run benchmark
docker compose -f experimental/lightx2v_5b/docker-compose.yml --profile experimental \
run --rm lightx2v-5b python /app/experimental/lightx2v_5b/test_i2v.py
```
Reports peak VRAM and wall time for an 81-frame 480p clip.
## Results
| Metric | Value |
|--------|-------|
| Model load | ~43s |
| VRAM after load | 6.53 GB |
| T5 encode | ~1s |
| VAE encode | ~0.5s |
Awaiting full end-to-end benchmark completion for wall time and peak VRAM.
## Go / no-go criteria
- **Go**: < 45s per 81-frame clip AND peak VRAM < 12 GB (leaves ~20 GB for the server)
- **No-go**: keep the 14B MoE Q4_K_M pipeline
### Baselines
- **vLLM-Omni + fp16 Turbo-5B**: 1663s / 22.5 GB — decisive no-go
- **LightX2V + 14B MoE Q4_K_M**: ~30s/clip, ~14.5 GB VRAM (current production pipeline)