9debc56137
Benchmarks the dense 5B Turbo model (Q8_0 GGUF + fp8 T5) as a lower-VRAM alternative to the 14B MoE pipeline. Includes dtype patches for dense WanModel, Wan 2.2 VAE config (48 channels, 16x spatial), and Blackwell fp8 workaround. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
66 lines
2.6 KiB
Markdown
66 lines
2.6 KiB
Markdown
# LightX2V + Wan2.2-TI2V-5B-Turbo (GGUF) Experiment
|
||
|
||
Swap the 14B MoE distill for the dense 5B Turbo model, keeping the LightX2V backend.
|
||
Hypothesis: half the parameters → lower VRAM footprint (can coexist with the running
|
||
server) and faster per-step compute, with the Turbo 4-step distill preserving wall time.
|
||
|
||
## Config
|
||
|
||
- **Model**: `hum-ma/Wan2.2-TI2V-5B-Turbo-GGUF` (Q8_0 by default — swap to Q4_K_M via env)
|
||
- **Base repo** (configs, T5, VAE): `Wan-AI/Wan2.2-TI2V-5B`
|
||
- **model_cls**: `wan2.2` (dense, single DIT — not MoE)
|
||
- **Steps**: 4 (Turbo distill)
|
||
- **Resolution**: 480×480, 81 frames @ 16 fps
|
||
|
||
## Key implementation details
|
||
|
||
- **Dense model (`wan2.2`)**: Uses single DIT checkpoint, not MoE — requires different dtype patching than the 14B pipeline
|
||
- **GGUF dequant → fp16**: Requires `DTYPE=FP16` and patches for T5 (bf16→fp16 wrapper), VAE (→fp16), and DIT pre/post weights (fp32→fp16)
|
||
- **Wan 2.2 VAE**: 48 latent channels with 16× spatial compression (vs 16 channels / 8× for Wan 2.1) — config must set `vae_stride: [4,16,16]` and `num_channels_latents: 48`
|
||
- **fp8 T5**: Uses `lightx2v/Encoders` fp8 checkpoint (~4.9 GB vs ~11.4 GB bf16)
|
||
- **Blackwell (SM120)**: Needs `_patch_fp8_scaled_mm_for_blackwell` to replace sgl_kernel's fp8 GEMM
|
||
|
||
## Why a separate container
|
||
|
||
Reuses the existing `voice-chat-voice-chat` image (LightX2V already installed) but runs
|
||
under its own compose profile so it doesn't interfere with the live server volumes or
|
||
startup. Shares the HF cache volume so model downloads are reused.
|
||
|
||
## Running
|
||
|
||
```bash
|
||
# Ensure main image is built
|
||
docker compose build voice-chat
|
||
|
||
# Stage model (downloads base + Turbo Q8 GGUF, ~6 GB)
|
||
docker compose -f experimental/lightx2v_5b/docker-compose.yml --profile experimental \
|
||
run --rm lightx2v-5b python /app/experimental/lightx2v_5b/setup_model.py
|
||
|
||
# Run benchmark
|
||
docker compose -f experimental/lightx2v_5b/docker-compose.yml --profile experimental \
|
||
run --rm lightx2v-5b python /app/experimental/lightx2v_5b/test_i2v.py
|
||
```
|
||
|
||
Reports peak VRAM and wall time for an 81-frame 480p clip.
|
||
|
||
## Results
|
||
|
||
| Metric | Value |
|
||
|--------|-------|
|
||
| Model load | ~43s |
|
||
| VRAM after load | 6.53 GB |
|
||
| T5 encode | ~1s |
|
||
| VAE encode | ~0.5s |
|
||
|
||
Awaiting full end-to-end benchmark completion for wall time and peak VRAM.
|
||
|
||
## Go / no-go criteria
|
||
|
||
- **Go**: < 45s per 81-frame clip AND peak VRAM < 12 GB (leaves ~20 GB for the server)
|
||
- **No-go**: keep the 14B MoE Q4_K_M pipeline
|
||
|
||
### Baselines
|
||
|
||
- **vLLM-Omni + fp16 Turbo-5B**: 1663s / 22.5 GB — decisive no-go
|
||
- **LightX2V + 14B MoE Q4_K_M**: ~30s/clip, ~14.5 GB VRAM (current production pipeline)
|