9debc56137
Benchmarks the dense 5B Turbo model (Q8_0 GGUF + fp8 T5) as a lower-VRAM alternative to the 14B MoE pipeline. Includes dtype patches for dense WanModel, Wan 2.2 VAE config (48 channels, 16x spatial), and Blackwell fp8 workaround. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
LightX2V + Wan2.2-TI2V-5B-Turbo (GGUF) Experiment
Swap the 14B MoE distill for the dense 5B Turbo model, keeping the LightX2V backend. Hypothesis: half the parameters → lower VRAM footprint (can coexist with the running server) and faster per-step compute, with the Turbo 4-step distill preserving wall time.
Config
- Model:
hum-ma/Wan2.2-TI2V-5B-Turbo-GGUF(Q8_0 by default — swap to Q4_K_M via env) - Base repo (configs, T5, VAE):
Wan-AI/Wan2.2-TI2V-5B - model_cls:
wan2.2(dense, single DIT — not MoE) - Steps: 4 (Turbo distill)
- Resolution: 480×480, 81 frames @ 16 fps
Key implementation details
- Dense model (
wan2.2): Uses single DIT checkpoint, not MoE — requires different dtype patching than the 14B pipeline - GGUF dequant → fp16: Requires
DTYPE=FP16and patches for T5 (bf16→fp16 wrapper), VAE (→fp16), and DIT pre/post weights (fp32→fp16) - Wan 2.2 VAE: 48 latent channels with 16× spatial compression (vs 16 channels / 8× for Wan 2.1) — config must set
vae_stride: [4,16,16]andnum_channels_latents: 48 - fp8 T5: Uses
lightx2v/Encodersfp8 checkpoint (~4.9 GB vs ~11.4 GB bf16) - Blackwell (SM120): Needs
_patch_fp8_scaled_mm_for_blackwellto replace sgl_kernel's fp8 GEMM
Why a separate container
Reuses the existing voice-chat-voice-chat image (LightX2V already installed) but runs
under its own compose profile so it doesn't interfere with the live server volumes or
startup. Shares the HF cache volume so model downloads are reused.
Running
# Ensure main image is built
docker compose build voice-chat
# Stage model (downloads base + Turbo Q8 GGUF, ~6 GB)
docker compose -f experimental/lightx2v_5b/docker-compose.yml --profile experimental \
run --rm lightx2v-5b python /app/experimental/lightx2v_5b/setup_model.py
# Run benchmark
docker compose -f experimental/lightx2v_5b/docker-compose.yml --profile experimental \
run --rm lightx2v-5b python /app/experimental/lightx2v_5b/test_i2v.py
Reports peak VRAM and wall time for an 81-frame 480p clip.
Results
| Metric | Value |
|---|---|
| Model load | ~43s |
| VRAM after load | 6.53 GB |
| T5 encode | ~1s |
| VAE encode | ~0.5s |
Awaiting full end-to-end benchmark completion for wall time and peak VRAM.
Go / no-go criteria
- Go: < 45s per 81-frame clip AND peak VRAM < 12 GB (leaves ~20 GB for the server)
- No-go: keep the 14B MoE Q4_K_M pipeline
Baselines
- vLLM-Omni + fp16 Turbo-5B: 1663s / 22.5 GB — decisive no-go
- LightX2V + 14B MoE Q4_K_M: ~30s/clip, ~14.5 GB VRAM (current production pipeline)