Add LightX2V + Wan2.2-TI2V-5B-Turbo GGUF experiment

Benchmarks the dense 5B Turbo model (Q8_0 GGUF + fp8 T5) as a lower-VRAM alternative to the 14B MoE pipeline. Includes dtype patches for dense WanModel, Wan 2.2 VAE config (48 channels, 16x spatial), and Blackwell fp8 workaround. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 01:27:45 -04:00
parent 56923ff424
commit 9debc56137
8 changed files with 407 additions and 0 deletions
@@ -0,0 +1,65 @@
+# LightX2V + Wan2.2-TI2V-5B-Turbo (GGUF) Experiment
+
+Swap the 14B MoE distill for the dense 5B Turbo model, keeping the LightX2V backend.
+Hypothesis: half the parameters → lower VRAM footprint (can coexist with the running
+server) and faster per-step compute, with the Turbo 4-step distill preserving wall time.
+
+## Config
+
+- **Model**: `hum-ma/Wan2.2-TI2V-5B-Turbo-GGUF` (Q8_0 by default — swap to Q4_K_M via env)
+- **Base repo** (configs, T5, VAE): `Wan-AI/Wan2.2-TI2V-5B`
+- **model_cls**: `wan2.2` (dense, single DIT — not MoE)
+- **Steps**: 4 (Turbo distill)
+- **Resolution**: 480×480, 81 frames @ 16 fps
+
+## Key implementation details
+
+- **Dense model (`wan2.2`)**: Uses single DIT checkpoint, not MoE — requires different dtype patching than the 14B pipeline
+- **GGUF dequant → fp16**: Requires `DTYPE=FP16` and patches for T5 (bf16→fp16 wrapper), VAE (→fp16), and DIT pre/post weights (fp32→fp16)
+- **Wan 2.2 VAE**: 48 latent channels with 16× spatial compression (vs 16 channels / 8× for Wan 2.1) — config must set `vae_stride: [4,16,16]` and `num_channels_latents: 48`
+- **fp8 T5**: Uses `lightx2v/Encoders` fp8 checkpoint (~4.9 GB vs ~11.4 GB bf16)
+- **Blackwell (SM120)**: Needs `_patch_fp8_scaled_mm_for_blackwell` to replace sgl_kernel's fp8 GEMM
+
+## Why a separate container
+
+Reuses the existing `voice-chat-voice-chat` image (LightX2V already installed) but runs
+under its own compose profile so it doesn't interfere with the live server volumes or
+startup. Shares the HF cache volume so model downloads are reused.
+
+## Running
+
+```bash
+# Ensure main image is built
+docker compose build voice-chat
+
+# Stage model (downloads base + Turbo Q8 GGUF, ~6 GB)
+docker compose -f experimental/lightx2v_5b/docker-compose.yml --profile experimental \
+    run --rm lightx2v-5b python /app/experimental/lightx2v_5b/setup_model.py
+
+# Run benchmark
+docker compose -f experimental/lightx2v_5b/docker-compose.yml --profile experimental \
+    run --rm lightx2v-5b python /app/experimental/lightx2v_5b/test_i2v.py
+```
+
+Reports peak VRAM and wall time for an 81-frame 480p clip.
+
+## Results
+
+| Metric | Value |
+|--------|-------|
+| Model load | ~43s |
+| VRAM after load | 6.53 GB |
+| T5 encode | ~1s |
+| VAE encode | ~0.5s |
+
+Awaiting full end-to-end benchmark completion for wall time and peak VRAM.
+
+## Go / no-go criteria
+
+- **Go**: < 45s per 81-frame clip AND peak VRAM < 12 GB (leaves ~20 GB for the server)
+- **No-go**: keep the 14B MoE Q4_K_M pipeline
+
+### Baselines
+
+- **vLLM-Omni + fp16 Turbo-5B**: 1663s / 22.5 GB — decisive no-go
+- **LightX2V + 14B MoE Q4_K_M**: ~30s/clip, ~14.5 GB VRAM (current production pipeline)