Add LightX2V + Wan2.2-TI2V-5B-Turbo GGUF experiment
Benchmarks the dense 5B Turbo model (Q8_0 GGUF + fp8 T5) as a lower-VRAM alternative to the 14B MoE pipeline. Includes dtype patches for dense WanModel, Wan 2.2 VAE config (48 channels, 16x spatial), and Blackwell fp8 workaround. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,65 @@
|
||||
# LightX2V + Wan2.2-TI2V-5B-Turbo (GGUF) Experiment
|
||||
|
||||
Swap the 14B MoE distill for the dense 5B Turbo model, keeping the LightX2V backend.
|
||||
Hypothesis: half the parameters → lower VRAM footprint (can coexist with the running
|
||||
server) and faster per-step compute, with the Turbo 4-step distill preserving wall time.
|
||||
|
||||
## Config
|
||||
|
||||
- **Model**: `hum-ma/Wan2.2-TI2V-5B-Turbo-GGUF` (Q8_0 by default — swap to Q4_K_M via env)
|
||||
- **Base repo** (configs, T5, VAE): `Wan-AI/Wan2.2-TI2V-5B`
|
||||
- **model_cls**: `wan2.2` (dense, single DIT — not MoE)
|
||||
- **Steps**: 4 (Turbo distill)
|
||||
- **Resolution**: 480×480, 81 frames @ 16 fps
|
||||
|
||||
## Key implementation details
|
||||
|
||||
- **Dense model (`wan2.2`)**: Uses single DIT checkpoint, not MoE — requires different dtype patching than the 14B pipeline
|
||||
- **GGUF dequant → fp16**: Requires `DTYPE=FP16` and patches for T5 (bf16→fp16 wrapper), VAE (→fp16), and DIT pre/post weights (fp32→fp16)
|
||||
- **Wan 2.2 VAE**: 48 latent channels with 16× spatial compression (vs 16 channels / 8× for Wan 2.1) — config must set `vae_stride: [4,16,16]` and `num_channels_latents: 48`
|
||||
- **fp8 T5**: Uses `lightx2v/Encoders` fp8 checkpoint (~4.9 GB vs ~11.4 GB bf16)
|
||||
- **Blackwell (SM120)**: Needs `_patch_fp8_scaled_mm_for_blackwell` to replace sgl_kernel's fp8 GEMM
|
||||
|
||||
## Why a separate container
|
||||
|
||||
Reuses the existing `voice-chat-voice-chat` image (LightX2V already installed) but runs
|
||||
under its own compose profile so it doesn't interfere with the live server volumes or
|
||||
startup. Shares the HF cache volume so model downloads are reused.
|
||||
|
||||
## Running
|
||||
|
||||
```bash
|
||||
# Ensure main image is built
|
||||
docker compose build voice-chat
|
||||
|
||||
# Stage model (downloads base + Turbo Q8 GGUF, ~6 GB)
|
||||
docker compose -f experimental/lightx2v_5b/docker-compose.yml --profile experimental \
|
||||
run --rm lightx2v-5b python /app/experimental/lightx2v_5b/setup_model.py
|
||||
|
||||
# Run benchmark
|
||||
docker compose -f experimental/lightx2v_5b/docker-compose.yml --profile experimental \
|
||||
run --rm lightx2v-5b python /app/experimental/lightx2v_5b/test_i2v.py
|
||||
```
|
||||
|
||||
Reports peak VRAM and wall time for an 81-frame 480p clip.
|
||||
|
||||
## Results
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Model load | ~43s |
|
||||
| VRAM after load | 6.53 GB |
|
||||
| T5 encode | ~1s |
|
||||
| VAE encode | ~0.5s |
|
||||
|
||||
Awaiting full end-to-end benchmark completion for wall time and peak VRAM.
|
||||
|
||||
## Go / no-go criteria
|
||||
|
||||
- **Go**: < 45s per 81-frame clip AND peak VRAM < 12 GB (leaves ~20 GB for the server)
|
||||
- **No-go**: keep the 14B MoE Q4_K_M pipeline
|
||||
|
||||
### Baselines
|
||||
|
||||
- **vLLM-Omni + fp16 Turbo-5B**: 1663s / 22.5 GB — decisive no-go
|
||||
- **LightX2V + 14B MoE Q4_K_M**: ~30s/clip, ~14.5 GB VRAM (current production pipeline)
|
||||
Reference in New Issue
Block a user