Files
live-voice-chat/experimental/lightx2v_5b
bhetherman 9debc56137 Add LightX2V + Wan2.2-TI2V-5B-Turbo GGUF experiment
Benchmarks the dense 5B Turbo model (Q8_0 GGUF + fp8 T5) as a
lower-VRAM alternative to the 14B MoE pipeline. Includes dtype
patches for dense WanModel, Wan 2.2 VAE config (48 channels, 16x
spatial), and Blackwell fp8 workaround.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 01:27:45 -04:00
..

LightX2V + Wan2.2-TI2V-5B-Turbo (GGUF) Experiment

Swap the 14B MoE distill for the dense 5B Turbo model, keeping the LightX2V backend. Hypothesis: half the parameters → lower VRAM footprint (can coexist with the running server) and faster per-step compute, with the Turbo 4-step distill preserving wall time.

Config

  • Model: hum-ma/Wan2.2-TI2V-5B-Turbo-GGUF (Q8_0 by default — swap to Q4_K_M via env)
  • Base repo (configs, T5, VAE): Wan-AI/Wan2.2-TI2V-5B
  • model_cls: wan2.2 (dense, single DIT — not MoE)
  • Steps: 4 (Turbo distill)
  • Resolution: 480×480, 81 frames @ 16 fps

Key implementation details

  • Dense model (wan2.2): Uses single DIT checkpoint, not MoE — requires different dtype patching than the 14B pipeline
  • GGUF dequant → fp16: Requires DTYPE=FP16 and patches for T5 (bf16→fp16 wrapper), VAE (→fp16), and DIT pre/post weights (fp32→fp16)
  • Wan 2.2 VAE: 48 latent channels with 16× spatial compression (vs 16 channels / 8× for Wan 2.1) — config must set vae_stride: [4,16,16] and num_channels_latents: 48
  • fp8 T5: Uses lightx2v/Encoders fp8 checkpoint (~4.9 GB vs ~11.4 GB bf16)
  • Blackwell (SM120): Needs _patch_fp8_scaled_mm_for_blackwell to replace sgl_kernel's fp8 GEMM

Why a separate container

Reuses the existing voice-chat-voice-chat image (LightX2V already installed) but runs under its own compose profile so it doesn't interfere with the live server volumes or startup. Shares the HF cache volume so model downloads are reused.

Running

# Ensure main image is built
docker compose build voice-chat

# Stage model (downloads base + Turbo Q8 GGUF, ~6 GB)
docker compose -f experimental/lightx2v_5b/docker-compose.yml --profile experimental \
    run --rm lightx2v-5b python /app/experimental/lightx2v_5b/setup_model.py

# Run benchmark
docker compose -f experimental/lightx2v_5b/docker-compose.yml --profile experimental \
    run --rm lightx2v-5b python /app/experimental/lightx2v_5b/test_i2v.py

Reports peak VRAM and wall time for an 81-frame 480p clip.

Results

Metric Value
Model load ~43s
VRAM after load 6.53 GB
T5 encode ~1s
VAE encode ~0.5s

Awaiting full end-to-end benchmark completion for wall time and peak VRAM.

Go / no-go criteria

  • Go: < 45s per 81-frame clip AND peak VRAM < 12 GB (leaves ~20 GB for the server)
  • No-go: keep the 14B MoE Q4_K_M pipeline

Baselines

  • vLLM-Omni + fp16 Turbo-5B: 1663s / 22.5 GB — decisive no-go
  • LightX2V + 14B MoE Q4_K_M: ~30s/clip, ~14.5 GB VRAM (current production pipeline)