Files
live-voice-chat/server/video_models/AGENT.md
T
2026-04-16 10:00:37 -04:00

4.8 KiB

Agent guide — server/video_models/

Wrappers around 3rd-party video models. These are the trickiest files in the repo: LightX2V's internals move quickly upstream, and the Blackwell (RTX 5090 / SM120) GPU path requires several non-obvious patches layered on top. Read this before editing.

Scope

  • wan22.py — LightX2V Wan2.2-I2V A14B MoE pipeline. Supports fp8 safetensors and GGUF DIT checkpoints. Loaded once at startup, held resident; per-turn calls go through generate_i2v and switch_lora.
  • musetalk.py — MuseTalk lip-sync over base frames + TTS audio.
  • muxer.py — thin ffmpeg wrappers: frames → MP4 loop, frames + audio → MP4.

Nothing here is imported unless config.video.enabled is true.

LightX2V entry points (upstream API)

Use these symbols, not internal/private ones:

from lightx2v.utils.set_config import set_config
from lightx2v.utils.input_info import init_empty_input_info, update_input_info_from_dict
from lightx2v.infer import init_runner

config = set_config(args)                         # args is an argparse.Namespace
input_info = init_empty_input_info(args.task, args.support_tasks)
runner = init_runner(config)                      # loads all weights — ONCE

update_input_info_from_dict(input_info, {...})    # per-turn inputs
runner.run_pipeline(input_info)                   # MP4 written to save_result_path
runner.switch_lora(lora_path, strength)           # hot-swap

Keep model load out of the per-turn path — init_runner is expensive.

Blackwell (SM120) patches — do not remove without testing

The GGUF pipeline works on a 5090 only because of layered patches in wan22.py and tuning in the LightX2V JSON configs under configs/lightx2v/. Each patch exists because a stock upstream path segfaults or silently miscomputes on SM120.

Dtype plumbing (GGUF path):

  • Default DTYPE must be BF16 at init_runner() time — T5 offload buffers break if FP16 at init.
  • Flip BF16 → FP16 after init_runner().
  • Wrap T5 encoder so it runs under BF16 internally, then cast outputs bf16 → fp16 before handing to the DIT. See _patch_t5_dtype_for_gguf.
  • Cast VAE both layers: the inner .model via .to(fp16) and the outer WanVAE wrapper's mean / inv_std / scale tensors. Missing the wrapper tensors upcasts the latent during decode's z/inv_std + mean.
  • DIT pre_weight.patch_embedding.pin_weight loads as fp32 (only pin_bias is fp16). Cast and re-pin via .pin_memory() — skipping re-pin segfaults during to_cuda H2D copy.
  • sgl_kernel's fp8 scaled matmul is patched to torch._scaled_mm in _patch_fp8_scaled_mm_for_blackwell.

LightX2V JSON config (see wan22_i2v_gguf_distill.json):

  • modulate_type: "torch" — Triton fuse_scale_shift_kernel segfaults in ast_to_ttir on Triton 3.4 + SM120.
  • rope_type: "torch" — flashinfer isn't installed.
  • self_attn_1_type / cross_attn_*_type: "torch_sdpa" — flash_attn3 unavailable; sageattention==1.0.6 from PyPI segfaults on Blackwell (newer requires source build).

If you add a new quant scheme or a new model_cls, create its own JSON under configs/lightx2v/ mirroring these choices, and exercise it end-to-end via a new tests/component/test_NN_*.py before wiring it into the default config.

HF download layout

Wan-AI/Wan2.2-I2V-A14B ships ~28 GB of bf16 DIT shards we replace with the quantised dit_repo. BASE_REPO_IGNORE_PATTERNS in wan22.py excludes them but keeps high_noise_model/*.json and low_noise_model/*.jsonset_config parses architecture params (dim, etc.) from those. Don't broaden the ignore pattern without checking.

Supported quant schemes live in wan22_dit_quant_scheme:

  • fp8-sgllightx2v/Wan2.2-Distill-Models, two .safetensors files
  • gguf-Q4_K_M, gguf-Q8_0, … — QuantStack/Wan2.2-I2V-A14B-GGUF, layout HighNoise/… and LowNoise/…

Filenames are templated at the top of wan22.py; update those if the upstream repos rename files.

Testing

  • tests/component/test_02_wan22_loras.py — full pipeline load + LoRA apply
  • tests/component/test_09_gguf_generate.py — GGUF end-to-end I2V
  • tests/component/test_10_t5_encode.py — T5 encoder dtype path
  • tests/component/test_11_image_encode.py — image → VAE latent
  • tests/component/test_12_dit_single_step.py — one DIT step per expert
  • tests/component/test_13_vae_decode.py — VAE decode → RGB

When diagnosing a Blackwell regression, run 10 → 11 → 12 → 13 in that order; the failure localises to the first failing stage.

LoRAs

switch_lora(path, strength) applies; switch_lora("", 0.0) removes. load_loras/unload_loras in this wrapper iterate over LoRASpecs from config and call switch_lora per target sub-model (high_noise, low_noise, or both). Wrong target = silently wrong output.