4.8 KiB
Agent guide — server/video_models/
Wrappers around 3rd-party video models. These are the trickiest files in the repo: LightX2V's internals move quickly upstream, and the Blackwell (RTX 5090 / SM120) GPU path requires several non-obvious patches layered on top. Read this before editing.
Scope
- wan22.py — LightX2V Wan2.2-I2V A14B MoE pipeline. Supports fp8 safetensors and GGUF DIT checkpoints. Loaded once at startup, held resident; per-turn calls go through
generate_i2vandswitch_lora. - musetalk.py — MuseTalk lip-sync over base frames + TTS audio.
- muxer.py — thin ffmpeg wrappers: frames → MP4 loop, frames + audio → MP4.
Nothing here is imported unless config.video.enabled is true.
LightX2V entry points (upstream API)
Use these symbols, not internal/private ones:
from lightx2v.utils.set_config import set_config
from lightx2v.utils.input_info import init_empty_input_info, update_input_info_from_dict
from lightx2v.infer import init_runner
config = set_config(args) # args is an argparse.Namespace
input_info = init_empty_input_info(args.task, args.support_tasks)
runner = init_runner(config) # loads all weights — ONCE
update_input_info_from_dict(input_info, {...}) # per-turn inputs
runner.run_pipeline(input_info) # MP4 written to save_result_path
runner.switch_lora(lora_path, strength) # hot-swap
Keep model load out of the per-turn path — init_runner is expensive.
Blackwell (SM120) patches — do not remove without testing
The GGUF pipeline works on a 5090 only because of layered patches in wan22.py and tuning in the LightX2V JSON configs under configs/lightx2v/. Each patch exists because a stock upstream path segfaults or silently miscomputes on SM120.
Dtype plumbing (GGUF path):
- Default
DTYPEmust beBF16atinit_runner()time — T5 offload buffers break if FP16 at init. - Flip
BF16 → FP16afterinit_runner(). - Wrap T5 encoder so it runs under BF16 internally, then cast outputs
bf16 → fp16before handing to the DIT. See_patch_t5_dtype_for_gguf. - Cast VAE both layers: the inner
.modelvia.to(fp16)and the outerWanVAEwrapper'smean/inv_std/scaletensors. Missing the wrapper tensors upcasts the latent during decode'sz/inv_std + mean. - DIT
pre_weight.patch_embedding.pin_weightloads as fp32 (onlypin_biasis fp16). Cast and re-pin via.pin_memory()— skipping re-pin segfaults duringto_cudaH2D copy. sgl_kernel's fp8 scaled matmul is patched totorch._scaled_mmin_patch_fp8_scaled_mm_for_blackwell.
LightX2V JSON config (see wan22_i2v_gguf_distill.json):
modulate_type: "torch"— Tritonfuse_scale_shift_kernelsegfaults inast_to_ttiron Triton 3.4 + SM120.rope_type: "torch"— flashinfer isn't installed.self_attn_1_type/cross_attn_*_type:"torch_sdpa"— flash_attn3 unavailable;sageattention==1.0.6from PyPI segfaults on Blackwell (newer requires source build).
If you add a new quant scheme or a new model_cls, create its own JSON under configs/lightx2v/ mirroring these choices, and exercise it end-to-end via a new tests/component/test_NN_*.py before wiring it into the default config.
HF download layout
Wan-AI/Wan2.2-I2V-A14B ships ~28 GB of bf16 DIT shards we replace with the quantised dit_repo. BASE_REPO_IGNORE_PATTERNS in wan22.py excludes them but keeps high_noise_model/*.json and low_noise_model/*.json — set_config parses architecture params (dim, etc.) from those. Don't broaden the ignore pattern without checking.
Supported quant schemes live in wan22_dit_quant_scheme:
fp8-sgl—lightx2v/Wan2.2-Distill-Models, two.safetensorsfilesgguf-Q4_K_M,gguf-Q8_0, … —QuantStack/Wan2.2-I2V-A14B-GGUF, layoutHighNoise/…andLowNoise/…
Filenames are templated at the top of wan22.py; update those if the upstream repos rename files.
Testing
tests/component/test_02_wan22_loras.py— full pipeline load + LoRA applytests/component/test_09_gguf_generate.py— GGUF end-to-end I2Vtests/component/test_10_t5_encode.py— T5 encoder dtype pathtests/component/test_11_image_encode.py— image → VAE latenttests/component/test_12_dit_single_step.py— one DIT step per experttests/component/test_13_vae_decode.py— VAE decode → RGB
When diagnosing a Blackwell regression, run 10 → 11 → 12 → 13 in that order; the failure localises to the first failing stage.
LoRAs
switch_lora(path, strength) applies; switch_lora("", 0.0) removes. load_loras/unload_loras in this wrapper iterate over LoRASpecs from config and call switch_lora per target sub-model (high_noise, low_noise, or both). Wrong target = silently wrong output.