# Agent guide — server/video_models/ Wrappers around 3rd-party video models. These are the trickiest files in the repo: LightX2V's internals move quickly upstream, and the Blackwell (RTX 5090 / SM120) GPU path requires several non-obvious patches layered on top. Read this before editing. ## Scope - [wan22.py](wan22.py) — LightX2V Wan2.2-I2V A14B MoE pipeline. Supports fp8 safetensors and GGUF DIT checkpoints. Loaded once at startup, held resident; per-turn calls go through `generate_i2v` and `switch_lora`. - [musetalk.py](musetalk.py) — MuseTalk lip-sync over base frames + TTS audio. - [muxer.py](muxer.py) — thin ffmpeg wrappers: frames → MP4 loop, frames + audio → MP4. Nothing here is imported unless `config.video.enabled` is true. ## LightX2V entry points (upstream API) Use these symbols, not internal/private ones: ```python from lightx2v.utils.set_config import set_config from lightx2v.utils.input_info import init_empty_input_info, update_input_info_from_dict from lightx2v.infer import init_runner config = set_config(args) # args is an argparse.Namespace input_info = init_empty_input_info(args.task, args.support_tasks) runner = init_runner(config) # loads all weights — ONCE update_input_info_from_dict(input_info, {...}) # per-turn inputs runner.run_pipeline(input_info) # MP4 written to save_result_path runner.switch_lora(lora_path, strength) # hot-swap ``` Keep model load out of the per-turn path — `init_runner` is expensive. ## Blackwell (SM120) patches — do not remove without testing The GGUF pipeline works on a 5090 only because of layered patches in `wan22.py` and tuning in the LightX2V JSON configs under [configs/lightx2v/](../../configs/lightx2v/). Each patch exists because a stock upstream path segfaults or silently miscomputes on SM120. **Dtype plumbing (GGUF path):** - Default `DTYPE` must be `BF16` at `init_runner()` time — T5 offload buffers break if FP16 at init. - Flip `BF16 → FP16` *after* `init_runner()`. - Wrap T5 encoder so it runs under BF16 internally, then cast outputs `bf16 → fp16` before handing to the DIT. See `_patch_t5_dtype_for_gguf`. - Cast VAE **both** layers: the inner `.model` via `.to(fp16)` **and** the outer `WanVAE` wrapper's `mean` / `inv_std` / `scale` tensors. Missing the wrapper tensors upcasts the latent during decode's `z/inv_std + mean`. - DIT `pre_weight.patch_embedding.pin_weight` loads as fp32 (only `pin_bias` is fp16). Cast **and** re-pin via `.pin_memory()` — skipping re-pin segfaults during `to_cuda` H2D copy. - `sgl_kernel`'s fp8 scaled matmul is patched to `torch._scaled_mm` in `_patch_fp8_scaled_mm_for_blackwell`. **LightX2V JSON config (see `wan22_i2v_gguf_distill.json`):** - `modulate_type: "torch"` — Triton `fuse_scale_shift_kernel` segfaults in `ast_to_ttir` on Triton 3.4 + SM120. - `rope_type: "torch"` — flashinfer isn't installed. - `self_attn_1_type` / `cross_attn_*_type`: `"torch_sdpa"` — flash_attn3 unavailable; `sageattention==1.0.6` from PyPI segfaults on Blackwell (newer requires source build). If you add a new quant scheme or a new model_cls, create its own JSON under `configs/lightx2v/` mirroring these choices, and exercise it end-to-end via a new `tests/component/test_NN_*.py` before wiring it into the default config. ## HF download layout `Wan-AI/Wan2.2-I2V-A14B` ships ~28 GB of bf16 DIT shards we replace with the quantised `dit_repo`. `BASE_REPO_IGNORE_PATTERNS` in [wan22.py](wan22.py) excludes them but **keeps** `high_noise_model/*.json` and `low_noise_model/*.json` — `set_config` parses architecture params (`dim`, etc.) from those. Don't broaden the ignore pattern without checking. Supported quant schemes live in `wan22_dit_quant_scheme`: - `fp8-sgl` — `lightx2v/Wan2.2-Distill-Models`, two `.safetensors` files - `gguf-Q4_K_M`, `gguf-Q8_0`, … — `QuantStack/Wan2.2-I2V-A14B-GGUF`, layout `HighNoise/…` and `LowNoise/…` Filenames are templated at the top of `wan22.py`; update those if the upstream repos rename files. ## Testing - `tests/component/test_02_wan22_loras.py` — full pipeline load + LoRA apply - `tests/component/test_09_gguf_generate.py` — GGUF end-to-end I2V - `tests/component/test_10_t5_encode.py` — T5 encoder dtype path - `tests/component/test_11_image_encode.py` — image → VAE latent - `tests/component/test_12_dit_single_step.py` — one DIT step per expert - `tests/component/test_13_vae_decode.py` — VAE decode → RGB When diagnosing a Blackwell regression, run 10 → 11 → 12 → 13 in that order; the failure localises to the first failing stage. ## LoRAs `switch_lora(path, strength)` applies; `switch_lora("", 0.0)` removes. `load_loras`/`unload_loras` in this wrapper iterate over `LoRASpec`s from config and call `switch_lora` per `target` sub-model (`high_noise`, `low_noise`, or `both`). Wrong target = silently wrong output.