# Agent guide — server/video_models/

Wrappers around 3rd-party video models. These are the trickiest files in the repo: LightX2V's internals move quickly upstream, and the Blackwell (RTX 5090 / SM120) GPU path requires several non-obvious patches layered on top. Read this before editing.

## Scope

- [wan22.py](wan22.py) — LightX2V Wan2.2-I2V A14B MoE pipeline. Supports fp8 safetensors and GGUF DIT checkpoints. Loaded once at startup, held resident; per-turn calls go through `generate_i2v` and `switch_lora`.
- [musetalk.py](musetalk.py) — MuseTalk lip-sync over base frames + TTS audio.
- [muxer.py](muxer.py) — thin ffmpeg wrappers: frames → MP4 loop, frames + audio → MP4.

Nothing here is imported unless `config.video.enabled` is true.

## LightX2V entry points (upstream API)

Use these symbols, not internal/private ones:

```python
from lightx2v.utils.set_config import set_config
from lightx2v.utils.input_info import init_empty_input_info, update_input_info_from_dict
from lightx2v.infer import init_runner

config = set_config(args)                         # args is an argparse.Namespace
input_info = init_empty_input_info(args.task, args.support_tasks)
runner = init_runner(config)                      # loads all weights — ONCE

update_input_info_from_dict(input_info, {...})    # per-turn inputs
runner.run_pipeline(input_info)                   # MP4 written to save_result_path
runner.switch_lora(lora_path, strength)           # hot-swap
```

Keep model load out of the per-turn path — `init_runner` is expensive.

## Blackwell (SM120) patches — do not remove without testing

The GGUF pipeline works on a 5090 only because of layered patches in `wan22.py` and tuning in the LightX2V JSON configs under [configs/lightx2v/](../../configs/lightx2v/). Each patch exists because a stock upstream path segfaults or silently miscomputes on SM120.

**Dtype plumbing (GGUF path):**

- Default `DTYPE` must be `BF16` at `init_runner()` time — T5 offload buffers break if FP16 at init.
- Flip `BF16 → FP16` *after* `init_runner()`.
- Wrap T5 encoder so it runs under BF16 internally, then cast outputs `bf16 → fp16` before handing to the DIT. See `_patch_t5_dtype_for_gguf`.
- Cast VAE **both** layers: the inner `.model` via `.to(fp16)` **and** the outer `WanVAE` wrapper's `mean` / `inv_std` / `scale` tensors. Missing the wrapper tensors upcasts the latent during decode's `z/inv_std + mean`.
- DIT `pre_weight.patch_embedding.pin_weight` loads as fp32 (only `pin_bias` is fp16). Cast **and** re-pin via `.pin_memory()` — skipping re-pin segfaults during `to_cuda` H2D copy.
- `sgl_kernel`'s fp8 scaled matmul is patched to `torch._scaled_mm` in `_patch_fp8_scaled_mm_for_blackwell`.

**LightX2V JSON config (see `wan22_i2v_gguf_distill.json`):**

- `modulate_type: "torch"` — Triton `fuse_scale_shift_kernel` segfaults in `ast_to_ttir` on Triton 3.4 + SM120.
- `rope_type: "torch"` — flashinfer isn't installed.
- `self_attn_1_type` / `cross_attn_*_type`: `"torch_sdpa"` — flash_attn3 unavailable; `sageattention==1.0.6` from PyPI segfaults on Blackwell (newer requires source build).

If you add a new quant scheme or a new model_cls, create its own JSON under `configs/lightx2v/` mirroring these choices, and exercise it end-to-end via a new `tests/component/test_NN_*.py` before wiring it into the default config.

## HF download layout

`Wan-AI/Wan2.2-I2V-A14B` ships ~28 GB of bf16 DIT shards we replace with the quantised `dit_repo`. `BASE_REPO_IGNORE_PATTERNS` in [wan22.py](wan22.py) excludes them but **keeps** `high_noise_model/*.json` and `low_noise_model/*.json` — `set_config` parses architecture params (`dim`, etc.) from those. Don't broaden the ignore pattern without checking.

Supported quant schemes live in `wan22_dit_quant_scheme`:

- `fp8-sgl` — `lightx2v/Wan2.2-Distill-Models`, two `.safetensors` files
- `gguf-Q4_K_M`, `gguf-Q8_0`, … — `QuantStack/Wan2.2-I2V-A14B-GGUF`, layout `HighNoise/…` and `LowNoise/…`

Filenames are templated at the top of `wan22.py`; update those if the upstream repos rename files.

## Testing

- `tests/component/test_02_wan22_loras.py` — full pipeline load + LoRA apply
- `tests/component/test_09_gguf_generate.py` — GGUF end-to-end I2V
- `tests/component/test_10_t5_encode.py` — T5 encoder dtype path
- `tests/component/test_11_image_encode.py` — image → VAE latent
- `tests/component/test_12_dit_single_step.py` — one DIT step per expert
- `tests/component/test_13_vae_decode.py` — VAE decode → RGB

When diagnosing a Blackwell regression, run 10 → 11 → 12 → 13 in that order; the failure localises to the first failing stage.

## LoRAs

`switch_lora(path, strength)` applies; `switch_lora("", 0.0)` removes. `load_loras`/`unload_loras` in this wrapper iterate over `LoRASpec`s from config and call `switch_lora` per `target` sub-model (`high_noise`, `low_noise`, or `both`). Wrong target = silently wrong output.