Compare commits
36 Commits
train_codes
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| ca5b7a8f28 | |||
| 0a89dec45a | |||
| 8c19579b1e | |||
| 9deb9bea0d | |||
| 6e39bd0d00 | |||
| 26ca7c2c03 | |||
| 8ca7d1884c | |||
| 67e7ee3c73 | |||
| 36163fccbd | |||
| 7b829ba778 | |||
| 1ab53a626b | |||
| e636166b85 | |||
| 23d88dcfb9 | |||
| 39ccf69f36 | |||
| fbe6a97dff | |||
| 7f258556b2 | |||
| 53b0e012a4 | |||
| f55237f683 | |||
| 17a93b2ff6 | |||
| 6255496f80 | |||
| 6151bf4ab2 | |||
| db204311a5 | |||
| 058f7ddc7f | |||
| 2141f07945 | |||
| 879f89d524 | |||
| a83f24440a | |||
| 0e1b9a75f3 | |||
| 31ee3e4d12 | |||
| 58350f9452 | |||
| 1c164aae34 | |||
| 6d2e7ac3e8 | |||
| d59f6d778a | |||
| 8bb985f25d | |||
| f051221f76 | |||
| cf72088c48 | |||
| 53a2b7e46a |
+11
-3
@@ -4,7 +4,15 @@
|
|||||||
.vscode/
|
.vscode/
|
||||||
*.pyc
|
*.pyc
|
||||||
.ipynb_checkpoints
|
.ipynb_checkpoints
|
||||||
models
|
|
||||||
results/
|
results/
|
||||||
data/audio/*.wav
|
/models/
|
||||||
data/video/*.mp4
|
**/__pycache__/
|
||||||
|
*.py[cod]
|
||||||
|
*$py.class
|
||||||
|
dataset/
|
||||||
|
ffmpeg*
|
||||||
|
ffmprobe*
|
||||||
|
ffplay*
|
||||||
|
debug
|
||||||
|
exp_out
|
||||||
|
.gradio
|
||||||
@@ -1,20 +1,31 @@
|
|||||||
# MuseTalk
|
# MuseTalk
|
||||||
|
|
||||||
MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting
|
<strong>MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling</strong>
|
||||||
</br>
|
|
||||||
Yue Zhang <sup>\*</sup>,
|
Yue Zhang<sup>\*</sup>,
|
||||||
|
Zhizhou Zhong<sup>\*</sup>,
|
||||||
Minhao Liu<sup>\*</sup>,
|
Minhao Liu<sup>\*</sup>,
|
||||||
Zhaokang Chen,
|
Zhaokang Chen,
|
||||||
Bin Wu<sup>†</sup>,
|
Bin Wu<sup>†</sup>,
|
||||||
Yingjie He,
|
Yubin Zeng,
|
||||||
Chao Zhan,
|
Chao Zhan,
|
||||||
|
Junxin Huang,
|
||||||
|
Yingjie He,
|
||||||
Wenjiang Zhou
|
Wenjiang Zhou
|
||||||
(<sup>*</sup>Equal Contribution, <sup>†</sup>Corresponding Author, benbinwu@tencent.com)
|
(<sup>*</sup>Equal Contribution, <sup>†</sup>Corresponding Author, benbinwu@tencent.com)
|
||||||
|
|
||||||
**[github](https://github.com/TMElyralab/MuseTalk)** **[huggingface](https://huggingface.co/TMElyralab/MuseTalk)** **[space](https://huggingface.co/spaces/TMElyralab/MuseTalk)** **Project (comming soon)** **Technical report (comming soon)**
|
Lyra Lab, Tencent Music Entertainment
|
||||||
|
|
||||||
|
**[github](https://github.com/TMElyralab/MuseTalk)** **[huggingface](https://huggingface.co/TMElyralab/MuseTalk)** **[space](https://huggingface.co/spaces/TMElyralab/MuseTalk)** **[Technical report](https://arxiv.org/abs/2410.10122)**
|
||||||
|
|
||||||
We introduce `MuseTalk`, a **real-time high quality** lip-syncing model (30fps+ on an NVIDIA Tesla V100). MuseTalk can be applied with input videos, e.g., generated by [MuseV](https://github.com/TMElyralab/MuseV), as a complete virtual human solution.
|
We introduce `MuseTalk`, a **real-time high quality** lip-syncing model (30fps+ on an NVIDIA Tesla V100). MuseTalk can be applied with input videos, e.g., generated by [MuseV](https://github.com/TMElyralab/MuseV), as a complete virtual human solution.
|
||||||
|
|
||||||
|
## 🔥 Updates
|
||||||
|
We're excited to unveil MuseTalk 1.5.
|
||||||
|
This version **(1)** integrates training with perceptual loss, GAN loss, and sync loss, significantly boosting its overall performance. **(2)** We've implemented a two-stage training strategy and a spatio-temporal data sampling approach to strike a balance between visual quality and lip-sync accuracy.
|
||||||
|
Learn more details [here](https://arxiv.org/abs/2410.10122).
|
||||||
|
**The inference codes, training codes and model weights of MuseTalk 1.5 are all available now!** 🚀
|
||||||
|
|
||||||
# Overview
|
# Overview
|
||||||
`MuseTalk` is a real-time high quality audio-driven lip-syncing model trained in the latent space of `ft-mse-vae`, which
|
`MuseTalk` is a real-time high quality audio-driven lip-syncing model trained in the latent space of `ft-mse-vae`, which
|
||||||
|
|
||||||
@@ -22,22 +33,384 @@ We introduce `MuseTalk`, a **real-time high quality** lip-syncing model (30fps+
|
|||||||
1. supports audio in various languages, such as Chinese, English, and Japanese.
|
1. supports audio in various languages, such as Chinese, English, and Japanese.
|
||||||
1. supports real-time inference with 30fps+ on an NVIDIA Tesla V100.
|
1. supports real-time inference with 30fps+ on an NVIDIA Tesla V100.
|
||||||
1. supports modification of the center point of the face region proposes, which **SIGNIFICANTLY** affects generation results.
|
1. supports modification of the center point of the face region proposes, which **SIGNIFICANTLY** affects generation results.
|
||||||
1. checkpoint available trained on the HDTF dataset.
|
1. checkpoint available trained on the HDTF and private dataset.
|
||||||
1. training codes (comming soon).
|
|
||||||
|
|
||||||
# News
|
# News
|
||||||
- [04/02/2024] Release MuseTalk project and pretrained models.
|
- [04/05/2025] :mega: We are excited to announce that the training code is now open-sourced! You can now train your own MuseTalk model using our provided training scripts and configurations.
|
||||||
|
- [03/28/2025] We are thrilled to announce the release of our 1.5 version. This version is a significant improvement over the 1.0 version, with enhanced clarity, identity consistency, and precise lip-speech synchronization. We update the [technical report](https://arxiv.org/abs/2410.10122) with more details.
|
||||||
|
- [10/18/2024] We release the [technical report](https://arxiv.org/abs/2410.10122v2). Our report details a superior model to the open-source L1 loss version. It includes GAN and perceptual losses for improved clarity, and sync loss for enhanced performance.
|
||||||
|
- [04/17/2024] We release a pipeline that utilizes MuseTalk for real-time inference.
|
||||||
- [04/16/2024] Release Gradio [demo](https://huggingface.co/spaces/TMElyralab/MuseTalk) on HuggingFace Spaces (thanks to HF team for their community grant)
|
- [04/16/2024] Release Gradio [demo](https://huggingface.co/spaces/TMElyralab/MuseTalk) on HuggingFace Spaces (thanks to HF team for their community grant)
|
||||||
- [04/17/2024] :mega: We release a pipeline that utilizes MuseTalk for real-time inference.
|
- [04/02/2024] Release MuseTalk project and pretrained models.
|
||||||
|
|
||||||
|
|
||||||
## Model
|
## Model
|
||||||

|

|
||||||
MuseTalk was trained in latent spaces, where the images were encoded by a freezed VAE. The audio was encoded by a freezed `whisper-tiny` model. The architecture of the generation network was borrowed from the UNet of the `stable-diffusion-v1-4`, where the audio embeddings were fused to the image embeddings by cross-attention.
|
MuseTalk was trained in latent spaces, where the images were encoded by a freezed VAE. The audio was encoded by a freezed `whisper-tiny` model. The architecture of the generation network was borrowed from the UNet of the `stable-diffusion-v1-4`, where the audio embeddings were fused to the image embeddings by cross-attention.
|
||||||
|
|
||||||
Note that although we use a very similar architecture as Stable Diffusion, MuseTalk is distinct in that it is **NOT** a diffusion model. Instead, MuseTalk operates by inpainting in the latent space with a single step.
|
Note that although we use a very similar architecture as Stable Diffusion, MuseTalk is distinct in that it is **NOT** a diffusion model. Instead, MuseTalk operates by inpainting in the latent space with a single step.
|
||||||
|
|
||||||
## Cases
|
## Cases
|
||||||
### MuseV + MuseTalk make human photos alive!
|
|
||||||
|
<table>
|
||||||
|
<tr>
|
||||||
|
<td width="33%">
|
||||||
|
|
||||||
|
### Input Video
|
||||||
|
---
|
||||||
|
https://github.com/TMElyralab/MuseTalk/assets/163980830/37a3a666-7b90-4244-8d3a-058cb0e44107
|
||||||
|
|
||||||
|
---
|
||||||
|
https://github.com/user-attachments/assets/1ce3e850-90ac-4a31-a45f-8dfa4f2960ac
|
||||||
|
|
||||||
|
---
|
||||||
|
https://github.com/user-attachments/assets/fa3b13a1-ae26-4d1d-899e-87435f8d22b3
|
||||||
|
|
||||||
|
---
|
||||||
|
https://github.com/user-attachments/assets/15800692-39d1-4f4c-99f2-aef044dc3251
|
||||||
|
|
||||||
|
---
|
||||||
|
https://github.com/user-attachments/assets/a843f9c9-136d-4ed4-9303-4a7269787a60
|
||||||
|
|
||||||
|
---
|
||||||
|
https://github.com/user-attachments/assets/6eb4e70e-9e19-48e9-85a9-bbfa589c5fcb
|
||||||
|
|
||||||
|
</td>
|
||||||
|
<td width="33%">
|
||||||
|
|
||||||
|
### MuseTalk 1.0
|
||||||
|
---
|
||||||
|
https://github.com/user-attachments/assets/c04f3cd5-9f77-40e9-aafd-61978380d0ef
|
||||||
|
|
||||||
|
---
|
||||||
|
https://github.com/user-attachments/assets/2051a388-1cef-4c1d-b2a2-3c1ceee5dc99
|
||||||
|
|
||||||
|
---
|
||||||
|
https://github.com/user-attachments/assets/b5f56f71-5cdc-4e2e-a519-454242000d32
|
||||||
|
|
||||||
|
---
|
||||||
|
https://github.com/user-attachments/assets/a5843835-04ab-4c31-989f-0995cfc22f34
|
||||||
|
|
||||||
|
---
|
||||||
|
https://github.com/user-attachments/assets/3dc7f1d7-8747-4733-bbdd-97874af0c028
|
||||||
|
|
||||||
|
---
|
||||||
|
https://github.com/user-attachments/assets/3c78064e-faad-4637-83ae-28452a22b09a
|
||||||
|
|
||||||
|
</td>
|
||||||
|
<td width="33%">
|
||||||
|
|
||||||
|
### MuseTalk 1.5
|
||||||
|
---
|
||||||
|
https://github.com/user-attachments/assets/999a6f5b-61dd-48e1-b902-bb3f9cbc7247
|
||||||
|
|
||||||
|
---
|
||||||
|
https://github.com/user-attachments/assets/d26a5c9a-003c-489d-a043-c9a331456e75
|
||||||
|
|
||||||
|
---
|
||||||
|
https://github.com/user-attachments/assets/471290d7-b157-4cf6-8a6d-7e899afa302c
|
||||||
|
|
||||||
|
---
|
||||||
|
https://github.com/user-attachments/assets/1ee77c4c-8c70-4add-b6db-583a12faa7dc
|
||||||
|
|
||||||
|
---
|
||||||
|
https://github.com/user-attachments/assets/370510ea-624c-43b7-bbb0-ab5333e0fcc4
|
||||||
|
|
||||||
|
---
|
||||||
|
https://github.com/user-attachments/assets/b011ece9-a332-4bc1-b8b7-ef6e383d7bde
|
||||||
|
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
</table>
|
||||||
|
|
||||||
|
|
||||||
|
# TODO:
|
||||||
|
- [x] trained models and inference codes.
|
||||||
|
- [x] Huggingface Gradio [demo](https://huggingface.co/spaces/TMElyralab/MuseTalk).
|
||||||
|
- [x] codes for real-time inference.
|
||||||
|
- [x] [technical report](https://arxiv.org/abs/2410.10122v2).
|
||||||
|
- [x] a better model with updated [technical report](https://arxiv.org/abs/2410.10122).
|
||||||
|
- [x] realtime inference code for 1.5 version.
|
||||||
|
- [x] training and data preprocessing codes.
|
||||||
|
- [ ] **always** welcome to submit issues and PRs to improve this repository! 😊
|
||||||
|
|
||||||
|
|
||||||
|
# Getting Started
|
||||||
|
We provide a detailed tutorial about the installation and the basic usage of MuseTalk for new users:
|
||||||
|
|
||||||
|
## Third party integration
|
||||||
|
Thanks for the third-party integration, which makes installation and use more convenient for everyone.
|
||||||
|
We also hope you note that we have not verified, maintained, or updated third-party. Please refer to this project for specific results.
|
||||||
|
|
||||||
|
### [ComfyUI](https://github.com/chaojie/ComfyUI-MuseTalk)
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
To prepare the Python environment and install additional packages such as opencv, diffusers, mmcv, etc., please follow the steps below:
|
||||||
|
|
||||||
|
### Build environment
|
||||||
|
We recommend Python 3.10 and CUDA 11.7. Set up your environment as follows:
|
||||||
|
|
||||||
|
```shell
|
||||||
|
conda create -n MuseTalk python==3.10
|
||||||
|
conda activate MuseTalk
|
||||||
|
```
|
||||||
|
|
||||||
|
### Install PyTorch 2.0.1
|
||||||
|
Choose one of the following installation methods:
|
||||||
|
|
||||||
|
```shell
|
||||||
|
# Option 1: Using pip
|
||||||
|
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
|
||||||
|
|
||||||
|
# Option 2: Using conda
|
||||||
|
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia
|
||||||
|
```
|
||||||
|
|
||||||
|
### Install Dependencies
|
||||||
|
Install the remaining required packages:
|
||||||
|
|
||||||
|
```shell
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
### Install MMLab Packages
|
||||||
|
Install the MMLab ecosystem packages:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install --no-cache-dir -U openmim
|
||||||
|
mim install mmengine
|
||||||
|
mim install "mmcv==2.0.1"
|
||||||
|
mim install "mmdet==3.1.0"
|
||||||
|
mim install "mmpose==1.1.0"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Setup FFmpeg
|
||||||
|
1. [Download](https://github.com/BtbN/FFmpeg-Builds/releases) the ffmpeg-static package
|
||||||
|
|
||||||
|
2. Configure FFmpeg based on your operating system:
|
||||||
|
|
||||||
|
For Linux:
|
||||||
|
```bash
|
||||||
|
export FFMPEG_PATH=/path/to/ffmpeg
|
||||||
|
# Example:
|
||||||
|
export FFMPEG_PATH=/musetalk/ffmpeg-4.4-amd64-static
|
||||||
|
```
|
||||||
|
|
||||||
|
For Windows:
|
||||||
|
Add the `ffmpeg-xxx\bin` directory to your system's PATH environment variable. Verify the installation by running `ffmpeg -version` in the command prompt - it should display the ffmpeg version information.
|
||||||
|
|
||||||
|
### Download weights
|
||||||
|
You can download weights in two ways:
|
||||||
|
|
||||||
|
#### Option 1: Using Download Scripts
|
||||||
|
We provide two scripts for automatic downloading:
|
||||||
|
|
||||||
|
For Linux:
|
||||||
|
```bash
|
||||||
|
sh ./download_weights.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
For Windows:
|
||||||
|
```batch
|
||||||
|
# Run the script
|
||||||
|
download_weights.bat
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Option 2: Manual Download
|
||||||
|
You can also download the weights manually from the following links:
|
||||||
|
|
||||||
|
1. Download our trained [weights](https://huggingface.co/TMElyralab/MuseTalk/tree/main)
|
||||||
|
2. Download the weights of other components:
|
||||||
|
- [sd-vae-ft-mse](https://huggingface.co/stabilityai/sd-vae-ft-mse/tree/main)
|
||||||
|
- [whisper](https://huggingface.co/openai/whisper-tiny/tree/main)
|
||||||
|
- [dwpose](https://huggingface.co/yzd-v/DWPose/tree/main)
|
||||||
|
- [syncnet](https://huggingface.co/ByteDance/LatentSync/tree/main)
|
||||||
|
- [face-parse-bisent](https://drive.google.com/file/d/154JgKpzCPW82qINcVieuPH3fZ2e0P812/view?pli=1)
|
||||||
|
- [resnet18](https://download.pytorch.org/models/resnet18-5c106cde.pth)
|
||||||
|
|
||||||
|
Finally, these weights should be organized in `models` as follows:
|
||||||
|
```
|
||||||
|
./models/
|
||||||
|
├── musetalk
|
||||||
|
│ └── musetalk.json
|
||||||
|
│ └── pytorch_model.bin
|
||||||
|
├── musetalkV15
|
||||||
|
│ └── musetalk.json
|
||||||
|
│ └── unet.pth
|
||||||
|
├── syncnet
|
||||||
|
│ └── latentsync_syncnet.pt
|
||||||
|
├── dwpose
|
||||||
|
│ └── dw-ll_ucoco_384.pth
|
||||||
|
├── face-parse-bisent
|
||||||
|
│ ├── 79999_iter.pth
|
||||||
|
│ └── resnet18-5c106cde.pth
|
||||||
|
├── sd-vae
|
||||||
|
│ ├── config.json
|
||||||
|
│ └── diffusion_pytorch_model.bin
|
||||||
|
└── whisper
|
||||||
|
├── config.json
|
||||||
|
├── pytorch_model.bin
|
||||||
|
└── preprocessor_config.json
|
||||||
|
|
||||||
|
```
|
||||||
|
## Quickstart
|
||||||
|
|
||||||
|
### Inference
|
||||||
|
We provide inference scripts for both versions of MuseTalk:
|
||||||
|
|
||||||
|
#### Prerequisites
|
||||||
|
Before running inference, please ensure ffmpeg is installed and accessible:
|
||||||
|
```bash
|
||||||
|
# Check ffmpeg installation
|
||||||
|
ffmpeg -version
|
||||||
|
```
|
||||||
|
If ffmpeg is not found, please install it first:
|
||||||
|
- Windows: Download from [ffmpeg-static](https://github.com/BtbN/FFmpeg-Builds/releases) and add to PATH
|
||||||
|
- Linux: `sudo apt-get install ffmpeg`
|
||||||
|
|
||||||
|
#### Normal Inference
|
||||||
|
##### Linux Environment
|
||||||
|
```bash
|
||||||
|
# MuseTalk 1.5 (Recommended)
|
||||||
|
sh inference.sh v1.5 normal
|
||||||
|
|
||||||
|
# MuseTalk 1.0
|
||||||
|
sh inference.sh v1.0 normal
|
||||||
|
```
|
||||||
|
|
||||||
|
##### Windows Environment
|
||||||
|
|
||||||
|
Please ensure that you set the `ffmpeg_path` to match the actual location of your FFmpeg installation.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# MuseTalk 1.5 (Recommended)
|
||||||
|
python -m scripts.inference --inference_config configs\inference\test.yaml --result_dir results\test --unet_model_path models\musetalkV15\unet.pth --unet_config models\musetalkV15\musetalk.json --version v15 --ffmpeg_path ffmpeg-master-latest-win64-gpl-shared\bin
|
||||||
|
|
||||||
|
# For MuseTalk 1.0, change:
|
||||||
|
# - models\musetalkV15 -> models\musetalk
|
||||||
|
# - unet.pth -> pytorch_model.bin
|
||||||
|
# - --version v15 -> --version v1
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Real-time Inference
|
||||||
|
##### Linux Environment
|
||||||
|
```bash
|
||||||
|
# MuseTalk 1.5 (Recommended)
|
||||||
|
sh inference.sh v1.5 realtime
|
||||||
|
|
||||||
|
# MuseTalk 1.0
|
||||||
|
sh inference.sh v1.0 realtime
|
||||||
|
```
|
||||||
|
|
||||||
|
##### Windows Environment
|
||||||
|
```bash
|
||||||
|
# MuseTalk 1.5 (Recommended)
|
||||||
|
python -m scripts.realtime_inference --inference_config configs\inference\realtime.yaml --result_dir results\realtime --unet_model_path models\musetalkV15\unet.pth --unet_config models\musetalkV15\musetalk.json --version v15 --fps 25 --ffmpeg_path ffmpeg-master-latest-win64-gpl-shared\bin
|
||||||
|
|
||||||
|
# For MuseTalk 1.0, change:
|
||||||
|
# - models\musetalkV15 -> models\musetalk
|
||||||
|
# - unet.pth -> pytorch_model.bin
|
||||||
|
# - --version v15 -> --version v1
|
||||||
|
```
|
||||||
|
|
||||||
|
The configuration file `configs/inference/test.yaml` contains the inference settings, including:
|
||||||
|
- `video_path`: Path to the input video, image file, or directory of images
|
||||||
|
- `audio_path`: Path to the input audio file
|
||||||
|
|
||||||
|
Note: For optimal results, we recommend using input videos with 25fps, which is the same fps used during model training. If your video has a lower frame rate, you can use frame interpolation or convert it to 25fps using ffmpeg.
|
||||||
|
|
||||||
|
Important notes for real-time inference:
|
||||||
|
1. Set `preparation` to `True` when processing a new avatar
|
||||||
|
2. After preparation, the avatar will generate videos using audio clips from `audio_clips`
|
||||||
|
3. The generation process can achieve 30fps+ on an NVIDIA Tesla V100
|
||||||
|
4. Set `preparation` to `False` for generating more videos with the same avatar
|
||||||
|
|
||||||
|
For faster generation without saving images, you can use:
|
||||||
|
```bash
|
||||||
|
python -m scripts.realtime_inference --inference_config configs/inference/realtime.yaml --skip_save_images
|
||||||
|
```
|
||||||
|
|
||||||
|
## Gradio Demo
|
||||||
|
We provide an intuitive web interface through Gradio for users to easily adjust input parameters. To optimize inference time, users can generate only the **first frame** to fine-tune the best lip-sync parameters, which helps reduce facial artifacts in the final output.
|
||||||
|

|
||||||
|
For minimum hardware requirements, we tested the system on a Windows environment using an NVIDIA GeForce RTX 3050 Ti Laptop GPU with 4GB VRAM. In fp16 mode, generating an 8-second video takes approximately 5 minutes. 
|
||||||
|
|
||||||
|
Both Linux and Windows users can launch the demo using the following command. Please ensure that the `ffmpeg_path` parameter matches your actual FFmpeg installation path:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# You can remove --use_float16 for better quality, but it will increase VRAM usage and inference time
|
||||||
|
python app.py --use_float16 --ffmpeg_path ffmpeg-master-latest-win64-gpl-shared\bin
|
||||||
|
```
|
||||||
|
|
||||||
|
## Training
|
||||||
|
|
||||||
|
### Data Preparation
|
||||||
|
To train MuseTalk, you need to prepare your dataset following these steps:
|
||||||
|
|
||||||
|
1. **Place your source videos**
|
||||||
|
|
||||||
|
For example, if you're using the HDTF dataset, place all your video files in `./dataset/HDTF/source`.
|
||||||
|
|
||||||
|
2. **Run the preprocessing script**
|
||||||
|
```bash
|
||||||
|
python -m scripts.preprocess --config ./configs/training/preprocess.yaml
|
||||||
|
```
|
||||||
|
This script will:
|
||||||
|
- Extract frames from videos
|
||||||
|
- Detect and align faces
|
||||||
|
- Generate audio features
|
||||||
|
- Create the necessary data structure for training
|
||||||
|
|
||||||
|
### Training Process
|
||||||
|
After data preprocessing, you can start the training process:
|
||||||
|
|
||||||
|
1. **First Stage**
|
||||||
|
```bash
|
||||||
|
sh train.sh stage1
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Second Stage**
|
||||||
|
```bash
|
||||||
|
sh train.sh stage2
|
||||||
|
```
|
||||||
|
|
||||||
|
### Configuration Adjustment
|
||||||
|
Before starting the training, you should adjust the configuration files according to your hardware and requirements:
|
||||||
|
|
||||||
|
1. **GPU Configuration** (`configs/training/gpu.yaml`):
|
||||||
|
- `gpu_ids`: Specify the GPU IDs you want to use (e.g., "0,1,2,3")
|
||||||
|
- `num_processes`: Set this to match the number of GPUs you're using
|
||||||
|
|
||||||
|
2. **Stage 1 Configuration** (`configs/training/stage1.yaml`):
|
||||||
|
- `data.train_bs`: Adjust batch size based on your GPU memory (default: 32)
|
||||||
|
- `data.n_sample_frames`: Number of sampled frames per video (default: 1)
|
||||||
|
|
||||||
|
3. **Stage 2 Configuration** (`configs/training/stage2.yaml`):
|
||||||
|
- `random_init_unet`: Must be set to `False` to use the model from stage 1
|
||||||
|
- `data.train_bs`: Smaller batch size due to high GPU memory cost (default: 2)
|
||||||
|
- `data.n_sample_frames`: Higher value for temporal consistency (default: 16)
|
||||||
|
- `solver.gradient_accumulation_steps`: Increase to simulate larger batch sizes (default: 8)
|
||||||
|
|
||||||
|
|
||||||
|
### GPU Memory Requirements
|
||||||
|
Based on our testing on a machine with 8 NVIDIA H20 GPUs:
|
||||||
|
|
||||||
|
#### Stage 1 Memory Usage
|
||||||
|
| Batch Size | Gradient Accumulation | Memory per GPU | Recommendation |
|
||||||
|
|:----------:|:----------------------:|:--------------:|:--------------:|
|
||||||
|
| 8 | 1 | ~32GB | |
|
||||||
|
| 16 | 1 | ~45GB | |
|
||||||
|
| 32 | 1 | ~74GB | ✓ |
|
||||||
|
|
||||||
|
#### Stage 2 Memory Usage
|
||||||
|
| Batch Size | Gradient Accumulation | Memory per GPU | Recommendation |
|
||||||
|
|:----------:|:----------------------:|:--------------:|:--------------:|
|
||||||
|
| 1 | 8 | ~54GB | |
|
||||||
|
| 2 | 2 | ~80GB | |
|
||||||
|
| 2 | 8 | ~85GB | ✓ |
|
||||||
|
|
||||||
|
<details close>
|
||||||
|
## TestCases For 1.0
|
||||||
<table class="center">
|
<table class="center">
|
||||||
<tr style="font-weight: bolder;text-align:center;">
|
<tr style="font-weight: bolder;text-align:center;">
|
||||||
<td width="33%">Image</td>
|
<td width="33%">Image</td>
|
||||||
@@ -123,132 +496,7 @@ Note that although we use a very similar architecture as Stable Diffusion, MuseT
|
|||||||
</tr>
|
</tr>
|
||||||
</table >
|
</table >
|
||||||
|
|
||||||
* The character of the last two rows, `Xinying Sun`, is a supermodel KOL. You can follow her on [douyin](https://www.douyin.com/user/MS4wLjABAAAAWDThbMPN_6Xmm_JgXexbOii1K-httbu2APdG8DvDyM8).
|
#### Use of bbox_shift to have adjustable results(For 1.0)
|
||||||
|
|
||||||
## Video dubbing
|
|
||||||
<table class="center">
|
|
||||||
<tr style="font-weight: bolder;text-align:center;">
|
|
||||||
<td width="70%">MuseTalk</td>
|
|
||||||
<td width="30%">Original videos</td>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>
|
|
||||||
<video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/4d7c5fa1-3550-4d52-8ed2-52f158150f24 controls preload></video>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="//www.bilibili.com/video/BV1wT411b7HU">Link</a>
|
|
||||||
<href src=""></href>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
</table>
|
|
||||||
|
|
||||||
* For video dubbing, we applied a self-developed tool which can identify the talking person.
|
|
||||||
|
|
||||||
## Some interesting videos!
|
|
||||||
<table class="center">
|
|
||||||
<tr style="font-weight: bolder;text-align:center;">
|
|
||||||
<td width="50%">Image</td>
|
|
||||||
<td width="50%">MuseV + MuseTalk</td>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>
|
|
||||||
<img src=assets/demo/video1/video1.png width="95%">
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/1f02f9c6-8b98-475e-86b8-82ebee82fe0d controls preload></video>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
</table>
|
|
||||||
|
|
||||||
# TODO:
|
|
||||||
- [x] trained models and inference codes.
|
|
||||||
- [x] Huggingface Gradio [demo](https://huggingface.co/spaces/TMElyralab/MuseTalk).
|
|
||||||
- [x] codes for real-time inference.
|
|
||||||
- [ ] technical report.
|
|
||||||
- [ ] training codes.
|
|
||||||
- [ ] a better model (may take longer).
|
|
||||||
|
|
||||||
|
|
||||||
# Getting Started
|
|
||||||
We provide a detailed tutorial about the installation and the basic usage of MuseTalk for new users:
|
|
||||||
|
|
||||||
## Third party integration
|
|
||||||
Thanks for the third-party integration, which makes installation and use more convenient for everyone.
|
|
||||||
We also hope you note that we have not verified, maintained, or updated third-party. Please refer to this project for specific results.
|
|
||||||
|
|
||||||
### [ComfyUI](https://github.com/chaojie/ComfyUI-MuseTalk)
|
|
||||||
|
|
||||||
## Installation
|
|
||||||
To prepare the Python environment and install additional packages such as opencv, diffusers, mmcv, etc., please follow the steps below:
|
|
||||||
### Build environment
|
|
||||||
|
|
||||||
We recommend a python version >=3.10 and cuda version =11.7. Then build environment as follows:
|
|
||||||
|
|
||||||
```shell
|
|
||||||
pip install -r requirements.txt
|
|
||||||
```
|
|
||||||
|
|
||||||
### mmlab packages
|
|
||||||
```bash
|
|
||||||
pip install --no-cache-dir -U openmim
|
|
||||||
mim install mmengine
|
|
||||||
mim install "mmcv>=2.0.1"
|
|
||||||
mim install "mmdet>=3.1.0"
|
|
||||||
mim install "mmpose>=1.1.0"
|
|
||||||
```
|
|
||||||
|
|
||||||
### Download ffmpeg-static
|
|
||||||
Download the ffmpeg-static and
|
|
||||||
```
|
|
||||||
export FFMPEG_PATH=/path/to/ffmpeg
|
|
||||||
```
|
|
||||||
for example:
|
|
||||||
```
|
|
||||||
export FFMPEG_PATH=/musetalk/ffmpeg-4.4-amd64-static
|
|
||||||
```
|
|
||||||
### Download weights
|
|
||||||
You can download weights manually as follows:
|
|
||||||
|
|
||||||
1. Download our trained [weights](https://huggingface.co/TMElyralab/MuseTalk).
|
|
||||||
|
|
||||||
2. Download the weights of other components:
|
|
||||||
- [sd-vae-ft-mse](https://huggingface.co/stabilityai/sd-vae-ft-mse)
|
|
||||||
- [whisper](https://openaipublic.azureedge.net/main/whisper/models/65147644a518d12f04e32d6f3b26facc3f8dd46e5390956a9424a650c0ce22b9/tiny.pt)
|
|
||||||
- [dwpose](https://huggingface.co/yzd-v/DWPose/tree/main)
|
|
||||||
- [face-parse-bisent](https://github.com/zllrunning/face-parsing.PyTorch)
|
|
||||||
- [resnet18](https://download.pytorch.org/models/resnet18-5c106cde.pth)
|
|
||||||
|
|
||||||
|
|
||||||
Finally, these weights should be organized in `models` as follows:
|
|
||||||
```
|
|
||||||
./models/
|
|
||||||
├── musetalk
|
|
||||||
│ └── musetalk.json
|
|
||||||
│ └── pytorch_model.bin
|
|
||||||
├── dwpose
|
|
||||||
│ └── dw-ll_ucoco_384.pth
|
|
||||||
├── face-parse-bisent
|
|
||||||
│ ├── 79999_iter.pth
|
|
||||||
│ └── resnet18-5c106cde.pth
|
|
||||||
├── sd-vae-ft-mse
|
|
||||||
│ ├── config.json
|
|
||||||
│ └── diffusion_pytorch_model.bin
|
|
||||||
└── whisper
|
|
||||||
└── tiny.pt
|
|
||||||
```
|
|
||||||
## Quickstart
|
|
||||||
|
|
||||||
### Inference
|
|
||||||
Here, we provide the inference script.
|
|
||||||
```
|
|
||||||
python -m scripts.inference --inference_config configs/inference/test.yaml
|
|
||||||
```
|
|
||||||
configs/inference/test.yaml is the path to the inference configuration file, including video_path and audio_path.
|
|
||||||
The video_path should be either a video file, an image file or a directory of images.
|
|
||||||
|
|
||||||
You are recommended to input video with `25fps`, the same fps used when training the model. If your video is far less than 25fps, you are recommended to apply frame interpolation or directly convert the video to 25fps using ffmpeg.
|
|
||||||
|
|
||||||
#### Use of bbox_shift to have adjustable results
|
|
||||||
:mag_right: We have found that upper-bound of the mask has an important impact on mouth openness. Thus, to control the mask region, we suggest using the `bbox_shift` parameter. Positive values (moving towards the lower half) increase mouth openness, while negative values (moving towards the upper half) decrease mouth openness.
|
:mag_right: We have found that upper-bound of the mask has an important impact on mouth openness. Thus, to control the mask region, we suggest using the `bbox_shift` parameter. Positive values (moving towards the lower half) increase mouth openness, while negative values (moving towards the upper half) decrease mouth openness.
|
||||||
|
|
||||||
You can start by running with the default configuration to obtain the adjustable value range, and then re-run the script within this range.
|
You can start by running with the default configuration to obtain the adjustable value range, and then re-run the script within this range.
|
||||||
@@ -259,36 +507,13 @@ python -m scripts.inference --inference_config configs/inference/test.yaml --bbo
|
|||||||
```
|
```
|
||||||
:pushpin: More technical details can be found in [bbox_shift](assets/BBOX_SHIFT.md).
|
:pushpin: More technical details can be found in [bbox_shift](assets/BBOX_SHIFT.md).
|
||||||
|
|
||||||
|
|
||||||
#### Combining MuseV and MuseTalk
|
#### Combining MuseV and MuseTalk
|
||||||
|
|
||||||
As a complete solution to virtual human generation, you are suggested to first apply [MuseV](https://github.com/TMElyralab/MuseV) to generate a video (text-to-video, image-to-video or pose-to-video) by referring [this](https://github.com/TMElyralab/MuseV?tab=readme-ov-file#text2video). Frame interpolation is suggested to increase frame rate. Then, you can use `MuseTalk` to generate a lip-sync video by referring [this](https://github.com/TMElyralab/MuseTalk?tab=readme-ov-file#inference).
|
As a complete solution to virtual human generation, you are suggested to first apply [MuseV](https://github.com/TMElyralab/MuseV) to generate a video (text-to-video, image-to-video or pose-to-video) by referring [this](https://github.com/TMElyralab/MuseV?tab=readme-ov-file#text2video). Frame interpolation is suggested to increase frame rate. Then, you can use `MuseTalk` to generate a lip-sync video by referring [this](https://github.com/TMElyralab/MuseTalk?tab=readme-ov-file#inference).
|
||||||
|
|
||||||
#### :new: Real-time inference
|
|
||||||
|
|
||||||
Here, we provide the inference script. This script first applies necessary pre-processing such as face detection, face parsing and VAE encode in advance. During inference, only UNet and the VAE decoder are involved, which makes MuseTalk real-time.
|
|
||||||
|
|
||||||
```
|
|
||||||
python -m scripts.realtime_inference --inference_config configs/inference/realtime.yaml --batch_size 4
|
|
||||||
```
|
|
||||||
configs/inference/realtime.yaml is the path to the real-time inference configuration file, including `preparation`, `video_path` , `bbox_shift` and `audio_clips`.
|
|
||||||
|
|
||||||
1. Set `preparation` to `True` in `realtime.yaml` to prepare the materials for a new `avatar`. (If the `bbox_shift` has changed, you also need to re-prepare the materials.)
|
|
||||||
1. After that, the `avatar` will use an audio clip selected from `audio_clips` to generate video.
|
|
||||||
```
|
|
||||||
Inferring using: data/audio/yongen.wav
|
|
||||||
```
|
|
||||||
1. While MuseTalk is inferring, sub-threads can simultaneously stream the results to the users. The generation process can achieve 30fps+ on an NVIDIA Tesla V100.
|
|
||||||
1. Set `preparation` to `False` and run this script if you want to genrate more videos using the same avatar.
|
|
||||||
|
|
||||||
##### Note for Real-time inference
|
|
||||||
1. If you want to generate multiple videos using the same avatar/video, you can also use this script to **SIGNIFICANTLY** expedite the generation process.
|
|
||||||
1. In the previous script, the generation time is also limited by I/O (e.g. saving images). If you just want to test the generation speed without saving the images, you can run
|
|
||||||
```
|
|
||||||
python -m scripts.realtime_inference --inference_config configs/inference/realtime.yaml --skip_save_images
|
|
||||||
```
|
|
||||||
|
|
||||||
# Acknowledgement
|
# Acknowledgement
|
||||||
1. We thank open-source components like [whisper](https://github.com/openai/whisper), [dwpose](https://github.com/IDEA-Research/DWPose), [face-alignment](https://github.com/1adrianb/face-alignment), [face-parsing](https://github.com/zllrunning/face-parsing.PyTorch), [S3FD](https://github.com/yxlijun/S3FD.pytorch).
|
1. We thank open-source components like [whisper](https://github.com/openai/whisper), [dwpose](https://github.com/IDEA-Research/DWPose), [face-alignment](https://github.com/1adrianb/face-alignment), [face-parsing](https://github.com/zllrunning/face-parsing.PyTorch), [S3FD](https://github.com/yxlijun/S3FD.pytorch) and [LatentSync](https://huggingface.co/ByteDance/LatentSync/tree/main).
|
||||||
1. MuseTalk has referred much to [diffusers](https://github.com/huggingface/diffusers) and [isaacOnline/whisper](https://github.com/isaacOnline/whisper/tree/extract-embeddings).
|
1. MuseTalk has referred much to [diffusers](https://github.com/huggingface/diffusers) and [isaacOnline/whisper](https://github.com/isaacOnline/whisper/tree/extract-embeddings).
|
||||||
1. MuseTalk has been built on [HDTF](https://github.com/MRzzm/HDTF) datasets.
|
1. MuseTalk has been built on [HDTF](https://github.com/MRzzm/HDTF) datasets.
|
||||||
|
|
||||||
@@ -305,10 +530,10 @@ If you need higher resolution, you could apply super resolution models such as [
|
|||||||
# Citation
|
# Citation
|
||||||
```bib
|
```bib
|
||||||
@article{musetalk,
|
@article{musetalk,
|
||||||
title={MuseTalk: Real-Time High Quality Lip Synchorization with Latent Space Inpainting},
|
title={MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling},
|
||||||
author={Zhang, Yue and Liu, Minhao and Chen, Zhaokang and Wu, Bin and He, Yingjie and Zhan, Chao and Zhou, Wenjiang},
|
author={Zhang, Yue and Zhong, Zhizhou and Liu, Minhao and Chen, Zhaokang and Wu, Bin and Zeng, Yubin and Zhan, Chao and He, Yingjie and Huang, Junxin and Zhou, Wenjiang},
|
||||||
journal={arxiv},
|
journal={arxiv},
|
||||||
year={2024}
|
year={2025}
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
# Disclaimer/License
|
# Disclaimer/License
|
||||||
|
|||||||
@@ -4,7 +4,6 @@ import pdb
|
|||||||
import re
|
import re
|
||||||
|
|
||||||
import gradio as gr
|
import gradio as gr
|
||||||
import spaces
|
|
||||||
import numpy as np
|
import numpy as np
|
||||||
import sys
|
import sys
|
||||||
import subprocess
|
import subprocess
|
||||||
@@ -28,11 +27,101 @@ import gdown
|
|||||||
import imageio
|
import imageio
|
||||||
import ffmpeg
|
import ffmpeg
|
||||||
from moviepy.editor import *
|
from moviepy.editor import *
|
||||||
|
from transformers import WhisperModel
|
||||||
|
|
||||||
ProjectDir = os.path.abspath(os.path.dirname(__file__))
|
ProjectDir = os.path.abspath(os.path.dirname(__file__))
|
||||||
CheckpointsDir = os.path.join(ProjectDir, "models")
|
CheckpointsDir = os.path.join(ProjectDir, "models")
|
||||||
|
|
||||||
|
@torch.no_grad()
|
||||||
|
def debug_inpainting(video_path, bbox_shift, extra_margin=10, parsing_mode="jaw",
|
||||||
|
left_cheek_width=90, right_cheek_width=90):
|
||||||
|
"""Debug inpainting parameters, only process the first frame"""
|
||||||
|
# Set default parameters
|
||||||
|
args_dict = {
|
||||||
|
"result_dir": './results/debug',
|
||||||
|
"fps": 25,
|
||||||
|
"batch_size": 1,
|
||||||
|
"output_vid_name": '',
|
||||||
|
"use_saved_coord": False,
|
||||||
|
"audio_padding_length_left": 2,
|
||||||
|
"audio_padding_length_right": 2,
|
||||||
|
"version": "v15",
|
||||||
|
"extra_margin": extra_margin,
|
||||||
|
"parsing_mode": parsing_mode,
|
||||||
|
"left_cheek_width": left_cheek_width,
|
||||||
|
"right_cheek_width": right_cheek_width
|
||||||
|
}
|
||||||
|
args = Namespace(**args_dict)
|
||||||
|
|
||||||
|
# Create debug directory
|
||||||
|
os.makedirs(args.result_dir, exist_ok=True)
|
||||||
|
|
||||||
|
# Read first frame
|
||||||
|
if get_file_type(video_path) == "video":
|
||||||
|
reader = imageio.get_reader(video_path)
|
||||||
|
first_frame = reader.get_data(0)
|
||||||
|
reader.close()
|
||||||
|
else:
|
||||||
|
first_frame = cv2.imread(video_path)
|
||||||
|
first_frame = cv2.cvtColor(first_frame, cv2.COLOR_BGR2RGB)
|
||||||
|
|
||||||
|
# Save first frame
|
||||||
|
debug_frame_path = os.path.join(args.result_dir, "debug_frame.png")
|
||||||
|
cv2.imwrite(debug_frame_path, cv2.cvtColor(first_frame, cv2.COLOR_RGB2BGR))
|
||||||
|
|
||||||
|
# Get face coordinates
|
||||||
|
coord_list, frame_list = get_landmark_and_bbox([debug_frame_path], bbox_shift)
|
||||||
|
bbox = coord_list[0]
|
||||||
|
frame = frame_list[0]
|
||||||
|
|
||||||
|
if bbox == coord_placeholder:
|
||||||
|
return None, "No face detected, please adjust bbox_shift parameter"
|
||||||
|
|
||||||
|
# Initialize face parser
|
||||||
|
fp = FaceParsing(
|
||||||
|
left_cheek_width=args.left_cheek_width,
|
||||||
|
right_cheek_width=args.right_cheek_width
|
||||||
|
)
|
||||||
|
|
||||||
|
# Process first frame
|
||||||
|
x1, y1, x2, y2 = bbox
|
||||||
|
y2 = y2 + args.extra_margin
|
||||||
|
y2 = min(y2, frame.shape[0])
|
||||||
|
crop_frame = frame[y1:y2, x1:x2]
|
||||||
|
crop_frame = cv2.resize(crop_frame,(256,256),interpolation = cv2.INTER_LANCZOS4)
|
||||||
|
|
||||||
|
# Generate random audio features
|
||||||
|
random_audio = torch.randn(1, 50, 384, device=device, dtype=weight_dtype)
|
||||||
|
audio_feature = pe(random_audio)
|
||||||
|
|
||||||
|
# Get latents
|
||||||
|
latents = vae.get_latents_for_unet(crop_frame)
|
||||||
|
latents = latents.to(dtype=weight_dtype)
|
||||||
|
|
||||||
|
# Generate prediction results
|
||||||
|
pred_latents = unet.model(latents, timesteps, encoder_hidden_states=audio_feature).sample
|
||||||
|
recon = vae.decode_latents(pred_latents)
|
||||||
|
|
||||||
|
# Inpaint back to original image
|
||||||
|
res_frame = recon[0]
|
||||||
|
res_frame = cv2.resize(res_frame.astype(np.uint8),(x2-x1,y2-y1))
|
||||||
|
combine_frame = get_image(frame, res_frame, [x1, y1, x2, y2], mode=args.parsing_mode, fp=fp)
|
||||||
|
|
||||||
|
# Save results (no need to convert color space again since get_image already returns RGB format)
|
||||||
|
debug_result_path = os.path.join(args.result_dir, "debug_result.png")
|
||||||
|
cv2.imwrite(debug_result_path, combine_frame)
|
||||||
|
|
||||||
|
# Create information text
|
||||||
|
info_text = f"Parameter information:\n" + \
|
||||||
|
f"bbox_shift: {bbox_shift}\n" + \
|
||||||
|
f"extra_margin: {extra_margin}\n" + \
|
||||||
|
f"parsing_mode: {parsing_mode}\n" + \
|
||||||
|
f"left_cheek_width: {left_cheek_width}\n" + \
|
||||||
|
f"right_cheek_width: {right_cheek_width}\n" + \
|
||||||
|
f"Detected face coordinates: [{x1}, {y1}, {x2}, {y2}]"
|
||||||
|
|
||||||
|
return cv2.cvtColor(combine_frame, cv2.COLOR_RGB2BGR), info_text
|
||||||
|
|
||||||
def print_directory_contents(path):
|
def print_directory_contents(path):
|
||||||
for child in os.listdir(path):
|
for child in os.listdir(path):
|
||||||
child_path = os.path.join(path, child)
|
child_path = os.path.join(path, child)
|
||||||
@@ -40,119 +129,107 @@ def print_directory_contents(path):
|
|||||||
print(child_path)
|
print(child_path)
|
||||||
|
|
||||||
def download_model():
|
def download_model():
|
||||||
if not os.path.exists(CheckpointsDir):
|
# 检查必需的模型文件是否存在
|
||||||
os.makedirs(CheckpointsDir)
|
required_models = {
|
||||||
print("Checkpoint Not Downloaded, start downloading...")
|
"MuseTalk": f"{CheckpointsDir}/musetalkV15/unet.pth",
|
||||||
tic = time.time()
|
"MuseTalk": f"{CheckpointsDir}/musetalkV15/musetalk.json",
|
||||||
snapshot_download(
|
"SD VAE": f"{CheckpointsDir}/sd-vae/config.json",
|
||||||
repo_id="TMElyralab/MuseTalk",
|
"Whisper": f"{CheckpointsDir}/whisper/config.json",
|
||||||
local_dir=CheckpointsDir,
|
"DWPose": f"{CheckpointsDir}/dwpose/dw-ll_ucoco_384.pth",
|
||||||
max_workers=8,
|
"SyncNet": f"{CheckpointsDir}/syncnet/latentsync_syncnet.pt",
|
||||||
local_dir_use_symlinks=True,
|
"Face Parse": f"{CheckpointsDir}/face-parse-bisent/79999_iter.pth",
|
||||||
force_download=True, resume_download=False
|
"ResNet": f"{CheckpointsDir}/face-parse-bisent/resnet18-5c106cde.pth"
|
||||||
)
|
}
|
||||||
# weight
|
|
||||||
os.makedirs(f"{CheckpointsDir}/sd-vae-ft-mse/")
|
missing_models = []
|
||||||
snapshot_download(
|
for model_name, model_path in required_models.items():
|
||||||
repo_id="stabilityai/sd-vae-ft-mse",
|
if not os.path.exists(model_path):
|
||||||
local_dir=CheckpointsDir+'/sd-vae-ft-mse',
|
missing_models.append(model_name)
|
||||||
max_workers=8,
|
|
||||||
local_dir_use_symlinks=True,
|
if missing_models:
|
||||||
force_download=True, resume_download=False
|
# 全用英文
|
||||||
)
|
print("The following required model files are missing:")
|
||||||
#dwpose
|
for model in missing_models:
|
||||||
os.makedirs(f"{CheckpointsDir}/dwpose/")
|
print(f"- {model}")
|
||||||
snapshot_download(
|
print("\nPlease run the download script to download the missing models:")
|
||||||
repo_id="yzd-v/DWPose",
|
if sys.platform == "win32":
|
||||||
local_dir=CheckpointsDir+'/dwpose',
|
print("Windows: Run download_weights.bat")
|
||||||
max_workers=8,
|
|
||||||
local_dir_use_symlinks=True,
|
|
||||||
force_download=True, resume_download=False
|
|
||||||
)
|
|
||||||
#vae
|
|
||||||
url = "https://openaipublic.azureedge.net/main/whisper/models/65147644a518d12f04e32d6f3b26facc3f8dd46e5390956a9424a650c0ce22b9/tiny.pt"
|
|
||||||
response = requests.get(url)
|
|
||||||
# 确保请求成功
|
|
||||||
if response.status_code == 200:
|
|
||||||
# 指定文件保存的位置
|
|
||||||
file_path = f"{CheckpointsDir}/whisper/tiny.pt"
|
|
||||||
os.makedirs(f"{CheckpointsDir}/whisper/")
|
|
||||||
# 将文件内容写入指定位置
|
|
||||||
with open(file_path, "wb") as f:
|
|
||||||
f.write(response.content)
|
|
||||||
else:
|
else:
|
||||||
print(f"请求失败,状态码:{response.status_code}")
|
print("Linux/Mac: Run ./download_weights.sh")
|
||||||
#gdown face parse
|
sys.exit(1)
|
||||||
url = "https://drive.google.com/uc?id=154JgKpzCPW82qINcVieuPH3fZ2e0P812"
|
|
||||||
os.makedirs(f"{CheckpointsDir}/face-parse-bisent/")
|
|
||||||
file_path = f"{CheckpointsDir}/face-parse-bisent/79999_iter.pth"
|
|
||||||
gdown.download(url, file_path, quiet=False)
|
|
||||||
#resnet
|
|
||||||
url = "https://download.pytorch.org/models/resnet18-5c106cde.pth"
|
|
||||||
response = requests.get(url)
|
|
||||||
# 确保请求成功
|
|
||||||
if response.status_code == 200:
|
|
||||||
# 指定文件保存的位置
|
|
||||||
file_path = f"{CheckpointsDir}/face-parse-bisent/resnet18-5c106cde.pth"
|
|
||||||
# 将文件内容写入指定位置
|
|
||||||
with open(file_path, "wb") as f:
|
|
||||||
f.write(response.content)
|
|
||||||
else:
|
|
||||||
print(f"请求失败,状态码:{response.status_code}")
|
|
||||||
|
|
||||||
|
|
||||||
toc = time.time()
|
|
||||||
|
|
||||||
print(f"download cost {toc-tic} seconds")
|
|
||||||
print_directory_contents(CheckpointsDir)
|
|
||||||
|
|
||||||
else:
|
else:
|
||||||
print("Already download the model.")
|
print("All required model files exist.")
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
download_model() # for huggingface deployment.
|
download_model() # for huggingface deployment.
|
||||||
|
|
||||||
|
|
||||||
from musetalk.utils.utils import get_file_type,get_video_fps,datagen
|
|
||||||
from musetalk.utils.preprocessing import get_landmark_and_bbox,read_imgs,coord_placeholder,get_bbox_range
|
|
||||||
from musetalk.utils.blending import get_image
|
from musetalk.utils.blending import get_image
|
||||||
from musetalk.utils.utils import load_all_model
|
from musetalk.utils.face_parsing import FaceParsing
|
||||||
|
from musetalk.utils.audio_processor import AudioProcessor
|
||||||
|
from musetalk.utils.utils import get_file_type, get_video_fps, datagen, load_all_model
|
||||||
|
from musetalk.utils.preprocessing import get_landmark_and_bbox, read_imgs, coord_placeholder, get_bbox_range
|
||||||
|
|
||||||
|
|
||||||
|
def fast_check_ffmpeg():
|
||||||
|
try:
|
||||||
|
subprocess.run(["ffmpeg", "-version"], capture_output=True, check=True)
|
||||||
|
return True
|
||||||
|
except:
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@spaces.GPU(duration=600)
|
|
||||||
@torch.no_grad()
|
@torch.no_grad()
|
||||||
def inference(audio_path,video_path,bbox_shift,progress=gr.Progress(track_tqdm=True)):
|
def inference(audio_path, video_path, bbox_shift, extra_margin=10, parsing_mode="jaw",
|
||||||
args_dict={"result_dir":'./results/output', "fps":25, "batch_size":8, "output_vid_name":'', "use_saved_coord":False}#same with inferenece script
|
left_cheek_width=90, right_cheek_width=90, progress=gr.Progress(track_tqdm=True)):
|
||||||
|
# Set default parameters, aligned with inference.py
|
||||||
|
args_dict = {
|
||||||
|
"result_dir": './results/output',
|
||||||
|
"fps": 25,
|
||||||
|
"batch_size": 8,
|
||||||
|
"output_vid_name": '',
|
||||||
|
"use_saved_coord": False,
|
||||||
|
"audio_padding_length_left": 2,
|
||||||
|
"audio_padding_length_right": 2,
|
||||||
|
"version": "v15", # Fixed use v15 version
|
||||||
|
"extra_margin": extra_margin,
|
||||||
|
"parsing_mode": parsing_mode,
|
||||||
|
"left_cheek_width": left_cheek_width,
|
||||||
|
"right_cheek_width": right_cheek_width
|
||||||
|
}
|
||||||
args = Namespace(**args_dict)
|
args = Namespace(**args_dict)
|
||||||
|
|
||||||
input_basename = os.path.basename(video_path).split('.')[0]
|
# Check ffmpeg
|
||||||
audio_basename = os.path.basename(audio_path).split('.')[0]
|
if not fast_check_ffmpeg():
|
||||||
output_basename = f"{input_basename}_{audio_basename}"
|
print("Warning: Unable to find ffmpeg, please ensure ffmpeg is properly installed")
|
||||||
result_img_save_path = os.path.join(args.result_dir, output_basename) # related to video & audio inputs
|
|
||||||
crop_coord_save_path = os.path.join(result_img_save_path, input_basename+".pkl") # only related to video input
|
|
||||||
os.makedirs(result_img_save_path,exist_ok =True)
|
|
||||||
|
|
||||||
if args.output_vid_name=="":
|
input_basename = os.path.basename(video_path).split('.')[0]
|
||||||
output_vid_name = os.path.join(args.result_dir, output_basename+".mp4")
|
audio_basename = os.path.basename(audio_path).split('.')[0]
|
||||||
|
output_basename = f"{input_basename}_{audio_basename}"
|
||||||
|
|
||||||
|
# Create temporary directory
|
||||||
|
temp_dir = os.path.join(args.result_dir, f"{args.version}")
|
||||||
|
os.makedirs(temp_dir, exist_ok=True)
|
||||||
|
|
||||||
|
# Set result save path
|
||||||
|
result_img_save_path = os.path.join(temp_dir, output_basename)
|
||||||
|
crop_coord_save_path = os.path.join(args.result_dir, "../", input_basename+".pkl")
|
||||||
|
os.makedirs(result_img_save_path, exist_ok=True)
|
||||||
|
|
||||||
|
if args.output_vid_name == "":
|
||||||
|
output_vid_name = os.path.join(temp_dir, output_basename+".mp4")
|
||||||
else:
|
else:
|
||||||
output_vid_name = os.path.join(args.result_dir, args.output_vid_name)
|
output_vid_name = os.path.join(temp_dir, args.output_vid_name)
|
||||||
|
|
||||||
############################################## extract frames from source video ##############################################
|
############################################## extract frames from source video ##############################################
|
||||||
if get_file_type(video_path)=="video":
|
if get_file_type(video_path) == "video":
|
||||||
save_dir_full = os.path.join(args.result_dir, input_basename)
|
save_dir_full = os.path.join(temp_dir, input_basename)
|
||||||
os.makedirs(save_dir_full,exist_ok = True)
|
os.makedirs(save_dir_full, exist_ok=True)
|
||||||
# cmd = f"ffmpeg -v fatal -i {video_path} -start_number 0 {save_dir_full}/%08d.png"
|
# Read video
|
||||||
# os.system(cmd)
|
|
||||||
# 读取视频
|
|
||||||
reader = imageio.get_reader(video_path)
|
reader = imageio.get_reader(video_path)
|
||||||
|
|
||||||
# 保存图片
|
# Save images
|
||||||
for i, im in enumerate(reader):
|
for i, im in enumerate(reader):
|
||||||
imageio.imwrite(f"{save_dir_full}/{i:08d}.png", im)
|
imageio.imwrite(f"{save_dir_full}/{i:08d}.png", im)
|
||||||
input_img_list = sorted(glob.glob(os.path.join(save_dir_full, '*.[jpJP][pnPN]*[gG]')))
|
input_img_list = sorted(glob.glob(os.path.join(save_dir_full, '*.[jpJP][pnPN]*[gG]')))
|
||||||
@@ -161,10 +238,21 @@ def inference(audio_path,video_path,bbox_shift,progress=gr.Progress(track_tqdm=T
|
|||||||
input_img_list = glob.glob(os.path.join(video_path, '*.[jpJP][pnPN]*[gG]'))
|
input_img_list = glob.glob(os.path.join(video_path, '*.[jpJP][pnPN]*[gG]'))
|
||||||
input_img_list = sorted(input_img_list, key=lambda x: int(os.path.splitext(os.path.basename(x))[0]))
|
input_img_list = sorted(input_img_list, key=lambda x: int(os.path.splitext(os.path.basename(x))[0]))
|
||||||
fps = args.fps
|
fps = args.fps
|
||||||
#print(input_img_list)
|
|
||||||
############################################## extract audio feature ##############################################
|
############################################## extract audio feature ##############################################
|
||||||
whisper_feature = audio_processor.audio2feat(audio_path)
|
# Extract audio features
|
||||||
whisper_chunks = audio_processor.feature2chunks(feature_array=whisper_feature,fps=fps)
|
whisper_input_features, librosa_length = audio_processor.get_audio_feature(audio_path)
|
||||||
|
whisper_chunks = audio_processor.get_whisper_chunk(
|
||||||
|
whisper_input_features,
|
||||||
|
device,
|
||||||
|
weight_dtype,
|
||||||
|
whisper,
|
||||||
|
librosa_length,
|
||||||
|
fps=fps,
|
||||||
|
audio_padding_length_left=args.audio_padding_length_left,
|
||||||
|
audio_padding_length_right=args.audio_padding_length_right,
|
||||||
|
)
|
||||||
|
|
||||||
############################################## preprocess input image ##############################################
|
############################################## preprocess input image ##############################################
|
||||||
if os.path.exists(crop_coord_save_path) and args.use_saved_coord:
|
if os.path.exists(crop_coord_save_path) and args.use_saved_coord:
|
||||||
print("using extracted coordinates")
|
print("using extracted coordinates")
|
||||||
@@ -176,13 +264,22 @@ def inference(audio_path,video_path,bbox_shift,progress=gr.Progress(track_tqdm=T
|
|||||||
coord_list, frame_list = get_landmark_and_bbox(input_img_list, bbox_shift)
|
coord_list, frame_list = get_landmark_and_bbox(input_img_list, bbox_shift)
|
||||||
with open(crop_coord_save_path, 'wb') as f:
|
with open(crop_coord_save_path, 'wb') as f:
|
||||||
pickle.dump(coord_list, f)
|
pickle.dump(coord_list, f)
|
||||||
bbox_shift_text=get_bbox_range(input_img_list, bbox_shift)
|
bbox_shift_text = get_bbox_range(input_img_list, bbox_shift)
|
||||||
|
|
||||||
|
# Initialize face parser
|
||||||
|
fp = FaceParsing(
|
||||||
|
left_cheek_width=args.left_cheek_width,
|
||||||
|
right_cheek_width=args.right_cheek_width
|
||||||
|
)
|
||||||
|
|
||||||
i = 0
|
i = 0
|
||||||
input_latent_list = []
|
input_latent_list = []
|
||||||
for bbox, frame in zip(coord_list, frame_list):
|
for bbox, frame in zip(coord_list, frame_list):
|
||||||
if bbox == coord_placeholder:
|
if bbox == coord_placeholder:
|
||||||
continue
|
continue
|
||||||
x1, y1, x2, y2 = bbox
|
x1, y1, x2, y2 = bbox
|
||||||
|
y2 = y2 + args.extra_margin
|
||||||
|
y2 = min(y2, frame.shape[0])
|
||||||
crop_frame = frame[y1:y2, x1:x2]
|
crop_frame = frame[y1:y2, x1:x2]
|
||||||
crop_frame = cv2.resize(crop_frame,(256,256),interpolation = cv2.INTER_LANCZOS4)
|
crop_frame = cv2.resize(crop_frame,(256,256),interpolation = cv2.INTER_LANCZOS4)
|
||||||
latents = vae.get_latents_for_unet(crop_frame)
|
latents = vae.get_latents_for_unet(crop_frame)
|
||||||
@@ -192,17 +289,23 @@ def inference(audio_path,video_path,bbox_shift,progress=gr.Progress(track_tqdm=T
|
|||||||
frame_list_cycle = frame_list + frame_list[::-1]
|
frame_list_cycle = frame_list + frame_list[::-1]
|
||||||
coord_list_cycle = coord_list + coord_list[::-1]
|
coord_list_cycle = coord_list + coord_list[::-1]
|
||||||
input_latent_list_cycle = input_latent_list + input_latent_list[::-1]
|
input_latent_list_cycle = input_latent_list + input_latent_list[::-1]
|
||||||
|
|
||||||
############################################## inference batch by batch ##############################################
|
############################################## inference batch by batch ##############################################
|
||||||
print("start inference")
|
print("start inference")
|
||||||
video_num = len(whisper_chunks)
|
video_num = len(whisper_chunks)
|
||||||
batch_size = args.batch_size
|
batch_size = args.batch_size
|
||||||
gen = datagen(whisper_chunks,input_latent_list_cycle,batch_size)
|
gen = datagen(
|
||||||
|
whisper_chunks=whisper_chunks,
|
||||||
|
vae_encode_latents=input_latent_list_cycle,
|
||||||
|
batch_size=batch_size,
|
||||||
|
delay_frame=0,
|
||||||
|
device=device,
|
||||||
|
)
|
||||||
res_frame_list = []
|
res_frame_list = []
|
||||||
for i, (whisper_batch,latent_batch) in enumerate(tqdm(gen,total=int(np.ceil(float(video_num)/batch_size)))):
|
for i, (whisper_batch,latent_batch) in enumerate(tqdm(gen,total=int(np.ceil(float(video_num)/batch_size)))):
|
||||||
|
audio_feature_batch = pe(whisper_batch)
|
||||||
tensor_list = [torch.FloatTensor(arr) for arr in whisper_batch]
|
# Ensure latent_batch is consistent with model weight type
|
||||||
audio_feature_batch = torch.stack(tensor_list).to(unet.device) # torch, B, 5*N,384
|
latent_batch = latent_batch.to(dtype=weight_dtype)
|
||||||
audio_feature_batch = pe(audio_feature_batch)
|
|
||||||
|
|
||||||
pred_latents = unet.model(latent_batch, timesteps, encoder_hidden_states=audio_feature_batch).sample
|
pred_latents = unet.model(latent_batch, timesteps, encoder_hidden_states=audio_feature_batch).sample
|
||||||
recon = vae.decode_latents(pred_latents)
|
recon = vae.decode_latents(pred_latents)
|
||||||
@@ -215,25 +318,24 @@ def inference(audio_path,video_path,bbox_shift,progress=gr.Progress(track_tqdm=T
|
|||||||
bbox = coord_list_cycle[i%(len(coord_list_cycle))]
|
bbox = coord_list_cycle[i%(len(coord_list_cycle))]
|
||||||
ori_frame = copy.deepcopy(frame_list_cycle[i%(len(frame_list_cycle))])
|
ori_frame = copy.deepcopy(frame_list_cycle[i%(len(frame_list_cycle))])
|
||||||
x1, y1, x2, y2 = bbox
|
x1, y1, x2, y2 = bbox
|
||||||
|
y2 = y2 + args.extra_margin
|
||||||
|
y2 = min(y2, frame.shape[0])
|
||||||
try:
|
try:
|
||||||
res_frame = cv2.resize(res_frame.astype(np.uint8),(x2-x1,y2-y1))
|
res_frame = cv2.resize(res_frame.astype(np.uint8),(x2-x1,y2-y1))
|
||||||
except:
|
except:
|
||||||
# print(bbox)
|
|
||||||
continue
|
continue
|
||||||
|
|
||||||
combine_frame = get_image(ori_frame,res_frame,bbox)
|
# Use v15 version blending
|
||||||
|
combine_frame = get_image(ori_frame, res_frame, [x1, y1, x2, y2], mode=args.parsing_mode, fp=fp)
|
||||||
|
|
||||||
cv2.imwrite(f"{result_img_save_path}/{str(i).zfill(8)}.png",combine_frame)
|
cv2.imwrite(f"{result_img_save_path}/{str(i).zfill(8)}.png",combine_frame)
|
||||||
|
|
||||||
# cmd_img2video = f"ffmpeg -y -v fatal -r {fps} -f image2 -i {result_img_save_path}/%08d.png -vcodec libx264 -vf format=rgb24,scale=out_color_matrix=bt709,format=yuv420p temp.mp4"
|
# Frame rate
|
||||||
# print(cmd_img2video)
|
|
||||||
# os.system(cmd_img2video)
|
|
||||||
# 帧率
|
|
||||||
fps = 25
|
fps = 25
|
||||||
# 图片路径
|
# Output video path
|
||||||
# 输出视频路径
|
|
||||||
output_video = 'temp.mp4'
|
output_video = 'temp.mp4'
|
||||||
|
|
||||||
# 读取图片
|
# Read images
|
||||||
def is_valid_image(file):
|
def is_valid_image(file):
|
||||||
pattern = re.compile(r'\d{8}\.png')
|
pattern = re.compile(r'\d{8}\.png')
|
||||||
return pattern.match(file)
|
return pattern.match(file)
|
||||||
@@ -247,13 +349,9 @@ def inference(audio_path,video_path,bbox_shift,progress=gr.Progress(track_tqdm=T
|
|||||||
images.append(imageio.imread(filename))
|
images.append(imageio.imread(filename))
|
||||||
|
|
||||||
|
|
||||||
# 保存视频
|
# Save video
|
||||||
imageio.mimwrite(output_video, images, 'FFMPEG', fps=fps, codec='libx264', pixelformat='yuv420p')
|
imageio.mimwrite(output_video, images, 'FFMPEG', fps=fps, codec='libx264', pixelformat='yuv420p')
|
||||||
|
|
||||||
# cmd_combine_audio = f"ffmpeg -y -v fatal -i {audio_path} -i temp.mp4 {output_vid_name}"
|
|
||||||
# print(cmd_combine_audio)
|
|
||||||
# os.system(cmd_combine_audio)
|
|
||||||
|
|
||||||
input_video = './temp.mp4'
|
input_video = './temp.mp4'
|
||||||
# Check if the input_video and audio_path exist
|
# Check if the input_video and audio_path exist
|
||||||
if not os.path.exists(input_video):
|
if not os.path.exists(input_video):
|
||||||
@@ -261,40 +359,15 @@ def inference(audio_path,video_path,bbox_shift,progress=gr.Progress(track_tqdm=T
|
|||||||
if not os.path.exists(audio_path):
|
if not os.path.exists(audio_path):
|
||||||
raise FileNotFoundError(f"Audio file not found: {audio_path}")
|
raise FileNotFoundError(f"Audio file not found: {audio_path}")
|
||||||
|
|
||||||
# 读取视频
|
# Read video
|
||||||
reader = imageio.get_reader(input_video)
|
reader = imageio.get_reader(input_video)
|
||||||
fps = reader.get_meta_data()['fps'] # 获取原视频的帧率
|
fps = reader.get_meta_data()['fps'] # Get original video frame rate
|
||||||
|
reader.close() # Otherwise, error on win11: PermissionError: [WinError 32] Another program is using this file, process cannot access. : 'temp.mp4'
|
||||||
# 将帧存储在列表中
|
# Store frames in list
|
||||||
frames = images
|
frames = images
|
||||||
|
|
||||||
# 保存视频并添加音频
|
|
||||||
# imageio.mimwrite(output_vid_name, frames, 'FFMPEG', fps=fps, codec='libx264', audio_codec='aac', input_params=['-i', audio_path])
|
|
||||||
|
|
||||||
# input_video = ffmpeg.input(input_video)
|
|
||||||
|
|
||||||
# input_audio = ffmpeg.input(audio_path)
|
|
||||||
|
|
||||||
print(len(frames))
|
print(len(frames))
|
||||||
|
|
||||||
# imageio.mimwrite(
|
|
||||||
# output_video,
|
|
||||||
# frames,
|
|
||||||
# 'FFMPEG',
|
|
||||||
# fps=25,
|
|
||||||
# codec='libx264',
|
|
||||||
# audio_codec='aac',
|
|
||||||
# input_params=['-i', audio_path],
|
|
||||||
# output_params=['-y'], # Add the '-y' flag to overwrite the output file if it exists
|
|
||||||
# )
|
|
||||||
# writer = imageio.get_writer(output_vid_name, fps = 25, codec='libx264', quality=10, pixelformat='yuvj444p')
|
|
||||||
# for im in frames:
|
|
||||||
# writer.append_data(im)
|
|
||||||
# writer.close()
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
# Load the video
|
# Load the video
|
||||||
video_clip = VideoFileClip(input_video)
|
video_clip = VideoFileClip(input_video)
|
||||||
|
|
||||||
@@ -315,11 +388,45 @@ def inference(audio_path,video_path,bbox_shift,progress=gr.Progress(track_tqdm=T
|
|||||||
|
|
||||||
|
|
||||||
# load model weights
|
# load model weights
|
||||||
audio_processor,vae,unet,pe = load_all_model()
|
|
||||||
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
||||||
|
vae, unet, pe = load_all_model(
|
||||||
|
unet_model_path="./models/musetalkV15/unet.pth",
|
||||||
|
vae_type="sd-vae",
|
||||||
|
unet_config="./models/musetalkV15/musetalk.json",
|
||||||
|
device=device
|
||||||
|
)
|
||||||
|
|
||||||
|
# Parse command line arguments
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument("--ffmpeg_path", type=str, default=r"ffmpeg-master-latest-win64-gpl-shared\bin", help="Path to ffmpeg executable")
|
||||||
|
parser.add_argument("--ip", type=str, default="127.0.0.1", help="IP address to bind to")
|
||||||
|
parser.add_argument("--port", type=int, default=7860, help="Port to bind to")
|
||||||
|
parser.add_argument("--share", action="store_true", help="Create a public link")
|
||||||
|
parser.add_argument("--use_float16", action="store_true", help="Use float16 for faster inference")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
# Set data type
|
||||||
|
if args.use_float16:
|
||||||
|
# Convert models to half precision for better performance
|
||||||
|
pe = pe.half()
|
||||||
|
vae.vae = vae.vae.half()
|
||||||
|
unet.model = unet.model.half()
|
||||||
|
weight_dtype = torch.float16
|
||||||
|
else:
|
||||||
|
weight_dtype = torch.float32
|
||||||
|
|
||||||
|
# Move models to specified device
|
||||||
|
pe = pe.to(device)
|
||||||
|
vae.vae = vae.vae.to(device)
|
||||||
|
unet.model = unet.model.to(device)
|
||||||
|
|
||||||
timesteps = torch.tensor([0], device=device)
|
timesteps = torch.tensor([0], device=device)
|
||||||
|
|
||||||
|
# Initialize audio processor and Whisper model
|
||||||
|
audio_processor = AudioProcessor(feature_extractor_path="./models/whisper")
|
||||||
|
whisper = WhisperModel.from_pretrained("./models/whisper")
|
||||||
|
whisper = whisper.to(device=device, dtype=weight_dtype).eval()
|
||||||
|
whisper.requires_grad_(False)
|
||||||
|
|
||||||
|
|
||||||
def check_video(video):
|
def check_video(video):
|
||||||
@@ -340,9 +447,6 @@ def check_video(video):
|
|||||||
output_video = os.path.join('./results/input', output_file_name)
|
output_video = os.path.join('./results/input', output_file_name)
|
||||||
|
|
||||||
|
|
||||||
# # Run the ffmpeg command to change the frame rate to 25fps
|
|
||||||
# command = f"ffmpeg -i {video} -r 25 -vcodec libx264 -vtag hvc1 -pix_fmt yuv420p crf 18 {output_video} -y"
|
|
||||||
|
|
||||||
# read video
|
# read video
|
||||||
reader = imageio.get_reader(video)
|
reader = imageio.get_reader(video)
|
||||||
fps = reader.get_meta_data()['fps'] # get fps from original video
|
fps = reader.get_meta_data()['fps'] # get fps from original video
|
||||||
@@ -374,34 +478,45 @@ css = """#input_img {max-width: 1024px !important} #output_vid {max-width: 1024p
|
|||||||
|
|
||||||
with gr.Blocks(css=css) as demo:
|
with gr.Blocks(css=css) as demo:
|
||||||
gr.Markdown(
|
gr.Markdown(
|
||||||
"<div align='center'> <h1>MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting </span> </h1> \
|
"""<div align='center'> <h1>MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling</h1> \
|
||||||
<h2 style='font-weight: 450; font-size: 1rem; margin: 0rem'>\
|
<h2 style='font-weight: 450; font-size: 1rem; margin: 0rem'>\
|
||||||
</br>\
|
</br>\
|
||||||
Yue Zhang <sup>\*</sup>,\
|
Yue Zhang <sup>*</sup>,\
|
||||||
Minhao Liu<sup>\*</sup>,\
|
Zhizhou Zhong <sup>*</sup>,\
|
||||||
|
Minhao Liu<sup>*</sup>,\
|
||||||
Zhaokang Chen,\
|
Zhaokang Chen,\
|
||||||
Bin Wu<sup>†</sup>,\
|
Bin Wu<sup>†</sup>,\
|
||||||
|
Yubin Zeng,\
|
||||||
|
Chao Zhang,\
|
||||||
Yingjie He,\
|
Yingjie He,\
|
||||||
Chao Zhan,\
|
Junxin Huang,\
|
||||||
Wenjiang Zhou\
|
Wenjiang Zhou <br>\
|
||||||
(<sup>*</sup>Equal Contribution, <sup>†</sup>Corresponding Author, benbinwu@tencent.com)\
|
(<sup>*</sup>Equal Contribution, <sup>†</sup>Corresponding Author, benbinwu@tencent.com)\
|
||||||
Lyra Lab, Tencent Music Entertainment\
|
Lyra Lab, Tencent Music Entertainment\
|
||||||
</h2> \
|
</h2> \
|
||||||
<a style='font-size:18px;color: #000000' href='https://github.com/TMElyralab/MuseTalk'>[Github Repo]</a>\
|
<a style='font-size:18px;color: #000000' href='https://github.com/TMElyralab/MuseTalk'>[Github Repo]</a>\
|
||||||
<a style='font-size:18px;color: #000000' href='https://github.com/TMElyralab/MuseTalk'>[Huggingface]</a>\
|
<a style='font-size:18px;color: #000000' href='https://github.com/TMElyralab/MuseTalk'>[Huggingface]</a>\
|
||||||
<a style='font-size:18px;color: #000000' href=''> [Technical report(Coming Soon)] </a>\
|
<a style='font-size:18px;color: #000000' href='https://arxiv.org/abs/2410.10122'> [Technical report] </a>"""
|
||||||
<a style='font-size:18px;color: #000000' href=''> [Project Page(Coming Soon)] </a> </div>"
|
|
||||||
)
|
)
|
||||||
|
|
||||||
with gr.Row():
|
with gr.Row():
|
||||||
with gr.Column():
|
with gr.Column():
|
||||||
audio = gr.Audio(label="Driven Audio",type="filepath")
|
audio = gr.Audio(label="Drving Audio",type="filepath")
|
||||||
video = gr.Video(label="Reference Video",sources=['upload'])
|
video = gr.Video(label="Reference Video",sources=['upload'])
|
||||||
bbox_shift = gr.Number(label="BBox_shift value, px", value=0)
|
bbox_shift = gr.Number(label="BBox_shift value, px", value=0)
|
||||||
bbox_shift_scale = gr.Textbox(label="BBox_shift recommend value lower bound,The corresponding bbox range is generated after the initial result is generated. \n If the result is not good, it can be adjusted according to this reference value", value="",interactive=False)
|
extra_margin = gr.Slider(label="Extra Margin", minimum=0, maximum=40, value=10, step=1)
|
||||||
|
parsing_mode = gr.Radio(label="Parsing Mode", choices=["jaw", "raw"], value="jaw")
|
||||||
|
left_cheek_width = gr.Slider(label="Left Cheek Width", minimum=20, maximum=160, value=90, step=5)
|
||||||
|
right_cheek_width = gr.Slider(label="Right Cheek Width", minimum=20, maximum=160, value=90, step=5)
|
||||||
|
bbox_shift_scale = gr.Textbox(label="'left_cheek_width' and 'right_cheek_width' parameters determine the range of left and right cheeks editing when parsing model is 'jaw'. The 'extra_margin' parameter determines the movement range of the jaw. Users can freely adjust these three parameters to obtain better inpainting results.")
|
||||||
|
|
||||||
btn = gr.Button("Generate")
|
with gr.Row():
|
||||||
out1 = gr.Video()
|
debug_btn = gr.Button("1. Test Inpainting ")
|
||||||
|
btn = gr.Button("2. Generate")
|
||||||
|
with gr.Column():
|
||||||
|
debug_image = gr.Image(label="Test Inpainting Result (First Frame)")
|
||||||
|
debug_info = gr.Textbox(label="Parameter Information", lines=5)
|
||||||
|
out1 = gr.Video()
|
||||||
|
|
||||||
video.change(
|
video.change(
|
||||||
fn=check_video, inputs=[video], outputs=[video]
|
fn=check_video, inputs=[video], outputs=[video]
|
||||||
@@ -412,15 +527,44 @@ with gr.Blocks(css=css) as demo:
|
|||||||
audio,
|
audio,
|
||||||
video,
|
video,
|
||||||
bbox_shift,
|
bbox_shift,
|
||||||
|
extra_margin,
|
||||||
|
parsing_mode,
|
||||||
|
left_cheek_width,
|
||||||
|
right_cheek_width
|
||||||
],
|
],
|
||||||
outputs=[out1,bbox_shift_scale]
|
outputs=[out1,bbox_shift_scale]
|
||||||
)
|
)
|
||||||
|
debug_btn.click(
|
||||||
|
fn=debug_inpainting,
|
||||||
|
inputs=[
|
||||||
|
video,
|
||||||
|
bbox_shift,
|
||||||
|
extra_margin,
|
||||||
|
parsing_mode,
|
||||||
|
left_cheek_width,
|
||||||
|
right_cheek_width
|
||||||
|
],
|
||||||
|
outputs=[debug_image, debug_info]
|
||||||
|
)
|
||||||
|
|
||||||
# Set the IP and port
|
# Check ffmpeg and add to PATH
|
||||||
ip_address = "0.0.0.0" # Replace with your desired IP address
|
if not fast_check_ffmpeg():
|
||||||
port_number = 7860 # Replace with your desired port number
|
print(f"Adding ffmpeg to PATH: {args.ffmpeg_path}")
|
||||||
|
# According to operating system, choose path separator
|
||||||
|
path_separator = ';' if sys.platform == 'win32' else ':'
|
||||||
|
os.environ["PATH"] = f"{args.ffmpeg_path}{path_separator}{os.environ['PATH']}"
|
||||||
|
if not fast_check_ffmpeg():
|
||||||
|
print("Warning: Unable to find ffmpeg, please ensure ffmpeg is properly installed")
|
||||||
|
|
||||||
|
# Solve asynchronous IO issues on Windows
|
||||||
|
if sys.platform == 'win32':
|
||||||
|
import asyncio
|
||||||
|
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
|
||||||
|
|
||||||
|
# Start Gradio application
|
||||||
demo.queue().launch(
|
demo.queue().launch(
|
||||||
share=False , debug=True, server_name=ip_address, server_port=port_number
|
share=args.share,
|
||||||
|
debug=True,
|
||||||
|
server_name=args.ip,
|
||||||
|
server_port=args.port
|
||||||
)
|
)
|
||||||
|
|||||||
Binary file not shown.
|
After Width: | Height: | Size: 14 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 73 KiB |
@@ -1,10 +1,10 @@
|
|||||||
avator_1:
|
avator_1:
|
||||||
preparation: False
|
preparation: True # your can set it to False if you want to use the existing avator, it will save time
|
||||||
bbox_shift: 5
|
bbox_shift: 5
|
||||||
video_path: "data/video/sun.mp4"
|
video_path: "data/video/yongen.mp4"
|
||||||
audio_clips:
|
audio_clips:
|
||||||
audio_0: "data/audio/yongen.wav"
|
audio_0: "data/audio/yongen.wav"
|
||||||
audio_1: "data/audio/sun.wav"
|
audio_1: "data/audio/eng.wav"
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -3,8 +3,8 @@ task_0:
|
|||||||
audio_path: "data/audio/yongen.wav"
|
audio_path: "data/audio/yongen.wav"
|
||||||
|
|
||||||
task_1:
|
task_1:
|
||||||
video_path: "data/video/sun.mp4"
|
video_path: "data/video/yongen.mp4"
|
||||||
audio_path: "data/audio/sun.wav"
|
audio_path: "data/audio/eng.wav"
|
||||||
bbox_shift: -7
|
bbox_shift: -7
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
Executable
+21
@@ -0,0 +1,21 @@
|
|||||||
|
compute_environment: LOCAL_MACHINE
|
||||||
|
debug: True
|
||||||
|
deepspeed_config:
|
||||||
|
offload_optimizer_device: none
|
||||||
|
offload_param_device: none
|
||||||
|
zero3_init_flag: False
|
||||||
|
zero_stage: 2
|
||||||
|
|
||||||
|
distributed_type: DEEPSPEED
|
||||||
|
downcast_bf16: 'no'
|
||||||
|
gpu_ids: "5, 7" # modify this according to your GPU number
|
||||||
|
machine_rank: 0
|
||||||
|
main_training_function: main
|
||||||
|
num_machines: 1
|
||||||
|
num_processes: 2 # it should be the same as the number of GPUs
|
||||||
|
rdzv_backend: static
|
||||||
|
same_network: true
|
||||||
|
tpu_env: []
|
||||||
|
tpu_use_cluster: false
|
||||||
|
tpu_use_sudo: false
|
||||||
|
use_cpu: false
|
||||||
Executable
+31
@@ -0,0 +1,31 @@
|
|||||||
|
clip_len_second: 30 # the length of the video clip
|
||||||
|
video_root_raw: "./dataset/HDTF/source/" # the path of the original video
|
||||||
|
val_list_hdtf:
|
||||||
|
- RD_Radio7_000
|
||||||
|
- RD_Radio8_000
|
||||||
|
- RD_Radio9_000
|
||||||
|
- WDA_TinaSmith_000
|
||||||
|
- WDA_TomCarper_000
|
||||||
|
- WDA_TomPerez_000
|
||||||
|
- WDA_TomUdall_000
|
||||||
|
- WDA_VeronicaEscobar0_000
|
||||||
|
- WDA_VeronicaEscobar1_000
|
||||||
|
- WDA_WhipJimClyburn_000
|
||||||
|
- WDA_XavierBecerra_000
|
||||||
|
- WDA_XavierBecerra_001
|
||||||
|
- WDA_XavierBecerra_002
|
||||||
|
- WDA_ZoeLofgren_000
|
||||||
|
- WRA_SteveScalise1_000
|
||||||
|
- WRA_TimScott_000
|
||||||
|
- WRA_ToddYoung_000
|
||||||
|
- WRA_TomCotton_000
|
||||||
|
- WRA_TomPrice_000
|
||||||
|
- WRA_VickyHartzler_000
|
||||||
|
|
||||||
|
# following dir will be automatically generated
|
||||||
|
video_root_25fps: "./dataset/HDTF/video_root_25fps/"
|
||||||
|
video_file_list: "./dataset/HDTF/video_file_list.txt"
|
||||||
|
video_audio_clip_root: "./dataset/HDTF/video_audio_clip_root/"
|
||||||
|
meta_root: "./dataset/HDTF/meta/"
|
||||||
|
video_clip_file_list_train: "./dataset/HDTF/train.txt"
|
||||||
|
video_clip_file_list_val: "./dataset/HDTF/val.txt"
|
||||||
Executable
+89
@@ -0,0 +1,89 @@
|
|||||||
|
exp_name: 'test' # Name of the experiment
|
||||||
|
output_dir: './exp_out/stage1/' # Directory to save experiment outputs
|
||||||
|
unet_sub_folder: musetalk # Subfolder name for UNet model
|
||||||
|
random_init_unet: True # Whether to randomly initialize UNet (stage1) or use pretrained weights (stage2)
|
||||||
|
whisper_path: "./models/whisper" # Path to the Whisper model
|
||||||
|
pretrained_model_name_or_path: "./models" # Path to pretrained models
|
||||||
|
resume_from_checkpoint: True # Whether to resume training from a checkpoint
|
||||||
|
padding_pixel_mouth: 10 # Number of pixels to pad around the mouth region
|
||||||
|
vae_type: "sd-vae" # Type of VAE model to use
|
||||||
|
# Validation parameters
|
||||||
|
num_images_to_keep: 8 # Number of validation images to keep
|
||||||
|
ref_dropout_rate: 0 # Dropout rate for reference images
|
||||||
|
syncnet_config_path: "./configs/training/syncnet.yaml" # Path to SyncNet configuration
|
||||||
|
use_adapted_weight: False # Whether to use adapted weights for loss calculation
|
||||||
|
cropping_jaw2edge_margin_mean: 10 # Mean margin for jaw-to-edge cropping
|
||||||
|
cropping_jaw2edge_margin_std: 10 # Standard deviation for jaw-to-edge cropping
|
||||||
|
crop_type: "crop_resize" # Type of cropping method
|
||||||
|
random_margin_method: "normal" # Method for random margin generation
|
||||||
|
num_backward_frames: 16 # Number of frames to use for backward pass in SyncNet
|
||||||
|
|
||||||
|
data:
|
||||||
|
dataset_key: "HDTF" # Dataset to use for training
|
||||||
|
train_bs: 32 # Training batch size (actual batch size is train_bs*n_sample_frames)
|
||||||
|
image_size: 256 # Size of input images
|
||||||
|
n_sample_frames: 1 # Number of frames to sample per batch
|
||||||
|
num_workers: 8 # Number of data loading workers
|
||||||
|
audio_padding_length_left: 2 # Left padding length for audio features
|
||||||
|
audio_padding_length_right: 2 # Right padding length for audio features
|
||||||
|
sample_method: pose_similarity_and_mouth_dissimilarity # Method for sampling frames
|
||||||
|
top_k_ratio: 0.51 # Ratio for top-k sampling
|
||||||
|
contorl_face_min_size: True # Whether to control minimum face size
|
||||||
|
min_face_size: 150 # Minimum face size in pixels
|
||||||
|
|
||||||
|
loss_params:
|
||||||
|
l1_loss: 1.0 # Weight for L1 loss
|
||||||
|
vgg_loss: 0.01 # Weight for VGG perceptual loss
|
||||||
|
vgg_layer_weight: [1, 1, 1, 1, 1] # Weights for different VGG layers
|
||||||
|
pyramid_scale: [1, 0.5, 0.25, 0.125] # Scales for image pyramid
|
||||||
|
gan_loss: 0 # Weight for GAN loss
|
||||||
|
fm_loss: [1.0, 1.0, 1.0, 1.0] # Weights for feature matching loss
|
||||||
|
sync_loss: 0 # Weight for sync loss
|
||||||
|
mouth_gan_loss: 0 # Weight for mouth-specific GAN loss
|
||||||
|
|
||||||
|
model_params:
|
||||||
|
discriminator_params:
|
||||||
|
scales: [1] # Scales for discriminator
|
||||||
|
block_expansion: 32 # Expansion factor for discriminator blocks
|
||||||
|
max_features: 512 # Maximum number of features in discriminator
|
||||||
|
num_blocks: 4 # Number of blocks in discriminator
|
||||||
|
sn: True # Whether to use spectral normalization
|
||||||
|
image_channel: 3 # Number of image channels
|
||||||
|
estimate_jacobian: False # Whether to estimate Jacobian
|
||||||
|
|
||||||
|
discriminator_train_params:
|
||||||
|
lr: 0.000005 # Learning rate for discriminator
|
||||||
|
eps: 0.00000001 # Epsilon for optimizer
|
||||||
|
weight_decay: 0.01 # Weight decay for optimizer
|
||||||
|
patch_size: 1 # Size of patches for discriminator
|
||||||
|
betas: [0.5, 0.999] # Beta parameters for Adam optimizer
|
||||||
|
epochs: 10000 # Number of training epochs
|
||||||
|
start_gan: 1000 # Step to start GAN training
|
||||||
|
|
||||||
|
solver:
|
||||||
|
gradient_accumulation_steps: 1 # Number of steps for gradient accumulation
|
||||||
|
uncond_steps: 10 # Number of unconditional steps
|
||||||
|
mixed_precision: 'fp32' # Precision mode for training
|
||||||
|
enable_xformers_memory_efficient_attention: True # Whether to use memory efficient attention
|
||||||
|
gradient_checkpointing: True # Whether to use gradient checkpointing
|
||||||
|
max_train_steps: 250000 # Maximum number of training steps
|
||||||
|
max_grad_norm: 1.0 # Maximum gradient norm for clipping
|
||||||
|
# Learning rate parameters
|
||||||
|
learning_rate: 2.0e-5 # Base learning rate
|
||||||
|
scale_lr: False # Whether to scale learning rate
|
||||||
|
lr_warmup_steps: 1000 # Number of warmup steps for learning rate
|
||||||
|
lr_scheduler: "linear" # Type of learning rate scheduler
|
||||||
|
# Optimizer parameters
|
||||||
|
use_8bit_adam: False # Whether to use 8-bit Adam optimizer
|
||||||
|
adam_beta1: 0.5 # Beta1 parameter for Adam optimizer
|
||||||
|
adam_beta2: 0.999 # Beta2 parameter for Adam optimizer
|
||||||
|
adam_weight_decay: 1.0e-2 # Weight decay for Adam optimizer
|
||||||
|
adam_epsilon: 1.0e-8 # Epsilon for Adam optimizer
|
||||||
|
|
||||||
|
total_limit: 10 # Maximum number of checkpoints to keep
|
||||||
|
save_model_epoch_interval: 250000 # Interval between model saves
|
||||||
|
checkpointing_steps: 10000 # Number of steps between checkpoints
|
||||||
|
val_freq: 2000 # Frequency of validation
|
||||||
|
|
||||||
|
seed: 41 # Random seed for reproducibility
|
||||||
|
|
||||||
Executable
+89
@@ -0,0 +1,89 @@
|
|||||||
|
exp_name: 'test' # Name of the experiment
|
||||||
|
output_dir: './exp_out/stage2/' # Directory to save experiment outputs
|
||||||
|
unet_sub_folder: musetalk # Subfolder name for UNet model
|
||||||
|
random_init_unet: False # Whether to randomly initialize UNet (stage1) or use pretrained weights (stage2)
|
||||||
|
whisper_path: "./models/whisper" # Path to the Whisper model
|
||||||
|
pretrained_model_name_or_path: "./models" # Path to pretrained models
|
||||||
|
resume_from_checkpoint: True # Whether to resume training from a checkpoint
|
||||||
|
padding_pixel_mouth: 10 # Number of pixels to pad around the mouth region
|
||||||
|
vae_type: "sd-vae" # Type of VAE model to use
|
||||||
|
# Validation parameters
|
||||||
|
num_images_to_keep: 8 # Number of validation images to keep
|
||||||
|
ref_dropout_rate: 0 # Dropout rate for reference images
|
||||||
|
syncnet_config_path: "./configs/training/syncnet.yaml" # Path to SyncNet configuration
|
||||||
|
use_adapted_weight: False # Whether to use adapted weights for loss calculation
|
||||||
|
cropping_jaw2edge_margin_mean: 10 # Mean margin for jaw-to-edge cropping
|
||||||
|
cropping_jaw2edge_margin_std: 10 # Standard deviation for jaw-to-edge cropping
|
||||||
|
crop_type: "dynamic_margin_crop_resize" # Type of cropping method
|
||||||
|
random_margin_method: "normal" # Method for random margin generation
|
||||||
|
num_backward_frames: 16 # Number of frames to use for backward pass in SyncNet
|
||||||
|
|
||||||
|
data:
|
||||||
|
dataset_key: "HDTF" # Dataset to use for training
|
||||||
|
train_bs: 2 # Training batch size (actual batch size is train_bs*n_sample_frames)
|
||||||
|
image_size: 256 # Size of input images
|
||||||
|
n_sample_frames: 16 # Number of frames to sample per batch
|
||||||
|
num_workers: 8 # Number of data loading workers
|
||||||
|
audio_padding_length_left: 2 # Left padding length for audio features
|
||||||
|
audio_padding_length_right: 2 # Right padding length for audio features
|
||||||
|
sample_method: pose_similarity_and_mouth_dissimilarity # Method for sampling frames
|
||||||
|
top_k_ratio: 0.51 # Ratio for top-k sampling
|
||||||
|
contorl_face_min_size: True # Whether to control minimum face size
|
||||||
|
min_face_size: 200 # Minimum face size in pixels
|
||||||
|
|
||||||
|
loss_params:
|
||||||
|
l1_loss: 1.0 # Weight for L1 loss
|
||||||
|
vgg_loss: 0.01 # Weight for VGG perceptual loss
|
||||||
|
vgg_layer_weight: [1, 1, 1, 1, 1] # Weights for different VGG layers
|
||||||
|
pyramid_scale: [1, 0.5, 0.25, 0.125] # Scales for image pyramid
|
||||||
|
gan_loss: 0.01 # Weight for GAN loss
|
||||||
|
fm_loss: [1.0, 1.0, 1.0, 1.0] # Weights for feature matching loss
|
||||||
|
sync_loss: 0.05 # Weight for sync loss
|
||||||
|
mouth_gan_loss: 0.01 # Weight for mouth-specific GAN loss
|
||||||
|
|
||||||
|
model_params:
|
||||||
|
discriminator_params:
|
||||||
|
scales: [1] # Scales for discriminator
|
||||||
|
block_expansion: 32 # Expansion factor for discriminator blocks
|
||||||
|
max_features: 512 # Maximum number of features in discriminator
|
||||||
|
num_blocks: 4 # Number of blocks in discriminator
|
||||||
|
sn: True # Whether to use spectral normalization
|
||||||
|
image_channel: 3 # Number of image channels
|
||||||
|
estimate_jacobian: False # Whether to estimate Jacobian
|
||||||
|
|
||||||
|
discriminator_train_params:
|
||||||
|
lr: 0.000005 # Learning rate for discriminator
|
||||||
|
eps: 0.00000001 # Epsilon for optimizer
|
||||||
|
weight_decay: 0.01 # Weight decay for optimizer
|
||||||
|
patch_size: 1 # Size of patches for discriminator
|
||||||
|
betas: [0.5, 0.999] # Beta parameters for Adam optimizer
|
||||||
|
epochs: 10000 # Number of training epochs
|
||||||
|
start_gan: 1000 # Step to start GAN training
|
||||||
|
|
||||||
|
solver:
|
||||||
|
gradient_accumulation_steps: 8 # Number of steps for gradient accumulation
|
||||||
|
uncond_steps: 10 # Number of unconditional steps
|
||||||
|
mixed_precision: 'fp32' # Precision mode for training
|
||||||
|
enable_xformers_memory_efficient_attention: True # Whether to use memory efficient attention
|
||||||
|
gradient_checkpointing: True # Whether to use gradient checkpointing
|
||||||
|
max_train_steps: 250000 # Maximum number of training steps
|
||||||
|
max_grad_norm: 1.0 # Maximum gradient norm for clipping
|
||||||
|
# Learning rate parameters
|
||||||
|
learning_rate: 5.0e-6 # Base learning rate
|
||||||
|
scale_lr: False # Whether to scale learning rate
|
||||||
|
lr_warmup_steps: 1000 # Number of warmup steps for learning rate
|
||||||
|
lr_scheduler: "linear" # Type of learning rate scheduler
|
||||||
|
# Optimizer parameters
|
||||||
|
use_8bit_adam: False # Whether to use 8-bit Adam optimizer
|
||||||
|
adam_beta1: 0.5 # Beta1 parameter for Adam optimizer
|
||||||
|
adam_beta2: 0.999 # Beta2 parameter for Adam optimizer
|
||||||
|
adam_weight_decay: 1.0e-2 # Weight decay for Adam optimizer
|
||||||
|
adam_epsilon: 1.0e-8 # Epsilon for Adam optimizer
|
||||||
|
|
||||||
|
total_limit: 10 # Maximum number of checkpoints to keep
|
||||||
|
save_model_epoch_interval: 250000 # Interval between model saves
|
||||||
|
checkpointing_steps: 2000 # Number of steps between checkpoints
|
||||||
|
val_freq: 2000 # Frequency of validation
|
||||||
|
|
||||||
|
seed: 41 # Random seed for reproducibility
|
||||||
|
|
||||||
@@ -0,0 +1,19 @@
|
|||||||
|
# This file is modified from LatentSync (https://github.com/bytedance/LatentSync/blob/main/latentsync/configs/training/syncnet_16_pixel.yaml).
|
||||||
|
model:
|
||||||
|
audio_encoder: # input (1, 80, 52)
|
||||||
|
in_channels: 1
|
||||||
|
block_out_channels: [32, 64, 128, 256, 512, 1024, 2048]
|
||||||
|
downsample_factors: [[2, 1], 2, 2, 1, 2, 2, [2, 3]]
|
||||||
|
attn_blocks: [0, 0, 0, 0, 0, 0, 0]
|
||||||
|
dropout: 0.0
|
||||||
|
visual_encoder: # input (48, 128, 256)
|
||||||
|
in_channels: 48
|
||||||
|
block_out_channels: [64, 128, 256, 256, 512, 1024, 2048, 2048]
|
||||||
|
downsample_factors: [[1, 2], 2, 2, 2, 2, 2, 2, 2]
|
||||||
|
attn_blocks: [0, 0, 0, 0, 0, 0, 0, 0]
|
||||||
|
dropout: 0.0
|
||||||
|
|
||||||
|
ckpt:
|
||||||
|
resume_ckpt_path: ""
|
||||||
|
inference_ckpt_path: ./models/syncnet/latentsync_syncnet.pt # this pretrained model is from LatentSync (https://huggingface.co/ByteDance/LatentSync/tree/main)
|
||||||
|
save_ckpt_steps: 2500
|
||||||
Executable
BIN
Binary file not shown.
@@ -0,0 +1,41 @@
|
|||||||
|
@echo off
|
||||||
|
setlocal
|
||||||
|
|
||||||
|
:: Set the checkpoints directory
|
||||||
|
set CheckpointsDir=models
|
||||||
|
|
||||||
|
:: Create necessary directories
|
||||||
|
mkdir %CheckpointsDir%\musetalk
|
||||||
|
mkdir %CheckpointsDir%\musetalkV15
|
||||||
|
mkdir %CheckpointsDir%\syncnet
|
||||||
|
mkdir %CheckpointsDir%\dwpose
|
||||||
|
mkdir %CheckpointsDir%\face-parse-bisent
|
||||||
|
mkdir %CheckpointsDir%\sd-vae-ft-mse
|
||||||
|
mkdir %CheckpointsDir%\whisper
|
||||||
|
|
||||||
|
:: Install required packages
|
||||||
|
pip install -U "huggingface_hub[hf_xet]"
|
||||||
|
|
||||||
|
:: Set HuggingFace endpoint
|
||||||
|
set HF_ENDPOINT=https://hf-mirror.com
|
||||||
|
|
||||||
|
:: Download MuseTalk weights
|
||||||
|
hf download TMElyralab/MuseTalk --local-dir %CheckpointsDir%
|
||||||
|
|
||||||
|
:: Download SD VAE weights
|
||||||
|
hf download stabilityai/sd-vae-ft-mse --local-dir %CheckpointsDir%\sd-vae --include "config.json" "diffusion_pytorch_model.bin"
|
||||||
|
|
||||||
|
:: Download Whisper weights
|
||||||
|
hf download openai/whisper-tiny --local-dir %CheckpointsDir%\whisper --include "config.json" "pytorch_model.bin" "preprocessor_config.json"
|
||||||
|
|
||||||
|
:: Download DWPose weights
|
||||||
|
hf download yzd-v/DWPose --local-dir %CheckpointsDir%\dwpose --include "dw-ll_ucoco_384.pth"
|
||||||
|
|
||||||
|
:: Download SyncNet weights
|
||||||
|
hf download ByteDance/LatentSync --local-dir %CheckpointsDir%\syncnet --include "latentsync_syncnet.pt"
|
||||||
|
|
||||||
|
:: Download face-parse-bisent weights
|
||||||
|
hf download ManyOtherFunctions/face-parse-bisent --local-dir %CheckpointsDir%\face-parse-bisent --include "79999_iter.pth" "resnet18-5c106cde.pth"
|
||||||
|
|
||||||
|
echo All weights have been downloaded successfully!
|
||||||
|
endlocal
|
||||||
@@ -0,0 +1,51 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
# Set the checkpoints directory
|
||||||
|
CheckpointsDir="models"
|
||||||
|
|
||||||
|
# Create necessary directories
|
||||||
|
mkdir -p models/musetalk models/musetalkV15 models/syncnet models/dwpose models/face-parse-bisent models/sd-vae models/whisper
|
||||||
|
|
||||||
|
# Install required packages
|
||||||
|
pip install -U "huggingface_hub[cli]"
|
||||||
|
pip install gdown
|
||||||
|
|
||||||
|
# Set HuggingFace mirror endpoint
|
||||||
|
export HF_ENDPOINT=https://hf-mirror.com
|
||||||
|
|
||||||
|
# Download MuseTalk V1.0 weights
|
||||||
|
huggingface-cli download TMElyralab/MuseTalk \
|
||||||
|
--local-dir $CheckpointsDir \
|
||||||
|
--include "musetalk/musetalk.json" "musetalk/pytorch_model.bin"
|
||||||
|
|
||||||
|
# Download MuseTalk V1.5 weights (unet.pth)
|
||||||
|
huggingface-cli download TMElyralab/MuseTalk \
|
||||||
|
--local-dir $CheckpointsDir \
|
||||||
|
--include "musetalkV15/musetalk.json" "musetalkV15/unet.pth"
|
||||||
|
|
||||||
|
# Download SD VAE weights
|
||||||
|
huggingface-cli download stabilityai/sd-vae-ft-mse \
|
||||||
|
--local-dir $CheckpointsDir/sd-vae \
|
||||||
|
--include "config.json" "diffusion_pytorch_model.bin"
|
||||||
|
|
||||||
|
# Download Whisper weights
|
||||||
|
huggingface-cli download openai/whisper-tiny \
|
||||||
|
--local-dir $CheckpointsDir/whisper \
|
||||||
|
--include "config.json" "pytorch_model.bin" "preprocessor_config.json"
|
||||||
|
|
||||||
|
# Download DWPose weights
|
||||||
|
huggingface-cli download yzd-v/DWPose \
|
||||||
|
--local-dir $CheckpointsDir/dwpose \
|
||||||
|
--include "dw-ll_ucoco_384.pth"
|
||||||
|
|
||||||
|
# Download SyncNet weights
|
||||||
|
huggingface-cli download ByteDance/LatentSync \
|
||||||
|
--local-dir $CheckpointsDir/syncnet \
|
||||||
|
--include "latentsync_syncnet.pt"
|
||||||
|
|
||||||
|
# Download Face Parse Bisent weights
|
||||||
|
gdown --id 154JgKpzCPW82qINcVieuPH3fZ2e0P812 -O $CheckpointsDir/face-parse-bisent/79999_iter.pth
|
||||||
|
curl -L https://download.pytorch.org/models/resnet18-5c106cde.pth \
|
||||||
|
-o $CheckpointsDir/face-parse-bisent/resnet18-5c106cde.pth
|
||||||
|
|
||||||
|
echo "✅ All weights have been downloaded successfully!"
|
||||||
@@ -0,0 +1,72 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
# This script runs inference based on the version and mode specified by the user.
|
||||||
|
# Usage:
|
||||||
|
# To run v1.0 inference: sh inference.sh v1.0 [normal|realtime]
|
||||||
|
# To run v1.5 inference: sh inference.sh v1.5 [normal|realtime]
|
||||||
|
|
||||||
|
# Check if the correct number of arguments is provided
|
||||||
|
if [ "$#" -ne 2 ]; then
|
||||||
|
echo "Usage: $0 <version> <mode>"
|
||||||
|
echo "Example: $0 v1.0 normal or $0 v1.5 realtime"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Get the version and mode from the user input
|
||||||
|
version=$1
|
||||||
|
mode=$2
|
||||||
|
|
||||||
|
# Validate mode
|
||||||
|
if [ "$mode" != "normal" ] && [ "$mode" != "realtime" ]; then
|
||||||
|
echo "Invalid mode specified. Please use 'normal' or 'realtime'."
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Set config path based on mode
|
||||||
|
if [ "$mode" = "normal" ]; then
|
||||||
|
config_path="./configs/inference/test.yaml"
|
||||||
|
result_dir="./results/test"
|
||||||
|
else
|
||||||
|
config_path="./configs/inference/realtime.yaml"
|
||||||
|
result_dir="./results/realtime"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Define the model paths based on the version
|
||||||
|
if [ "$version" = "v1.0" ]; then
|
||||||
|
model_dir="./models/musetalk"
|
||||||
|
unet_model_path="$model_dir/pytorch_model.bin"
|
||||||
|
unet_config="$model_dir/musetalk.json"
|
||||||
|
version_arg="v1"
|
||||||
|
elif [ "$version" = "v1.5" ]; then
|
||||||
|
model_dir="./models/musetalkV15"
|
||||||
|
unet_model_path="$model_dir/unet.pth"
|
||||||
|
unet_config="$model_dir/musetalk.json"
|
||||||
|
version_arg="v15"
|
||||||
|
else
|
||||||
|
echo "Invalid version specified. Please use v1.0 or v1.5."
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Set script name based on mode
|
||||||
|
if [ "$mode" = "normal" ]; then
|
||||||
|
script_name="scripts.inference"
|
||||||
|
else
|
||||||
|
script_name="scripts.realtime_inference"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Base command arguments
|
||||||
|
cmd_args="--inference_config $config_path \
|
||||||
|
--result_dir $result_dir \
|
||||||
|
--unet_model_path $unet_model_path \
|
||||||
|
--unet_config $unet_config \
|
||||||
|
--version $version_arg"
|
||||||
|
|
||||||
|
# Add realtime-specific arguments if in realtime mode
|
||||||
|
if [ "$mode" = "realtime" ]; then
|
||||||
|
cmd_args="$cmd_args \
|
||||||
|
--fps 25 \
|
||||||
|
--version $version_arg"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Run inference
|
||||||
|
python3 -m $script_name $cmd_args
|
||||||
Executable
+168
@@ -0,0 +1,168 @@
|
|||||||
|
import librosa
|
||||||
|
import librosa.filters
|
||||||
|
import numpy as np
|
||||||
|
from scipy import signal
|
||||||
|
from scipy.io import wavfile
|
||||||
|
|
||||||
|
class HParams:
|
||||||
|
# copy from wav2lip
|
||||||
|
def __init__(self):
|
||||||
|
self.n_fft = 800
|
||||||
|
self.hop_size = 200
|
||||||
|
self.win_size = 800
|
||||||
|
self.sample_rate = 16000
|
||||||
|
self.frame_shift_ms = None
|
||||||
|
self.signal_normalization = True
|
||||||
|
|
||||||
|
self.allow_clipping_in_normalization = True
|
||||||
|
self.symmetric_mels = True
|
||||||
|
self.max_abs_value = 4.0
|
||||||
|
self.preemphasize = True
|
||||||
|
self.preemphasis = 0.97
|
||||||
|
self.min_level_db = -100
|
||||||
|
self.ref_level_db = 20
|
||||||
|
self.fmin = 55
|
||||||
|
self.fmax=7600
|
||||||
|
|
||||||
|
self.use_lws=False
|
||||||
|
self.num_mels=80 # Number of mel-spectrogram channels and local conditioning dimensionality
|
||||||
|
self.rescale=True # Whether to rescale audio prior to preprocessing
|
||||||
|
self.rescaling_max=0.9 # Rescaling value
|
||||||
|
self.use_lws=False
|
||||||
|
|
||||||
|
|
||||||
|
hp = HParams()
|
||||||
|
|
||||||
|
def load_wav(path, sr):
|
||||||
|
return librosa.core.load(path, sr=sr)[0]
|
||||||
|
#def load_wav(path, sr):
|
||||||
|
# audio, sr_native = sf.read(path)
|
||||||
|
# if sr != sr_native:
|
||||||
|
# audio = librosa.resample(audio.T, sr_native, sr).T
|
||||||
|
# return audio
|
||||||
|
|
||||||
|
def save_wav(wav, path, sr):
|
||||||
|
wav *= 32767 / max(0.01, np.max(np.abs(wav)))
|
||||||
|
#proposed by @dsmiller
|
||||||
|
wavfile.write(path, sr, wav.astype(np.int16))
|
||||||
|
|
||||||
|
def save_wavenet_wav(wav, path, sr):
|
||||||
|
librosa.output.write_wav(path, wav, sr=sr)
|
||||||
|
|
||||||
|
def preemphasis(wav, k, preemphasize=True):
|
||||||
|
if preemphasize:
|
||||||
|
return signal.lfilter([1, -k], [1], wav)
|
||||||
|
return wav
|
||||||
|
|
||||||
|
def inv_preemphasis(wav, k, inv_preemphasize=True):
|
||||||
|
if inv_preemphasize:
|
||||||
|
return signal.lfilter([1], [1, -k], wav)
|
||||||
|
return wav
|
||||||
|
|
||||||
|
def get_hop_size():
|
||||||
|
hop_size = hp.hop_size
|
||||||
|
if hop_size is None:
|
||||||
|
assert hp.frame_shift_ms is not None
|
||||||
|
hop_size = int(hp.frame_shift_ms / 1000 * hp.sample_rate)
|
||||||
|
return hop_size
|
||||||
|
|
||||||
|
def linearspectrogram(wav):
|
||||||
|
D = _stft(preemphasis(wav, hp.preemphasis, hp.preemphasize))
|
||||||
|
S = _amp_to_db(np.abs(D)) - hp.ref_level_db
|
||||||
|
|
||||||
|
if hp.signal_normalization:
|
||||||
|
return _normalize(S)
|
||||||
|
return S
|
||||||
|
|
||||||
|
def melspectrogram(wav):
|
||||||
|
D = _stft(preemphasis(wav, hp.preemphasis, hp.preemphasize))
|
||||||
|
S = _amp_to_db(_linear_to_mel(np.abs(D))) - hp.ref_level_db
|
||||||
|
|
||||||
|
if hp.signal_normalization:
|
||||||
|
return _normalize(S)
|
||||||
|
return S
|
||||||
|
|
||||||
|
def _lws_processor():
|
||||||
|
import lws
|
||||||
|
return lws.lws(hp.n_fft, get_hop_size(), fftsize=hp.win_size, mode="speech")
|
||||||
|
|
||||||
|
def _stft(y):
|
||||||
|
if hp.use_lws:
|
||||||
|
return _lws_processor(hp).stft(y).T
|
||||||
|
else:
|
||||||
|
return librosa.stft(y=y, n_fft=hp.n_fft, hop_length=get_hop_size(), win_length=hp.win_size)
|
||||||
|
|
||||||
|
##########################################################
|
||||||
|
#Those are only correct when using lws!!! (This was messing with Wavenet quality for a long time!)
|
||||||
|
def num_frames(length, fsize, fshift):
|
||||||
|
"""Compute number of time frames of spectrogram
|
||||||
|
"""
|
||||||
|
pad = (fsize - fshift)
|
||||||
|
if length % fshift == 0:
|
||||||
|
M = (length + pad * 2 - fsize) // fshift + 1
|
||||||
|
else:
|
||||||
|
M = (length + pad * 2 - fsize) // fshift + 2
|
||||||
|
return M
|
||||||
|
|
||||||
|
|
||||||
|
def pad_lr(x, fsize, fshift):
|
||||||
|
"""Compute left and right padding
|
||||||
|
"""
|
||||||
|
M = num_frames(len(x), fsize, fshift)
|
||||||
|
pad = (fsize - fshift)
|
||||||
|
T = len(x) + 2 * pad
|
||||||
|
r = (M - 1) * fshift + fsize - T
|
||||||
|
return pad, pad + r
|
||||||
|
##########################################################
|
||||||
|
#Librosa correct padding
|
||||||
|
def librosa_pad_lr(x, fsize, fshift):
|
||||||
|
return 0, (x.shape[0] // fshift + 1) * fshift - x.shape[0]
|
||||||
|
|
||||||
|
# Conversions
|
||||||
|
_mel_basis = None
|
||||||
|
|
||||||
|
def _linear_to_mel(spectogram):
|
||||||
|
global _mel_basis
|
||||||
|
if _mel_basis is None:
|
||||||
|
_mel_basis = _build_mel_basis()
|
||||||
|
return np.dot(_mel_basis, spectogram)
|
||||||
|
|
||||||
|
def _build_mel_basis():
|
||||||
|
assert hp.fmax <= hp.sample_rate // 2
|
||||||
|
return librosa.filters.mel(sr=hp.sample_rate, n_fft=hp.n_fft, n_mels=hp.num_mels,
|
||||||
|
fmin=hp.fmin, fmax=hp.fmax)
|
||||||
|
|
||||||
|
def _amp_to_db(x):
|
||||||
|
min_level = np.exp(hp.min_level_db / 20 * np.log(10))
|
||||||
|
return 20 * np.log10(np.maximum(min_level, x))
|
||||||
|
|
||||||
|
def _db_to_amp(x):
|
||||||
|
return np.power(10.0, (x) * 0.05)
|
||||||
|
|
||||||
|
def _normalize(S):
|
||||||
|
if hp.allow_clipping_in_normalization:
|
||||||
|
if hp.symmetric_mels:
|
||||||
|
return np.clip((2 * hp.max_abs_value) * ((S - hp.min_level_db) / (-hp.min_level_db)) - hp.max_abs_value,
|
||||||
|
-hp.max_abs_value, hp.max_abs_value)
|
||||||
|
else:
|
||||||
|
return np.clip(hp.max_abs_value * ((S - hp.min_level_db) / (-hp.min_level_db)), 0, hp.max_abs_value)
|
||||||
|
|
||||||
|
assert S.max() <= 0 and S.min() - hp.min_level_db >= 0
|
||||||
|
if hp.symmetric_mels:
|
||||||
|
return (2 * hp.max_abs_value) * ((S - hp.min_level_db) / (-hp.min_level_db)) - hp.max_abs_value
|
||||||
|
else:
|
||||||
|
return hp.max_abs_value * ((S - hp.min_level_db) / (-hp.min_level_db))
|
||||||
|
|
||||||
|
def _denormalize(D):
|
||||||
|
if hp.allow_clipping_in_normalization:
|
||||||
|
if hp.symmetric_mels:
|
||||||
|
return (((np.clip(D, -hp.max_abs_value,
|
||||||
|
hp.max_abs_value) + hp.max_abs_value) * -hp.min_level_db / (2 * hp.max_abs_value))
|
||||||
|
+ hp.min_level_db)
|
||||||
|
else:
|
||||||
|
return ((np.clip(D, 0, hp.max_abs_value) * -hp.min_level_db / hp.max_abs_value) + hp.min_level_db)
|
||||||
|
|
||||||
|
if hp.symmetric_mels:
|
||||||
|
return (((D + hp.max_abs_value) * -hp.min_level_db / (2 * hp.max_abs_value)) + hp.min_level_db)
|
||||||
|
else:
|
||||||
|
return ((D * -hp.min_level_db / hp.max_abs_value) + hp.min_level_db)
|
||||||
Executable
+610
@@ -0,0 +1,610 @@
|
|||||||
|
import os
|
||||||
|
import numpy as np
|
||||||
|
import random
|
||||||
|
from PIL import Image
|
||||||
|
import torch
|
||||||
|
from torch.utils.data import Dataset, ConcatDataset
|
||||||
|
import torchvision.transforms as transforms
|
||||||
|
from transformers import AutoFeatureExtractor
|
||||||
|
import librosa
|
||||||
|
import time
|
||||||
|
import json
|
||||||
|
import math
|
||||||
|
from decord import AudioReader, VideoReader
|
||||||
|
from decord.ndarray import cpu
|
||||||
|
|
||||||
|
from musetalk.data.sample_method import get_src_idx, shift_landmarks_to_face_coordinates, resize_landmark
|
||||||
|
from musetalk.data import audio
|
||||||
|
from musetalk.utils.audio_utils import ensure_wav
|
||||||
|
|
||||||
|
syncnet_mel_step_size = math.ceil(16 / 5 * 16) # latentsync
|
||||||
|
|
||||||
|
|
||||||
|
class FaceDataset(Dataset):
|
||||||
|
"""Dataset class for loading and processing video data
|
||||||
|
|
||||||
|
Each video can be represented as:
|
||||||
|
- Concatenated frame images
|
||||||
|
- '.mp4' or '.gif' files
|
||||||
|
- Folder containing all frames
|
||||||
|
"""
|
||||||
|
def __init__(self,
|
||||||
|
cfg,
|
||||||
|
list_paths,
|
||||||
|
root_path='./dataset/',
|
||||||
|
repeats=None):
|
||||||
|
# Initialize dataset paths
|
||||||
|
meta_paths = []
|
||||||
|
if repeats is None:
|
||||||
|
repeats = [1] * len(list_paths)
|
||||||
|
assert len(repeats) == len(list_paths)
|
||||||
|
|
||||||
|
# Load data list
|
||||||
|
for list_path, repeat_time in zip(list_paths, repeats):
|
||||||
|
with open(list_path, 'r') as f:
|
||||||
|
num = 0
|
||||||
|
f.readline() # Skip header line
|
||||||
|
for line in f.readlines():
|
||||||
|
line_info = line.strip()
|
||||||
|
meta = line_info.split()
|
||||||
|
meta = meta[0]
|
||||||
|
meta_paths.extend([os.path.join(root_path, meta)] * repeat_time)
|
||||||
|
num += 1
|
||||||
|
print(f'{list_path}: {num} x {repeat_time} = {num * repeat_time} samples')
|
||||||
|
|
||||||
|
# Set basic attributes
|
||||||
|
self.meta_paths = meta_paths
|
||||||
|
self.root_path = root_path
|
||||||
|
self.image_size = cfg['image_size']
|
||||||
|
self.min_face_size = cfg['min_face_size']
|
||||||
|
self.T = cfg['T']
|
||||||
|
self.sample_method = cfg['sample_method']
|
||||||
|
self.top_k_ratio = cfg['top_k_ratio']
|
||||||
|
self.max_attempts = 200
|
||||||
|
self.padding_pixel_mouth = cfg['padding_pixel_mouth']
|
||||||
|
|
||||||
|
# Cropping related parameters
|
||||||
|
self.crop_type = cfg['crop_type']
|
||||||
|
self.jaw2edge_margin_mean = cfg['cropping_jaw2edge_margin_mean']
|
||||||
|
self.jaw2edge_margin_std = cfg['cropping_jaw2edge_margin_std']
|
||||||
|
self.random_margin_method = cfg['random_margin_method']
|
||||||
|
|
||||||
|
# Image transformations
|
||||||
|
self.to_tensor = transforms.Compose([
|
||||||
|
transforms.ToTensor(),
|
||||||
|
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
|
||||||
|
])
|
||||||
|
self.pose_to_tensor = transforms.Compose([
|
||||||
|
transforms.ToTensor(),
|
||||||
|
])
|
||||||
|
|
||||||
|
# Feature extractor
|
||||||
|
self.feature_extractor = AutoFeatureExtractor.from_pretrained(cfg['whisper_path'])
|
||||||
|
self.contorl_face_min_size = cfg["contorl_face_min_size"]
|
||||||
|
|
||||||
|
print("The sample method is: ", self.sample_method)
|
||||||
|
print(f"only use face size > {self.min_face_size}", self.contorl_face_min_size)
|
||||||
|
|
||||||
|
def generate_random_value(self):
|
||||||
|
"""Generate random value
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
float: Generated random value
|
||||||
|
"""
|
||||||
|
if self.random_margin_method == "uniform":
|
||||||
|
random_value = np.random.uniform(
|
||||||
|
self.jaw2edge_margin_mean - self.jaw2edge_margin_std,
|
||||||
|
self.jaw2edge_margin_mean + self.jaw2edge_margin_std
|
||||||
|
)
|
||||||
|
elif self.random_margin_method == "normal":
|
||||||
|
random_value = np.random.normal(
|
||||||
|
loc=self.jaw2edge_margin_mean,
|
||||||
|
scale=self.jaw2edge_margin_std
|
||||||
|
)
|
||||||
|
random_value = np.clip(
|
||||||
|
random_value,
|
||||||
|
self.jaw2edge_margin_mean - self.jaw2edge_margin_std,
|
||||||
|
self.jaw2edge_margin_mean + self.jaw2edge_margin_std,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
raise ValueError(f"Invalid random margin method: {self.random_margin_method}")
|
||||||
|
return max(0, random_value)
|
||||||
|
|
||||||
|
def dynamic_margin_crop(self, img, original_bbox, extra_margin=None):
|
||||||
|
"""Dynamically crop image with dynamic margin
|
||||||
|
|
||||||
|
Args:
|
||||||
|
img: Input image
|
||||||
|
original_bbox: Original bounding box
|
||||||
|
extra_margin: Extra margin
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
tuple: (x1, y1, x2, y2, extra_margin)
|
||||||
|
"""
|
||||||
|
if extra_margin is None:
|
||||||
|
extra_margin = self.generate_random_value()
|
||||||
|
w, h = img.size
|
||||||
|
x1, y1, x2, y2 = original_bbox
|
||||||
|
y2 = min(y2 + int(extra_margin), h)
|
||||||
|
return x1, y1, x2, y2, extra_margin
|
||||||
|
|
||||||
|
def crop_resize_img(self, img, bbox, crop_type='crop_resize', extra_margin=None):
|
||||||
|
"""Crop and resize image
|
||||||
|
|
||||||
|
Args:
|
||||||
|
img: Input image
|
||||||
|
bbox: Bounding box
|
||||||
|
crop_type: Type of cropping
|
||||||
|
extra_margin: Extra margin
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
tuple: (Processed image, extra_margin, mask_scaled_factor)
|
||||||
|
"""
|
||||||
|
mask_scaled_factor = 1.
|
||||||
|
if crop_type == 'crop_resize':
|
||||||
|
x1, y1, x2, y2 = bbox
|
||||||
|
img = img.crop((x1, y1, x2, y2))
|
||||||
|
img = img.resize((self.image_size, self.image_size), Image.LANCZOS)
|
||||||
|
elif crop_type == 'dynamic_margin_crop_resize':
|
||||||
|
x1, y1, x2, y2, extra_margin = self.dynamic_margin_crop(img, bbox, extra_margin)
|
||||||
|
w_original, _ = img.size
|
||||||
|
img = img.crop((x1, y1, x2, y2))
|
||||||
|
w_cropped, _ = img.size
|
||||||
|
mask_scaled_factor = w_cropped / w_original
|
||||||
|
img = img.resize((self.image_size, self.image_size), Image.LANCZOS)
|
||||||
|
elif crop_type == 'resize':
|
||||||
|
w, h = img.size
|
||||||
|
scale = np.sqrt(self.image_size ** 2 / (h * w))
|
||||||
|
new_w = int(w * scale) / 64 * 64
|
||||||
|
new_h = int(h * scale) / 64 * 64
|
||||||
|
img = img.resize((new_w, new_h), Image.LANCZOS)
|
||||||
|
return img, extra_margin, mask_scaled_factor
|
||||||
|
|
||||||
|
def get_audio_file(self, wav_path, start_index):
|
||||||
|
"""Get audio file features
|
||||||
|
|
||||||
|
Args:
|
||||||
|
wav_path: Audio file path
|
||||||
|
start_index: Starting index
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
tuple: (Audio features, start index)
|
||||||
|
"""
|
||||||
|
if not os.path.exists(wav_path):
|
||||||
|
return None
|
||||||
|
wav_path_converted = ensure_wav(wav_path)
|
||||||
|
audio_input_librosa, sampling_rate = librosa.load(wav_path_converted, sr=16000)
|
||||||
|
assert sampling_rate == 16000
|
||||||
|
|
||||||
|
while start_index >= 25 * 30:
|
||||||
|
audio_input = audio_input_librosa[16000*30:]
|
||||||
|
start_index -= 25 * 30
|
||||||
|
if start_index + 2 * 25 >= 25 * 30:
|
||||||
|
start_index -= 4 * 25
|
||||||
|
audio_input = audio_input_librosa[16000*4:16000*34]
|
||||||
|
else:
|
||||||
|
audio_input = audio_input_librosa[:16000*30]
|
||||||
|
|
||||||
|
assert 2 * (start_index) >= 0
|
||||||
|
assert 2 * (start_index + 2 * 25) <= 1500
|
||||||
|
|
||||||
|
audio_input = self.feature_extractor(
|
||||||
|
audio_input,
|
||||||
|
return_tensors="pt",
|
||||||
|
sampling_rate=sampling_rate
|
||||||
|
).input_features
|
||||||
|
return audio_input, start_index
|
||||||
|
|
||||||
|
def get_audio_file_mel(self, wav_path, start_index):
|
||||||
|
"""Get mel spectrogram of audio file
|
||||||
|
|
||||||
|
Args:
|
||||||
|
wav_path: Audio file path
|
||||||
|
start_index: Starting index
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
tuple: (Mel spectrogram, start index)
|
||||||
|
"""
|
||||||
|
if not os.path.exists(wav_path):
|
||||||
|
return None
|
||||||
|
|
||||||
|
wav_path_converted = ensure_wav(wav_path)
|
||||||
|
audio_input_librosa, sampling_rate = librosa.load(wav_path_converted, sr=16000)
|
||||||
|
assert sampling_rate == 16000
|
||||||
|
|
||||||
|
audio_mel = self.mel_feature_extractor(audio_input_librosa)
|
||||||
|
return audio_mel, start_index
|
||||||
|
|
||||||
|
def mel_feature_extractor(self, audio_input):
|
||||||
|
"""Extract mel spectrogram features
|
||||||
|
|
||||||
|
Args:
|
||||||
|
audio_input: Input audio
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
ndarray: Mel spectrogram features
|
||||||
|
"""
|
||||||
|
orig_mel = audio.melspectrogram(audio_input)
|
||||||
|
return orig_mel.T
|
||||||
|
|
||||||
|
def crop_audio_window(self, spec, start_frame_num, fps=25):
|
||||||
|
"""Crop audio window
|
||||||
|
|
||||||
|
Args:
|
||||||
|
spec: Spectrogram
|
||||||
|
start_frame_num: Starting frame number
|
||||||
|
fps: Frames per second
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
ndarray: Cropped spectrogram
|
||||||
|
"""
|
||||||
|
start_idx = int(80. * (start_frame_num / float(fps)))
|
||||||
|
end_idx = start_idx + syncnet_mel_step_size
|
||||||
|
return spec[start_idx: end_idx, :]
|
||||||
|
|
||||||
|
def get_syncnet_input(self, video_path):
|
||||||
|
"""Get SyncNet input features
|
||||||
|
|
||||||
|
Args:
|
||||||
|
video_path: Video file path
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
ndarray: SyncNet input features
|
||||||
|
"""
|
||||||
|
ar = AudioReader(video_path, sample_rate=16000)
|
||||||
|
original_mel = audio.melspectrogram(ar[:].asnumpy().squeeze(0))
|
||||||
|
return original_mel.T
|
||||||
|
|
||||||
|
def get_resized_mouth_mask(
|
||||||
|
self,
|
||||||
|
img_resized,
|
||||||
|
landmark_array,
|
||||||
|
face_shape,
|
||||||
|
padding_pixel_mouth=0,
|
||||||
|
image_size=256,
|
||||||
|
crop_margin=0
|
||||||
|
):
|
||||||
|
landmark_array = np.array(landmark_array)
|
||||||
|
resized_landmark = resize_landmark(
|
||||||
|
landmark_array, w=face_shape[0], h=face_shape[1], new_w=image_size, new_h=image_size)
|
||||||
|
|
||||||
|
landmark_array = np.array(resized_landmark[48 : 67]) # the lip landmarks in 68 landmarks format
|
||||||
|
min_x, min_y = np.min(landmark_array, axis=0)
|
||||||
|
max_x, max_y = np.max(landmark_array, axis=0)
|
||||||
|
min_x = min_x - padding_pixel_mouth
|
||||||
|
max_x = max_x + padding_pixel_mouth
|
||||||
|
|
||||||
|
# Calculate x-axis length and use it for y-axis
|
||||||
|
width = max_x - min_x
|
||||||
|
|
||||||
|
# Calculate old center point
|
||||||
|
center_y = (max_y + min_y) / 2
|
||||||
|
|
||||||
|
# Determine new min_y and max_y based on width
|
||||||
|
min_y = center_y - width / 4
|
||||||
|
max_y = center_y + width / 4
|
||||||
|
|
||||||
|
# Adjust mask position for dynamic crop, shift y-axis
|
||||||
|
min_y = min_y - crop_margin
|
||||||
|
max_y = max_y - crop_margin
|
||||||
|
|
||||||
|
# Prevent out of bounds
|
||||||
|
min_x = max(min_x, 0)
|
||||||
|
min_y = max(min_y, 0)
|
||||||
|
max_x = min(max_x, face_shape[0])
|
||||||
|
max_y = min(max_y, face_shape[1])
|
||||||
|
|
||||||
|
mask = np.zeros_like(np.array(img_resized))
|
||||||
|
mask[round(min_y):round(max_y), round(min_x):round(max_x)] = 255
|
||||||
|
return Image.fromarray(mask)
|
||||||
|
|
||||||
|
def __len__(self):
|
||||||
|
return 100000
|
||||||
|
|
||||||
|
def __getitem__(self, idx):
|
||||||
|
attempts = 0
|
||||||
|
while attempts < self.max_attempts:
|
||||||
|
try:
|
||||||
|
meta_path = random.sample(self.meta_paths, k=1)[0]
|
||||||
|
with open(meta_path, 'r') as f:
|
||||||
|
meta_data = json.load(f)
|
||||||
|
except Exception as e:
|
||||||
|
print(f"meta file error:{meta_path}")
|
||||||
|
print(e)
|
||||||
|
attempts += 1
|
||||||
|
time.sleep(0.1)
|
||||||
|
continue
|
||||||
|
|
||||||
|
video_path = meta_data["mp4_path"]
|
||||||
|
wav_path = meta_data["wav_path"]
|
||||||
|
bbox_list = meta_data["face_list"]
|
||||||
|
landmark_list = meta_data["landmark_list"]
|
||||||
|
T = self.T
|
||||||
|
|
||||||
|
s = 0
|
||||||
|
e = meta_data["frames"]
|
||||||
|
len_valid_clip = e - s
|
||||||
|
|
||||||
|
if len_valid_clip < T * 10:
|
||||||
|
attempts += 1
|
||||||
|
print(f"video {video_path} has less than {T * 10} frames")
|
||||||
|
continue
|
||||||
|
|
||||||
|
try:
|
||||||
|
cap = VideoReader(video_path, fault_tol=1, ctx=cpu(0))
|
||||||
|
total_frames = len(cap)
|
||||||
|
assert total_frames == len(landmark_list)
|
||||||
|
assert total_frames == len(bbox_list)
|
||||||
|
landmark_shape = np.array(landmark_list).shape
|
||||||
|
if landmark_shape != (total_frames, 68, 2):
|
||||||
|
attempts += 1
|
||||||
|
print(f"video {video_path} has invalid landmark shape: {landmark_shape}, expected: {(total_frames, 68, 2)}") # we use 68 landmarks
|
||||||
|
continue
|
||||||
|
except Exception as e:
|
||||||
|
print(f"video file error:{video_path}")
|
||||||
|
print(e)
|
||||||
|
attempts += 1
|
||||||
|
time.sleep(0.1)
|
||||||
|
continue
|
||||||
|
|
||||||
|
shift_landmarks, bbox_list_union, face_shapes = shift_landmarks_to_face_coordinates(
|
||||||
|
landmark_list,
|
||||||
|
bbox_list
|
||||||
|
)
|
||||||
|
if self.contorl_face_min_size and face_shapes[0][0] < self.min_face_size:
|
||||||
|
print(f"video {video_path} has face size {face_shapes[0][0]} less than minimum required {self.min_face_size}")
|
||||||
|
attempts += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
step = 1
|
||||||
|
drive_idx_start = random.randint(s, e - T * step)
|
||||||
|
drive_idx_list = list(
|
||||||
|
range(drive_idx_start, drive_idx_start + T * step, step))
|
||||||
|
assert len(drive_idx_list) == T
|
||||||
|
|
||||||
|
src_idx_list = []
|
||||||
|
list_index_out_of_range = False
|
||||||
|
for drive_idx in drive_idx_list:
|
||||||
|
src_idx = get_src_idx(
|
||||||
|
drive_idx, T, self.sample_method, shift_landmarks, face_shapes, self.top_k_ratio)
|
||||||
|
if src_idx is None:
|
||||||
|
list_index_out_of_range = True
|
||||||
|
break
|
||||||
|
src_idx = min(src_idx, e - 1)
|
||||||
|
src_idx = max(src_idx, s)
|
||||||
|
src_idx_list.append(src_idx)
|
||||||
|
|
||||||
|
if list_index_out_of_range:
|
||||||
|
attempts += 1
|
||||||
|
print(f"video {video_path} has invalid source index for drive frames")
|
||||||
|
continue
|
||||||
|
|
||||||
|
ref_face_valid_flag = True
|
||||||
|
extra_margin = self.generate_random_value()
|
||||||
|
|
||||||
|
# Get reference images
|
||||||
|
ref_imgs = []
|
||||||
|
for src_idx in src_idx_list:
|
||||||
|
imSrc = Image.fromarray(cap[src_idx].asnumpy())
|
||||||
|
bbox_s = bbox_list_union[src_idx]
|
||||||
|
imSrc, _, _ = self.crop_resize_img(
|
||||||
|
imSrc,
|
||||||
|
bbox_s,
|
||||||
|
self.crop_type,
|
||||||
|
extra_margin=None
|
||||||
|
)
|
||||||
|
if self.contorl_face_min_size and min(imSrc.size[0], imSrc.size[1]) < self.min_face_size:
|
||||||
|
ref_face_valid_flag = False
|
||||||
|
break
|
||||||
|
ref_imgs.append(imSrc)
|
||||||
|
|
||||||
|
if not ref_face_valid_flag:
|
||||||
|
attempts += 1
|
||||||
|
print(f"video {video_path} has reference face size smaller than minimum required {self.min_face_size}")
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Get target images and masks
|
||||||
|
imSameIDs = []
|
||||||
|
bboxes = []
|
||||||
|
face_masks = []
|
||||||
|
face_mask_valid = True
|
||||||
|
target_face_valid_flag = True
|
||||||
|
|
||||||
|
for drive_idx in drive_idx_list:
|
||||||
|
imSameID = Image.fromarray(cap[drive_idx].asnumpy())
|
||||||
|
bbox_s = bbox_list_union[drive_idx]
|
||||||
|
imSameID, _ , mask_scaled_factor = self.crop_resize_img(
|
||||||
|
imSameID,
|
||||||
|
bbox_s,
|
||||||
|
self.crop_type,
|
||||||
|
extra_margin=extra_margin
|
||||||
|
)
|
||||||
|
if self.contorl_face_min_size and min(imSameID.size[0], imSameID.size[1]) < self.min_face_size:
|
||||||
|
target_face_valid_flag = False
|
||||||
|
break
|
||||||
|
crop_margin = extra_margin * mask_scaled_factor
|
||||||
|
face_mask = self.get_resized_mouth_mask(
|
||||||
|
imSameID,
|
||||||
|
shift_landmarks[drive_idx],
|
||||||
|
face_shapes[drive_idx],
|
||||||
|
self.padding_pixel_mouth,
|
||||||
|
self.image_size,
|
||||||
|
crop_margin=crop_margin
|
||||||
|
)
|
||||||
|
if np.count_nonzero(face_mask) == 0:
|
||||||
|
face_mask_valid = False
|
||||||
|
break
|
||||||
|
|
||||||
|
if face_mask.size[1] == 0 or face_mask.size[0] == 0:
|
||||||
|
print(f"video {video_path} has invalid face mask size at frame {drive_idx}")
|
||||||
|
face_mask_valid = False
|
||||||
|
break
|
||||||
|
|
||||||
|
imSameIDs.append(imSameID)
|
||||||
|
bboxes.append(bbox_s)
|
||||||
|
face_masks.append(face_mask)
|
||||||
|
|
||||||
|
if not face_mask_valid:
|
||||||
|
attempts += 1
|
||||||
|
print(f"video {video_path} has invalid face mask")
|
||||||
|
continue
|
||||||
|
|
||||||
|
if not target_face_valid_flag:
|
||||||
|
attempts += 1
|
||||||
|
print(f"video {video_path} has target face size smaller than minimum required {self.min_face_size}")
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Process audio features
|
||||||
|
audio_offset = drive_idx_list[0]
|
||||||
|
audio_step = step
|
||||||
|
fps = 25.0 / step
|
||||||
|
|
||||||
|
try:
|
||||||
|
audio_feature, audio_offset = self.get_audio_file(wav_path, audio_offset)
|
||||||
|
_, audio_offset = self.get_audio_file_mel(wav_path, audio_offset)
|
||||||
|
audio_feature_mel = self.get_syncnet_input(video_path)
|
||||||
|
except Exception as e:
|
||||||
|
print(f"audio file error:{wav_path}")
|
||||||
|
print(e)
|
||||||
|
attempts += 1
|
||||||
|
time.sleep(0.1)
|
||||||
|
continue
|
||||||
|
|
||||||
|
mel = self.crop_audio_window(audio_feature_mel, audio_offset)
|
||||||
|
if mel.shape[0] != syncnet_mel_step_size:
|
||||||
|
attempts += 1
|
||||||
|
print(f"video {video_path} has invalid mel spectrogram shape: {mel.shape}, expected: {syncnet_mel_step_size}")
|
||||||
|
continue
|
||||||
|
|
||||||
|
mel = torch.FloatTensor(mel.T).unsqueeze(0)
|
||||||
|
|
||||||
|
# Build sample dictionary
|
||||||
|
sample = dict(
|
||||||
|
pixel_values_vid=torch.stack(
|
||||||
|
[self.to_tensor(imSameID) for imSameID in imSameIDs], dim=0),
|
||||||
|
pixel_values_ref_img=torch.stack(
|
||||||
|
[self.to_tensor(ref_img) for ref_img in ref_imgs], dim=0),
|
||||||
|
pixel_values_face_mask=torch.stack(
|
||||||
|
[self.pose_to_tensor(face_mask) for face_mask in face_masks], dim=0),
|
||||||
|
audio_feature=audio_feature[0],
|
||||||
|
audio_offset=audio_offset,
|
||||||
|
audio_step=audio_step,
|
||||||
|
mel=mel,
|
||||||
|
wav_path=wav_path,
|
||||||
|
fps=fps,
|
||||||
|
)
|
||||||
|
|
||||||
|
return sample
|
||||||
|
|
||||||
|
raise ValueError("Unable to find a valid sample after maximum attempts.")
|
||||||
|
|
||||||
|
class HDTFDataset(FaceDataset):
|
||||||
|
"""HDTF dataset class"""
|
||||||
|
def __init__(self, cfg):
|
||||||
|
root_path = './dataset/HDTF/meta'
|
||||||
|
list_paths = [
|
||||||
|
'./dataset/HDTF/train.txt',
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
repeats = [10]
|
||||||
|
super().__init__(cfg, list_paths, root_path, repeats)
|
||||||
|
print('HDTFDataset: ', len(self))
|
||||||
|
|
||||||
|
class VFHQDataset(FaceDataset):
|
||||||
|
"""VFHQ dataset class"""
|
||||||
|
def __init__(self, cfg):
|
||||||
|
root_path = './dataset/VFHQ/meta'
|
||||||
|
list_paths = [
|
||||||
|
'./dataset/VFHQ/train.txt',
|
||||||
|
]
|
||||||
|
repeats = [1]
|
||||||
|
super().__init__(cfg, list_paths, root_path, repeats)
|
||||||
|
print('VFHQDataset: ', len(self))
|
||||||
|
|
||||||
|
def PortraitDataset(cfg=None):
|
||||||
|
"""Return dataset based on configuration
|
||||||
|
|
||||||
|
Args:
|
||||||
|
cfg: Configuration dictionary
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dataset: Combined dataset
|
||||||
|
"""
|
||||||
|
if cfg["dataset_key"] == "HDTF":
|
||||||
|
return ConcatDataset([HDTFDataset(cfg)])
|
||||||
|
elif cfg["dataset_key"] == "VFHQ":
|
||||||
|
return ConcatDataset([VFHQDataset(cfg)])
|
||||||
|
else:
|
||||||
|
print("############ use all dataset ############ ")
|
||||||
|
return ConcatDataset([HDTFDataset(cfg), VFHQDataset(cfg)])
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
# Set random seeds for reproducibility
|
||||||
|
seed = 42
|
||||||
|
random.seed(seed)
|
||||||
|
np.random.seed(seed)
|
||||||
|
torch.manual_seed(seed)
|
||||||
|
torch.cuda.manual_seed(seed)
|
||||||
|
torch.cuda.manual_seed_all(seed)
|
||||||
|
|
||||||
|
# Create dataset with configuration parameters
|
||||||
|
dataset = PortraitDataset(cfg={
|
||||||
|
'T': 1, # Number of frames to process at once
|
||||||
|
'random_margin_method': "normal", # Method for generating random margins: "normal" or "uniform"
|
||||||
|
'dataset_key': "HDTF", # Dataset to use: "HDTF", "VFHQ", or None for both
|
||||||
|
'image_size': 256, # Size of processed images (height and width)
|
||||||
|
'sample_method': 'pose_similarity_and_mouth_dissimilarity', # Method for selecting reference frames
|
||||||
|
'top_k_ratio': 0.51, # Ratio for top-k selection in reference frame sampling
|
||||||
|
'contorl_face_min_size': True, # Whether to enforce minimum face size
|
||||||
|
'padding_pixel_mouth': 10, # Padding pixels around mouth region in mask
|
||||||
|
'min_face_size': 200, # Minimum face size requirement for dataset
|
||||||
|
'whisper_path': "./models/whisper", # Path to Whisper model
|
||||||
|
'cropping_jaw2edge_margin_mean': 10, # Mean margin for jaw-to-edge cropping
|
||||||
|
'cropping_jaw2edge_margin_std': 10, # Standard deviation for jaw-to-edge cropping
|
||||||
|
'crop_type': "dynamic_margin_crop_resize", # Type of cropping: "crop_resize", "dynamic_margin_crop_resize", or "resize"
|
||||||
|
})
|
||||||
|
print(len(dataset))
|
||||||
|
|
||||||
|
import torchvision
|
||||||
|
os.makedirs('debug', exist_ok=True)
|
||||||
|
for i in range(10): # Check 10 samples
|
||||||
|
sample = dataset[0]
|
||||||
|
print(f"processing {i}")
|
||||||
|
|
||||||
|
# Get images and mask
|
||||||
|
ref_img = (sample['pixel_values_ref_img'] + 1.0) / 2 # (b, c, h, w)
|
||||||
|
target_img = (sample['pixel_values_vid'] + 1.0) / 2
|
||||||
|
face_mask = sample['pixel_values_face_mask']
|
||||||
|
|
||||||
|
# Print dimension information
|
||||||
|
print(f"ref_img shape: {ref_img.shape}")
|
||||||
|
print(f"target_img shape: {target_img.shape}")
|
||||||
|
print(f"face_mask shape: {face_mask.shape}")
|
||||||
|
|
||||||
|
# Create visualization images
|
||||||
|
b, c, h, w = ref_img.shape
|
||||||
|
|
||||||
|
# Apply mask only to target image
|
||||||
|
target_mask = face_mask
|
||||||
|
|
||||||
|
# Keep reference image unchanged
|
||||||
|
ref_with_mask = ref_img.clone()
|
||||||
|
|
||||||
|
# Create mask overlay for target image
|
||||||
|
target_with_mask = target_img.clone()
|
||||||
|
target_with_mask = target_with_mask * (1 - target_mask) + target_mask # Apply mask only to target
|
||||||
|
|
||||||
|
# Save original images, mask, and overlay results
|
||||||
|
# First row: original images
|
||||||
|
# Second row: mask
|
||||||
|
# Third row: overlay effect
|
||||||
|
concatenated_img = torch.cat((
|
||||||
|
ref_img, target_img, # Original images
|
||||||
|
torch.zeros_like(ref_img), target_mask, # Mask (black for ref)
|
||||||
|
ref_with_mask, target_with_mask # Overlay effect
|
||||||
|
), dim=3)
|
||||||
|
|
||||||
|
torchvision.utils.save_image(
|
||||||
|
concatenated_img, f'debug/mask_check_{i}.jpg', nrow=2)
|
||||||
Executable
+233
@@ -0,0 +1,233 @@
|
|||||||
|
import numpy as np
|
||||||
|
import random
|
||||||
|
|
||||||
|
def summarize_tensor(x):
|
||||||
|
return f"\033[34m{str(tuple(x.shape)).ljust(24)}\033[0m (\033[31mmin {x.min().item():+.4f}\033[0m / \033[32mmean {x.mean().item():+.4f}\033[0m / \033[33mmax {x.max().item():+.4f}\033[0m)"
|
||||||
|
|
||||||
|
def calculate_mouth_open_similarity(landmarks_list, select_idx,top_k=50,ascending=True):
|
||||||
|
num_landmarks = len(landmarks_list)
|
||||||
|
mouth_open_ratios = np.zeros(num_landmarks) # Initialize as a numpy array
|
||||||
|
print(np.shape(landmarks_list))
|
||||||
|
## Calculate mouth opening ratios
|
||||||
|
for i, landmarks in enumerate(landmarks_list):
|
||||||
|
# Assuming landmarks are in the format [x, y] and accessible by index
|
||||||
|
mouth_top = landmarks[165] # Adjust index according to your landmarks format
|
||||||
|
mouth_bottom = landmarks[147] # Adjust index according to your landmarks format
|
||||||
|
mouth_open_ratio = np.linalg.norm(mouth_top - mouth_bottom)
|
||||||
|
mouth_open_ratios[i] = mouth_open_ratio
|
||||||
|
|
||||||
|
# Calculate differences matrix
|
||||||
|
differences_matrix = np.abs(mouth_open_ratios[:, np.newaxis] - mouth_open_ratios[select_idx])
|
||||||
|
differences_matrix_with_signs = mouth_open_ratios[:, np.newaxis] - mouth_open_ratios[select_idx]
|
||||||
|
print(differences_matrix.shape)
|
||||||
|
# Find top_k similar indices for each landmark set
|
||||||
|
if ascending:
|
||||||
|
top_indices = np.argsort(differences_matrix[i])[:top_k]
|
||||||
|
else:
|
||||||
|
top_indices = np.argsort(-differences_matrix[i])[:top_k]
|
||||||
|
similar_landmarks_indices = top_indices.tolist()
|
||||||
|
similar_landmarks_distances = differences_matrix_with_signs[i].tolist() #注意这里不要排序
|
||||||
|
|
||||||
|
return similar_landmarks_indices, similar_landmarks_distances
|
||||||
|
#############################################################################################
|
||||||
|
def get_closed_mouth(landmarks_list,ascending=True,top_k=50):
|
||||||
|
num_landmarks = len(landmarks_list)
|
||||||
|
|
||||||
|
mouth_open_ratios = np.zeros(num_landmarks) # Initialize as a numpy array
|
||||||
|
## Calculate mouth opening ratios
|
||||||
|
#print("landmarks shape",np.shape(landmarks_list))
|
||||||
|
for i, landmarks in enumerate(landmarks_list):
|
||||||
|
# Assuming landmarks are in the format [x, y] and accessible by index
|
||||||
|
#print(landmarks[165])
|
||||||
|
mouth_top = np.array(landmarks[165])# Adjust index according to your landmarks format
|
||||||
|
mouth_bottom = np.array(landmarks[147]) # Adjust index according to your landmarks format
|
||||||
|
mouth_open_ratio = np.linalg.norm(mouth_top - mouth_bottom)
|
||||||
|
mouth_open_ratios[i] = mouth_open_ratio
|
||||||
|
|
||||||
|
# Find top_k similar indices for each landmark set
|
||||||
|
if ascending:
|
||||||
|
top_indices = np.argsort(mouth_open_ratios)[:top_k]
|
||||||
|
else:
|
||||||
|
top_indices = np.argsort(-mouth_open_ratios)[:top_k]
|
||||||
|
return top_indices
|
||||||
|
|
||||||
|
def calculate_landmarks_similarity(selected_idx, landmarks_list,image_shapes, start_index, end_index, top_k=50,ascending=True):
|
||||||
|
"""
|
||||||
|
Calculate the similarity between sets of facial landmarks and return the indices of the most similar faces.
|
||||||
|
|
||||||
|
Parameters:
|
||||||
|
landmarks_list (list): A list containing sets of facial landmarks, each element is a set of landmarks.
|
||||||
|
image_shapes (list): A list containing the shape of each image, each element is a (width, height) tuple.
|
||||||
|
start_index (int): The starting index of the facial landmarks.
|
||||||
|
end_index (int): The ending index of the facial landmarks.
|
||||||
|
top_k (int): The number of most similar landmark sets to return. Default is 50.
|
||||||
|
ascending (bool): Controls the sorting order. If True, sort in ascending order; If False, sort in descending order. Default is True.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
similar_landmarks_indices (list): A list containing the indices of the most similar facial landmarks for each face.
|
||||||
|
resized_landmarks (list): A list containing the resized facial landmarks.
|
||||||
|
"""
|
||||||
|
num_landmarks = len(landmarks_list)
|
||||||
|
resized_landmarks = []
|
||||||
|
|
||||||
|
# Preprocess landmarks
|
||||||
|
for i in range(num_landmarks):
|
||||||
|
landmark_array = np.array(landmarks_list[i])
|
||||||
|
selected_landmarks = landmark_array[start_index:end_index]
|
||||||
|
resized_landmark = resize_landmark(selected_landmarks, w=image_shapes[i][0], h=image_shapes[i][1],new_w=256,new_h=256)
|
||||||
|
resized_landmarks.append(resized_landmark)
|
||||||
|
|
||||||
|
resized_landmarks_array = np.array(resized_landmarks) # Convert list to array for easier manipulation
|
||||||
|
|
||||||
|
# Calculate similarity
|
||||||
|
distances = np.linalg.norm(resized_landmarks_array - resized_landmarks_array[selected_idx][np.newaxis, :], axis=2)
|
||||||
|
overall_distances = np.mean(distances, axis=1) # Calculate mean distance for each set of landmarks
|
||||||
|
|
||||||
|
if ascending:
|
||||||
|
sorted_indices = np.argsort(overall_distances)
|
||||||
|
similar_landmarks_indices = sorted_indices[1:top_k+1].tolist() # Exclude self and take top_k
|
||||||
|
else:
|
||||||
|
sorted_indices = np.argsort(-overall_distances)
|
||||||
|
similar_landmarks_indices = sorted_indices[0:top_k].tolist()
|
||||||
|
|
||||||
|
return similar_landmarks_indices
|
||||||
|
|
||||||
|
def process_bbox_musetalk(face_array, landmark_array):
|
||||||
|
x_min_face, y_min_face, x_max_face, y_max_face = map(int, face_array)
|
||||||
|
x_min_lm = min([int(x) for x, y in landmark_array])
|
||||||
|
y_min_lm = min([int(y) for x, y in landmark_array])
|
||||||
|
x_max_lm = max([int(x) for x, y in landmark_array])
|
||||||
|
y_max_lm = max([int(y) for x, y in landmark_array])
|
||||||
|
x_min = min(x_min_face, x_min_lm)
|
||||||
|
y_min = min(y_min_face, y_min_lm)
|
||||||
|
x_max = max(x_max_face, x_max_lm)
|
||||||
|
y_max = max(y_max_face, y_max_lm)
|
||||||
|
|
||||||
|
x_min = max(x_min, 0)
|
||||||
|
y_min = max(y_min, 0)
|
||||||
|
|
||||||
|
return [x_min, y_min, x_max, y_max]
|
||||||
|
|
||||||
|
def shift_landmarks_to_face_coordinates(landmark_list, face_list):
|
||||||
|
"""
|
||||||
|
Translates the data in landmark_list to the coordinates of the cropped larger face.
|
||||||
|
|
||||||
|
Parameters:
|
||||||
|
landmark_list (list): A list containing multiple sets of facial landmarks.
|
||||||
|
face_list (list): A list containing multiple facial images.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
landmark_list_shift (list): The list of translated landmarks.
|
||||||
|
bbox_union (list): The list of union bounding boxes.
|
||||||
|
face_shapes (list): The list of facial shapes.
|
||||||
|
"""
|
||||||
|
landmark_list_shift = []
|
||||||
|
bbox_union = []
|
||||||
|
face_shapes = []
|
||||||
|
|
||||||
|
for i in range(len(face_list)):
|
||||||
|
landmark_array = np.array(landmark_list[i]) # 转换为numpy数组并创建副本
|
||||||
|
face_array = face_list[i]
|
||||||
|
f_landmark_bbox = process_bbox_musetalk(face_array, landmark_array)
|
||||||
|
x_min, y_min, x_max, y_max = f_landmark_bbox
|
||||||
|
landmark_array[:, 0] = landmark_array[:, 0] - f_landmark_bbox[0]
|
||||||
|
landmark_array[:, 1] = landmark_array[:, 1] - f_landmark_bbox[1]
|
||||||
|
landmark_list_shift.append(landmark_array)
|
||||||
|
bbox_union.append(f_landmark_bbox)
|
||||||
|
face_shapes.append((x_max - x_min, y_max - y_min))
|
||||||
|
|
||||||
|
return landmark_list_shift, bbox_union, face_shapes
|
||||||
|
|
||||||
|
def resize_landmark(landmark, w, h, new_w, new_h):
|
||||||
|
landmark_norm = landmark / [w, h]
|
||||||
|
landmark_resized = landmark_norm * [new_w, new_h]
|
||||||
|
|
||||||
|
return landmark_resized
|
||||||
|
|
||||||
|
def get_src_idx(drive_idx, T, sample_method,landmarks_list,image_shapes,top_k_ratio):
|
||||||
|
"""
|
||||||
|
Calculate the source index (src_idx) based on the given drive index, T, s, e, and sampling method.
|
||||||
|
|
||||||
|
Parameters:
|
||||||
|
- drive_idx (int): The current drive index.
|
||||||
|
- T (int): Total number of frames or a specific range limit.
|
||||||
|
- sample_method (str): Sampling method, which can be "random" or other methods.
|
||||||
|
- landmarks_list (list): List of facial landmarks.
|
||||||
|
- image_shapes (list): List of image shapes.
|
||||||
|
- top_k_ratio (float): Ratio for selecting top k similar frames.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
- src_idx (int): The calculated source index.
|
||||||
|
"""
|
||||||
|
if sample_method == "random":
|
||||||
|
src_idx = random.randint(drive_idx - 5 * T, drive_idx + 5 * T)
|
||||||
|
elif sample_method == "pose_similarity":
|
||||||
|
top_k = int(top_k_ratio*len(landmarks_list))
|
||||||
|
try:
|
||||||
|
top_k = int(top_k_ratio*len(landmarks_list))
|
||||||
|
# facial contour
|
||||||
|
landmark_start_idx = 0
|
||||||
|
landmark_end_idx = 16
|
||||||
|
pose_similarity_list = calculate_landmarks_similarity(drive_idx, landmarks_list,image_shapes, landmark_start_idx, landmark_end_idx,top_k=top_k, ascending=True)
|
||||||
|
src_idx = random.choice(pose_similarity_list)
|
||||||
|
while abs(src_idx-drive_idx)<5:
|
||||||
|
src_idx = random.choice(pose_similarity_list)
|
||||||
|
except Exception as e:
|
||||||
|
print(e)
|
||||||
|
return None
|
||||||
|
elif sample_method=="pose_similarity_and_closed_mouth":
|
||||||
|
# facial contour
|
||||||
|
landmark_start_idx = 0
|
||||||
|
landmark_end_idx = 16
|
||||||
|
try:
|
||||||
|
top_k = int(top_k_ratio*len(landmarks_list))
|
||||||
|
closed_mouth_list = get_closed_mouth(landmarks_list, ascending=True,top_k=top_k)
|
||||||
|
#print("closed_mouth_list",closed_mouth_list)
|
||||||
|
pose_similarity_list = calculate_landmarks_similarity(drive_idx, landmarks_list,image_shapes, landmark_start_idx, landmark_end_idx,top_k=top_k, ascending=True)
|
||||||
|
#print("pose_similarity_list",pose_similarity_list)
|
||||||
|
common_list = list(set(closed_mouth_list).intersection(set(pose_similarity_list)))
|
||||||
|
if len(common_list) == 0:
|
||||||
|
src_idx = random.randint(drive_idx - 5 * T, drive_idx + 5 * T)
|
||||||
|
else:
|
||||||
|
src_idx = random.choice(common_list)
|
||||||
|
|
||||||
|
while abs(src_idx-drive_idx) <5:
|
||||||
|
src_idx = random.randint(drive_idx - 5 * T, drive_idx + 5 * T)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(e)
|
||||||
|
return None
|
||||||
|
|
||||||
|
elif sample_method=="pose_similarity_and_mouth_dissimilarity":
|
||||||
|
top_k = int(top_k_ratio*len(landmarks_list))
|
||||||
|
try:
|
||||||
|
top_k = int(top_k_ratio*len(landmarks_list))
|
||||||
|
|
||||||
|
# facial contour for 68 landmarks format
|
||||||
|
landmark_start_idx = 0
|
||||||
|
landmark_end_idx = 16
|
||||||
|
|
||||||
|
pose_similarity_list = calculate_landmarks_similarity(drive_idx, landmarks_list,image_shapes, landmark_start_idx, landmark_end_idx,top_k=top_k, ascending=True)
|
||||||
|
|
||||||
|
# Mouth inner coutour for 68 landmarks format
|
||||||
|
landmark_start_idx = 60
|
||||||
|
landmark_end_idx = 67
|
||||||
|
|
||||||
|
mouth_dissimilarity_list = calculate_landmarks_similarity(drive_idx, landmarks_list,image_shapes, landmark_start_idx, landmark_end_idx,top_k=top_k, ascending=False)
|
||||||
|
|
||||||
|
common_list = list(set(pose_similarity_list).intersection(set(mouth_dissimilarity_list)))
|
||||||
|
if len(common_list) == 0:
|
||||||
|
src_idx = random.randint(drive_idx - 5 * T, drive_idx + 5 * T)
|
||||||
|
else:
|
||||||
|
src_idx = random.choice(common_list)
|
||||||
|
|
||||||
|
while abs(src_idx-drive_idx) <5:
|
||||||
|
src_idx = random.randint(drive_idx - 5 * T, drive_idx + 5 * T)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(e)
|
||||||
|
return None
|
||||||
|
|
||||||
|
else:
|
||||||
|
raise ValueError(f"Unknown sample_method: {sample_method}")
|
||||||
|
return src_idx
|
||||||
Executable
+81
@@ -0,0 +1,81 @@
|
|||||||
|
import torch
|
||||||
|
import torch.nn as nn
|
||||||
|
from omegaconf import OmegaConf
|
||||||
|
import torch
|
||||||
|
import torch.nn.functional as F
|
||||||
|
from torch import nn, optim
|
||||||
|
from torch.optim.lr_scheduler import CosineAnnealingLR
|
||||||
|
from musetalk.loss.discriminator import MultiScaleDiscriminator,DiscriminatorFullModel
|
||||||
|
import musetalk.loss.vgg_face as vgg_face
|
||||||
|
|
||||||
|
class Interpolate(nn.Module):
|
||||||
|
def __init__(self, size=None, scale_factor=None, mode='nearest', align_corners=None):
|
||||||
|
super(Interpolate, self).__init__()
|
||||||
|
self.size = size
|
||||||
|
self.scale_factor = scale_factor
|
||||||
|
self.mode = mode
|
||||||
|
self.align_corners = align_corners
|
||||||
|
|
||||||
|
def forward(self, input):
|
||||||
|
return F.interpolate(input, self.size, self.scale_factor, self.mode, self.align_corners)
|
||||||
|
|
||||||
|
def set_requires_grad(net, requires_grad=False):
|
||||||
|
if net is not None:
|
||||||
|
for param in net.parameters():
|
||||||
|
param.requires_grad = requires_grad
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
cfg = OmegaConf.load("config/audio_adapter/E7.yaml")
|
||||||
|
|
||||||
|
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
|
||||||
|
pyramid_scale = [1, 0.5, 0.25, 0.125]
|
||||||
|
vgg_IN = vgg_face.Vgg19().to(device)
|
||||||
|
pyramid = vgg_face.ImagePyramide(cfg.loss_params.pyramid_scale, 3).to(device)
|
||||||
|
vgg_IN.eval()
|
||||||
|
downsampler = Interpolate(size=(224, 224), mode='bilinear', align_corners=False)
|
||||||
|
|
||||||
|
image = torch.rand(8, 3, 256, 256).to(device)
|
||||||
|
image_pred = torch.rand(8, 3, 256, 256).to(device)
|
||||||
|
pyramide_real = pyramid(downsampler(image))
|
||||||
|
pyramide_generated = pyramid(downsampler(image_pred))
|
||||||
|
|
||||||
|
|
||||||
|
loss_IN = 0
|
||||||
|
for scale in cfg.loss_params.pyramid_scale:
|
||||||
|
x_vgg = vgg_IN(pyramide_generated['prediction_' + str(scale)])
|
||||||
|
y_vgg = vgg_IN(pyramide_real['prediction_' + str(scale)])
|
||||||
|
for i, weight in enumerate(cfg.loss_params.vgg_layer_weight):
|
||||||
|
value = torch.abs(x_vgg[i] - y_vgg[i].detach()).mean()
|
||||||
|
loss_IN += weight * value
|
||||||
|
loss_IN /= sum(cfg.loss_params.vgg_layer_weight) # 对vgg不同层取均值,金字塔loss是每层叠
|
||||||
|
print(loss_IN)
|
||||||
|
|
||||||
|
#print(cfg.model_params.discriminator_params)
|
||||||
|
|
||||||
|
discriminator = MultiScaleDiscriminator(**cfg.model_params.discriminator_params).to(device)
|
||||||
|
discriminator_full = DiscriminatorFullModel(discriminator)
|
||||||
|
disc_scales = cfg.model_params.discriminator_params.scales
|
||||||
|
# Prepare optimizer and loss function
|
||||||
|
optimizer_D = optim.AdamW(discriminator.parameters(),
|
||||||
|
lr=cfg.discriminator_train_params.lr,
|
||||||
|
weight_decay=cfg.discriminator_train_params.weight_decay,
|
||||||
|
betas=cfg.discriminator_train_params.betas,
|
||||||
|
eps=cfg.discriminator_train_params.eps)
|
||||||
|
scheduler_D = CosineAnnealingLR(optimizer_D,
|
||||||
|
T_max=cfg.discriminator_train_params.epochs,
|
||||||
|
eta_min=1e-6)
|
||||||
|
|
||||||
|
discriminator.train()
|
||||||
|
|
||||||
|
set_requires_grad(discriminator, False)
|
||||||
|
|
||||||
|
loss_G = 0.
|
||||||
|
discriminator_maps_generated = discriminator(pyramide_generated)
|
||||||
|
discriminator_maps_real = discriminator(pyramide_real)
|
||||||
|
|
||||||
|
for scale in disc_scales:
|
||||||
|
key = 'prediction_map_%s' % scale
|
||||||
|
value = ((1 - discriminator_maps_generated[key]) ** 2).mean()
|
||||||
|
loss_G += value
|
||||||
|
|
||||||
|
print(loss_G)
|
||||||
Executable
+44
@@ -0,0 +1,44 @@
|
|||||||
|
import torch
|
||||||
|
from torch import nn
|
||||||
|
from torch.nn import functional as F
|
||||||
|
|
||||||
|
class Conv2d(nn.Module):
|
||||||
|
def __init__(self, cin, cout, kernel_size, stride, padding, residual=False, *args, **kwargs):
|
||||||
|
super().__init__(*args, **kwargs)
|
||||||
|
self.conv_block = nn.Sequential(
|
||||||
|
nn.Conv2d(cin, cout, kernel_size, stride, padding),
|
||||||
|
nn.BatchNorm2d(cout)
|
||||||
|
)
|
||||||
|
self.act = nn.ReLU()
|
||||||
|
self.residual = residual
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
out = self.conv_block(x)
|
||||||
|
if self.residual:
|
||||||
|
out += x
|
||||||
|
return self.act(out)
|
||||||
|
|
||||||
|
class nonorm_Conv2d(nn.Module):
|
||||||
|
def __init__(self, cin, cout, kernel_size, stride, padding, residual=False, *args, **kwargs):
|
||||||
|
super().__init__(*args, **kwargs)
|
||||||
|
self.conv_block = nn.Sequential(
|
||||||
|
nn.Conv2d(cin, cout, kernel_size, stride, padding),
|
||||||
|
)
|
||||||
|
self.act = nn.LeakyReLU(0.01, inplace=True)
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
out = self.conv_block(x)
|
||||||
|
return self.act(out)
|
||||||
|
|
||||||
|
class Conv2dTranspose(nn.Module):
|
||||||
|
def __init__(self, cin, cout, kernel_size, stride, padding, output_padding=0, *args, **kwargs):
|
||||||
|
super().__init__(*args, **kwargs)
|
||||||
|
self.conv_block = nn.Sequential(
|
||||||
|
nn.ConvTranspose2d(cin, cout, kernel_size, stride, padding, output_padding),
|
||||||
|
nn.BatchNorm2d(cout)
|
||||||
|
)
|
||||||
|
self.act = nn.ReLU()
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
out = self.conv_block(x)
|
||||||
|
return self.act(out)
|
||||||
Executable
+145
@@ -0,0 +1,145 @@
|
|||||||
|
from torch import nn
|
||||||
|
import torch.nn.functional as F
|
||||||
|
import torch
|
||||||
|
from musetalk.loss.vgg_face import ImagePyramide
|
||||||
|
|
||||||
|
class DownBlock2d(nn.Module):
|
||||||
|
"""
|
||||||
|
Simple block for processing video (encoder).
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, in_features, out_features, norm=False, kernel_size=4, pool=False, sn=False):
|
||||||
|
super(DownBlock2d, self).__init__()
|
||||||
|
self.conv = nn.Conv2d(in_channels=in_features, out_channels=out_features, kernel_size=kernel_size)
|
||||||
|
|
||||||
|
if sn:
|
||||||
|
self.conv = nn.utils.spectral_norm(self.conv)
|
||||||
|
|
||||||
|
if norm:
|
||||||
|
self.norm = nn.InstanceNorm2d(out_features, affine=True)
|
||||||
|
else:
|
||||||
|
self.norm = None
|
||||||
|
self.pool = pool
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
out = x
|
||||||
|
out = self.conv(out)
|
||||||
|
if self.norm:
|
||||||
|
out = self.norm(out)
|
||||||
|
out = F.leaky_relu(out, 0.2)
|
||||||
|
if self.pool:
|
||||||
|
out = F.avg_pool2d(out, (2, 2))
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
class Discriminator(nn.Module):
|
||||||
|
"""
|
||||||
|
Discriminator similar to Pix2Pix
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, num_channels=3, block_expansion=64, num_blocks=4, max_features=512,
|
||||||
|
sn=False, **kwargs):
|
||||||
|
super(Discriminator, self).__init__()
|
||||||
|
|
||||||
|
down_blocks = []
|
||||||
|
for i in range(num_blocks):
|
||||||
|
down_blocks.append(
|
||||||
|
DownBlock2d(num_channels if i == 0 else min(max_features, block_expansion * (2 ** i)),
|
||||||
|
min(max_features, block_expansion * (2 ** (i + 1))),
|
||||||
|
norm=(i != 0), kernel_size=4, pool=(i != num_blocks - 1), sn=sn))
|
||||||
|
|
||||||
|
self.down_blocks = nn.ModuleList(down_blocks)
|
||||||
|
self.conv = nn.Conv2d(self.down_blocks[-1].conv.out_channels, out_channels=1, kernel_size=1)
|
||||||
|
if sn:
|
||||||
|
self.conv = nn.utils.spectral_norm(self.conv)
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
feature_maps = []
|
||||||
|
out = x
|
||||||
|
|
||||||
|
for down_block in self.down_blocks:
|
||||||
|
feature_maps.append(down_block(out))
|
||||||
|
out = feature_maps[-1]
|
||||||
|
prediction_map = self.conv(out)
|
||||||
|
|
||||||
|
return feature_maps, prediction_map
|
||||||
|
|
||||||
|
|
||||||
|
class MultiScaleDiscriminator(nn.Module):
|
||||||
|
"""
|
||||||
|
Multi-scale (scale) discriminator
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, scales=(), **kwargs):
|
||||||
|
super(MultiScaleDiscriminator, self).__init__()
|
||||||
|
self.scales = scales
|
||||||
|
discs = {}
|
||||||
|
for scale in scales:
|
||||||
|
discs[str(scale).replace('.', '-')] = Discriminator(**kwargs)
|
||||||
|
self.discs = nn.ModuleDict(discs)
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
out_dict = {}
|
||||||
|
for scale, disc in self.discs.items():
|
||||||
|
scale = str(scale).replace('-', '.')
|
||||||
|
key = 'prediction_' + scale
|
||||||
|
#print(key)
|
||||||
|
#print(x)
|
||||||
|
feature_maps, prediction_map = disc(x[key])
|
||||||
|
out_dict['feature_maps_' + scale] = feature_maps
|
||||||
|
out_dict['prediction_map_' + scale] = prediction_map
|
||||||
|
return out_dict
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
class DiscriminatorFullModel(torch.nn.Module):
|
||||||
|
"""
|
||||||
|
Merge all discriminator related updates into single model for better multi-gpu usage
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, discriminator):
|
||||||
|
super(DiscriminatorFullModel, self).__init__()
|
||||||
|
self.discriminator = discriminator
|
||||||
|
self.scales = self.discriminator.scales
|
||||||
|
print("scales",self.scales)
|
||||||
|
self.pyramid = ImagePyramide(self.scales, 3)
|
||||||
|
if torch.cuda.is_available():
|
||||||
|
self.pyramid = self.pyramid.cuda()
|
||||||
|
|
||||||
|
self.zero_tensor = None
|
||||||
|
|
||||||
|
def get_zero_tensor(self, input):
|
||||||
|
if self.zero_tensor is None:
|
||||||
|
self.zero_tensor = torch.FloatTensor(1).fill_(0).cuda()
|
||||||
|
self.zero_tensor.requires_grad_(False)
|
||||||
|
return self.zero_tensor.expand_as(input)
|
||||||
|
|
||||||
|
def forward(self, x, generated, gan_mode='ls'):
|
||||||
|
pyramide_real = self.pyramid(x)
|
||||||
|
pyramide_generated = self.pyramid(generated.detach())
|
||||||
|
|
||||||
|
discriminator_maps_generated = self.discriminator(pyramide_generated)
|
||||||
|
discriminator_maps_real = self.discriminator(pyramide_real)
|
||||||
|
|
||||||
|
value_total = 0
|
||||||
|
for scale in self.scales:
|
||||||
|
key = 'prediction_map_%s' % scale
|
||||||
|
if gan_mode == 'hinge':
|
||||||
|
value = -torch.mean(torch.min(discriminator_maps_real[key]-1, self.get_zero_tensor(discriminator_maps_real[key]))) - torch.mean(torch.min(-discriminator_maps_generated[key]-1, self.get_zero_tensor(discriminator_maps_generated[key])))
|
||||||
|
elif gan_mode == 'ls':
|
||||||
|
value = ((1 - discriminator_maps_real[key]) ** 2 + discriminator_maps_generated[key] ** 2).mean()
|
||||||
|
else:
|
||||||
|
raise ValueError('Unexpected gan_mode {}'.format(self.train_params['gan_mode']))
|
||||||
|
|
||||||
|
value_total += value
|
||||||
|
|
||||||
|
return value_total
|
||||||
|
|
||||||
|
def main():
|
||||||
|
discriminator = MultiScaleDiscriminator(scales=[1],
|
||||||
|
block_expansion=32,
|
||||||
|
max_features=512,
|
||||||
|
num_blocks=4,
|
||||||
|
sn=True,
|
||||||
|
image_channel=3,
|
||||||
|
estimate_jacobian=False)
|
||||||
Executable
+152
@@ -0,0 +1,152 @@
|
|||||||
|
import torch.nn as nn
|
||||||
|
import math
|
||||||
|
|
||||||
|
__all__ = ['ResNet', 'resnet50']
|
||||||
|
|
||||||
|
def conv3x3(in_planes, out_planes, stride=1):
|
||||||
|
"""3x3 convolution with padding"""
|
||||||
|
return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
|
||||||
|
padding=1, bias=False)
|
||||||
|
|
||||||
|
|
||||||
|
class BasicBlock(nn.Module):
|
||||||
|
expansion = 1
|
||||||
|
|
||||||
|
def __init__(self, inplanes, planes, stride=1, downsample=None):
|
||||||
|
super(BasicBlock, self).__init__()
|
||||||
|
self.conv1 = conv3x3(inplanes, planes, stride)
|
||||||
|
self.bn1 = nn.BatchNorm2d(planes)
|
||||||
|
self.relu = nn.ReLU(inplace=True)
|
||||||
|
self.conv2 = conv3x3(planes, planes)
|
||||||
|
self.bn2 = nn.BatchNorm2d(planes)
|
||||||
|
self.downsample = downsample
|
||||||
|
self.stride = stride
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
residual = x
|
||||||
|
|
||||||
|
out = self.conv1(x)
|
||||||
|
out = self.bn1(out)
|
||||||
|
out = self.relu(out)
|
||||||
|
|
||||||
|
out = self.conv2(out)
|
||||||
|
out = self.bn2(out)
|
||||||
|
|
||||||
|
if self.downsample is not None:
|
||||||
|
residual = self.downsample(x)
|
||||||
|
|
||||||
|
out += residual
|
||||||
|
out = self.relu(out)
|
||||||
|
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
class Bottleneck(nn.Module):
|
||||||
|
expansion = 4
|
||||||
|
|
||||||
|
def __init__(self, inplanes, planes, stride=1, downsample=None):
|
||||||
|
super(Bottleneck, self).__init__()
|
||||||
|
self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=1, stride=stride, bias=False)
|
||||||
|
self.bn1 = nn.BatchNorm2d(planes)
|
||||||
|
self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=1, padding=1, bias=False)
|
||||||
|
self.bn2 = nn.BatchNorm2d(planes)
|
||||||
|
self.conv3 = nn.Conv2d(planes, planes * 4, kernel_size=1, bias=False)
|
||||||
|
self.bn3 = nn.BatchNorm2d(planes * 4)
|
||||||
|
self.relu = nn.ReLU(inplace=True)
|
||||||
|
self.downsample = downsample
|
||||||
|
self.stride = stride
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
residual = x
|
||||||
|
|
||||||
|
out = self.conv1(x)
|
||||||
|
out = self.bn1(out)
|
||||||
|
out = self.relu(out)
|
||||||
|
|
||||||
|
out = self.conv2(out)
|
||||||
|
out = self.bn2(out)
|
||||||
|
out = self.relu(out)
|
||||||
|
|
||||||
|
out = self.conv3(out)
|
||||||
|
out = self.bn3(out)
|
||||||
|
|
||||||
|
if self.downsample is not None:
|
||||||
|
residual = self.downsample(x)
|
||||||
|
|
||||||
|
out += residual
|
||||||
|
out = self.relu(out)
|
||||||
|
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
class ResNet(nn.Module):
|
||||||
|
|
||||||
|
def __init__(self, block, layers, num_classes=1000, include_top=True):
|
||||||
|
self.inplanes = 64
|
||||||
|
super(ResNet, self).__init__()
|
||||||
|
self.include_top = include_top
|
||||||
|
|
||||||
|
self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
|
||||||
|
self.bn1 = nn.BatchNorm2d(64)
|
||||||
|
self.relu = nn.ReLU(inplace=True)
|
||||||
|
self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=0, ceil_mode=True)
|
||||||
|
|
||||||
|
self.layer1 = self._make_layer(block, 64, layers[0])
|
||||||
|
self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
|
||||||
|
self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
|
||||||
|
self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
|
||||||
|
self.avgpool = nn.AvgPool2d(7, stride=1)
|
||||||
|
self.fc = nn.Linear(512 * block.expansion, num_classes)
|
||||||
|
|
||||||
|
for m in self.modules():
|
||||||
|
if isinstance(m, nn.Conv2d):
|
||||||
|
n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
|
||||||
|
m.weight.data.normal_(0, math.sqrt(2. / n))
|
||||||
|
elif isinstance(m, nn.BatchNorm2d):
|
||||||
|
m.weight.data.fill_(1)
|
||||||
|
m.bias.data.zero_()
|
||||||
|
|
||||||
|
def _make_layer(self, block, planes, blocks, stride=1):
|
||||||
|
downsample = None
|
||||||
|
if stride != 1 or self.inplanes != planes * block.expansion:
|
||||||
|
downsample = nn.Sequential(
|
||||||
|
nn.Conv2d(self.inplanes, planes * block.expansion,
|
||||||
|
kernel_size=1, stride=stride, bias=False),
|
||||||
|
nn.BatchNorm2d(planes * block.expansion),
|
||||||
|
)
|
||||||
|
|
||||||
|
layers = []
|
||||||
|
layers.append(block(self.inplanes, planes, stride, downsample))
|
||||||
|
self.inplanes = planes * block.expansion
|
||||||
|
for i in range(1, blocks):
|
||||||
|
layers.append(block(self.inplanes, planes))
|
||||||
|
|
||||||
|
return nn.Sequential(*layers)
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
x = x * 255.
|
||||||
|
x = x.flip(1)
|
||||||
|
x = self.conv1(x)
|
||||||
|
x = self.bn1(x)
|
||||||
|
x = self.relu(x)
|
||||||
|
x = self.maxpool(x)
|
||||||
|
|
||||||
|
x = self.layer1(x)
|
||||||
|
x = self.layer2(x)
|
||||||
|
x = self.layer3(x)
|
||||||
|
x = self.layer4(x)
|
||||||
|
|
||||||
|
x = self.avgpool(x)
|
||||||
|
|
||||||
|
if not self.include_top:
|
||||||
|
return x
|
||||||
|
|
||||||
|
x = x.view(x.size(0), -1)
|
||||||
|
x = self.fc(x)
|
||||||
|
return x
|
||||||
|
|
||||||
|
def resnet50(**kwargs):
|
||||||
|
"""Constructs a ResNet-50 model.
|
||||||
|
"""
|
||||||
|
model = ResNet(Bottleneck, [3, 4, 6, 3], **kwargs)
|
||||||
|
return model
|
||||||
Executable
+95
@@ -0,0 +1,95 @@
|
|||||||
|
import torch
|
||||||
|
from torch import nn
|
||||||
|
from torch.nn import functional as F
|
||||||
|
|
||||||
|
from .conv import Conv2d
|
||||||
|
|
||||||
|
logloss = nn.BCELoss(reduction="none")
|
||||||
|
def cosine_loss(a, v, y):
|
||||||
|
d = nn.functional.cosine_similarity(a, v)
|
||||||
|
d = d.clamp(0,1) # cosine_similarity的取值范围是【-1,1】,BCE如果输入负数会报错RuntimeError: CUDA error: device-side assert triggered
|
||||||
|
loss = logloss(d.unsqueeze(1), y).squeeze()
|
||||||
|
loss = loss.mean()
|
||||||
|
return loss, d
|
||||||
|
|
||||||
|
def get_sync_loss(
|
||||||
|
audio_embed,
|
||||||
|
gt_frames,
|
||||||
|
pred_frames,
|
||||||
|
syncnet,
|
||||||
|
adapted_weight,
|
||||||
|
frames_left_index=0,
|
||||||
|
frames_right_index=16,
|
||||||
|
):
|
||||||
|
# 跟gt_frames做随机的插入交换,节省显存开销
|
||||||
|
assert pred_frames.shape[1] == (frames_right_index - frames_left_index) * 3
|
||||||
|
# 3通道图像
|
||||||
|
frames_sync_loss = torch.cat(
|
||||||
|
[gt_frames[:, :3 * frames_left_index, ...], pred_frames, gt_frames[:, 3 * frames_right_index:, ...]],
|
||||||
|
axis=1
|
||||||
|
)
|
||||||
|
vision_embed = syncnet.get_image_embed(frames_sync_loss)
|
||||||
|
y = torch.ones(frames_sync_loss.size(0), 1).float().to(audio_embed.device)
|
||||||
|
loss, score = cosine_loss(audio_embed, vision_embed, y)
|
||||||
|
return loss, score
|
||||||
|
|
||||||
|
class SyncNet_color(nn.Module):
|
||||||
|
def __init__(self):
|
||||||
|
super(SyncNet_color, self).__init__()
|
||||||
|
|
||||||
|
self.face_encoder = nn.Sequential(
|
||||||
|
Conv2d(15, 32, kernel_size=(7, 7), stride=1, padding=3),
|
||||||
|
|
||||||
|
Conv2d(32, 64, kernel_size=5, stride=(1, 2), padding=1),
|
||||||
|
Conv2d(64, 64, kernel_size=3, stride=1, padding=1, residual=True),
|
||||||
|
Conv2d(64, 64, kernel_size=3, stride=1, padding=1, residual=True),
|
||||||
|
|
||||||
|
Conv2d(64, 128, kernel_size=3, stride=2, padding=1),
|
||||||
|
Conv2d(128, 128, kernel_size=3, stride=1, padding=1, residual=True),
|
||||||
|
Conv2d(128, 128, kernel_size=3, stride=1, padding=1, residual=True),
|
||||||
|
Conv2d(128, 128, kernel_size=3, stride=1, padding=1, residual=True),
|
||||||
|
|
||||||
|
Conv2d(128, 256, kernel_size=3, stride=2, padding=1),
|
||||||
|
Conv2d(256, 256, kernel_size=3, stride=1, padding=1, residual=True),
|
||||||
|
Conv2d(256, 256, kernel_size=3, stride=1, padding=1, residual=True),
|
||||||
|
|
||||||
|
Conv2d(256, 512, kernel_size=3, stride=2, padding=1),
|
||||||
|
Conv2d(512, 512, kernel_size=3, stride=1, padding=1, residual=True),
|
||||||
|
Conv2d(512, 512, kernel_size=3, stride=1, padding=1, residual=True),
|
||||||
|
|
||||||
|
Conv2d(512, 512, kernel_size=3, stride=2, padding=1),
|
||||||
|
Conv2d(512, 512, kernel_size=3, stride=1, padding=0),
|
||||||
|
Conv2d(512, 512, kernel_size=1, stride=1, padding=0),)
|
||||||
|
|
||||||
|
self.audio_encoder = nn.Sequential(
|
||||||
|
Conv2d(1, 32, kernel_size=3, stride=1, padding=1),
|
||||||
|
Conv2d(32, 32, kernel_size=3, stride=1, padding=1, residual=True),
|
||||||
|
Conv2d(32, 32, kernel_size=3, stride=1, padding=1, residual=True),
|
||||||
|
|
||||||
|
Conv2d(32, 64, kernel_size=3, stride=(3, 1), padding=1),
|
||||||
|
Conv2d(64, 64, kernel_size=3, stride=1, padding=1, residual=True),
|
||||||
|
Conv2d(64, 64, kernel_size=3, stride=1, padding=1, residual=True),
|
||||||
|
|
||||||
|
Conv2d(64, 128, kernel_size=3, stride=3, padding=1),
|
||||||
|
Conv2d(128, 128, kernel_size=3, stride=1, padding=1, residual=True),
|
||||||
|
Conv2d(128, 128, kernel_size=3, stride=1, padding=1, residual=True),
|
||||||
|
|
||||||
|
Conv2d(128, 256, kernel_size=3, stride=(3, 2), padding=1),
|
||||||
|
Conv2d(256, 256, kernel_size=3, stride=1, padding=1, residual=True),
|
||||||
|
Conv2d(256, 256, kernel_size=3, stride=1, padding=1, residual=True),
|
||||||
|
|
||||||
|
Conv2d(256, 512, kernel_size=3, stride=1, padding=0),
|
||||||
|
Conv2d(512, 512, kernel_size=1, stride=1, padding=0),)
|
||||||
|
|
||||||
|
def forward(self, audio_sequences, face_sequences): # audio_sequences := (B, dim, T)
|
||||||
|
face_embedding = self.face_encoder(face_sequences)
|
||||||
|
audio_embedding = self.audio_encoder(audio_sequences)
|
||||||
|
|
||||||
|
audio_embedding = audio_embedding.view(audio_embedding.size(0), -1)
|
||||||
|
face_embedding = face_embedding.view(face_embedding.size(0), -1)
|
||||||
|
|
||||||
|
audio_embedding = F.normalize(audio_embedding, p=2, dim=1)
|
||||||
|
face_embedding = F.normalize(face_embedding, p=2, dim=1)
|
||||||
|
|
||||||
|
|
||||||
|
return audio_embedding, face_embedding
|
||||||
Executable
+237
@@ -0,0 +1,237 @@
|
|||||||
|
'''
|
||||||
|
This part of code contains a pretrained vgg_face model.
|
||||||
|
ref link: https://github.com/prlz77/vgg-face.pytorch
|
||||||
|
'''
|
||||||
|
import torch
|
||||||
|
import torch.nn.functional as F
|
||||||
|
import torch.utils.model_zoo
|
||||||
|
import pickle
|
||||||
|
from musetalk.loss import resnet as ResNet
|
||||||
|
|
||||||
|
|
||||||
|
MODEL_URL = "https://github.com/claudio-unipv/vggface-pytorch/releases/download/v0.1/vggface-9d491dd7c30312.pth"
|
||||||
|
VGG_FACE_PATH = '/apdcephfs_cq8/share_1367250/zhentaoyu/Driving/00_VASA/00_data/models/pretrain_models/resnet50_ft_weight.pkl'
|
||||||
|
|
||||||
|
# It was 93.5940, 104.7624, 129.1863 before dividing by 255
|
||||||
|
MEAN_RGB = [
|
||||||
|
0.367035294117647,
|
||||||
|
0.41083294117647057,
|
||||||
|
0.5066129411764705
|
||||||
|
]
|
||||||
|
def load_state_dict(model, fname):
|
||||||
|
"""
|
||||||
|
Set parameters converted from Caffe models authors of VGGFace2 provide.
|
||||||
|
See https://www.robots.ox.ac.uk/~vgg/data/vgg_face2/.
|
||||||
|
|
||||||
|
Arguments:
|
||||||
|
model: model
|
||||||
|
fname: file name of parameters converted from a Caffe model, assuming the file format is Pickle.
|
||||||
|
"""
|
||||||
|
with open(fname, 'rb') as f:
|
||||||
|
weights = pickle.load(f, encoding='latin1')
|
||||||
|
|
||||||
|
own_state = model.state_dict()
|
||||||
|
for name, param in weights.items():
|
||||||
|
if name in own_state:
|
||||||
|
try:
|
||||||
|
own_state[name].copy_(torch.from_numpy(param))
|
||||||
|
except Exception:
|
||||||
|
raise RuntimeError('While copying the parameter named {}, whose dimensions in the model are {} and whose '\
|
||||||
|
'dimensions in the checkpoint are {}.'.format(name, own_state[name].size(), param.size()))
|
||||||
|
else:
|
||||||
|
raise KeyError('unexpected key "{}" in state_dict'.format(name))
|
||||||
|
|
||||||
|
|
||||||
|
def vggface2(pretrained=True):
|
||||||
|
vggface = ResNet.resnet50(num_classes=8631, include_top=True)
|
||||||
|
load_state_dict(vggface, VGG_FACE_PATH)
|
||||||
|
return vggface
|
||||||
|
|
||||||
|
def vggface(pretrained=False, **kwargs):
|
||||||
|
"""VGGFace model.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
pretrained (bool): If True, returns pre-trained model
|
||||||
|
"""
|
||||||
|
model = VggFace(**kwargs)
|
||||||
|
if pretrained:
|
||||||
|
state = torch.utils.model_zoo.load_url(MODEL_URL)
|
||||||
|
model.load_state_dict(state)
|
||||||
|
return model
|
||||||
|
|
||||||
|
|
||||||
|
class VggFace(torch.nn.Module):
|
||||||
|
def __init__(self, classes=2622):
|
||||||
|
"""VGGFace model.
|
||||||
|
|
||||||
|
Face recognition network. It takes as input a Bx3x224x224
|
||||||
|
batch of face images and gives as output a BxC score vector
|
||||||
|
(C is the number of identities).
|
||||||
|
Input images need to be scaled in the 0-1 range and then
|
||||||
|
normalized with respect to the mean RGB used during training.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
classes (int): number of identities recognized by the
|
||||||
|
network
|
||||||
|
|
||||||
|
"""
|
||||||
|
super().__init__()
|
||||||
|
self.conv1 = _ConvBlock(3, 64, 64)
|
||||||
|
self.conv2 = _ConvBlock(64, 128, 128)
|
||||||
|
self.conv3 = _ConvBlock(128, 256, 256, 256)
|
||||||
|
self.conv4 = _ConvBlock(256, 512, 512, 512)
|
||||||
|
self.conv5 = _ConvBlock(512, 512, 512, 512)
|
||||||
|
self.dropout = torch.nn.Dropout(0.5)
|
||||||
|
self.fc1 = torch.nn.Linear(7 * 7 * 512, 4096)
|
||||||
|
self.fc2 = torch.nn.Linear(4096, 4096)
|
||||||
|
self.fc3 = torch.nn.Linear(4096, classes)
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
x = self.conv1(x)
|
||||||
|
x = self.conv2(x)
|
||||||
|
x = self.conv3(x)
|
||||||
|
x = self.conv4(x)
|
||||||
|
x = self.conv5(x)
|
||||||
|
x = x.view(x.size(0), -1)
|
||||||
|
x = self.dropout(F.relu(self.fc1(x)))
|
||||||
|
x = self.dropout(F.relu(self.fc2(x)))
|
||||||
|
x = self.fc3(x)
|
||||||
|
return x
|
||||||
|
|
||||||
|
|
||||||
|
class _ConvBlock(torch.nn.Module):
|
||||||
|
"""A Convolutional block."""
|
||||||
|
|
||||||
|
def __init__(self, *units):
|
||||||
|
"""Create a block with len(units) - 1 convolutions.
|
||||||
|
|
||||||
|
convolution number i transforms the number of channels from
|
||||||
|
units[i - 1] to units[i] channels.
|
||||||
|
|
||||||
|
"""
|
||||||
|
super().__init__()
|
||||||
|
self.convs = torch.nn.ModuleList([
|
||||||
|
torch.nn.Conv2d(in_, out, 3, 1, 1)
|
||||||
|
for in_, out in zip(units[:-1], units[1:])
|
||||||
|
])
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
# Each convolution is followed by a ReLU, then the block is
|
||||||
|
# concluded by a max pooling.
|
||||||
|
for c in self.convs:
|
||||||
|
x = F.relu(c(x))
|
||||||
|
return F.max_pool2d(x, 2, 2, 0, ceil_mode=True)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
from torchvision import models
|
||||||
|
class Vgg19(torch.nn.Module):
|
||||||
|
"""
|
||||||
|
Vgg19 network for perceptual loss.
|
||||||
|
"""
|
||||||
|
def __init__(self, requires_grad=False):
|
||||||
|
super(Vgg19, self).__init__()
|
||||||
|
vgg_pretrained_features = models.vgg19(pretrained=True).features
|
||||||
|
self.slice1 = torch.nn.Sequential()
|
||||||
|
self.slice2 = torch.nn.Sequential()
|
||||||
|
self.slice3 = torch.nn.Sequential()
|
||||||
|
self.slice4 = torch.nn.Sequential()
|
||||||
|
self.slice5 = torch.nn.Sequential()
|
||||||
|
for x in range(2):
|
||||||
|
self.slice1.add_module(str(x), vgg_pretrained_features[x])
|
||||||
|
for x in range(2, 7):
|
||||||
|
self.slice2.add_module(str(x), vgg_pretrained_features[x])
|
||||||
|
for x in range(7, 12):
|
||||||
|
self.slice3.add_module(str(x), vgg_pretrained_features[x])
|
||||||
|
for x in range(12, 21):
|
||||||
|
self.slice4.add_module(str(x), vgg_pretrained_features[x])
|
||||||
|
for x in range(21, 30):
|
||||||
|
self.slice5.add_module(str(x), vgg_pretrained_features[x])
|
||||||
|
|
||||||
|
self.mean = torch.nn.Parameter(data=torch.Tensor(np.array([0.485, 0.456, 0.406]).reshape((1, 3, 1, 1))),
|
||||||
|
requires_grad=False)
|
||||||
|
self.std = torch.nn.Parameter(data=torch.Tensor(np.array([0.229, 0.224, 0.225]).reshape((1, 3, 1, 1))),
|
||||||
|
requires_grad=False)
|
||||||
|
|
||||||
|
if not requires_grad:
|
||||||
|
for param in self.parameters():
|
||||||
|
param.requires_grad = False
|
||||||
|
|
||||||
|
def forward(self, X):
|
||||||
|
X = (X - self.mean) / self.std
|
||||||
|
h_relu1 = self.slice1(X)
|
||||||
|
h_relu2 = self.slice2(h_relu1)
|
||||||
|
h_relu3 = self.slice3(h_relu2)
|
||||||
|
h_relu4 = self.slice4(h_relu3)
|
||||||
|
h_relu5 = self.slice5(h_relu4)
|
||||||
|
out = [h_relu1, h_relu2, h_relu3, h_relu4, h_relu5]
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
from torch import nn
|
||||||
|
class AntiAliasInterpolation2d(nn.Module):
|
||||||
|
"""
|
||||||
|
Band-limited downsampling, for better preservation of the input signal.
|
||||||
|
"""
|
||||||
|
def __init__(self, channels, scale):
|
||||||
|
super(AntiAliasInterpolation2d, self).__init__()
|
||||||
|
sigma = (1 / scale - 1) / 2
|
||||||
|
kernel_size = 2 * round(sigma * 4) + 1
|
||||||
|
self.ka = kernel_size // 2
|
||||||
|
self.kb = self.ka - 1 if kernel_size % 2 == 0 else self.ka
|
||||||
|
|
||||||
|
kernel_size = [kernel_size, kernel_size]
|
||||||
|
sigma = [sigma, sigma]
|
||||||
|
# The gaussian kernel is the product of the
|
||||||
|
# gaussian function of each dimension.
|
||||||
|
kernel = 1
|
||||||
|
meshgrids = torch.meshgrid(
|
||||||
|
[
|
||||||
|
torch.arange(size, dtype=torch.float32)
|
||||||
|
for size in kernel_size
|
||||||
|
]
|
||||||
|
)
|
||||||
|
for size, std, mgrid in zip(kernel_size, sigma, meshgrids):
|
||||||
|
mean = (size - 1) / 2
|
||||||
|
kernel *= torch.exp(-(mgrid - mean) ** 2 / (2 * std ** 2))
|
||||||
|
|
||||||
|
# Make sure sum of values in gaussian kernel equals 1.
|
||||||
|
kernel = kernel / torch.sum(kernel)
|
||||||
|
# Reshape to depthwise convolutional weight
|
||||||
|
kernel = kernel.view(1, 1, *kernel.size())
|
||||||
|
kernel = kernel.repeat(channels, *[1] * (kernel.dim() - 1))
|
||||||
|
|
||||||
|
self.register_buffer('weight', kernel)
|
||||||
|
self.groups = channels
|
||||||
|
self.scale = scale
|
||||||
|
inv_scale = 1 / scale
|
||||||
|
self.int_inv_scale = int(inv_scale)
|
||||||
|
|
||||||
|
def forward(self, input):
|
||||||
|
if self.scale == 1.0:
|
||||||
|
return input
|
||||||
|
|
||||||
|
out = F.pad(input, (self.ka, self.kb, self.ka, self.kb))
|
||||||
|
out = F.conv2d(out, weight=self.weight, groups=self.groups)
|
||||||
|
out = out[:, :, ::self.int_inv_scale, ::self.int_inv_scale]
|
||||||
|
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
class ImagePyramide(torch.nn.Module):
|
||||||
|
"""
|
||||||
|
Create image pyramide for computing pyramide perceptual loss.
|
||||||
|
"""
|
||||||
|
def __init__(self, scales, num_channels):
|
||||||
|
super(ImagePyramide, self).__init__()
|
||||||
|
downs = {}
|
||||||
|
for scale in scales:
|
||||||
|
downs[str(scale).replace('.', '-')] = AntiAliasInterpolation2d(num_channels, scale)
|
||||||
|
self.downs = nn.ModuleDict(downs)
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
out_dict = {}
|
||||||
|
for scale, down_module in self.downs.items():
|
||||||
|
out_dict['prediction_' + str(scale).replace('-', '.')] = down_module(x)
|
||||||
|
return out_dict
|
||||||
Executable
+240
@@ -0,0 +1,240 @@
|
|||||||
|
"""
|
||||||
|
This file is modified from LatentSync (https://github.com/bytedance/LatentSync/blob/main/latentsync/models/stable_syncnet.py).
|
||||||
|
"""
|
||||||
|
|
||||||
|
import torch
|
||||||
|
from torch import nn
|
||||||
|
from einops import rearrange
|
||||||
|
from torch.nn import functional as F
|
||||||
|
|
||||||
|
import torch.nn as nn
|
||||||
|
import torch.nn.functional as F
|
||||||
|
|
||||||
|
from diffusers.models.attention import Attention as CrossAttention, FeedForward
|
||||||
|
from diffusers.utils.import_utils import is_xformers_available
|
||||||
|
from einops import rearrange
|
||||||
|
|
||||||
|
|
||||||
|
class SyncNet(nn.Module):
|
||||||
|
def __init__(self, config):
|
||||||
|
super().__init__()
|
||||||
|
self.audio_encoder = DownEncoder2D(
|
||||||
|
in_channels=config["audio_encoder"]["in_channels"],
|
||||||
|
block_out_channels=config["audio_encoder"]["block_out_channels"],
|
||||||
|
downsample_factors=config["audio_encoder"]["downsample_factors"],
|
||||||
|
dropout=config["audio_encoder"]["dropout"],
|
||||||
|
attn_blocks=config["audio_encoder"]["attn_blocks"],
|
||||||
|
)
|
||||||
|
|
||||||
|
self.visual_encoder = DownEncoder2D(
|
||||||
|
in_channels=config["visual_encoder"]["in_channels"],
|
||||||
|
block_out_channels=config["visual_encoder"]["block_out_channels"],
|
||||||
|
downsample_factors=config["visual_encoder"]["downsample_factors"],
|
||||||
|
dropout=config["visual_encoder"]["dropout"],
|
||||||
|
attn_blocks=config["visual_encoder"]["attn_blocks"],
|
||||||
|
)
|
||||||
|
|
||||||
|
self.eval()
|
||||||
|
|
||||||
|
def forward(self, image_sequences, audio_sequences):
|
||||||
|
vision_embeds = self.visual_encoder(image_sequences) # (b, c, 1, 1)
|
||||||
|
audio_embeds = self.audio_encoder(audio_sequences) # (b, c, 1, 1)
|
||||||
|
|
||||||
|
vision_embeds = vision_embeds.reshape(vision_embeds.shape[0], -1) # (b, c)
|
||||||
|
audio_embeds = audio_embeds.reshape(audio_embeds.shape[0], -1) # (b, c)
|
||||||
|
|
||||||
|
# Make them unit vectors
|
||||||
|
vision_embeds = F.normalize(vision_embeds, p=2, dim=1)
|
||||||
|
audio_embeds = F.normalize(audio_embeds, p=2, dim=1)
|
||||||
|
|
||||||
|
return vision_embeds, audio_embeds
|
||||||
|
|
||||||
|
def get_image_embed(self, image_sequences):
|
||||||
|
vision_embeds = self.visual_encoder(image_sequences) # (b, c, 1, 1)
|
||||||
|
|
||||||
|
vision_embeds = vision_embeds.reshape(vision_embeds.shape[0], -1) # (b, c)
|
||||||
|
|
||||||
|
# Make them unit vectors
|
||||||
|
vision_embeds = F.normalize(vision_embeds, p=2, dim=1)
|
||||||
|
|
||||||
|
return vision_embeds
|
||||||
|
|
||||||
|
def get_audio_embed(self, audio_sequences):
|
||||||
|
audio_embeds = self.audio_encoder(audio_sequences) # (b, c, 1, 1)
|
||||||
|
|
||||||
|
audio_embeds = audio_embeds.reshape(audio_embeds.shape[0], -1) # (b, c)
|
||||||
|
|
||||||
|
audio_embeds = F.normalize(audio_embeds, p=2, dim=1)
|
||||||
|
|
||||||
|
return audio_embeds
|
||||||
|
|
||||||
|
class ResnetBlock2D(nn.Module):
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
in_channels: int,
|
||||||
|
out_channels: int,
|
||||||
|
dropout: float = 0.0,
|
||||||
|
norm_num_groups: int = 32,
|
||||||
|
eps: float = 1e-6,
|
||||||
|
act_fn: str = "silu",
|
||||||
|
downsample_factor=2,
|
||||||
|
):
|
||||||
|
super().__init__()
|
||||||
|
|
||||||
|
self.norm1 = nn.GroupNorm(num_groups=norm_num_groups, num_channels=in_channels, eps=eps, affine=True)
|
||||||
|
self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=1, padding=1)
|
||||||
|
|
||||||
|
self.norm2 = nn.GroupNorm(num_groups=norm_num_groups, num_channels=out_channels, eps=eps, affine=True)
|
||||||
|
self.dropout = nn.Dropout(dropout)
|
||||||
|
self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1)
|
||||||
|
|
||||||
|
if act_fn == "relu":
|
||||||
|
self.act_fn = nn.ReLU()
|
||||||
|
elif act_fn == "silu":
|
||||||
|
self.act_fn = nn.SiLU()
|
||||||
|
|
||||||
|
if in_channels != out_channels:
|
||||||
|
self.conv_shortcut = nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=1, padding=0)
|
||||||
|
else:
|
||||||
|
self.conv_shortcut = None
|
||||||
|
|
||||||
|
if isinstance(downsample_factor, list):
|
||||||
|
downsample_factor = tuple(downsample_factor)
|
||||||
|
|
||||||
|
if downsample_factor == 1:
|
||||||
|
self.downsample_conv = None
|
||||||
|
else:
|
||||||
|
self.downsample_conv = nn.Conv2d(
|
||||||
|
out_channels, out_channels, kernel_size=3, stride=downsample_factor, padding=0
|
||||||
|
)
|
||||||
|
self.pad = (0, 1, 0, 1)
|
||||||
|
if isinstance(downsample_factor, tuple):
|
||||||
|
if downsample_factor[0] == 1:
|
||||||
|
self.pad = (0, 1, 1, 1) # The padding order is from back to front
|
||||||
|
elif downsample_factor[1] == 1:
|
||||||
|
self.pad = (1, 1, 0, 1)
|
||||||
|
|
||||||
|
def forward(self, input_tensor):
|
||||||
|
hidden_states = input_tensor
|
||||||
|
|
||||||
|
hidden_states = self.norm1(hidden_states)
|
||||||
|
hidden_states = self.act_fn(hidden_states)
|
||||||
|
|
||||||
|
hidden_states = self.conv1(hidden_states)
|
||||||
|
hidden_states = self.norm2(hidden_states)
|
||||||
|
hidden_states = self.act_fn(hidden_states)
|
||||||
|
|
||||||
|
hidden_states = self.dropout(hidden_states)
|
||||||
|
hidden_states = self.conv2(hidden_states)
|
||||||
|
|
||||||
|
if self.conv_shortcut is not None:
|
||||||
|
input_tensor = self.conv_shortcut(input_tensor)
|
||||||
|
|
||||||
|
hidden_states += input_tensor
|
||||||
|
|
||||||
|
if self.downsample_conv is not None:
|
||||||
|
hidden_states = F.pad(hidden_states, self.pad, mode="constant", value=0)
|
||||||
|
hidden_states = self.downsample_conv(hidden_states)
|
||||||
|
|
||||||
|
return hidden_states
|
||||||
|
|
||||||
|
|
||||||
|
class AttentionBlock2D(nn.Module):
|
||||||
|
def __init__(self, query_dim, norm_num_groups=32, dropout=0.0):
|
||||||
|
super().__init__()
|
||||||
|
if not is_xformers_available():
|
||||||
|
raise ModuleNotFoundError(
|
||||||
|
"You have to install xformers to enable memory efficient attetion", name="xformers"
|
||||||
|
)
|
||||||
|
# inner_dim = dim_head * heads
|
||||||
|
self.norm1 = torch.nn.GroupNorm(num_groups=norm_num_groups, num_channels=query_dim, eps=1e-6, affine=True)
|
||||||
|
self.norm2 = nn.LayerNorm(query_dim)
|
||||||
|
self.norm3 = nn.LayerNorm(query_dim)
|
||||||
|
|
||||||
|
self.ff = FeedForward(query_dim, dropout=dropout, activation_fn="geglu")
|
||||||
|
|
||||||
|
self.conv_in = nn.Conv2d(query_dim, query_dim, kernel_size=1, stride=1, padding=0)
|
||||||
|
self.conv_out = nn.Conv2d(query_dim, query_dim, kernel_size=1, stride=1, padding=0)
|
||||||
|
|
||||||
|
self.attn = CrossAttention(query_dim=query_dim, heads=8, dim_head=query_dim // 8, dropout=dropout, bias=True)
|
||||||
|
self.attn._use_memory_efficient_attention_xformers = True
|
||||||
|
|
||||||
|
def forward(self, hidden_states):
|
||||||
|
assert hidden_states.dim() == 4, f"Expected hidden_states to have ndim=4, but got ndim={hidden_states.dim()}."
|
||||||
|
|
||||||
|
batch, channel, height, width = hidden_states.shape
|
||||||
|
residual = hidden_states
|
||||||
|
|
||||||
|
hidden_states = self.norm1(hidden_states)
|
||||||
|
hidden_states = self.conv_in(hidden_states)
|
||||||
|
hidden_states = rearrange(hidden_states, "b c h w -> b (h w) c")
|
||||||
|
|
||||||
|
norm_hidden_states = self.norm2(hidden_states)
|
||||||
|
hidden_states = self.attn(norm_hidden_states, attention_mask=None) + hidden_states
|
||||||
|
hidden_states = self.ff(self.norm3(hidden_states)) + hidden_states
|
||||||
|
|
||||||
|
hidden_states = rearrange(hidden_states, "b (h w) c -> b c h w", h=height, w=width)
|
||||||
|
hidden_states = self.conv_out(hidden_states)
|
||||||
|
|
||||||
|
hidden_states = hidden_states + residual
|
||||||
|
return hidden_states
|
||||||
|
|
||||||
|
|
||||||
|
class DownEncoder2D(nn.Module):
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
in_channels=4 * 16,
|
||||||
|
block_out_channels=[64, 128, 256, 256],
|
||||||
|
downsample_factors=[2, 2, 2, 2],
|
||||||
|
layers_per_block=2,
|
||||||
|
norm_num_groups=32,
|
||||||
|
attn_blocks=[1, 1, 1, 1],
|
||||||
|
dropout: float = 0.0,
|
||||||
|
act_fn="silu",
|
||||||
|
):
|
||||||
|
super().__init__()
|
||||||
|
self.layers_per_block = layers_per_block
|
||||||
|
|
||||||
|
# in
|
||||||
|
self.conv_in = nn.Conv2d(in_channels, block_out_channels[0], kernel_size=3, stride=1, padding=1)
|
||||||
|
|
||||||
|
# down
|
||||||
|
self.down_blocks = nn.ModuleList([])
|
||||||
|
|
||||||
|
output_channels = block_out_channels[0]
|
||||||
|
for i, block_out_channel in enumerate(block_out_channels):
|
||||||
|
input_channels = output_channels
|
||||||
|
output_channels = block_out_channel
|
||||||
|
# is_final_block = i == len(block_out_channels) - 1
|
||||||
|
|
||||||
|
down_block = ResnetBlock2D(
|
||||||
|
in_channels=input_channels,
|
||||||
|
out_channels=output_channels,
|
||||||
|
downsample_factor=downsample_factors[i],
|
||||||
|
norm_num_groups=norm_num_groups,
|
||||||
|
dropout=dropout,
|
||||||
|
act_fn=act_fn,
|
||||||
|
)
|
||||||
|
|
||||||
|
self.down_blocks.append(down_block)
|
||||||
|
|
||||||
|
if attn_blocks[i] == 1:
|
||||||
|
attention_block = AttentionBlock2D(query_dim=output_channels, dropout=dropout)
|
||||||
|
self.down_blocks.append(attention_block)
|
||||||
|
|
||||||
|
# out
|
||||||
|
self.norm_out = nn.GroupNorm(num_channels=block_out_channels[-1], num_groups=norm_num_groups, eps=1e-6)
|
||||||
|
self.act_fn_out = nn.ReLU()
|
||||||
|
|
||||||
|
def forward(self, hidden_states):
|
||||||
|
hidden_states = self.conv_in(hidden_states)
|
||||||
|
|
||||||
|
# down
|
||||||
|
for down_block in self.down_blocks:
|
||||||
|
hidden_states = down_block(hidden_states)
|
||||||
|
|
||||||
|
# post-process
|
||||||
|
hidden_states = self.norm_out(hidden_states)
|
||||||
|
hidden_states = self.act_fn_out(hidden_states)
|
||||||
|
|
||||||
|
return hidden_states
|
||||||
@@ -31,12 +31,16 @@ class UNet():
|
|||||||
unet_config,
|
unet_config,
|
||||||
model_path,
|
model_path,
|
||||||
use_float16=False,
|
use_float16=False,
|
||||||
|
device=None
|
||||||
):
|
):
|
||||||
with open(unet_config, 'r') as f:
|
with open(unet_config, 'r') as f:
|
||||||
unet_config = json.load(f)
|
unet_config = json.load(f)
|
||||||
self.model = UNet2DConditionModel(**unet_config)
|
self.model = UNet2DConditionModel(**unet_config)
|
||||||
self.pe = PositionalEncoding(d_model=384)
|
self.pe = PositionalEncoding(d_model=384)
|
||||||
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
if device != None:
|
||||||
|
self.device = device
|
||||||
|
else:
|
||||||
|
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
||||||
weights = torch.load(model_path) if torch.cuda.is_available() else torch.load(model_path, map_location=self.device)
|
weights = torch.load(model_path) if torch.cuda.is_available() else torch.load(model_path, map_location=self.device)
|
||||||
self.model.load_state_dict(weights)
|
self.model.load_state_dict(weights)
|
||||||
if use_float16:
|
if use_float16:
|
||||||
|
|||||||
Regular → Executable
Executable
+102
@@ -0,0 +1,102 @@
|
|||||||
|
import math
|
||||||
|
import os
|
||||||
|
|
||||||
|
import librosa
|
||||||
|
import numpy as np
|
||||||
|
import torch
|
||||||
|
from einops import rearrange
|
||||||
|
from transformers import AutoFeatureExtractor
|
||||||
|
|
||||||
|
|
||||||
|
class AudioProcessor:
|
||||||
|
def __init__(self, feature_extractor_path="openai/whisper-tiny/"):
|
||||||
|
self.feature_extractor = AutoFeatureExtractor.from_pretrained(feature_extractor_path)
|
||||||
|
|
||||||
|
def get_audio_feature(self, wav_path, start_index=0, weight_dtype=None):
|
||||||
|
if not os.path.exists(wav_path):
|
||||||
|
return None
|
||||||
|
librosa_output, sampling_rate = librosa.load(wav_path, sr=16000)
|
||||||
|
assert sampling_rate == 16000
|
||||||
|
# Split audio into 30s segments
|
||||||
|
segment_length = 30 * sampling_rate
|
||||||
|
segments = [librosa_output[i:i + segment_length] for i in range(0, len(librosa_output), segment_length)]
|
||||||
|
|
||||||
|
features = []
|
||||||
|
for segment in segments:
|
||||||
|
audio_feature = self.feature_extractor(
|
||||||
|
segment,
|
||||||
|
return_tensors="pt",
|
||||||
|
sampling_rate=sampling_rate
|
||||||
|
).input_features
|
||||||
|
if weight_dtype is not None:
|
||||||
|
audio_feature = audio_feature.to(dtype=weight_dtype)
|
||||||
|
features.append(audio_feature)
|
||||||
|
|
||||||
|
return features, len(librosa_output)
|
||||||
|
|
||||||
|
def get_whisper_chunk(
|
||||||
|
self,
|
||||||
|
whisper_input_features,
|
||||||
|
device,
|
||||||
|
weight_dtype,
|
||||||
|
whisper,
|
||||||
|
librosa_length,
|
||||||
|
fps=25,
|
||||||
|
audio_padding_length_left=2,
|
||||||
|
audio_padding_length_right=2,
|
||||||
|
):
|
||||||
|
audio_feature_length_per_frame = 2 * (audio_padding_length_left + audio_padding_length_right + 1)
|
||||||
|
whisper_feature = []
|
||||||
|
# Process multiple 30s mel input features
|
||||||
|
for input_feature in whisper_input_features:
|
||||||
|
input_feature = input_feature.to(device).to(weight_dtype)
|
||||||
|
audio_feats = whisper.encoder(input_feature, output_hidden_states=True).hidden_states
|
||||||
|
audio_feats = torch.stack(audio_feats, dim=2)
|
||||||
|
whisper_feature.append(audio_feats)
|
||||||
|
|
||||||
|
whisper_feature = torch.cat(whisper_feature, dim=1)
|
||||||
|
# Trim the last segment to remove padding
|
||||||
|
sr = 16000
|
||||||
|
audio_fps = 50
|
||||||
|
fps = int(fps)
|
||||||
|
whisper_idx_multiplier = audio_fps / fps
|
||||||
|
num_frames = math.floor((librosa_length / sr) * fps)
|
||||||
|
actual_length = math.floor((librosa_length / sr) * audio_fps)
|
||||||
|
whisper_feature = whisper_feature[:,:actual_length,...]
|
||||||
|
|
||||||
|
# Calculate padding amount
|
||||||
|
padding_nums = math.ceil(whisper_idx_multiplier)
|
||||||
|
# Add padding at start and end
|
||||||
|
whisper_feature = torch.cat([
|
||||||
|
torch.zeros_like(whisper_feature[:, :padding_nums * audio_padding_length_left]),
|
||||||
|
whisper_feature,
|
||||||
|
# Add extra padding to prevent out of bounds
|
||||||
|
torch.zeros_like(whisper_feature[:, :padding_nums * 3 * audio_padding_length_right])
|
||||||
|
], 1)
|
||||||
|
|
||||||
|
audio_prompts = []
|
||||||
|
for frame_index in range(num_frames):
|
||||||
|
try:
|
||||||
|
audio_index = math.floor(frame_index * whisper_idx_multiplier)
|
||||||
|
audio_clip = whisper_feature[:, audio_index: audio_index + audio_feature_length_per_frame]
|
||||||
|
assert audio_clip.shape[1] == audio_feature_length_per_frame
|
||||||
|
audio_prompts.append(audio_clip)
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error occurred: {e}")
|
||||||
|
print(f"whisper_feature.shape: {whisper_feature.shape}")
|
||||||
|
print(f"audio_clip.shape: {audio_clip.shape}")
|
||||||
|
print(f"num frames: {num_frames}, fps: {fps}, whisper_idx_multiplier: {whisper_idx_multiplier}")
|
||||||
|
print(f"frame_index: {frame_index}, audio_index: {audio_index}-{audio_index + audio_feature_length_per_frame}")
|
||||||
|
exit()
|
||||||
|
|
||||||
|
audio_prompts = torch.cat(audio_prompts, dim=0) # T, 10, 5, 384
|
||||||
|
audio_prompts = rearrange(audio_prompts, 'b c h w -> b (c h) w')
|
||||||
|
return audio_prompts
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
audio_processor = AudioProcessor()
|
||||||
|
wav_path = "./2.wav"
|
||||||
|
audio_feature, librosa_feature_length = audio_processor.get_audio_feature(wav_path)
|
||||||
|
print("Audio Feature shape:", audio_feature.shape)
|
||||||
|
print("librosa_feature_length:", librosa_feature_length)
|
||||||
|
|
||||||
@@ -0,0 +1,17 @@
|
|||||||
|
import os, subprocess
|
||||||
|
|
||||||
|
def ensure_wav(input_path: str, target_path: str | None = None) -> str:
|
||||||
|
"""
|
||||||
|
Convert any audio (mp3/ogg/m4a/wav/…) to 16kHz mono PCM WAV via ffmpeg.
|
||||||
|
Returns path to the converted .wav (original if already correct).
|
||||||
|
"""
|
||||||
|
if not isinstance(input_path, str) or not os.path.exists(input_path):
|
||||||
|
return input_path
|
||||||
|
base, ext = os.path.splitext(input_path)
|
||||||
|
ext = ext.lower()
|
||||||
|
|
||||||
|
if target_path is None:
|
||||||
|
target_path = base + "_16k.wav"
|
||||||
|
cmd = ["ffmpeg", "-y", "-i", input_path, "-ar", "16000", "-ac", "1", "-c:a", "pcm_s16le", target_path]
|
||||||
|
subprocess.run(cmd, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
|
||||||
|
return target_path
|
||||||
Regular → Executable
+97
-61
@@ -1,9 +1,8 @@
|
|||||||
from PIL import Image
|
from PIL import Image
|
||||||
import numpy as np
|
import numpy as np
|
||||||
import cv2
|
import cv2
|
||||||
from face_parsing import FaceParsing
|
import copy
|
||||||
|
|
||||||
fp = FaceParsing()
|
|
||||||
|
|
||||||
def get_crop_box(box, expand):
|
def get_crop_box(box, expand):
|
||||||
x, y, x1, y1 = box
|
x, y, x1, y1 = box
|
||||||
@@ -13,78 +12,88 @@ def get_crop_box(box, expand):
|
|||||||
crop_box = [x_c-s, y_c-s, x_c+s, y_c+s]
|
crop_box = [x_c-s, y_c-s, x_c+s, y_c+s]
|
||||||
return crop_box, s
|
return crop_box, s
|
||||||
|
|
||||||
def face_seg(image):
|
|
||||||
seg_image = fp(image)
|
def face_seg(image, mode="raw", fp=None):
|
||||||
|
"""
|
||||||
|
对图像进行面部解析,生成面部区域的掩码。
|
||||||
|
|
||||||
|
Args:
|
||||||
|
image (PIL.Image): 输入图像。
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
PIL.Image: 面部区域的掩码图像。
|
||||||
|
"""
|
||||||
|
seg_image = fp(image, mode=mode) # 使用 FaceParsing 模型解析面部
|
||||||
if seg_image is None:
|
if seg_image is None:
|
||||||
print("error, no person_segment")
|
print("error, no person_segment") # 如果没有检测到面部,返回错误
|
||||||
return None
|
return None
|
||||||
|
|
||||||
seg_image = seg_image.resize(image.size)
|
seg_image = seg_image.resize(image.size) # 将掩码图像调整为输入图像的大小
|
||||||
return seg_image
|
return seg_image
|
||||||
|
|
||||||
def get_image(image,face,face_box,upper_boundary_ratio = 0.5,expand=1.2):
|
|
||||||
#print(image.shape)
|
|
||||||
#print(face.shape)
|
|
||||||
|
|
||||||
body = Image.fromarray(image[:,:,::-1])
|
|
||||||
face = Image.fromarray(face[:,:,::-1])
|
|
||||||
|
|
||||||
x, y, x1, y1 = face_box
|
def get_image(image, face, face_box, upper_boundary_ratio=0.5, expand=1.5, mode="raw", fp=None):
|
||||||
#print(x1-x,y1-y)
|
"""
|
||||||
crop_box, s = get_crop_box(face_box, expand)
|
将裁剪的面部图像粘贴回原始图像,并进行一些处理。
|
||||||
x_s, y_s, x_e, y_e = crop_box
|
|
||||||
face_position = (x, y)
|
|
||||||
|
|
||||||
|
Args:
|
||||||
|
image (numpy.ndarray): 原始图像(身体部分)。
|
||||||
|
face (numpy.ndarray): 裁剪的面部图像。
|
||||||
|
face_box (tuple): 面部边界框的坐标 (x, y, x1, y1)。
|
||||||
|
upper_boundary_ratio (float): 用于控制面部区域的保留比例。
|
||||||
|
expand (float): 扩展因子,用于放大裁剪框。
|
||||||
|
mode: 融合mask构建方式
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
numpy.ndarray: 处理后的图像。
|
||||||
|
"""
|
||||||
|
# 将 numpy 数组转换为 PIL 图像
|
||||||
|
body = Image.fromarray(image[:, :, ::-1]) # 身体部分图像(整张图)
|
||||||
|
face = Image.fromarray(face[:, :, ::-1]) # 面部图像
|
||||||
|
|
||||||
|
x, y, x1, y1 = face_box # 获取面部边界框的坐标
|
||||||
|
crop_box, s = get_crop_box(face_box, expand) # 计算扩展后的裁剪框
|
||||||
|
x_s, y_s, x_e, y_e = crop_box # 裁剪框的坐标
|
||||||
|
face_position = (x, y) # 面部在原始图像中的位置
|
||||||
|
|
||||||
|
# 从身体图像中裁剪出扩展后的面部区域(下巴到边界有距离)
|
||||||
face_large = body.crop(crop_box)
|
face_large = body.crop(crop_box)
|
||||||
ori_shape = face_large.size
|
|
||||||
|
ori_shape = face_large.size # 裁剪后图像的原始尺寸
|
||||||
|
|
||||||
mask_image = face_seg(face_large)
|
# 对裁剪后的面部区域进行面部解析,生成掩码
|
||||||
mask_small = mask_image.crop((x-x_s, y-y_s, x1-x_s, y1-y_s))
|
mask_image = face_seg(face_large, mode=mode, fp=fp)
|
||||||
mask_image = Image.new('L', ori_shape, 0)
|
|
||||||
mask_image.paste(mask_small, (x-x_s, y-y_s, x1-x_s, y1-y_s))
|
mask_small = mask_image.crop((x - x_s, y - y_s, x1 - x_s, y1 - y_s)) # 裁剪出面部区域的掩码
|
||||||
|
|
||||||
# keep upper_boundary_ratio of talking area
|
mask_image = Image.new('L', ori_shape, 0) # 创建一个全黑的掩码图像
|
||||||
width, height = mask_image.size
|
mask_image.paste(mask_small, (x - x_s, y - y_s, x1 - x_s, y1 - y_s)) # 将面部掩码粘贴到全黑图像上
|
||||||
top_boundary = int(height * upper_boundary_ratio)
|
|
||||||
modified_mask_image = Image.new('L', ori_shape, 0)
|
|
||||||
modified_mask_image.paste(mask_image.crop((0, top_boundary, width, height)), (0, top_boundary))
|
# 保留面部区域的上半部分(用于控制说话区域)
|
||||||
|
width, height = mask_image.size
|
||||||
blur_kernel_size = int(0.1 * ori_shape[0] // 2 * 2) + 1
|
top_boundary = int(height * upper_boundary_ratio) # 计算上半部分的边界
|
||||||
mask_array = cv2.GaussianBlur(np.array(modified_mask_image), (blur_kernel_size, blur_kernel_size), 0)
|
modified_mask_image = Image.new('L', ori_shape, 0) # 创建一个新的全黑掩码图像
|
||||||
mask_image = Image.fromarray(mask_array)
|
modified_mask_image.paste(mask_image.crop((0, top_boundary, width, height)), (0, top_boundary)) # 粘贴上半部分掩码
|
||||||
|
|
||||||
|
|
||||||
|
# 对掩码进行高斯模糊,使边缘更平滑
|
||||||
|
blur_kernel_size = int(0.05 * ori_shape[0] // 2 * 2) + 1 # 计算模糊核大小
|
||||||
|
mask_array = cv2.GaussianBlur(np.array(modified_mask_image), (blur_kernel_size, blur_kernel_size), 0) # 高斯模糊
|
||||||
|
#mask_array = np.array(modified_mask_image)
|
||||||
|
mask_image = Image.fromarray(mask_array) # 将模糊后的掩码转换回 PIL 图像
|
||||||
|
|
||||||
|
# 将裁剪的面部图像粘贴回扩展后的面部区域
|
||||||
|
face_large.paste(face, (x - x_s, y - y_s, x1 - x_s, y1 - y_s))
|
||||||
|
|
||||||
face_large.paste(face, (x-x_s, y-y_s, x1-x_s, y1-y_s))
|
|
||||||
body.paste(face_large, crop_box[:2], mask_image)
|
body.paste(face_large, crop_box[:2], mask_image)
|
||||||
body = np.array(body)
|
|
||||||
return body[:,:,::-1]
|
body = np.array(body) # 将 PIL 图像转换回 numpy 数组
|
||||||
|
|
||||||
def get_image_prepare_material(image,face_box,upper_boundary_ratio = 0.5,expand=1.2):
|
return body[:, :, ::-1] # 返回处理后的图像(BGR 转 RGB)
|
||||||
body = Image.fromarray(image[:,:,::-1])
|
|
||||||
|
|
||||||
x, y, x1, y1 = face_box
|
|
||||||
#print(x1-x,y1-y)
|
|
||||||
crop_box, s = get_crop_box(face_box, expand)
|
|
||||||
x_s, y_s, x_e, y_e = crop_box
|
|
||||||
|
|
||||||
face_large = body.crop(crop_box)
|
def get_image_blending(image, face, face_box, mask_array, crop_box):
|
||||||
ori_shape = face_large.size
|
|
||||||
|
|
||||||
mask_image = face_seg(face_large)
|
|
||||||
mask_small = mask_image.crop((x-x_s, y-y_s, x1-x_s, y1-y_s))
|
|
||||||
mask_image = Image.new('L', ori_shape, 0)
|
|
||||||
mask_image.paste(mask_small, (x-x_s, y-y_s, x1-x_s, y1-y_s))
|
|
||||||
|
|
||||||
# keep upper_boundary_ratio of talking area
|
|
||||||
width, height = mask_image.size
|
|
||||||
top_boundary = int(height * upper_boundary_ratio)
|
|
||||||
modified_mask_image = Image.new('L', ori_shape, 0)
|
|
||||||
modified_mask_image.paste(mask_image.crop((0, top_boundary, width, height)), (0, top_boundary))
|
|
||||||
|
|
||||||
blur_kernel_size = int(0.1 * ori_shape[0] // 2 * 2) + 1
|
|
||||||
mask_array = cv2.GaussianBlur(np.array(modified_mask_image), (blur_kernel_size, blur_kernel_size), 0)
|
|
||||||
return mask_array,crop_box
|
|
||||||
|
|
||||||
def get_image_blending(image,face,face_box,mask_array,crop_box):
|
|
||||||
body = Image.fromarray(image[:,:,::-1])
|
body = Image.fromarray(image[:,:,::-1])
|
||||||
face = Image.fromarray(face[:,:,::-1])
|
face = Image.fromarray(face[:,:,::-1])
|
||||||
|
|
||||||
@@ -97,4 +106,31 @@ def get_image_blending(image,face,face_box,mask_array,crop_box):
|
|||||||
face_large.paste(face, (x-x_s, y-y_s, x1-x_s, y1-y_s))
|
face_large.paste(face, (x-x_s, y-y_s, x1-x_s, y1-y_s))
|
||||||
body.paste(face_large, crop_box[:2], mask_image)
|
body.paste(face_large, crop_box[:2], mask_image)
|
||||||
body = np.array(body)
|
body = np.array(body)
|
||||||
return body[:,:,::-1]
|
return body[:,:,::-1]
|
||||||
|
|
||||||
|
|
||||||
|
def get_image_prepare_material(image, face_box, upper_boundary_ratio=0.5, expand=1.5, fp=None, mode="raw"):
|
||||||
|
body = Image.fromarray(image[:,:,::-1])
|
||||||
|
|
||||||
|
x, y, x1, y1 = face_box
|
||||||
|
#print(x1-x,y1-y)
|
||||||
|
crop_box, s = get_crop_box(face_box, expand)
|
||||||
|
x_s, y_s, x_e, y_e = crop_box
|
||||||
|
|
||||||
|
face_large = body.crop(crop_box)
|
||||||
|
ori_shape = face_large.size
|
||||||
|
|
||||||
|
mask_image = face_seg(face_large, mode=mode, fp=fp)
|
||||||
|
mask_small = mask_image.crop((x-x_s, y-y_s, x1-x_s, y1-y_s))
|
||||||
|
mask_image = Image.new('L', ori_shape, 0)
|
||||||
|
mask_image.paste(mask_small, (x-x_s, y-y_s, x1-x_s, y1-y_s))
|
||||||
|
|
||||||
|
# keep upper_boundary_ratio of talking area
|
||||||
|
width, height = mask_image.size
|
||||||
|
top_boundary = int(height * upper_boundary_ratio)
|
||||||
|
modified_mask_image = Image.new('L', ori_shape, 0)
|
||||||
|
modified_mask_image.paste(mask_image.crop((0, top_boundary, width, height)), (0, top_boundary))
|
||||||
|
|
||||||
|
blur_kernel_size = int(0.1 * ori_shape[0] // 2 * 2) + 1
|
||||||
|
mask_array = cv2.GaussianBlur(np.array(modified_mask_image), (blur_kernel_size, blur_kernel_size), 0)
|
||||||
|
return mask_array, crop_box
|
||||||
|
|||||||
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
@@ -8,9 +8,53 @@ from .model import BiSeNet
|
|||||||
import torchvision.transforms as transforms
|
import torchvision.transforms as transforms
|
||||||
|
|
||||||
class FaceParsing():
|
class FaceParsing():
|
||||||
def __init__(self):
|
def __init__(self, left_cheek_width=80, right_cheek_width=80):
|
||||||
self.net = self.model_init()
|
self.net = self.model_init()
|
||||||
self.preprocess = self.image_preprocess()
|
self.preprocess = self.image_preprocess()
|
||||||
|
# Ensure all size parameters are integers
|
||||||
|
cone_height = 21
|
||||||
|
tail_height = 12
|
||||||
|
total_size = cone_height + tail_height
|
||||||
|
|
||||||
|
# Create kernel with explicit integer dimensions
|
||||||
|
kernel = np.zeros((total_size, total_size), dtype=np.uint8)
|
||||||
|
center_x = total_size // 2 # Ensure center coordinates are integers
|
||||||
|
|
||||||
|
# Cone part
|
||||||
|
for row in range(cone_height):
|
||||||
|
if row < cone_height//2:
|
||||||
|
continue
|
||||||
|
width = int(2 * (row - cone_height//2) + 1)
|
||||||
|
start = int(center_x - (width // 2))
|
||||||
|
end = int(center_x + (width // 2) + 1)
|
||||||
|
kernel[row, start:end] = 1
|
||||||
|
|
||||||
|
# Vertical extension part
|
||||||
|
if cone_height > 0:
|
||||||
|
base_width = int(kernel[cone_height-1].sum())
|
||||||
|
else:
|
||||||
|
base_width = 1
|
||||||
|
|
||||||
|
for row in range(cone_height, total_size):
|
||||||
|
start = max(0, int(center_x - (base_width//2)))
|
||||||
|
end = min(total_size, int(center_x + (base_width//2) + 1))
|
||||||
|
kernel[row, start:end] = 1
|
||||||
|
self.kernel = kernel
|
||||||
|
|
||||||
|
# Modify cheek erosion kernel to be flatter ellipse
|
||||||
|
self.cheek_kernel = cv2.getStructuringElement(
|
||||||
|
cv2.MORPH_ELLIPSE, (35, 3))
|
||||||
|
|
||||||
|
# Add cheek area mask (protect chin area)
|
||||||
|
self.cheek_mask = self._create_cheek_mask(left_cheek_width=left_cheek_width, right_cheek_width=right_cheek_width)
|
||||||
|
|
||||||
|
def _create_cheek_mask(self, left_cheek_width=80, right_cheek_width=80):
|
||||||
|
"""Create cheek area mask (1/4 area on both sides)"""
|
||||||
|
mask = np.zeros((512, 512), dtype=np.uint8)
|
||||||
|
center = 512 // 2
|
||||||
|
cv2.rectangle(mask, (0, 0), (center - left_cheek_width, 512), 255, -1) # Left cheek
|
||||||
|
cv2.rectangle(mask, (center + right_cheek_width, 0), (512, 512), 255, -1) # Right cheek
|
||||||
|
return mask
|
||||||
|
|
||||||
def model_init(self,
|
def model_init(self,
|
||||||
resnet_path='./models/face-parse-bisent/resnet18-5c106cde.pth',
|
resnet_path='./models/face-parse-bisent/resnet18-5c106cde.pth',
|
||||||
@@ -30,7 +74,7 @@ class FaceParsing():
|
|||||||
transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
|
transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
|
||||||
])
|
])
|
||||||
|
|
||||||
def __call__(self, image, size=(512, 512)):
|
def __call__(self, image, size=(512, 512), mode="raw"):
|
||||||
if isinstance(image, str):
|
if isinstance(image, str):
|
||||||
image = Image.open(image)
|
image = Image.open(image)
|
||||||
|
|
||||||
@@ -44,8 +88,25 @@ class FaceParsing():
|
|||||||
img = torch.unsqueeze(img, 0)
|
img = torch.unsqueeze(img, 0)
|
||||||
out = self.net(img)[0]
|
out = self.net(img)[0]
|
||||||
parsing = out.squeeze(0).cpu().numpy().argmax(0)
|
parsing = out.squeeze(0).cpu().numpy().argmax(0)
|
||||||
parsing[np.where(parsing>13)] = 0
|
|
||||||
parsing[np.where(parsing>=1)] = 255
|
# Add 14:neck, remove 10:nose and 7:8:9
|
||||||
|
if mode == "neck":
|
||||||
|
parsing[np.isin(parsing, [1, 11, 12, 13, 14])] = 255
|
||||||
|
parsing[np.where(parsing!=255)] = 0
|
||||||
|
elif mode == "jaw":
|
||||||
|
face_region = np.isin(parsing, [1])*255
|
||||||
|
face_region = face_region.astype(np.uint8)
|
||||||
|
original_dilated = cv2.dilate(face_region, self.kernel, iterations=1)
|
||||||
|
eroded = cv2.erode(original_dilated, self.cheek_kernel, iterations=2)
|
||||||
|
face_region = cv2.bitwise_and(eroded, self.cheek_mask)
|
||||||
|
face_region = cv2.bitwise_or(face_region, cv2.bitwise_and(original_dilated, ~self.cheek_mask))
|
||||||
|
parsing[(face_region==255) & (~np.isin(parsing, [10]))] = 255
|
||||||
|
parsing[np.isin(parsing, [11, 12, 13])] = 255
|
||||||
|
parsing[np.where(parsing!=255)] = 0
|
||||||
|
else:
|
||||||
|
parsing[np.isin(parsing, [1, 11, 12, 13])] = 255
|
||||||
|
parsing[np.where(parsing!=255)] = 0
|
||||||
|
|
||||||
parsing = Image.fromarray(parsing.astype(np.uint8))
|
parsing = Image.fromarray(parsing.astype(np.uint8))
|
||||||
return parsing
|
return parsing
|
||||||
|
|
||||||
|
|||||||
Regular → Executable
+2
-1
@@ -118,7 +118,8 @@ def get_landmark_and_bbox(img_list,upperbondrange =0):
|
|||||||
if upperbondrange != 0:
|
if upperbondrange != 0:
|
||||||
half_face_coord[1] = upperbondrange+half_face_coord[1] #手动调整 + 向下(偏29) - 向上(偏28)
|
half_face_coord[1] = upperbondrange+half_face_coord[1] #手动调整 + 向下(偏29) - 向上(偏28)
|
||||||
half_face_dist = np.max(face_land_mark[:,1]) - half_face_coord[1]
|
half_face_dist = np.max(face_land_mark[:,1]) - half_face_coord[1]
|
||||||
upper_bond = half_face_coord[1]-half_face_dist
|
min_upper_bond = 0
|
||||||
|
upper_bond = max(min_upper_bond, half_face_coord[1] - half_face_dist)
|
||||||
|
|
||||||
f_landmark = (np.min(face_land_mark[:, 0]),int(upper_bond),np.max(face_land_mark[:, 0]),np.max(face_land_mark[:,1]))
|
f_landmark = (np.min(face_land_mark[:, 0]),int(upper_bond),np.max(face_land_mark[:, 0]),np.max(face_land_mark[:,1]))
|
||||||
x1, y1, x2, y2 = f_landmark
|
x1, y1, x2, y2 = f_landmark
|
||||||
|
|||||||
@@ -0,0 +1,337 @@
|
|||||||
|
import os
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import torch
|
||||||
|
import torch.nn as nn
|
||||||
|
import torch.optim as optim
|
||||||
|
from torch.optim.lr_scheduler import CosineAnnealingLR
|
||||||
|
from diffusers import AutoencoderKL, UNet2DConditionModel
|
||||||
|
from transformers import WhisperModel
|
||||||
|
from diffusers.optimization import get_scheduler
|
||||||
|
from omegaconf import OmegaConf
|
||||||
|
from einops import rearrange
|
||||||
|
|
||||||
|
from musetalk.models.syncnet import SyncNet
|
||||||
|
from musetalk.loss.discriminator import MultiScaleDiscriminator, DiscriminatorFullModel
|
||||||
|
from musetalk.loss.basic_loss import Interpolate
|
||||||
|
import musetalk.loss.vgg_face as vgg_face
|
||||||
|
from musetalk.data.dataset import PortraitDataset
|
||||||
|
from musetalk.utils.utils import (
|
||||||
|
get_image_pred,
|
||||||
|
process_audio_features,
|
||||||
|
process_and_save_images
|
||||||
|
)
|
||||||
|
|
||||||
|
class Net(nn.Module):
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
unet: UNet2DConditionModel,
|
||||||
|
):
|
||||||
|
super().__init__()
|
||||||
|
self.unet = unet
|
||||||
|
|
||||||
|
def forward(
|
||||||
|
self,
|
||||||
|
input_latents,
|
||||||
|
timesteps,
|
||||||
|
audio_prompts,
|
||||||
|
):
|
||||||
|
model_pred = self.unet(
|
||||||
|
input_latents,
|
||||||
|
timesteps,
|
||||||
|
encoder_hidden_states=audio_prompts
|
||||||
|
).sample
|
||||||
|
return model_pred
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
def initialize_models_and_optimizers(cfg, accelerator, weight_dtype):
|
||||||
|
"""Initialize models and optimizers"""
|
||||||
|
model_dict = {
|
||||||
|
'vae': None,
|
||||||
|
'unet': None,
|
||||||
|
'net': None,
|
||||||
|
'wav2vec': None,
|
||||||
|
'optimizer': None,
|
||||||
|
'lr_scheduler': None,
|
||||||
|
'scheduler_max_steps': None,
|
||||||
|
'trainable_params': None
|
||||||
|
}
|
||||||
|
|
||||||
|
model_dict['vae'] = AutoencoderKL.from_pretrained(
|
||||||
|
cfg.pretrained_model_name_or_path,
|
||||||
|
subfolder=cfg.vae_type,
|
||||||
|
)
|
||||||
|
|
||||||
|
unet_config_file = os.path.join(
|
||||||
|
cfg.pretrained_model_name_or_path,
|
||||||
|
cfg.unet_sub_folder + "/musetalk.json"
|
||||||
|
)
|
||||||
|
|
||||||
|
with open(unet_config_file, 'r') as f:
|
||||||
|
unet_config = json.load(f)
|
||||||
|
model_dict['unet'] = UNet2DConditionModel(**unet_config)
|
||||||
|
|
||||||
|
if not cfg.random_init_unet:
|
||||||
|
pretrained_unet_path = os.path.join(cfg.pretrained_model_name_or_path, cfg.unet_sub_folder, "pytorch_model.bin")
|
||||||
|
print(f"### Loading existing unet weights from {pretrained_unet_path}. ###")
|
||||||
|
checkpoint = torch.load(pretrained_unet_path, map_location=accelerator.device)
|
||||||
|
model_dict['unet'].load_state_dict(checkpoint)
|
||||||
|
|
||||||
|
unet_params = [p.numel() for n, p in model_dict['unet'].named_parameters()]
|
||||||
|
logger.info(f"unet {sum(unet_params) / 1e6}M-parameter")
|
||||||
|
|
||||||
|
model_dict['vae'].requires_grad_(False)
|
||||||
|
model_dict['unet'].requires_grad_(True)
|
||||||
|
|
||||||
|
model_dict['vae'].to(accelerator.device, dtype=weight_dtype)
|
||||||
|
|
||||||
|
model_dict['net'] = Net(model_dict['unet'])
|
||||||
|
|
||||||
|
model_dict['wav2vec'] = WhisperModel.from_pretrained(cfg.whisper_path).to(
|
||||||
|
device="cuda", dtype=weight_dtype).eval()
|
||||||
|
model_dict['wav2vec'].requires_grad_(False)
|
||||||
|
|
||||||
|
if cfg.solver.gradient_checkpointing:
|
||||||
|
model_dict['unet'].enable_gradient_checkpointing()
|
||||||
|
|
||||||
|
if cfg.solver.scale_lr:
|
||||||
|
learning_rate = (
|
||||||
|
cfg.solver.learning_rate
|
||||||
|
* cfg.solver.gradient_accumulation_steps
|
||||||
|
* cfg.data.train_bs
|
||||||
|
* accelerator.num_processes
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
learning_rate = cfg.solver.learning_rate
|
||||||
|
|
||||||
|
if cfg.solver.use_8bit_adam:
|
||||||
|
try:
|
||||||
|
import bitsandbytes as bnb
|
||||||
|
except ImportError:
|
||||||
|
raise ImportError(
|
||||||
|
"Please install bitsandbytes to use 8-bit Adam. You can do so by running `pip install bitsandbytes`"
|
||||||
|
)
|
||||||
|
optimizer_cls = bnb.optim.AdamW8bit
|
||||||
|
else:
|
||||||
|
optimizer_cls = torch.optim.AdamW
|
||||||
|
|
||||||
|
model_dict['trainable_params'] = list(filter(lambda p: p.requires_grad, model_dict['net'].parameters()))
|
||||||
|
if accelerator.is_main_process:
|
||||||
|
print('trainable params')
|
||||||
|
for n, p in model_dict['net'].named_parameters():
|
||||||
|
if p.requires_grad:
|
||||||
|
print(n)
|
||||||
|
|
||||||
|
model_dict['optimizer'] = optimizer_cls(
|
||||||
|
model_dict['trainable_params'],
|
||||||
|
lr=learning_rate,
|
||||||
|
betas=(cfg.solver.adam_beta1, cfg.solver.adam_beta2),
|
||||||
|
weight_decay=cfg.solver.adam_weight_decay,
|
||||||
|
eps=cfg.solver.adam_epsilon,
|
||||||
|
)
|
||||||
|
|
||||||
|
model_dict['scheduler_max_steps'] = cfg.solver.max_train_steps * cfg.solver.gradient_accumulation_steps
|
||||||
|
model_dict['lr_scheduler'] = get_scheduler(
|
||||||
|
cfg.solver.lr_scheduler,
|
||||||
|
optimizer=model_dict['optimizer'],
|
||||||
|
num_warmup_steps=cfg.solver.lr_warmup_steps * cfg.solver.gradient_accumulation_steps,
|
||||||
|
num_training_steps=model_dict['scheduler_max_steps'],
|
||||||
|
)
|
||||||
|
|
||||||
|
return model_dict
|
||||||
|
|
||||||
|
def initialize_dataloaders(cfg):
|
||||||
|
"""Initialize training and validation dataloaders"""
|
||||||
|
dataloader_dict = {
|
||||||
|
'train_dataset': None,
|
||||||
|
'val_dataset': None,
|
||||||
|
'train_dataloader': None,
|
||||||
|
'val_dataloader': None
|
||||||
|
}
|
||||||
|
|
||||||
|
dataloader_dict['train_dataset'] = PortraitDataset(cfg={
|
||||||
|
'image_size': cfg.data.image_size,
|
||||||
|
'T': cfg.data.n_sample_frames,
|
||||||
|
"sample_method": cfg.data.sample_method,
|
||||||
|
'top_k_ratio': cfg.data.top_k_ratio,
|
||||||
|
"contorl_face_min_size": cfg.data.contorl_face_min_size,
|
||||||
|
"dataset_key": cfg.data.dataset_key,
|
||||||
|
"padding_pixel_mouth": cfg.padding_pixel_mouth,
|
||||||
|
"whisper_path": cfg.whisper_path,
|
||||||
|
"min_face_size": cfg.data.min_face_size,
|
||||||
|
"cropping_jaw2edge_margin_mean": cfg.cropping_jaw2edge_margin_mean,
|
||||||
|
"cropping_jaw2edge_margin_std": cfg.cropping_jaw2edge_margin_std,
|
||||||
|
"crop_type": cfg.crop_type,
|
||||||
|
"random_margin_method": cfg.random_margin_method,
|
||||||
|
})
|
||||||
|
|
||||||
|
dataloader_dict['train_dataloader'] = torch.utils.data.DataLoader(
|
||||||
|
dataloader_dict['train_dataset'],
|
||||||
|
batch_size=cfg.data.train_bs,
|
||||||
|
shuffle=True,
|
||||||
|
num_workers=cfg.data.num_workers,
|
||||||
|
)
|
||||||
|
|
||||||
|
dataloader_dict['val_dataset'] = PortraitDataset(cfg={
|
||||||
|
'image_size': cfg.data.image_size,
|
||||||
|
'T': cfg.data.n_sample_frames,
|
||||||
|
"sample_method": cfg.data.sample_method,
|
||||||
|
'top_k_ratio': cfg.data.top_k_ratio,
|
||||||
|
"contorl_face_min_size": cfg.data.contorl_face_min_size,
|
||||||
|
"dataset_key": cfg.data.dataset_key,
|
||||||
|
"padding_pixel_mouth": cfg.padding_pixel_mouth,
|
||||||
|
"whisper_path": cfg.whisper_path,
|
||||||
|
"min_face_size": cfg.data.min_face_size,
|
||||||
|
"cropping_jaw2edge_margin_mean": cfg.cropping_jaw2edge_margin_mean,
|
||||||
|
"cropping_jaw2edge_margin_std": cfg.cropping_jaw2edge_margin_std,
|
||||||
|
"crop_type": cfg.crop_type,
|
||||||
|
"random_margin_method": cfg.random_margin_method,
|
||||||
|
})
|
||||||
|
|
||||||
|
dataloader_dict['val_dataloader'] = torch.utils.data.DataLoader(
|
||||||
|
dataloader_dict['val_dataset'],
|
||||||
|
batch_size=cfg.data.train_bs,
|
||||||
|
shuffle=True,
|
||||||
|
num_workers=1,
|
||||||
|
)
|
||||||
|
|
||||||
|
return dataloader_dict
|
||||||
|
|
||||||
|
def initialize_loss_functions(cfg, accelerator, scheduler_max_steps):
|
||||||
|
"""Initialize loss functions and discriminators"""
|
||||||
|
loss_dict = {
|
||||||
|
'L1_loss': nn.L1Loss(reduction='mean'),
|
||||||
|
'discriminator': None,
|
||||||
|
'mouth_discriminator': None,
|
||||||
|
'optimizer_D': None,
|
||||||
|
'mouth_optimizer_D': None,
|
||||||
|
'scheduler_D': None,
|
||||||
|
'mouth_scheduler_D': None,
|
||||||
|
'disc_scales': None,
|
||||||
|
'discriminator_full': None,
|
||||||
|
'mouth_discriminator_full': None
|
||||||
|
}
|
||||||
|
|
||||||
|
if cfg.loss_params.gan_loss > 0:
|
||||||
|
loss_dict['discriminator'] = MultiScaleDiscriminator(
|
||||||
|
**cfg.model_params.discriminator_params).to(accelerator.device)
|
||||||
|
loss_dict['discriminator_full'] = DiscriminatorFullModel(loss_dict['discriminator'])
|
||||||
|
loss_dict['disc_scales'] = cfg.model_params.discriminator_params.scales
|
||||||
|
loss_dict['optimizer_D'] = optim.AdamW(
|
||||||
|
loss_dict['discriminator'].parameters(),
|
||||||
|
lr=cfg.discriminator_train_params.lr,
|
||||||
|
weight_decay=cfg.discriminator_train_params.weight_decay,
|
||||||
|
betas=cfg.discriminator_train_params.betas,
|
||||||
|
eps=cfg.discriminator_train_params.eps)
|
||||||
|
loss_dict['scheduler_D'] = CosineAnnealingLR(
|
||||||
|
loss_dict['optimizer_D'],
|
||||||
|
T_max=scheduler_max_steps,
|
||||||
|
eta_min=1e-6
|
||||||
|
)
|
||||||
|
|
||||||
|
if cfg.loss_params.mouth_gan_loss > 0:
|
||||||
|
loss_dict['mouth_discriminator'] = MultiScaleDiscriminator(
|
||||||
|
**cfg.model_params.discriminator_params).to(accelerator.device)
|
||||||
|
loss_dict['mouth_discriminator_full'] = DiscriminatorFullModel(loss_dict['mouth_discriminator'])
|
||||||
|
loss_dict['mouth_optimizer_D'] = optim.AdamW(
|
||||||
|
loss_dict['mouth_discriminator'].parameters(),
|
||||||
|
lr=cfg.discriminator_train_params.lr,
|
||||||
|
weight_decay=cfg.discriminator_train_params.weight_decay,
|
||||||
|
betas=cfg.discriminator_train_params.betas,
|
||||||
|
eps=cfg.discriminator_train_params.eps)
|
||||||
|
loss_dict['mouth_scheduler_D'] = CosineAnnealingLR(
|
||||||
|
loss_dict['mouth_optimizer_D'],
|
||||||
|
T_max=scheduler_max_steps,
|
||||||
|
eta_min=1e-6
|
||||||
|
)
|
||||||
|
|
||||||
|
return loss_dict
|
||||||
|
|
||||||
|
def initialize_syncnet(cfg, accelerator, weight_dtype):
|
||||||
|
"""Initialize SyncNet model"""
|
||||||
|
if cfg.loss_params.sync_loss > 0 or cfg.use_adapted_weight:
|
||||||
|
if cfg.data.n_sample_frames != 16:
|
||||||
|
raise ValueError(
|
||||||
|
f"Invalid n_sample_frames {cfg.data.n_sample_frames} for sync_loss, it should be 16."
|
||||||
|
)
|
||||||
|
syncnet_config = OmegaConf.load(cfg.syncnet_config_path)
|
||||||
|
syncnet = SyncNet(OmegaConf.to_container(
|
||||||
|
syncnet_config.model)).to(accelerator.device)
|
||||||
|
print(
|
||||||
|
f"Load SyncNet checkpoint from: {syncnet_config.ckpt.inference_ckpt_path}")
|
||||||
|
checkpoint = torch.load(
|
||||||
|
syncnet_config.ckpt.inference_ckpt_path, map_location=accelerator.device)
|
||||||
|
syncnet.load_state_dict(checkpoint["state_dict"])
|
||||||
|
syncnet.to(dtype=weight_dtype)
|
||||||
|
syncnet.requires_grad_(False)
|
||||||
|
syncnet.eval()
|
||||||
|
return syncnet
|
||||||
|
return None
|
||||||
|
|
||||||
|
def initialize_vgg(cfg, accelerator):
|
||||||
|
"""Initialize VGG model"""
|
||||||
|
if cfg.loss_params.vgg_loss > 0:
|
||||||
|
vgg_IN = vgg_face.Vgg19().to(accelerator.device,)
|
||||||
|
pyramid = vgg_face.ImagePyramide(
|
||||||
|
cfg.loss_params.pyramid_scale, 3).to(accelerator.device)
|
||||||
|
vgg_IN.eval()
|
||||||
|
downsampler = Interpolate(
|
||||||
|
size=(224, 224), mode='bilinear', align_corners=False).to(accelerator.device)
|
||||||
|
return vgg_IN, pyramid, downsampler
|
||||||
|
return None, None, None
|
||||||
|
|
||||||
|
def validation(
|
||||||
|
cfg,
|
||||||
|
val_dataloader,
|
||||||
|
net,
|
||||||
|
vae,
|
||||||
|
wav2vec,
|
||||||
|
accelerator,
|
||||||
|
save_dir,
|
||||||
|
global_step,
|
||||||
|
weight_dtype,
|
||||||
|
syncnet_score=1,
|
||||||
|
):
|
||||||
|
"""Validation function for model evaluation"""
|
||||||
|
net.eval() # Set the model to evaluation mode
|
||||||
|
for batch in val_dataloader:
|
||||||
|
# The same ref_latents
|
||||||
|
ref_pixel_values = batch["pixel_values_ref_img"].to(weight_dtype).to(
|
||||||
|
accelerator.device, non_blocking=True
|
||||||
|
)
|
||||||
|
pixel_values = batch["pixel_values_vid"].to(weight_dtype).to(
|
||||||
|
accelerator.device, non_blocking=True
|
||||||
|
)
|
||||||
|
bsz, num_frames, c, h, w = ref_pixel_values.shape
|
||||||
|
|
||||||
|
audio_prompts = process_audio_features(cfg, batch, wav2vec, bsz, num_frames, weight_dtype)
|
||||||
|
# audio feature for unet
|
||||||
|
audio_prompts = rearrange(
|
||||||
|
audio_prompts,
|
||||||
|
'b f c h w-> (b f) c h w'
|
||||||
|
)
|
||||||
|
audio_prompts = rearrange(
|
||||||
|
audio_prompts,
|
||||||
|
'(b f) c h w -> (b f) (c h) w',
|
||||||
|
b=bsz
|
||||||
|
)
|
||||||
|
# different masked_latents
|
||||||
|
image_pred_train = get_image_pred(
|
||||||
|
pixel_values, ref_pixel_values, audio_prompts, vae, net, weight_dtype)
|
||||||
|
image_pred_infer = get_image_pred(
|
||||||
|
ref_pixel_values, ref_pixel_values, audio_prompts, vae, net, weight_dtype)
|
||||||
|
|
||||||
|
process_and_save_images(
|
||||||
|
batch,
|
||||||
|
image_pred_train,
|
||||||
|
image_pred_infer,
|
||||||
|
save_dir,
|
||||||
|
global_step,
|
||||||
|
accelerator,
|
||||||
|
cfg.num_images_to_keep,
|
||||||
|
syncnet_score
|
||||||
|
)
|
||||||
|
# only infer 1 image in validation
|
||||||
|
break
|
||||||
|
net.train() # Set the model back to training mode
|
||||||
Regular → Executable
+278
-23
@@ -2,26 +2,33 @@ import os
|
|||||||
import cv2
|
import cv2
|
||||||
import numpy as np
|
import numpy as np
|
||||||
import torch
|
import torch
|
||||||
|
from typing import Union, List
|
||||||
|
import torch.nn.functional as F
|
||||||
|
from einops import rearrange
|
||||||
|
import shutil
|
||||||
|
import os.path as osp
|
||||||
|
|
||||||
ffmpeg_path = os.getenv('FFMPEG_PATH')
|
|
||||||
if ffmpeg_path is None:
|
|
||||||
print("please download ffmpeg-static and export to FFMPEG_PATH. \nFor example: export FFMPEG_PATH=/musetalk/ffmpeg-4.4-amd64-static")
|
|
||||||
elif ffmpeg_path not in os.getenv('PATH'):
|
|
||||||
print("add ffmpeg to path")
|
|
||||||
os.environ["PATH"] = f"{ffmpeg_path}:{os.environ['PATH']}"
|
|
||||||
|
|
||||||
|
|
||||||
from musetalk.whisper.audio2feature import Audio2Feature
|
|
||||||
from musetalk.models.vae import VAE
|
from musetalk.models.vae import VAE
|
||||||
from musetalk.models.unet import UNet,PositionalEncoding
|
from musetalk.models.unet import UNet,PositionalEncoding
|
||||||
|
|
||||||
def load_all_model():
|
|
||||||
audio_processor = Audio2Feature(model_path="./models/whisper/tiny.pt")
|
def load_all_model(
|
||||||
vae = VAE(model_path = "./models/sd-vae-ft-mse/")
|
unet_model_path=os.path.join("models", "musetalkV15", "unet.pth"),
|
||||||
unet = UNet(unet_config="./models/musetalk/musetalk.json",
|
vae_type="sd-vae",
|
||||||
model_path ="./models/musetalk/pytorch_model.bin")
|
unet_config=os.path.join("models", "musetalkV15", "musetalk.json"),
|
||||||
|
device=None,
|
||||||
|
):
|
||||||
|
vae = VAE(
|
||||||
|
model_path = os.path.join("models", vae_type),
|
||||||
|
)
|
||||||
|
print(f"load unet model from {unet_model_path}")
|
||||||
|
unet = UNet(
|
||||||
|
unet_config=unet_config,
|
||||||
|
model_path=unet_model_path,
|
||||||
|
device=device
|
||||||
|
)
|
||||||
pe = PositionalEncoding(d_model=384)
|
pe = PositionalEncoding(d_model=384)
|
||||||
return audio_processor,vae,unet,pe
|
return vae, unet, pe
|
||||||
|
|
||||||
def get_file_type(video_path):
|
def get_file_type(video_path):
|
||||||
_, ext = os.path.splitext(video_path)
|
_, ext = os.path.splitext(video_path)
|
||||||
@@ -39,10 +46,13 @@ def get_video_fps(video_path):
|
|||||||
video.release()
|
video.release()
|
||||||
return fps
|
return fps
|
||||||
|
|
||||||
def datagen(whisper_chunks,
|
def datagen(
|
||||||
vae_encode_latents,
|
whisper_chunks,
|
||||||
batch_size=8,
|
vae_encode_latents,
|
||||||
delay_frame=0):
|
batch_size=8,
|
||||||
|
delay_frame=0,
|
||||||
|
device="cuda:0",
|
||||||
|
):
|
||||||
whisper_batch, latent_batch = [], []
|
whisper_batch, latent_batch = [], []
|
||||||
for i, w in enumerate(whisper_chunks):
|
for i, w in enumerate(whisper_chunks):
|
||||||
idx = (i+delay_frame)%len(vae_encode_latents)
|
idx = (i+delay_frame)%len(vae_encode_latents)
|
||||||
@@ -51,14 +61,259 @@ def datagen(whisper_chunks,
|
|||||||
latent_batch.append(latent)
|
latent_batch.append(latent)
|
||||||
|
|
||||||
if len(latent_batch) >= batch_size:
|
if len(latent_batch) >= batch_size:
|
||||||
whisper_batch = np.stack(whisper_batch)
|
whisper_batch = torch.stack(whisper_batch)
|
||||||
latent_batch = torch.cat(latent_batch, dim=0)
|
latent_batch = torch.cat(latent_batch, dim=0)
|
||||||
yield whisper_batch, latent_batch
|
yield whisper_batch, latent_batch
|
||||||
whisper_batch, latent_batch = [], []
|
whisper_batch, latent_batch = [], []
|
||||||
|
|
||||||
# the last batch may smaller than batch size
|
# the last batch may smaller than batch size
|
||||||
if len(latent_batch) > 0:
|
if len(latent_batch) > 0:
|
||||||
whisper_batch = np.stack(whisper_batch)
|
whisper_batch = torch.stack(whisper_batch)
|
||||||
latent_batch = torch.cat(latent_batch, dim=0)
|
latent_batch = torch.cat(latent_batch, dim=0)
|
||||||
|
|
||||||
yield whisper_batch, latent_batch
|
yield whisper_batch.to(device), latent_batch.to(device)
|
||||||
|
|
||||||
|
def cast_training_params(
|
||||||
|
model: Union[torch.nn.Module, List[torch.nn.Module]],
|
||||||
|
dtype=torch.float32,
|
||||||
|
):
|
||||||
|
if not isinstance(model, list):
|
||||||
|
model = [model]
|
||||||
|
for m in model:
|
||||||
|
for param in m.parameters():
|
||||||
|
# only upcast trainable parameters into fp32
|
||||||
|
if param.requires_grad:
|
||||||
|
param.data = param.to(dtype)
|
||||||
|
|
||||||
|
def rand_log_normal(
|
||||||
|
shape,
|
||||||
|
loc=0.,
|
||||||
|
scale=1.,
|
||||||
|
device='cpu',
|
||||||
|
dtype=torch.float32,
|
||||||
|
generator=None
|
||||||
|
):
|
||||||
|
"""Draws samples from an lognormal distribution."""
|
||||||
|
rnd_normal = torch.randn(
|
||||||
|
shape, device=device, dtype=dtype, generator=generator) # N(0, I)
|
||||||
|
sigma = (rnd_normal * scale + loc).exp()
|
||||||
|
return sigma
|
||||||
|
|
||||||
|
def get_mouth_region(frames, image_pred, pixel_values_face_mask):
|
||||||
|
# Initialize lists to store the results for each image in the batch
|
||||||
|
mouth_real_list = []
|
||||||
|
mouth_generated_list = []
|
||||||
|
|
||||||
|
# Process each image in the batch
|
||||||
|
for b in range(frames.shape[0]):
|
||||||
|
# Find the non-zero area in the face mask
|
||||||
|
non_zero_indices = torch.nonzero(pixel_values_face_mask[b])
|
||||||
|
# If there are no non-zero indices, skip this image
|
||||||
|
if non_zero_indices.numel() == 0:
|
||||||
|
continue
|
||||||
|
|
||||||
|
min_y, max_y = torch.min(non_zero_indices[:, 1]), torch.max(
|
||||||
|
non_zero_indices[:, 1])
|
||||||
|
min_x, max_x = torch.min(non_zero_indices[:, 2]), torch.max(
|
||||||
|
non_zero_indices[:, 2])
|
||||||
|
|
||||||
|
# Crop the frames and image_pred according to the non-zero area
|
||||||
|
frames_cropped = frames[b, :, min_y:max_y, min_x:max_x]
|
||||||
|
image_pred_cropped = image_pred[b, :, min_y:max_y, min_x:max_x]
|
||||||
|
# Resize the cropped images to 256*256
|
||||||
|
frames_resized = F.interpolate(frames_cropped.unsqueeze(
|
||||||
|
0), size=(256, 256), mode='bilinear', align_corners=False)
|
||||||
|
image_pred_resized = F.interpolate(image_pred_cropped.unsqueeze(
|
||||||
|
0), size=(256, 256), mode='bilinear', align_corners=False)
|
||||||
|
|
||||||
|
# Append the resized images to the result lists
|
||||||
|
mouth_real_list.append(frames_resized)
|
||||||
|
mouth_generated_list.append(image_pred_resized)
|
||||||
|
|
||||||
|
# Convert the lists to tensors if they are not empty
|
||||||
|
mouth_real = torch.cat(mouth_real_list, dim=0) if mouth_real_list else None
|
||||||
|
mouth_generated = torch.cat(
|
||||||
|
mouth_generated_list, dim=0) if mouth_generated_list else None
|
||||||
|
|
||||||
|
return mouth_real, mouth_generated
|
||||||
|
|
||||||
|
def get_image_pred(pixel_values,
|
||||||
|
ref_pixel_values,
|
||||||
|
audio_prompts,
|
||||||
|
vae,
|
||||||
|
net,
|
||||||
|
weight_dtype):
|
||||||
|
with torch.no_grad():
|
||||||
|
bsz, num_frames, c, h, w = pixel_values.shape
|
||||||
|
|
||||||
|
masked_pixel_values = pixel_values.clone()
|
||||||
|
masked_pixel_values[:, :, :, h//2:, :] = -1
|
||||||
|
|
||||||
|
masked_frames = rearrange(
|
||||||
|
masked_pixel_values, 'b f c h w -> (b f) c h w')
|
||||||
|
masked_latents = vae.encode(masked_frames).latent_dist.mode()
|
||||||
|
masked_latents = masked_latents * vae.config.scaling_factor
|
||||||
|
masked_latents = masked_latents.float()
|
||||||
|
|
||||||
|
ref_frames = rearrange(ref_pixel_values, 'b f c h w-> (b f) c h w')
|
||||||
|
ref_latents = vae.encode(ref_frames).latent_dist.mode()
|
||||||
|
ref_latents = ref_latents * vae.config.scaling_factor
|
||||||
|
ref_latents = ref_latents.float()
|
||||||
|
|
||||||
|
input_latents = torch.cat([masked_latents, ref_latents], dim=1)
|
||||||
|
input_latents = input_latents.to(weight_dtype)
|
||||||
|
timesteps = torch.tensor([0], device=input_latents.device)
|
||||||
|
latents_pred = net(
|
||||||
|
input_latents,
|
||||||
|
timesteps,
|
||||||
|
audio_prompts,
|
||||||
|
)
|
||||||
|
latents_pred = (1 / vae.config.scaling_factor) * latents_pred
|
||||||
|
image_pred = vae.decode(latents_pred).sample
|
||||||
|
image_pred = image_pred.float()
|
||||||
|
|
||||||
|
return image_pred
|
||||||
|
|
||||||
|
def process_audio_features(cfg, batch, wav2vec, bsz, num_frames, weight_dtype):
|
||||||
|
with torch.no_grad():
|
||||||
|
audio_feature_length_per_frame = 2 * \
|
||||||
|
(cfg.data.audio_padding_length_left +
|
||||||
|
cfg.data.audio_padding_length_right + 1)
|
||||||
|
audio_feats = batch['audio_feature'].to(weight_dtype)
|
||||||
|
audio_feats = wav2vec.encoder(
|
||||||
|
audio_feats, output_hidden_states=True).hidden_states
|
||||||
|
audio_feats = torch.stack(audio_feats, dim=2).to(weight_dtype) # [B, T, 10, 5, 384]
|
||||||
|
|
||||||
|
start_ts = batch['audio_offset']
|
||||||
|
step_ts = batch['audio_step']
|
||||||
|
audio_feats = torch.cat([torch.zeros_like(audio_feats[:, :2*cfg.data.audio_padding_length_left]),
|
||||||
|
audio_feats,
|
||||||
|
torch.zeros_like(audio_feats[:, :2*cfg.data.audio_padding_length_right])], 1)
|
||||||
|
audio_prompts = []
|
||||||
|
for bb in range(bsz):
|
||||||
|
audio_feats_list = []
|
||||||
|
for f in range(num_frames):
|
||||||
|
cur_t = (start_ts[bb] + f * step_ts[bb]) * 2
|
||||||
|
audio_clip = audio_feats[bb:bb+1,
|
||||||
|
cur_t: cur_t+audio_feature_length_per_frame]
|
||||||
|
|
||||||
|
audio_feats_list.append(audio_clip)
|
||||||
|
audio_feats_list = torch.stack(audio_feats_list, 1)
|
||||||
|
audio_prompts.append(audio_feats_list)
|
||||||
|
audio_prompts = torch.cat(audio_prompts) # B, T, 10, 5, 384
|
||||||
|
return audio_prompts
|
||||||
|
|
||||||
|
def save_checkpoint(model, save_dir, ckpt_num, name="appearance_net", total_limit=None, logger=None):
|
||||||
|
save_path = os.path.join(save_dir, f"{name}-{ckpt_num}.pth")
|
||||||
|
|
||||||
|
if total_limit is not None:
|
||||||
|
checkpoints = os.listdir(save_dir)
|
||||||
|
checkpoints = [d for d in checkpoints if d.endswith(".pth")]
|
||||||
|
checkpoints = [d for d in checkpoints if name in d]
|
||||||
|
checkpoints = sorted(
|
||||||
|
checkpoints, key=lambda x: int(x.split("-")[1].split(".")[0])
|
||||||
|
)
|
||||||
|
|
||||||
|
if len(checkpoints) >= total_limit:
|
||||||
|
num_to_remove = len(checkpoints) - total_limit + 1
|
||||||
|
removing_checkpoints = checkpoints[0:num_to_remove]
|
||||||
|
logger.info(
|
||||||
|
f"{len(checkpoints)} checkpoints already exist, removing {len(removing_checkpoints)} checkpoints"
|
||||||
|
)
|
||||||
|
logger.info(
|
||||||
|
f"removing checkpoints: {', '.join(removing_checkpoints)}")
|
||||||
|
|
||||||
|
for removing_checkpoint in removing_checkpoints:
|
||||||
|
removing_checkpoint = os.path.join(
|
||||||
|
save_dir, removing_checkpoint)
|
||||||
|
os.remove(removing_checkpoint)
|
||||||
|
|
||||||
|
state_dict = model.state_dict()
|
||||||
|
torch.save(state_dict, save_path)
|
||||||
|
|
||||||
|
def save_models(accelerator, net, save_dir, global_step, cfg, logger=None):
|
||||||
|
unwarp_net = accelerator.unwrap_model(net)
|
||||||
|
save_checkpoint(
|
||||||
|
unwarp_net.unet,
|
||||||
|
save_dir,
|
||||||
|
global_step,
|
||||||
|
name="unet",
|
||||||
|
total_limit=cfg.total_limit,
|
||||||
|
logger=logger
|
||||||
|
)
|
||||||
|
|
||||||
|
def delete_additional_ckpt(base_path, num_keep):
|
||||||
|
dirs = []
|
||||||
|
for d in os.listdir(base_path):
|
||||||
|
if d.startswith("checkpoint-"):
|
||||||
|
dirs.append(d)
|
||||||
|
num_tot = len(dirs)
|
||||||
|
if num_tot <= num_keep:
|
||||||
|
return
|
||||||
|
# ensure ckpt is sorted and delete the ealier!
|
||||||
|
del_dirs = sorted(dirs, key=lambda x: int(x.split("-")[-1]))[: num_tot - num_keep]
|
||||||
|
for d in del_dirs:
|
||||||
|
path_to_dir = osp.join(base_path, d)
|
||||||
|
if osp.exists(path_to_dir):
|
||||||
|
shutil.rmtree(path_to_dir)
|
||||||
|
|
||||||
|
def seed_everything(seed):
|
||||||
|
import random
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
torch.manual_seed(seed)
|
||||||
|
torch.cuda.manual_seed_all(seed)
|
||||||
|
np.random.seed(seed % (2**32))
|
||||||
|
random.seed(seed)
|
||||||
|
|
||||||
|
def process_and_save_images(
|
||||||
|
batch,
|
||||||
|
image_pred,
|
||||||
|
image_pred_infer,
|
||||||
|
save_dir,
|
||||||
|
global_step,
|
||||||
|
accelerator,
|
||||||
|
num_images_to_keep=10,
|
||||||
|
syncnet_score=1
|
||||||
|
):
|
||||||
|
# Rearrange the tensors
|
||||||
|
print("image_pred.shape: ", image_pred.shape)
|
||||||
|
pixel_values_ref_img = rearrange(batch['pixel_values_ref_img'], "b f c h w -> (b f) c h w")
|
||||||
|
pixel_values = rearrange(batch["pixel_values_vid"], 'b f c h w -> (b f) c h w')
|
||||||
|
|
||||||
|
# Create masked pixel values
|
||||||
|
masked_pixel_values = batch["pixel_values_vid"].clone()
|
||||||
|
_, _, _, h, _ = batch["pixel_values_vid"].shape
|
||||||
|
masked_pixel_values[:, :, :, h//2:, :] = -1
|
||||||
|
masked_pixel_values = rearrange(masked_pixel_values, 'b f c h w -> (b f) c h w')
|
||||||
|
|
||||||
|
# Keep only the specified number of images
|
||||||
|
pixel_values = pixel_values[:num_images_to_keep, :, :, :]
|
||||||
|
masked_pixel_values = masked_pixel_values[:num_images_to_keep, :, :, :]
|
||||||
|
pixel_values_ref_img = pixel_values_ref_img[:num_images_to_keep, :, :, :]
|
||||||
|
image_pred = image_pred.detach()[:num_images_to_keep, :, :, :]
|
||||||
|
image_pred_infer = image_pred_infer.detach()[:num_images_to_keep, :, :, :]
|
||||||
|
|
||||||
|
# Concatenate images
|
||||||
|
concat = torch.cat([
|
||||||
|
masked_pixel_values * 0.5 + 0.5,
|
||||||
|
pixel_values_ref_img * 0.5 + 0.5,
|
||||||
|
image_pred * 0.5 + 0.5,
|
||||||
|
pixel_values * 0.5 + 0.5,
|
||||||
|
image_pred_infer * 0.5 + 0.5,
|
||||||
|
], dim=2)
|
||||||
|
print("concat.shape: ", concat.shape)
|
||||||
|
|
||||||
|
# Create the save directory if it doesn't exist
|
||||||
|
os.makedirs(f'{save_dir}/samples/', exist_ok=True)
|
||||||
|
|
||||||
|
# Try to save the concatenated image
|
||||||
|
try:
|
||||||
|
# Concatenate images horizontally and convert to numpy array
|
||||||
|
final_image = torch.cat([concat[i] for i in range(concat.shape[0])], dim=-1).permute(1, 2, 0).cpu().numpy()[:, :, [2, 1, 0]] * 255
|
||||||
|
# Save the image
|
||||||
|
cv2.imwrite(f'{save_dir}/samples/sample_{global_step}_{accelerator.device}_SyncNetScore_{syncnet_score}.jpg', final_image)
|
||||||
|
print(f"Image saved successfully: {save_dir}/samples/sample_{global_step}_{accelerator.device}_SyncNetScore_{syncnet_score}.jpg")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Failed to save image: {e}")
|
||||||
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
Regular → Executable
@@ -0,0 +1,15 @@
|
|||||||
|
[build-system]
|
||||||
|
requires = ["setuptools>=64"]
|
||||||
|
build-backend = "setuptools.build_meta"
|
||||||
|
|
||||||
|
[project]
|
||||||
|
name = "musetalk"
|
||||||
|
version = "1.5.0"
|
||||||
|
description = "MuseTalk: audio-driven lip-sync (source-only install; dependencies managed by the consumer)"
|
||||||
|
readme = "README.md"
|
||||||
|
requires-python = ">=3.10"
|
||||||
|
license = { text = "MIT" }
|
||||||
|
|
||||||
|
[tool.setuptools.packages.find]
|
||||||
|
include = ["musetalk*"]
|
||||||
|
exclude = ["scripts*", "assets*", "data*", "configs*"]
|
||||||
+6
-7
@@ -1,14 +1,15 @@
|
|||||||
--extra-index-url https://download.pytorch.org/whl/cu118
|
diffusers==0.30.2
|
||||||
torch==2.0.1
|
|
||||||
torchvision==0.15.2
|
|
||||||
torchaudio==2.0.2
|
|
||||||
diffusers==0.27.2
|
|
||||||
accelerate==0.28.0
|
accelerate==0.28.0
|
||||||
|
numpy==1.23.5
|
||||||
tensorflow==2.12.0
|
tensorflow==2.12.0
|
||||||
tensorboard==2.12.0
|
tensorboard==2.12.0
|
||||||
opencv-python==4.9.0.80
|
opencv-python==4.9.0.80
|
||||||
soundfile==0.12.1
|
soundfile==0.12.1
|
||||||
transformers==4.39.2
|
transformers==4.39.2
|
||||||
|
huggingface_hub==0.30.2
|
||||||
|
librosa==0.11.0
|
||||||
|
einops==0.8.1
|
||||||
|
gradio==5.24.0
|
||||||
|
|
||||||
gdown
|
gdown
|
||||||
requests
|
requests
|
||||||
@@ -16,6 +17,4 @@ imageio[ffmpeg]
|
|||||||
|
|
||||||
omegaconf
|
omegaconf
|
||||||
ffmpeg-python
|
ffmpeg-python
|
||||||
gradio
|
|
||||||
spaces
|
|
||||||
moviepy
|
moviepy
|
||||||
|
|||||||
@@ -0,0 +1 @@
|
|||||||
|
|
||||||
+252
-137
@@ -1,161 +1,276 @@
|
|||||||
import argparse
|
|
||||||
import os
|
import os
|
||||||
from omegaconf import OmegaConf
|
|
||||||
import numpy as np
|
|
||||||
import cv2
|
import cv2
|
||||||
|
import math
|
||||||
|
import copy
|
||||||
import torch
|
import torch
|
||||||
import glob
|
import glob
|
||||||
import pickle
|
|
||||||
from tqdm import tqdm
|
|
||||||
import copy
|
|
||||||
|
|
||||||
from musetalk.utils.utils import get_file_type,get_video_fps,datagen
|
|
||||||
from musetalk.utils.preprocessing import get_landmark_and_bbox,read_imgs,coord_placeholder
|
|
||||||
from musetalk.utils.blending import get_image
|
|
||||||
from musetalk.utils.utils import load_all_model
|
|
||||||
import shutil
|
import shutil
|
||||||
|
import pickle
|
||||||
|
import argparse
|
||||||
|
import numpy as np
|
||||||
|
import subprocess
|
||||||
|
from tqdm import tqdm
|
||||||
|
from omegaconf import OmegaConf
|
||||||
|
from transformers import WhisperModel
|
||||||
|
import sys
|
||||||
|
|
||||||
# load model weights
|
from musetalk.utils.blending import get_image
|
||||||
audio_processor, vae, unet, pe = load_all_model()
|
from musetalk.utils.face_parsing import FaceParsing
|
||||||
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
from musetalk.utils.audio_processor import AudioProcessor
|
||||||
timesteps = torch.tensor([0], device=device)
|
from musetalk.utils.utils import get_file_type, get_video_fps, datagen, load_all_model
|
||||||
|
from musetalk.utils.preprocessing import get_landmark_and_bbox, read_imgs, coord_placeholder
|
||||||
|
|
||||||
|
def fast_check_ffmpeg():
|
||||||
|
try:
|
||||||
|
subprocess.run(["ffmpeg", "-version"], capture_output=True, check=True)
|
||||||
|
return True
|
||||||
|
except:
|
||||||
|
return False
|
||||||
|
|
||||||
@torch.no_grad()
|
@torch.no_grad()
|
||||||
def main(args):
|
def main(args):
|
||||||
global pe
|
# Configure ffmpeg path
|
||||||
if args.use_float16 is True:
|
if not fast_check_ffmpeg():
|
||||||
|
print("Adding ffmpeg to PATH")
|
||||||
|
# Choose path separator based on operating system
|
||||||
|
path_separator = ';' if sys.platform == 'win32' else ':'
|
||||||
|
os.environ["PATH"] = f"{args.ffmpeg_path}{path_separator}{os.environ['PATH']}"
|
||||||
|
if not fast_check_ffmpeg():
|
||||||
|
print("Warning: Unable to find ffmpeg, please ensure ffmpeg is properly installed")
|
||||||
|
|
||||||
|
# Set computing device
|
||||||
|
device = torch.device(f"cuda:{args.gpu_id}" if torch.cuda.is_available() else "cpu")
|
||||||
|
# Load model weights
|
||||||
|
vae, unet, pe = load_all_model(
|
||||||
|
unet_model_path=args.unet_model_path,
|
||||||
|
vae_type=args.vae_type,
|
||||||
|
unet_config=args.unet_config,
|
||||||
|
device=device
|
||||||
|
)
|
||||||
|
timesteps = torch.tensor([0], device=device)
|
||||||
|
|
||||||
|
# Convert models to half precision if float16 is enabled
|
||||||
|
if args.use_float16:
|
||||||
pe = pe.half()
|
pe = pe.half()
|
||||||
vae.vae = vae.vae.half()
|
vae.vae = vae.vae.half()
|
||||||
unet.model = unet.model.half()
|
unet.model = unet.model.half()
|
||||||
|
|
||||||
inference_config = OmegaConf.load(args.inference_config)
|
# Move models to specified device
|
||||||
print(inference_config)
|
pe = pe.to(device)
|
||||||
for task_id in inference_config:
|
vae.vae = vae.vae.to(device)
|
||||||
video_path = inference_config[task_id]["video_path"]
|
unet.model = unet.model.to(device)
|
||||||
audio_path = inference_config[task_id]["audio_path"]
|
|
||||||
bbox_shift = inference_config[task_id].get("bbox_shift", args.bbox_shift)
|
|
||||||
|
|
||||||
input_basename = os.path.basename(video_path).split('.')[0]
|
|
||||||
audio_basename = os.path.basename(audio_path).split('.')[0]
|
|
||||||
output_basename = f"{input_basename}_{audio_basename}"
|
|
||||||
result_img_save_path = os.path.join(args.result_dir, output_basename) # related to video & audio inputs
|
|
||||||
crop_coord_save_path = os.path.join(result_img_save_path, input_basename+".pkl") # only related to video input
|
|
||||||
os.makedirs(result_img_save_path,exist_ok =True)
|
|
||||||
|
|
||||||
if args.output_vid_name is None:
|
# Initialize audio processor and Whisper model
|
||||||
output_vid_name = os.path.join(args.result_dir, output_basename+".mp4")
|
audio_processor = AudioProcessor(feature_extractor_path=args.whisper_dir)
|
||||||
else:
|
weight_dtype = unet.model.dtype
|
||||||
output_vid_name = os.path.join(args.result_dir, args.output_vid_name)
|
whisper = WhisperModel.from_pretrained(args.whisper_dir)
|
||||||
############################################## extract frames from source video ##############################################
|
whisper = whisper.to(device=device, dtype=weight_dtype).eval()
|
||||||
if get_file_type(video_path)=="video":
|
whisper.requires_grad_(False)
|
||||||
save_dir_full = os.path.join(args.result_dir, input_basename)
|
|
||||||
os.makedirs(save_dir_full,exist_ok = True)
|
|
||||||
cmd = f"ffmpeg -v fatal -i {video_path} -start_number 0 {save_dir_full}/%08d.png"
|
|
||||||
os.system(cmd)
|
|
||||||
input_img_list = sorted(glob.glob(os.path.join(save_dir_full, '*.[jpJP][pnPN]*[gG]')))
|
|
||||||
fps = get_video_fps(video_path)
|
|
||||||
elif get_file_type(video_path)=="image":
|
|
||||||
input_img_list = [video_path, ]
|
|
||||||
fps = args.fps
|
|
||||||
elif os.path.isdir(video_path): # input img folder
|
|
||||||
input_img_list = glob.glob(os.path.join(video_path, '*.[jpJP][pnPN]*[gG]'))
|
|
||||||
input_img_list = sorted(input_img_list, key=lambda x: int(os.path.splitext(os.path.basename(x))[0]))
|
|
||||||
fps = args.fps
|
|
||||||
else:
|
|
||||||
raise ValueError(f"{video_path} should be a video file, an image file or a directory of images")
|
|
||||||
|
|
||||||
#print(input_img_list)
|
|
||||||
############################################## extract audio feature ##############################################
|
|
||||||
whisper_feature = audio_processor.audio2feat(audio_path)
|
|
||||||
whisper_chunks = audio_processor.feature2chunks(feature_array=whisper_feature,fps=fps)
|
|
||||||
############################################## preprocess input image ##############################################
|
|
||||||
if os.path.exists(crop_coord_save_path) and args.use_saved_coord:
|
|
||||||
print("using extracted coordinates")
|
|
||||||
with open(crop_coord_save_path,'rb') as f:
|
|
||||||
coord_list = pickle.load(f)
|
|
||||||
frame_list = read_imgs(input_img_list)
|
|
||||||
else:
|
|
||||||
print("extracting landmarks...time consuming")
|
|
||||||
coord_list, frame_list = get_landmark_and_bbox(input_img_list, bbox_shift)
|
|
||||||
with open(crop_coord_save_path, 'wb') as f:
|
|
||||||
pickle.dump(coord_list, f)
|
|
||||||
|
|
||||||
i = 0
|
|
||||||
input_latent_list = []
|
|
||||||
for bbox, frame in zip(coord_list, frame_list):
|
|
||||||
if bbox == coord_placeholder:
|
|
||||||
continue
|
|
||||||
x1, y1, x2, y2 = bbox
|
|
||||||
crop_frame = frame[y1:y2, x1:x2]
|
|
||||||
crop_frame = cv2.resize(crop_frame,(256,256),interpolation = cv2.INTER_LANCZOS4)
|
|
||||||
latents = vae.get_latents_for_unet(crop_frame)
|
|
||||||
input_latent_list.append(latents)
|
|
||||||
|
|
||||||
# to smooth the first and the last frame
|
# Initialize face parser with configurable parameters based on version
|
||||||
frame_list_cycle = frame_list + frame_list[::-1]
|
if args.version == "v15":
|
||||||
coord_list_cycle = coord_list + coord_list[::-1]
|
fp = FaceParsing(
|
||||||
input_latent_list_cycle = input_latent_list + input_latent_list[::-1]
|
left_cheek_width=args.left_cheek_width,
|
||||||
############################################## inference batch by batch ##############################################
|
right_cheek_width=args.right_cheek_width
|
||||||
print("start inference")
|
)
|
||||||
video_num = len(whisper_chunks)
|
else: # v1
|
||||||
batch_size = args.batch_size
|
fp = FaceParsing()
|
||||||
gen = datagen(whisper_chunks,input_latent_list_cycle,batch_size)
|
|
||||||
res_frame_list = []
|
# Load inference configuration
|
||||||
for i, (whisper_batch,latent_batch) in enumerate(tqdm(gen,total=int(np.ceil(float(video_num)/batch_size)))):
|
inference_config = OmegaConf.load(args.inference_config)
|
||||||
audio_feature_batch = torch.from_numpy(whisper_batch)
|
print("Loaded inference config:", inference_config)
|
||||||
audio_feature_batch = audio_feature_batch.to(device=unet.device,
|
|
||||||
dtype=unet.model.dtype) # torch, B, 5*N,384
|
# Process each task
|
||||||
audio_feature_batch = pe(audio_feature_batch)
|
for task_id in inference_config:
|
||||||
latent_batch = latent_batch.to(dtype=unet.model.dtype)
|
try:
|
||||||
|
# Get task configuration
|
||||||
|
video_path = inference_config[task_id]["video_path"]
|
||||||
|
audio_path = inference_config[task_id]["audio_path"]
|
||||||
|
if "result_name" in inference_config[task_id]:
|
||||||
|
args.output_vid_name = inference_config[task_id]["result_name"]
|
||||||
|
|
||||||
pred_latents = unet.model(latent_batch, timesteps, encoder_hidden_states=audio_feature_batch).sample
|
# Set bbox_shift based on version
|
||||||
recon = vae.decode_latents(pred_latents)
|
if args.version == "v15":
|
||||||
for res_frame in recon:
|
bbox_shift = 0 # v15 uses fixed bbox_shift
|
||||||
res_frame_list.append(res_frame)
|
else:
|
||||||
|
bbox_shift = inference_config[task_id].get("bbox_shift", args.bbox_shift) # v1 uses config or default
|
||||||
############################################## pad to full image ##############################################
|
|
||||||
print("pad talking image to original video")
|
|
||||||
for i, res_frame in enumerate(tqdm(res_frame_list)):
|
|
||||||
bbox = coord_list_cycle[i%(len(coord_list_cycle))]
|
|
||||||
ori_frame = copy.deepcopy(frame_list_cycle[i%(len(frame_list_cycle))])
|
|
||||||
x1, y1, x2, y2 = bbox
|
|
||||||
try:
|
|
||||||
res_frame = cv2.resize(res_frame.astype(np.uint8),(x2-x1,y2-y1))
|
|
||||||
except:
|
|
||||||
# print(bbox)
|
|
||||||
continue
|
|
||||||
|
|
||||||
combine_frame = get_image(ori_frame,res_frame,bbox)
|
# Set output paths
|
||||||
cv2.imwrite(f"{result_img_save_path}/{str(i).zfill(8)}.png",combine_frame)
|
input_basename = os.path.basename(video_path).split('.')[0]
|
||||||
|
audio_basename = os.path.basename(audio_path).split('.')[0]
|
||||||
|
output_basename = f"{input_basename}_{audio_basename}"
|
||||||
|
|
||||||
|
# Create temporary directories
|
||||||
|
temp_dir = os.path.join(args.result_dir, f"{args.version}")
|
||||||
|
os.makedirs(temp_dir, exist_ok=True)
|
||||||
|
|
||||||
|
# Set result save paths
|
||||||
|
result_img_save_path = os.path.join(temp_dir, output_basename)
|
||||||
|
crop_coord_save_path = os.path.join(args.result_dir, "../", input_basename+".pkl")
|
||||||
|
os.makedirs(result_img_save_path, exist_ok=True)
|
||||||
|
|
||||||
|
# Set output video paths
|
||||||
|
if args.output_vid_name is None:
|
||||||
|
output_vid_name = os.path.join(temp_dir, output_basename + ".mp4")
|
||||||
|
else:
|
||||||
|
output_vid_name = os.path.join(temp_dir, args.output_vid_name)
|
||||||
|
output_vid_name_concat = os.path.join(temp_dir, output_basename + "_concat.mp4")
|
||||||
|
|
||||||
|
# Extract frames from source video
|
||||||
|
if get_file_type(video_path) == "video":
|
||||||
|
save_dir_full = os.path.join(temp_dir, input_basename)
|
||||||
|
os.makedirs(save_dir_full, exist_ok=True)
|
||||||
|
cmd = f"ffmpeg -v fatal -i {video_path} -start_number 0 {save_dir_full}/%08d.png"
|
||||||
|
os.system(cmd)
|
||||||
|
input_img_list = sorted(glob.glob(os.path.join(save_dir_full, '*.[jpJP][pnPN]*[gG]')))
|
||||||
|
fps = get_video_fps(video_path)
|
||||||
|
elif get_file_type(video_path) == "image":
|
||||||
|
input_img_list = [video_path]
|
||||||
|
fps = args.fps
|
||||||
|
elif os.path.isdir(video_path):
|
||||||
|
input_img_list = glob.glob(os.path.join(video_path, '*.[jpJP][pnPN]*[gG]'))
|
||||||
|
input_img_list = sorted(input_img_list, key=lambda x: int(os.path.splitext(os.path.basename(x))[0]))
|
||||||
|
fps = args.fps
|
||||||
|
else:
|
||||||
|
raise ValueError(f"{video_path} should be a video file, an image file or a directory of images")
|
||||||
|
|
||||||
cmd_img2video = f"ffmpeg -y -v warning -r {fps} -f image2 -i {result_img_save_path}/%08d.png -vcodec libx264 -vf format=rgb24,scale=out_color_matrix=bt709,format=yuv420p -crf 18 temp.mp4"
|
# Extract audio features
|
||||||
print(cmd_img2video)
|
whisper_input_features, librosa_length = audio_processor.get_audio_feature(audio_path)
|
||||||
os.system(cmd_img2video)
|
whisper_chunks = audio_processor.get_whisper_chunk(
|
||||||
|
whisper_input_features,
|
||||||
|
device,
|
||||||
|
weight_dtype,
|
||||||
|
whisper,
|
||||||
|
librosa_length,
|
||||||
|
fps=fps,
|
||||||
|
audio_padding_length_left=args.audio_padding_length_left,
|
||||||
|
audio_padding_length_right=args.audio_padding_length_right,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Preprocess input images
|
||||||
|
if os.path.exists(crop_coord_save_path) and args.use_saved_coord:
|
||||||
|
print("Using saved coordinates")
|
||||||
|
with open(crop_coord_save_path, 'rb') as f:
|
||||||
|
coord_list = pickle.load(f)
|
||||||
|
frame_list = read_imgs(input_img_list)
|
||||||
|
else:
|
||||||
|
print("Extracting landmarks... time-consuming operation")
|
||||||
|
coord_list, frame_list = get_landmark_and_bbox(input_img_list, bbox_shift)
|
||||||
|
with open(crop_coord_save_path, 'wb') as f:
|
||||||
|
pickle.dump(coord_list, f)
|
||||||
|
|
||||||
|
print(f"Number of frames: {len(frame_list)}")
|
||||||
|
|
||||||
|
# Process each frame
|
||||||
|
input_latent_list = []
|
||||||
|
for bbox, frame in zip(coord_list, frame_list):
|
||||||
|
if bbox == coord_placeholder:
|
||||||
|
continue
|
||||||
|
x1, y1, x2, y2 = bbox
|
||||||
|
if args.version == "v15":
|
||||||
|
y2 = y2 + args.extra_margin
|
||||||
|
y2 = min(y2, frame.shape[0])
|
||||||
|
crop_frame = frame[y1:y2, x1:x2]
|
||||||
|
crop_frame = cv2.resize(crop_frame, (256,256), interpolation=cv2.INTER_LANCZOS4)
|
||||||
|
latents = vae.get_latents_for_unet(crop_frame)
|
||||||
|
input_latent_list.append(latents)
|
||||||
|
|
||||||
cmd_combine_audio = f"ffmpeg -y -v warning -i {audio_path} -i temp.mp4 {output_vid_name}"
|
# Smooth first and last frames
|
||||||
print(cmd_combine_audio)
|
frame_list_cycle = frame_list + frame_list[::-1]
|
||||||
os.system(cmd_combine_audio)
|
coord_list_cycle = coord_list + coord_list[::-1]
|
||||||
|
input_latent_list_cycle = input_latent_list + input_latent_list[::-1]
|
||||||
os.remove("temp.mp4")
|
|
||||||
shutil.rmtree(result_img_save_path)
|
# Batch inference
|
||||||
print(f"result is save to {output_vid_name}")
|
print("Starting inference")
|
||||||
|
video_num = len(whisper_chunks)
|
||||||
|
batch_size = args.batch_size
|
||||||
|
gen = datagen(
|
||||||
|
whisper_chunks=whisper_chunks,
|
||||||
|
vae_encode_latents=input_latent_list_cycle,
|
||||||
|
batch_size=batch_size,
|
||||||
|
delay_frame=0,
|
||||||
|
device=device,
|
||||||
|
)
|
||||||
|
|
||||||
|
res_frame_list = []
|
||||||
|
total = int(np.ceil(float(video_num) / batch_size))
|
||||||
|
|
||||||
|
# Execute inference
|
||||||
|
for i, (whisper_batch, latent_batch) in enumerate(tqdm(gen, total=total)):
|
||||||
|
audio_feature_batch = pe(whisper_batch)
|
||||||
|
latent_batch = latent_batch.to(dtype=unet.model.dtype)
|
||||||
|
|
||||||
|
pred_latents = unet.model(latent_batch, timesteps, encoder_hidden_states=audio_feature_batch).sample
|
||||||
|
recon = vae.decode_latents(pred_latents)
|
||||||
|
for res_frame in recon:
|
||||||
|
res_frame_list.append(res_frame)
|
||||||
|
|
||||||
|
# Pad generated images to original video size
|
||||||
|
print("Padding generated images to original video size")
|
||||||
|
for i, res_frame in enumerate(tqdm(res_frame_list)):
|
||||||
|
bbox = coord_list_cycle[i%(len(coord_list_cycle))]
|
||||||
|
ori_frame = copy.deepcopy(frame_list_cycle[i%(len(frame_list_cycle))])
|
||||||
|
x1, y1, x2, y2 = bbox
|
||||||
|
if args.version == "v15":
|
||||||
|
y2 = y2 + args.extra_margin
|
||||||
|
y2 = min(y2, frame.shape[0])
|
||||||
|
try:
|
||||||
|
res_frame = cv2.resize(res_frame.astype(np.uint8), (x2-x1, y2-y1))
|
||||||
|
except:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Merge results with version-specific parameters
|
||||||
|
if args.version == "v15":
|
||||||
|
combine_frame = get_image(ori_frame, res_frame, [x1, y1, x2, y2], mode=args.parsing_mode, fp=fp)
|
||||||
|
else:
|
||||||
|
combine_frame = get_image(ori_frame, res_frame, [x1, y1, x2, y2], fp=fp)
|
||||||
|
cv2.imwrite(f"{result_img_save_path}/{str(i).zfill(8)}.png", combine_frame)
|
||||||
|
|
||||||
|
# Save prediction results
|
||||||
|
temp_vid_path = f"{temp_dir}/temp_{input_basename}_{audio_basename}.mp4"
|
||||||
|
cmd_img2video = f"ffmpeg -y -v warning -r {fps} -f image2 -i {result_img_save_path}/%08d.png -vcodec libx264 -vf format=yuv420p -crf 18 {temp_vid_path}"
|
||||||
|
print("Video generation command:", cmd_img2video)
|
||||||
|
os.system(cmd_img2video)
|
||||||
|
|
||||||
|
cmd_combine_audio = f"ffmpeg -y -v warning -i {audio_path} -i {temp_vid_path} {output_vid_name}"
|
||||||
|
print("Audio combination command:", cmd_combine_audio)
|
||||||
|
os.system(cmd_combine_audio)
|
||||||
|
|
||||||
|
# Clean up temporary files
|
||||||
|
shutil.rmtree(result_img_save_path)
|
||||||
|
os.remove(temp_vid_path)
|
||||||
|
|
||||||
|
shutil.rmtree(save_dir_full)
|
||||||
|
if not args.saved_coord:
|
||||||
|
os.remove(crop_coord_save_path)
|
||||||
|
|
||||||
|
print(f"Results saved to {output_vid_name}")
|
||||||
|
except Exception as e:
|
||||||
|
print("Error occurred during processing:", e)
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
parser = argparse.ArgumentParser()
|
parser = argparse.ArgumentParser()
|
||||||
parser.add_argument("--inference_config", type=str, default="configs/inference/test_img.yaml")
|
parser.add_argument("--ffmpeg_path", type=str, default="./ffmpeg-4.4-amd64-static/", help="Path to ffmpeg executable")
|
||||||
parser.add_argument("--bbox_shift", type=int, default=0)
|
parser.add_argument("--gpu_id", type=int, default=0, help="GPU ID to use")
|
||||||
parser.add_argument("--result_dir", default='./results', help="path to output")
|
parser.add_argument("--vae_type", type=str, default="sd-vae", help="Type of VAE model")
|
||||||
|
parser.add_argument("--unet_config", type=str, default="./models/musetalk/config.json", help="Path to UNet configuration file")
|
||||||
parser.add_argument("--fps", type=int, default=25)
|
parser.add_argument("--unet_model_path", type=str, default="./models/musetalkV15/unet.pth", help="Path to UNet model weights")
|
||||||
parser.add_argument("--batch_size", type=int, default=8)
|
parser.add_argument("--whisper_dir", type=str, default="./models/whisper", help="Directory containing Whisper model")
|
||||||
parser.add_argument("--output_vid_name", type=str, default=None)
|
parser.add_argument("--inference_config", type=str, default="configs/inference/test_img.yaml", help="Path to inference configuration file")
|
||||||
parser.add_argument("--use_saved_coord",
|
parser.add_argument("--bbox_shift", type=int, default=0, help="Bounding box shift value")
|
||||||
action="store_true",
|
parser.add_argument("--result_dir", default='./results', help="Directory for output results")
|
||||||
help='use saved coordinate to save time')
|
parser.add_argument("--extra_margin", type=int, default=10, help="Extra margin for face cropping")
|
||||||
parser.add_argument("--use_float16",
|
parser.add_argument("--fps", type=int, default=25, help="Video frames per second")
|
||||||
action="store_true",
|
parser.add_argument("--audio_padding_length_left", type=int, default=2, help="Left padding length for audio")
|
||||||
help="Whether use float16 to speed up inference",
|
parser.add_argument("--audio_padding_length_right", type=int, default=2, help="Right padding length for audio")
|
||||||
)
|
parser.add_argument("--batch_size", type=int, default=8, help="Batch size for inference")
|
||||||
|
parser.add_argument("--output_vid_name", type=str, default=None, help="Name of output video file")
|
||||||
|
parser.add_argument("--use_saved_coord", action="store_true", help='Use saved coordinates to save time')
|
||||||
|
parser.add_argument("--saved_coord", action="store_true", help='Save coordinates for future use')
|
||||||
|
parser.add_argument("--use_float16", action="store_true", help="Use float16 for faster inference")
|
||||||
|
parser.add_argument("--parsing_mode", default='jaw', help="Face blending parsing mode")
|
||||||
|
parser.add_argument("--left_cheek_width", type=int, default=90, help="Width of left cheek region")
|
||||||
|
parser.add_argument("--right_cheek_width", type=int, default=90, help="Width of right cheek region")
|
||||||
|
parser.add_argument("--version", type=str, default="v15", choices=["v1", "v15"], help="Model version to use")
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
main(args)
|
main(args)
|
||||||
|
|||||||
Executable
+334
@@ -0,0 +1,334 @@
|
|||||||
|
import os
|
||||||
|
import argparse
|
||||||
|
import subprocess
|
||||||
|
import torch
|
||||||
|
import numpy as np
|
||||||
|
from tqdm import tqdm
|
||||||
|
from omegaconf import OmegaConf
|
||||||
|
from typing import Tuple, List, Union
|
||||||
|
import decord
|
||||||
|
import json
|
||||||
|
import cv2
|
||||||
|
from musetalk.utils.face_detection import FaceAlignment,LandmarksType
|
||||||
|
from mmpose.apis import inference_topdown, init_model
|
||||||
|
from mmpose.structures import merge_data_samples
|
||||||
|
import sys
|
||||||
|
|
||||||
|
def fast_check_ffmpeg():
|
||||||
|
try:
|
||||||
|
subprocess.run(["ffmpeg", "-version"], capture_output=True, check=True)
|
||||||
|
return True
|
||||||
|
except:
|
||||||
|
return False
|
||||||
|
|
||||||
|
ffmpeg_path = "./ffmpeg-4.4-amd64-static/"
|
||||||
|
if not fast_check_ffmpeg():
|
||||||
|
print("Adding ffmpeg to PATH")
|
||||||
|
# Choose path separator based on operating system
|
||||||
|
path_separator = ';' if sys.platform == 'win32' else ':'
|
||||||
|
os.environ["PATH"] = f"{args.ffmpeg_path}{path_separator}{os.environ['PATH']}"
|
||||||
|
if not fast_check_ffmpeg():
|
||||||
|
print("Warning: Unable to find ffmpeg, please ensure ffmpeg is properly installed")
|
||||||
|
|
||||||
|
class AnalyzeFace:
|
||||||
|
def __init__(self, device: Union[str, torch.device], config_file: str, checkpoint_file: str):
|
||||||
|
"""
|
||||||
|
Initialize the AnalyzeFace class with the given device, config file, and checkpoint file.
|
||||||
|
|
||||||
|
Parameters:
|
||||||
|
device (Union[str, torch.device]): The device to run the models on ('cuda' or 'cpu').
|
||||||
|
config_file (str): Path to the mmpose model configuration file.
|
||||||
|
checkpoint_file (str): Path to the mmpose model checkpoint file.
|
||||||
|
"""
|
||||||
|
self.device = device
|
||||||
|
self.dwpose = init_model(config_file, checkpoint_file, device=self.device)
|
||||||
|
self.facedet = FaceAlignment(LandmarksType._2D, flip_input=False, device=self.device)
|
||||||
|
|
||||||
|
def __call__(self, im: np.ndarray) -> Tuple[List[np.ndarray], np.ndarray]:
|
||||||
|
"""
|
||||||
|
Detect faces and keypoints in the given image.
|
||||||
|
|
||||||
|
Parameters:
|
||||||
|
im (np.ndarray): The input image.
|
||||||
|
maxface (bool): Whether to detect the maximum face. Default is True.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Tuple[List[np.ndarray], np.ndarray]: A tuple containing the bounding boxes and keypoints.
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
# Ensure the input image has the correct shape
|
||||||
|
if im.ndim == 3:
|
||||||
|
im = np.expand_dims(im, axis=0)
|
||||||
|
elif im.ndim != 4 or im.shape[0] != 1:
|
||||||
|
raise ValueError("Input image must have shape (1, H, W, C)")
|
||||||
|
|
||||||
|
bbox = self.facedet.get_detections_for_batch(np.asarray(im))
|
||||||
|
results = inference_topdown(self.dwpose, np.asarray(im)[0])
|
||||||
|
results = merge_data_samples(results)
|
||||||
|
keypoints = results.pred_instances.keypoints
|
||||||
|
face_land_mark= keypoints[0][23:91]
|
||||||
|
face_land_mark = face_land_mark.astype(np.int32)
|
||||||
|
|
||||||
|
return face_land_mark, bbox
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error during face analysis: {e}")
|
||||||
|
return np.array([]),[]
|
||||||
|
|
||||||
|
def convert_video(org_path: str, dst_path: str, vid_list: List[str]) -> None:
|
||||||
|
|
||||||
|
"""
|
||||||
|
Convert video files to a specified format and save them to the destination path.
|
||||||
|
|
||||||
|
Parameters:
|
||||||
|
org_path (str): The directory containing the original video files.
|
||||||
|
dst_path (str): The directory where the converted video files will be saved.
|
||||||
|
vid_list (List[str]): A list of video file names to process.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
None
|
||||||
|
"""
|
||||||
|
for idx, vid in enumerate(vid_list):
|
||||||
|
if vid.endswith('.mp4'):
|
||||||
|
org_vid_path = os.path.join(org_path, vid)
|
||||||
|
dst_vid_path = os.path.join(dst_path, vid)
|
||||||
|
|
||||||
|
if org_vid_path != dst_vid_path:
|
||||||
|
cmd = [
|
||||||
|
"ffmpeg", "-hide_banner", "-y", "-i", org_vid_path,
|
||||||
|
"-r", "25", "-crf", "15", "-c:v", "libx264",
|
||||||
|
"-pix_fmt", "yuv420p", dst_vid_path
|
||||||
|
]
|
||||||
|
subprocess.run(cmd, check=True)
|
||||||
|
|
||||||
|
if idx % 1000 == 0:
|
||||||
|
print(f"### {idx} videos converted ###")
|
||||||
|
|
||||||
|
def segment_video(org_path: str, dst_path: str, vid_list: List[str], segment_duration: int = 30) -> None:
|
||||||
|
"""
|
||||||
|
Segment video files into smaller clips of specified duration.
|
||||||
|
|
||||||
|
Parameters:
|
||||||
|
org_path (str): The directory containing the original video files.
|
||||||
|
dst_path (str): The directory where the segmented video files will be saved.
|
||||||
|
vid_list (List[str]): A list of video file names to process.
|
||||||
|
segment_duration (int): The duration of each segment in seconds. Default is 30 seconds.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
None
|
||||||
|
"""
|
||||||
|
for idx, vid in enumerate(vid_list):
|
||||||
|
if vid.endswith('.mp4'):
|
||||||
|
input_file = os.path.join(org_path, vid)
|
||||||
|
original_filename = os.path.basename(input_file)
|
||||||
|
|
||||||
|
command = [
|
||||||
|
'ffmpeg', '-i', input_file, '-c', 'copy', '-map', '0',
|
||||||
|
'-segment_time', str(segment_duration), '-f', 'segment',
|
||||||
|
'-reset_timestamps', '1',
|
||||||
|
os.path.join(dst_path, f'clip%03d_{original_filename}')
|
||||||
|
]
|
||||||
|
|
||||||
|
subprocess.run(command, check=True)
|
||||||
|
|
||||||
|
def extract_audio(org_path: str, dst_path: str, vid_list: List[str]) -> None:
|
||||||
|
"""
|
||||||
|
Extract audio from video files and save as WAV format.
|
||||||
|
|
||||||
|
Parameters:
|
||||||
|
org_path (str): The directory containing the original video files.
|
||||||
|
dst_path (str): The directory where the extracted audio files will be saved.
|
||||||
|
vid_list (List[str]): A list of video file names to process.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
None
|
||||||
|
"""
|
||||||
|
for idx, vid in enumerate(vid_list):
|
||||||
|
if vid.endswith('.mp4'):
|
||||||
|
video_path = os.path.join(org_path, vid)
|
||||||
|
audio_output_path = os.path.join(dst_path, os.path.splitext(vid)[0] + ".wav")
|
||||||
|
try:
|
||||||
|
command = [
|
||||||
|
'ffmpeg', '-hide_banner', '-y', '-i', video_path,
|
||||||
|
'-vn', '-acodec', 'pcm_s16le', '-f', 'wav',
|
||||||
|
'-ar', '16000', '-ac', '1', audio_output_path,
|
||||||
|
]
|
||||||
|
|
||||||
|
subprocess.run(command, check=True)
|
||||||
|
print(f"Audio saved to: {audio_output_path}")
|
||||||
|
except subprocess.CalledProcessError as e:
|
||||||
|
print(f"Error extracting audio from {vid}: {e}")
|
||||||
|
|
||||||
|
def split_data(video_files: List[str], val_list_hdtf: List[str]) -> (List[str], List[str]):
|
||||||
|
"""
|
||||||
|
Split video files into training and validation sets based on val_list_hdtf.
|
||||||
|
|
||||||
|
Parameters:
|
||||||
|
video_files (List[str]): A list of video file names.
|
||||||
|
val_list_hdtf (List[str]): A list of validation file identifiers.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
(List[str], List[str]): A tuple containing the training and validation file lists.
|
||||||
|
"""
|
||||||
|
val_files = [f for f in video_files if any(val_id in f for val_id in val_list_hdtf)]
|
||||||
|
train_files = [f for f in video_files if f not in val_files]
|
||||||
|
return train_files, val_files
|
||||||
|
|
||||||
|
def save_list_to_file(file_path: str, data_list: List[str]) -> None:
|
||||||
|
"""
|
||||||
|
Save a list of strings to a file, each string on a new line.
|
||||||
|
|
||||||
|
Parameters:
|
||||||
|
file_path (str): The path to the file where the list will be saved.
|
||||||
|
data_list (List[str]): The list of strings to save.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
None
|
||||||
|
"""
|
||||||
|
with open(file_path, 'w') as file:
|
||||||
|
for item in data_list:
|
||||||
|
file.write(f"{item}\n")
|
||||||
|
|
||||||
|
def generate_train_list(cfg):
|
||||||
|
train_file_path = cfg.video_clip_file_list_train
|
||||||
|
val_file_path = cfg.video_clip_file_list_val
|
||||||
|
val_list_hdtf = cfg.val_list_hdtf
|
||||||
|
|
||||||
|
meta_list = os.listdir(cfg.meta_root)
|
||||||
|
|
||||||
|
sorted_meta_list = sorted(meta_list)
|
||||||
|
train_files, val_files = split_data(meta_list, val_list_hdtf)
|
||||||
|
|
||||||
|
save_list_to_file(train_file_path, train_files)
|
||||||
|
save_list_to_file(val_file_path, val_files)
|
||||||
|
|
||||||
|
print(val_list_hdtf)
|
||||||
|
|
||||||
|
def analyze_video(org_path: str, dst_path: str, vid_list: List[str]) -> None:
|
||||||
|
"""
|
||||||
|
Convert video files to a specified format and save them to the destination path.
|
||||||
|
|
||||||
|
Parameters:
|
||||||
|
org_path (str): The directory containing the original video files.
|
||||||
|
dst_path (str): The directory where the meta json will be saved.
|
||||||
|
vid_list (List[str]): A list of video file names to process.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
None
|
||||||
|
"""
|
||||||
|
device = "cuda" if torch.cuda.is_available() else "cpu"
|
||||||
|
config_file = './musetalk/utils/dwpose/rtmpose-l_8xb32-270e_coco-ubody-wholebody-384x288.py'
|
||||||
|
checkpoint_file = './models/dwpose/dw-ll_ucoco_384.pth'
|
||||||
|
|
||||||
|
analyze_face = AnalyzeFace(device, config_file, checkpoint_file)
|
||||||
|
|
||||||
|
for vid in tqdm(vid_list, desc="Processing videos"):
|
||||||
|
#vid = "clip005_WDA_BernieSanders_000.mp4"
|
||||||
|
#print(vid)
|
||||||
|
if vid.endswith('.mp4'):
|
||||||
|
vid_path = os.path.join(org_path, vid)
|
||||||
|
wav_path = vid_path.replace(".mp4",".wav")
|
||||||
|
vid_meta = os.path.join(dst_path, os.path.splitext(vid)[0] + ".json")
|
||||||
|
if os.path.exists(vid_meta):
|
||||||
|
continue
|
||||||
|
print('process video {}'.format(vid))
|
||||||
|
|
||||||
|
total_bbox_list = []
|
||||||
|
total_pts_list = []
|
||||||
|
isvalid = True
|
||||||
|
|
||||||
|
# process
|
||||||
|
try:
|
||||||
|
cap = decord.VideoReader(vid_path, fault_tol=1)
|
||||||
|
except Exception as e:
|
||||||
|
print(e)
|
||||||
|
continue
|
||||||
|
|
||||||
|
total_frames = len(cap)
|
||||||
|
for frame_idx in range(total_frames):
|
||||||
|
frame = cap[frame_idx]
|
||||||
|
if frame_idx==0:
|
||||||
|
video_height,video_width,_ = frame.shape
|
||||||
|
frame_bgr = cv2.cvtColor(frame.asnumpy(), cv2.COLOR_BGR2RGB)
|
||||||
|
pts_list, bbox_list = analyze_face(frame_bgr)
|
||||||
|
|
||||||
|
if len(bbox_list)>0 and None not in bbox_list:
|
||||||
|
bbox = bbox_list[0]
|
||||||
|
else:
|
||||||
|
isvalid = False
|
||||||
|
bbox = []
|
||||||
|
print(f"set isvalid to False as broken img in {frame_idx} of {vid}")
|
||||||
|
break
|
||||||
|
|
||||||
|
#print(pts_list)
|
||||||
|
if len(pts_list)>0 and pts_list is not None:
|
||||||
|
pts = pts_list.tolist()
|
||||||
|
else:
|
||||||
|
isvalid = False
|
||||||
|
pts = []
|
||||||
|
break
|
||||||
|
|
||||||
|
if frame_idx==0:
|
||||||
|
x1,y1,x2,y2 = bbox
|
||||||
|
face_height, face_width = y2-y1,x2-x1
|
||||||
|
|
||||||
|
total_pts_list.append(pts)
|
||||||
|
total_bbox_list.append(bbox)
|
||||||
|
|
||||||
|
meta_data = {
|
||||||
|
"mp4_path": vid_path,
|
||||||
|
"wav_path": wav_path,
|
||||||
|
"video_size": [video_height, video_width],
|
||||||
|
"face_size": [face_height, face_width],
|
||||||
|
"frames": total_frames,
|
||||||
|
"face_list": total_bbox_list,
|
||||||
|
"landmark_list": total_pts_list,
|
||||||
|
"isvalid":isvalid,
|
||||||
|
}
|
||||||
|
with open(vid_meta, 'w') as f:
|
||||||
|
json.dump(meta_data, f, indent=4)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def main(cfg):
|
||||||
|
# Ensure all necessary directories exist
|
||||||
|
os.makedirs(cfg.video_root_25fps, exist_ok=True)
|
||||||
|
os.makedirs(cfg.video_audio_clip_root, exist_ok=True)
|
||||||
|
os.makedirs(cfg.meta_root, exist_ok=True)
|
||||||
|
os.makedirs(os.path.dirname(cfg.video_file_list), exist_ok=True)
|
||||||
|
os.makedirs(os.path.dirname(cfg.video_clip_file_list_train), exist_ok=True)
|
||||||
|
os.makedirs(os.path.dirname(cfg.video_clip_file_list_val), exist_ok=True)
|
||||||
|
|
||||||
|
vid_list = os.listdir(cfg.video_root_raw)
|
||||||
|
sorted_vid_list = sorted(vid_list)
|
||||||
|
|
||||||
|
# Save video file list
|
||||||
|
with open(cfg.video_file_list, 'w') as file:
|
||||||
|
for vid in sorted_vid_list:
|
||||||
|
file.write(vid + '\n')
|
||||||
|
|
||||||
|
# 1. Convert videos to 25 FPS
|
||||||
|
convert_video(cfg.video_root_raw, cfg.video_root_25fps, sorted_vid_list)
|
||||||
|
|
||||||
|
# 2. Segment videos into 30-second clips
|
||||||
|
segment_video(cfg.video_root_25fps, cfg.video_audio_clip_root, vid_list, segment_duration=cfg.clip_len_second)
|
||||||
|
|
||||||
|
# 3. Extract audio
|
||||||
|
clip_vid_list = os.listdir(cfg.video_audio_clip_root)
|
||||||
|
extract_audio(cfg.video_audio_clip_root, cfg.video_audio_clip_root, clip_vid_list)
|
||||||
|
|
||||||
|
# 4. Generate video metadata
|
||||||
|
analyze_video(cfg.video_audio_clip_root, cfg.meta_root, clip_vid_list)
|
||||||
|
|
||||||
|
# 5. Generate training and validation set lists
|
||||||
|
generate_train_list(cfg)
|
||||||
|
print("done")
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument("--config", type=str, default="./configs/training/preprocess.yaml")
|
||||||
|
args = parser.parse_args()
|
||||||
|
config = OmegaConf.load(args.config)
|
||||||
|
|
||||||
|
main(config)
|
||||||
|
|
||||||
+199
-125
@@ -10,26 +10,31 @@ import sys
|
|||||||
from tqdm import tqdm
|
from tqdm import tqdm
|
||||||
import copy
|
import copy
|
||||||
import json
|
import json
|
||||||
from musetalk.utils.utils import get_file_type,get_video_fps,datagen
|
from transformers import WhisperModel
|
||||||
from musetalk.utils.preprocessing import get_landmark_and_bbox,read_imgs,coord_placeholder
|
|
||||||
from musetalk.utils.blending import get_image,get_image_prepare_material,get_image_blending
|
|
||||||
from musetalk.utils.utils import load_all_model
|
|
||||||
import shutil
|
|
||||||
|
|
||||||
|
from musetalk.utils.face_parsing import FaceParsing
|
||||||
|
from musetalk.utils.utils import datagen
|
||||||
|
from musetalk.utils.preprocessing import get_landmark_and_bbox, read_imgs
|
||||||
|
from musetalk.utils.blending import get_image_prepare_material, get_image_blending
|
||||||
|
from musetalk.utils.utils import load_all_model
|
||||||
|
from musetalk.utils.audio_processor import AudioProcessor
|
||||||
|
|
||||||
|
import shutil
|
||||||
import threading
|
import threading
|
||||||
import queue
|
import queue
|
||||||
|
|
||||||
import time
|
import time
|
||||||
|
import subprocess
|
||||||
|
|
||||||
# load model weights
|
|
||||||
audio_processor, vae, unet, pe = load_all_model()
|
|
||||||
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
|
||||||
timesteps = torch.tensor([0], device=device)
|
|
||||||
pe = pe.half()
|
|
||||||
vae.vae = vae.vae.half()
|
|
||||||
unet.model = unet.model.half()
|
|
||||||
|
|
||||||
def video2imgs(vid_path, save_path, ext = '.png',cut_frame = 10000000):
|
def fast_check_ffmpeg():
|
||||||
|
try:
|
||||||
|
subprocess.run(["ffmpeg", "-version"], capture_output=True, check=True)
|
||||||
|
return True
|
||||||
|
except:
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def video2imgs(vid_path, save_path, ext='.png', cut_frame=10000000):
|
||||||
cap = cv2.VideoCapture(vid_path)
|
cap = cv2.VideoCapture(vid_path)
|
||||||
count = 0
|
count = 0
|
||||||
while True:
|
while True:
|
||||||
@@ -42,35 +47,43 @@ def video2imgs(vid_path, save_path, ext = '.png',cut_frame = 10000000):
|
|||||||
else:
|
else:
|
||||||
break
|
break
|
||||||
|
|
||||||
|
|
||||||
def osmakedirs(path_list):
|
def osmakedirs(path_list):
|
||||||
for path in path_list:
|
for path in path_list:
|
||||||
os.makedirs(path) if not os.path.exists(path) else None
|
os.makedirs(path) if not os.path.exists(path) else None
|
||||||
|
|
||||||
|
|
||||||
@torch.no_grad()
|
|
||||||
|
@torch.no_grad()
|
||||||
class Avatar:
|
class Avatar:
|
||||||
def __init__(self, avatar_id, video_path, bbox_shift, batch_size, preparation):
|
def __init__(self, avatar_id, video_path, bbox_shift, batch_size, preparation):
|
||||||
self.avatar_id = avatar_id
|
self.avatar_id = avatar_id
|
||||||
self.video_path = video_path
|
self.video_path = video_path
|
||||||
self.bbox_shift = bbox_shift
|
self.bbox_shift = bbox_shift
|
||||||
self.avatar_path = f"./results/avatars/{avatar_id}"
|
# 根据版本设置不同的基础路径
|
||||||
self.full_imgs_path = f"{self.avatar_path}/full_imgs"
|
if args.version == "v15":
|
||||||
|
self.base_path = f"./results/{args.version}/avatars/{avatar_id}"
|
||||||
|
else: # v1
|
||||||
|
self.base_path = f"./results/avatars/{avatar_id}"
|
||||||
|
|
||||||
|
self.avatar_path = self.base_path
|
||||||
|
self.full_imgs_path = f"{self.avatar_path}/full_imgs"
|
||||||
self.coords_path = f"{self.avatar_path}/coords.pkl"
|
self.coords_path = f"{self.avatar_path}/coords.pkl"
|
||||||
self.latents_out_path= f"{self.avatar_path}/latents.pt"
|
self.latents_out_path = f"{self.avatar_path}/latents.pt"
|
||||||
self.video_out_path = f"{self.avatar_path}/vid_output/"
|
self.video_out_path = f"{self.avatar_path}/vid_output/"
|
||||||
self.mask_out_path =f"{self.avatar_path}/mask"
|
self.mask_out_path = f"{self.avatar_path}/mask"
|
||||||
self.mask_coords_path =f"{self.avatar_path}/mask_coords.pkl"
|
self.mask_coords_path = f"{self.avatar_path}/mask_coords.pkl"
|
||||||
self.avatar_info_path = f"{self.avatar_path}/avator_info.json"
|
self.avatar_info_path = f"{self.avatar_path}/avator_info.json"
|
||||||
self.avatar_info = {
|
self.avatar_info = {
|
||||||
"avatar_id":avatar_id,
|
"avatar_id": avatar_id,
|
||||||
"video_path":video_path,
|
"video_path": video_path,
|
||||||
"bbox_shift":bbox_shift
|
"bbox_shift": bbox_shift,
|
||||||
|
"version": args.version
|
||||||
}
|
}
|
||||||
self.preparation = preparation
|
self.preparation = preparation
|
||||||
self.batch_size = batch_size
|
self.batch_size = batch_size
|
||||||
self.idx = 0
|
self.idx = 0
|
||||||
self.init()
|
self.init()
|
||||||
|
|
||||||
def init(self):
|
def init(self):
|
||||||
if self.preparation:
|
if self.preparation:
|
||||||
if os.path.exists(self.avatar_path):
|
if os.path.exists(self.avatar_path):
|
||||||
@@ -80,7 +93,7 @@ class Avatar:
|
|||||||
print("*********************************")
|
print("*********************************")
|
||||||
print(f" creating avator: {self.avatar_id}")
|
print(f" creating avator: {self.avatar_id}")
|
||||||
print("*********************************")
|
print("*********************************")
|
||||||
osmakedirs([self.avatar_path,self.full_imgs_path,self.video_out_path,self.mask_out_path])
|
osmakedirs([self.avatar_path, self.full_imgs_path, self.video_out_path, self.mask_out_path])
|
||||||
self.prepare_material()
|
self.prepare_material()
|
||||||
else:
|
else:
|
||||||
self.input_latent_list_cycle = torch.load(self.latents_out_path)
|
self.input_latent_list_cycle = torch.load(self.latents_out_path)
|
||||||
@@ -98,16 +111,16 @@ class Avatar:
|
|||||||
print("*********************************")
|
print("*********************************")
|
||||||
print(f" creating avator: {self.avatar_id}")
|
print(f" creating avator: {self.avatar_id}")
|
||||||
print("*********************************")
|
print("*********************************")
|
||||||
osmakedirs([self.avatar_path,self.full_imgs_path,self.video_out_path,self.mask_out_path])
|
osmakedirs([self.avatar_path, self.full_imgs_path, self.video_out_path, self.mask_out_path])
|
||||||
self.prepare_material()
|
self.prepare_material()
|
||||||
else:
|
else:
|
||||||
if not os.path.exists(self.avatar_path):
|
if not os.path.exists(self.avatar_path):
|
||||||
print(f"{self.avatar_id} does not exist, you should set preparation to True")
|
print(f"{self.avatar_id} does not exist, you should set preparation to True")
|
||||||
sys.exit()
|
sys.exit()
|
||||||
|
|
||||||
with open(self.avatar_info_path, "r") as f:
|
with open(self.avatar_info_path, "r") as f:
|
||||||
avatar_info = json.load(f)
|
avatar_info = json.load(f)
|
||||||
|
|
||||||
if avatar_info['bbox_shift'] != self.avatar_info['bbox_shift']:
|
if avatar_info['bbox_shift'] != self.avatar_info['bbox_shift']:
|
||||||
response = input(f" 【bbox_shift】 is changed, you need to re-create it ! (c/continue)")
|
response = input(f" 【bbox_shift】 is changed, you need to re-create it ! (c/continue)")
|
||||||
if response.lower() == "c":
|
if response.lower() == "c":
|
||||||
@@ -115,11 +128,11 @@ class Avatar:
|
|||||||
print("*********************************")
|
print("*********************************")
|
||||||
print(f" creating avator: {self.avatar_id}")
|
print(f" creating avator: {self.avatar_id}")
|
||||||
print("*********************************")
|
print("*********************************")
|
||||||
osmakedirs([self.avatar_path,self.full_imgs_path,self.video_out_path,self.mask_out_path])
|
osmakedirs([self.avatar_path, self.full_imgs_path, self.video_out_path, self.mask_out_path])
|
||||||
self.prepare_material()
|
self.prepare_material()
|
||||||
else:
|
else:
|
||||||
sys.exit()
|
sys.exit()
|
||||||
else:
|
else:
|
||||||
self.input_latent_list_cycle = torch.load(self.latents_out_path)
|
self.input_latent_list_cycle = torch.load(self.latents_out_path)
|
||||||
with open(self.coords_path, 'rb') as f:
|
with open(self.coords_path, 'rb') as f:
|
||||||
self.coord_list_cycle = pickle.load(f)
|
self.coord_list_cycle = pickle.load(f)
|
||||||
@@ -131,36 +144,40 @@ class Avatar:
|
|||||||
input_mask_list = glob.glob(os.path.join(self.mask_out_path, '*.[jpJP][pnPN]*[gG]'))
|
input_mask_list = glob.glob(os.path.join(self.mask_out_path, '*.[jpJP][pnPN]*[gG]'))
|
||||||
input_mask_list = sorted(input_mask_list, key=lambda x: int(os.path.splitext(os.path.basename(x))[0]))
|
input_mask_list = sorted(input_mask_list, key=lambda x: int(os.path.splitext(os.path.basename(x))[0]))
|
||||||
self.mask_list_cycle = read_imgs(input_mask_list)
|
self.mask_list_cycle = read_imgs(input_mask_list)
|
||||||
|
|
||||||
def prepare_material(self):
|
def prepare_material(self):
|
||||||
print("preparing data materials ... ...")
|
print("preparing data materials ... ...")
|
||||||
with open(self.avatar_info_path, "w") as f:
|
with open(self.avatar_info_path, "w") as f:
|
||||||
json.dump(self.avatar_info, f)
|
json.dump(self.avatar_info, f)
|
||||||
|
|
||||||
if os.path.isfile(self.video_path):
|
if os.path.isfile(self.video_path):
|
||||||
video2imgs(self.video_path, self.full_imgs_path, ext = 'png')
|
video2imgs(self.video_path, self.full_imgs_path, ext='png')
|
||||||
else:
|
else:
|
||||||
print(f"copy files in {self.video_path}")
|
print(f"copy files in {self.video_path}")
|
||||||
files = os.listdir(self.video_path)
|
files = os.listdir(self.video_path)
|
||||||
files.sort()
|
files.sort()
|
||||||
files = [file for file in files if file.split(".")[-1]=="png"]
|
files = [file for file in files if file.split(".")[-1] == "png"]
|
||||||
for filename in files:
|
for filename in files:
|
||||||
shutil.copyfile(f"{self.video_path}/{filename}", f"{self.full_imgs_path}/{filename}")
|
shutil.copyfile(f"{self.video_path}/{filename}", f"{self.full_imgs_path}/{filename}")
|
||||||
input_img_list = sorted(glob.glob(os.path.join(self.full_imgs_path, '*.[jpJP][pnPN]*[gG]')))
|
input_img_list = sorted(glob.glob(os.path.join(self.full_imgs_path, '*.[jpJP][pnPN]*[gG]')))
|
||||||
|
|
||||||
print("extracting landmarks...")
|
print("extracting landmarks...")
|
||||||
coord_list, frame_list = get_landmark_and_bbox(input_img_list, self.bbox_shift)
|
coord_list, frame_list = get_landmark_and_bbox(input_img_list, self.bbox_shift)
|
||||||
input_latent_list = []
|
input_latent_list = []
|
||||||
idx = -1
|
idx = -1
|
||||||
# maker if the bbox is not sufficient
|
# maker if the bbox is not sufficient
|
||||||
coord_placeholder = (0.0,0.0,0.0,0.0)
|
coord_placeholder = (0.0, 0.0, 0.0, 0.0)
|
||||||
for bbox, frame in zip(coord_list, frame_list):
|
for bbox, frame in zip(coord_list, frame_list):
|
||||||
idx = idx + 1
|
idx = idx + 1
|
||||||
if bbox == coord_placeholder:
|
if bbox == coord_placeholder:
|
||||||
continue
|
continue
|
||||||
x1, y1, x2, y2 = bbox
|
x1, y1, x2, y2 = bbox
|
||||||
|
if args.version == "v15":
|
||||||
|
y2 = y2 + args.extra_margin
|
||||||
|
y2 = min(y2, frame.shape[0])
|
||||||
|
coord_list[idx] = [x1, y1, x2, y2] # 更新coord_list中的bbox
|
||||||
crop_frame = frame[y1:y2, x1:x2]
|
crop_frame = frame[y1:y2, x1:x2]
|
||||||
resized_crop_frame = cv2.resize(crop_frame,(256,256),interpolation = cv2.INTER_LANCZOS4)
|
resized_crop_frame = cv2.resize(crop_frame, (256, 256), interpolation=cv2.INTER_LANCZOS4)
|
||||||
latents = vae.get_latents_for_unet(resized_crop_frame)
|
latents = vae.get_latents_for_unet(resized_crop_frame)
|
||||||
input_latent_list.append(latents)
|
input_latent_list.append(latents)
|
||||||
|
|
||||||
@@ -170,112 +187,117 @@ class Avatar:
|
|||||||
self.mask_coords_list_cycle = []
|
self.mask_coords_list_cycle = []
|
||||||
self.mask_list_cycle = []
|
self.mask_list_cycle = []
|
||||||
|
|
||||||
for i,frame in enumerate(tqdm(self.frame_list_cycle)):
|
for i, frame in enumerate(tqdm(self.frame_list_cycle)):
|
||||||
cv2.imwrite(f"{self.full_imgs_path}/{str(i).zfill(8)}.png",frame)
|
cv2.imwrite(f"{self.full_imgs_path}/{str(i).zfill(8)}.png", frame)
|
||||||
|
|
||||||
face_box = self.coord_list_cycle[i]
|
x1, y1, x2, y2 = self.coord_list_cycle[i]
|
||||||
mask,crop_box = get_image_prepare_material(frame,face_box)
|
if args.version == "v15":
|
||||||
cv2.imwrite(f"{self.mask_out_path}/{str(i).zfill(8)}.png",mask)
|
mode = args.parsing_mode
|
||||||
|
else:
|
||||||
|
mode = "raw"
|
||||||
|
mask, crop_box = get_image_prepare_material(frame, [x1, y1, x2, y2], fp=fp, mode=mode)
|
||||||
|
|
||||||
|
cv2.imwrite(f"{self.mask_out_path}/{str(i).zfill(8)}.png", mask)
|
||||||
self.mask_coords_list_cycle += [crop_box]
|
self.mask_coords_list_cycle += [crop_box]
|
||||||
self.mask_list_cycle.append(mask)
|
self.mask_list_cycle.append(mask)
|
||||||
|
|
||||||
with open(self.mask_coords_path, 'wb') as f:
|
with open(self.mask_coords_path, 'wb') as f:
|
||||||
pickle.dump(self.mask_coords_list_cycle, f)
|
pickle.dump(self.mask_coords_list_cycle, f)
|
||||||
|
|
||||||
with open(self.coords_path, 'wb') as f:
|
with open(self.coords_path, 'wb') as f:
|
||||||
pickle.dump(self.coord_list_cycle, f)
|
pickle.dump(self.coord_list_cycle, f)
|
||||||
|
|
||||||
torch.save(self.input_latent_list_cycle, os.path.join(self.latents_out_path))
|
torch.save(self.input_latent_list_cycle, os.path.join(self.latents_out_path))
|
||||||
#
|
|
||||||
|
def process_frames(self, res_frame_queue, video_len, skip_save_images):
|
||||||
def process_frames(self,
|
|
||||||
res_frame_queue,
|
|
||||||
video_len,
|
|
||||||
skip_save_images):
|
|
||||||
print(video_len)
|
print(video_len)
|
||||||
while True:
|
while True:
|
||||||
if self.idx>=video_len-1:
|
if self.idx >= video_len - 1:
|
||||||
break
|
break
|
||||||
try:
|
try:
|
||||||
start = time.time()
|
start = time.time()
|
||||||
res_frame = res_frame_queue.get(block=True, timeout=1)
|
res_frame = res_frame_queue.get(block=True, timeout=1)
|
||||||
except queue.Empty:
|
except queue.Empty:
|
||||||
continue
|
continue
|
||||||
|
|
||||||
bbox = self.coord_list_cycle[self.idx%(len(self.coord_list_cycle))]
|
bbox = self.coord_list_cycle[self.idx % (len(self.coord_list_cycle))]
|
||||||
ori_frame = copy.deepcopy(self.frame_list_cycle[self.idx%(len(self.frame_list_cycle))])
|
ori_frame = copy.deepcopy(self.frame_list_cycle[self.idx % (len(self.frame_list_cycle))])
|
||||||
x1, y1, x2, y2 = bbox
|
x1, y1, x2, y2 = bbox
|
||||||
try:
|
try:
|
||||||
res_frame = cv2.resize(res_frame.astype(np.uint8),(x2-x1,y2-y1))
|
res_frame = cv2.resize(res_frame.astype(np.uint8), (x2 - x1, y2 - y1))
|
||||||
except:
|
except:
|
||||||
continue
|
continue
|
||||||
mask = self.mask_list_cycle[self.idx%(len(self.mask_list_cycle))]
|
mask = self.mask_list_cycle[self.idx % (len(self.mask_list_cycle))]
|
||||||
mask_crop_box = self.mask_coords_list_cycle[self.idx%(len(self.mask_coords_list_cycle))]
|
mask_crop_box = self.mask_coords_list_cycle[self.idx % (len(self.mask_coords_list_cycle))]
|
||||||
#combine_frame = get_image(ori_frame,res_frame,bbox)
|
|
||||||
combine_frame = get_image_blending(ori_frame,res_frame,bbox,mask,mask_crop_box)
|
combine_frame = get_image_blending(ori_frame,res_frame,bbox,mask,mask_crop_box)
|
||||||
|
|
||||||
if skip_save_images is False:
|
if skip_save_images is False:
|
||||||
cv2.imwrite(f"{self.avatar_path}/tmp/{str(self.idx).zfill(8)}.png",combine_frame)
|
cv2.imwrite(f"{self.avatar_path}/tmp/{str(self.idx).zfill(8)}.png", combine_frame)
|
||||||
self.idx = self.idx + 1
|
self.idx = self.idx + 1
|
||||||
|
|
||||||
def inference(self,
|
@torch.no_grad()
|
||||||
audio_path,
|
def inference(self, audio_path, out_vid_name, fps, skip_save_images):
|
||||||
out_vid_name,
|
os.makedirs(self.avatar_path + '/tmp', exist_ok=True)
|
||||||
fps,
|
|
||||||
skip_save_images):
|
|
||||||
os.makedirs(self.avatar_path+'/tmp',exist_ok =True)
|
|
||||||
print("start inference")
|
print("start inference")
|
||||||
############################################## extract audio feature ##############################################
|
############################################## extract audio feature ##############################################
|
||||||
start_time = time.time()
|
start_time = time.time()
|
||||||
whisper_feature = audio_processor.audio2feat(audio_path)
|
# Extract audio features
|
||||||
whisper_chunks = audio_processor.feature2chunks(feature_array=whisper_feature,fps=fps)
|
whisper_input_features, librosa_length = audio_processor.get_audio_feature(audio_path, weight_dtype=weight_dtype)
|
||||||
|
whisper_chunks = audio_processor.get_whisper_chunk(
|
||||||
|
whisper_input_features,
|
||||||
|
device,
|
||||||
|
weight_dtype,
|
||||||
|
whisper,
|
||||||
|
librosa_length,
|
||||||
|
fps=fps,
|
||||||
|
audio_padding_length_left=args.audio_padding_length_left,
|
||||||
|
audio_padding_length_right=args.audio_padding_length_right,
|
||||||
|
)
|
||||||
print(f"processing audio:{audio_path} costs {(time.time() - start_time) * 1000}ms")
|
print(f"processing audio:{audio_path} costs {(time.time() - start_time) * 1000}ms")
|
||||||
############################################## inference batch by batch ##############################################
|
############################################## inference batch by batch ##############################################
|
||||||
video_num = len(whisper_chunks)
|
video_num = len(whisper_chunks)
|
||||||
res_frame_queue = queue.Queue()
|
res_frame_queue = queue.Queue()
|
||||||
self.idx = 0
|
self.idx = 0
|
||||||
# # Create a sub-thread and start it
|
# Create a sub-thread and start it
|
||||||
process_thread = threading.Thread(target=self.process_frames, args=(res_frame_queue, video_num, skip_save_images))
|
process_thread = threading.Thread(target=self.process_frames, args=(res_frame_queue, video_num, skip_save_images))
|
||||||
process_thread.start()
|
process_thread.start()
|
||||||
|
|
||||||
gen = datagen(whisper_chunks,
|
gen = datagen(whisper_chunks,
|
||||||
self.input_latent_list_cycle,
|
self.input_latent_list_cycle,
|
||||||
self.batch_size)
|
self.batch_size)
|
||||||
start_time = time.time()
|
start_time = time.time()
|
||||||
res_frame_list = []
|
res_frame_list = []
|
||||||
|
|
||||||
for i, (whisper_batch,latent_batch) in enumerate(tqdm(gen,total=int(np.ceil(float(video_num)/self.batch_size)))):
|
|
||||||
audio_feature_batch = torch.from_numpy(whisper_batch)
|
|
||||||
audio_feature_batch = audio_feature_batch.to(device=unet.device,
|
|
||||||
dtype=unet.model.dtype)
|
|
||||||
audio_feature_batch = pe(audio_feature_batch)
|
|
||||||
latent_batch = latent_batch.to(dtype=unet.model.dtype)
|
|
||||||
|
|
||||||
pred_latents = unet.model(latent_batch,
|
for i, (whisper_batch, latent_batch) in enumerate(tqdm(gen, total=int(np.ceil(float(video_num) / self.batch_size)))):
|
||||||
timesteps,
|
audio_feature_batch = pe(whisper_batch.to(device))
|
||||||
encoder_hidden_states=audio_feature_batch).sample
|
latent_batch = latent_batch.to(device=device, dtype=unet.model.dtype)
|
||||||
|
|
||||||
|
pred_latents = unet.model(latent_batch,
|
||||||
|
timesteps,
|
||||||
|
encoder_hidden_states=audio_feature_batch).sample
|
||||||
|
pred_latents = pred_latents.to(device=device, dtype=vae.vae.dtype)
|
||||||
recon = vae.decode_latents(pred_latents)
|
recon = vae.decode_latents(pred_latents)
|
||||||
for res_frame in recon:
|
for res_frame in recon:
|
||||||
res_frame_queue.put(res_frame)
|
res_frame_queue.put(res_frame)
|
||||||
# Close the queue and sub-thread after all tasks are completed
|
# Close the queue and sub-thread after all tasks are completed
|
||||||
process_thread.join()
|
process_thread.join()
|
||||||
|
|
||||||
if args.skip_save_images is True:
|
if args.skip_save_images is True:
|
||||||
print('Total process time of {} frames without saving images = {}s'.format(
|
print('Total process time of {} frames without saving images = {}s'.format(
|
||||||
video_num,
|
video_num,
|
||||||
time.time()-start_time))
|
time.time() - start_time))
|
||||||
else:
|
else:
|
||||||
print('Total process time of {} frames including saving images = {}s'.format(
|
print('Total process time of {} frames including saving images = {}s'.format(
|
||||||
video_num,
|
video_num,
|
||||||
time.time()-start_time))
|
time.time() - start_time))
|
||||||
|
|
||||||
if out_vid_name is not None and args.skip_save_images is False:
|
if out_vid_name is not None and args.skip_save_images is False:
|
||||||
# optional
|
# optional
|
||||||
cmd_img2video = f"ffmpeg -y -v warning -r {fps} -f image2 -i {self.avatar_path}/tmp/%08d.png -vcodec libx264 -vf format=rgb24,scale=out_color_matrix=bt709,format=yuv420p -crf 18 {self.avatar_path}/temp.mp4"
|
cmd_img2video = f"ffmpeg -y -v warning -r {fps} -f image2 -i {self.avatar_path}/tmp/%08d.png -vcodec libx264 -vf format=yuv420p -crf 18 {self.avatar_path}/temp.mp4"
|
||||||
print(cmd_img2video)
|
print(cmd_img2video)
|
||||||
os.system(cmd_img2video)
|
os.system(cmd_img2video)
|
||||||
|
|
||||||
output_vid = os.path.join(self.video_out_path, out_vid_name+".mp4") # on
|
output_vid = os.path.join(self.video_out_path, out_vid_name + ".mp4") # on
|
||||||
cmd_combine_audio = f"ffmpeg -y -v warning -i {audio_path} -i {self.avatar_path}/temp.mp4 {output_vid}"
|
cmd_combine_audio = f"ffmpeg -y -v warning -i {audio_path} -i {self.avatar_path}/temp.mp4 {output_vid}"
|
||||||
print(cmd_combine_audio)
|
print(cmd_combine_audio)
|
||||||
os.system(cmd_combine_audio)
|
os.system(cmd_combine_audio)
|
||||||
@@ -284,52 +306,104 @@ class Avatar:
|
|||||||
shutil.rmtree(f"{self.avatar_path}/tmp")
|
shutil.rmtree(f"{self.avatar_path}/tmp")
|
||||||
print(f"result is save to {output_vid}")
|
print(f"result is save to {output_vid}")
|
||||||
print("\n")
|
print("\n")
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
'''
|
'''
|
||||||
This script is used to simulate online chatting and applies necessary pre-processing such as face detection and face parsing in advance. During online chatting, only UNet and the VAE decoder are involved, which makes MuseTalk real-time.
|
This script is used to simulate online chatting and applies necessary pre-processing such as face detection and face parsing in advance. During online chatting, only UNet and the VAE decoder are involved, which makes MuseTalk real-time.
|
||||||
'''
|
'''
|
||||||
|
|
||||||
parser = argparse.ArgumentParser()
|
parser = argparse.ArgumentParser()
|
||||||
parser.add_argument("--inference_config",
|
parser.add_argument("--version", type=str, default="v15", choices=["v1", "v15"], help="Version of MuseTalk: v1 or v15")
|
||||||
type=str,
|
parser.add_argument("--ffmpeg_path", type=str, default="./ffmpeg-4.4-amd64-static/", help="Path to ffmpeg executable")
|
||||||
default="configs/inference/realtime.yaml",
|
parser.add_argument("--gpu_id", type=int, default=0, help="GPU ID to use")
|
||||||
)
|
parser.add_argument("--vae_type", type=str, default="sd-vae", help="Type of VAE model")
|
||||||
parser.add_argument("--fps",
|
parser.add_argument("--unet_config", type=str, default="./models/musetalk/musetalk.json", help="Path to UNet configuration file")
|
||||||
type=int,
|
parser.add_argument("--unet_model_path", type=str, default="./models/musetalk/pytorch_model.bin", help="Path to UNet model weights")
|
||||||
default=25,
|
parser.add_argument("--whisper_dir", type=str, default="./models/whisper", help="Directory containing Whisper model")
|
||||||
)
|
parser.add_argument("--inference_config", type=str, default="configs/inference/realtime.yaml")
|
||||||
parser.add_argument("--batch_size",
|
parser.add_argument("--bbox_shift", type=int, default=0, help="Bounding box shift value")
|
||||||
type=int,
|
parser.add_argument("--result_dir", default='./results', help="Directory for output results")
|
||||||
default=4,
|
parser.add_argument("--extra_margin", type=int, default=10, help="Extra margin for face cropping")
|
||||||
)
|
parser.add_argument("--fps", type=int, default=25, help="Video frames per second")
|
||||||
|
parser.add_argument("--audio_padding_length_left", type=int, default=2, help="Left padding length for audio")
|
||||||
|
parser.add_argument("--audio_padding_length_right", type=int, default=2, help="Right padding length for audio")
|
||||||
|
parser.add_argument("--batch_size", type=int, default=20, help="Batch size for inference")
|
||||||
|
parser.add_argument("--output_vid_name", type=str, default=None, help="Name of output video file")
|
||||||
|
parser.add_argument("--use_saved_coord", action="store_true", help='Use saved coordinates to save time')
|
||||||
|
parser.add_argument("--saved_coord", action="store_true", help='Save coordinates for future use')
|
||||||
|
parser.add_argument("--parsing_mode", default='jaw', help="Face blending parsing mode")
|
||||||
|
parser.add_argument("--left_cheek_width", type=int, default=90, help="Width of left cheek region")
|
||||||
|
parser.add_argument("--right_cheek_width", type=int, default=90, help="Width of right cheek region")
|
||||||
parser.add_argument("--skip_save_images",
|
parser.add_argument("--skip_save_images",
|
||||||
action="store_true",
|
action="store_true",
|
||||||
help="Whether skip saving images for better generation speed calculation",
|
help="Whether skip saving images for better generation speed calculation",
|
||||||
)
|
)
|
||||||
|
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
# Configure ffmpeg path
|
||||||
|
if not fast_check_ffmpeg():
|
||||||
|
print("Adding ffmpeg to PATH")
|
||||||
|
# Choose path separator based on operating system
|
||||||
|
path_separator = ';' if sys.platform == 'win32' else ':'
|
||||||
|
os.environ["PATH"] = f"{args.ffmpeg_path}{path_separator}{os.environ['PATH']}"
|
||||||
|
if not fast_check_ffmpeg():
|
||||||
|
print("Warning: Unable to find ffmpeg, please ensure ffmpeg is properly installed")
|
||||||
|
|
||||||
|
# Set computing device
|
||||||
|
device = torch.device(f"cuda:{args.gpu_id}" if torch.cuda.is_available() else "cpu")
|
||||||
|
|
||||||
|
# Load model weights
|
||||||
|
vae, unet, pe = load_all_model(
|
||||||
|
unet_model_path=args.unet_model_path,
|
||||||
|
vae_type=args.vae_type,
|
||||||
|
unet_config=args.unet_config,
|
||||||
|
device=device
|
||||||
|
)
|
||||||
|
timesteps = torch.tensor([0], device=device)
|
||||||
|
|
||||||
|
pe = pe.half().to(device)
|
||||||
|
vae.vae = vae.vae.half().to(device)
|
||||||
|
unet.model = unet.model.half().to(device)
|
||||||
|
|
||||||
|
# Initialize audio processor and Whisper model
|
||||||
|
audio_processor = AudioProcessor(feature_extractor_path=args.whisper_dir)
|
||||||
|
weight_dtype = unet.model.dtype
|
||||||
|
whisper = WhisperModel.from_pretrained(args.whisper_dir)
|
||||||
|
whisper = whisper.to(device=device, dtype=weight_dtype).eval()
|
||||||
|
whisper.requires_grad_(False)
|
||||||
|
|
||||||
|
# Initialize face parser with configurable parameters based on version
|
||||||
|
if args.version == "v15":
|
||||||
|
fp = FaceParsing(
|
||||||
|
left_cheek_width=args.left_cheek_width,
|
||||||
|
right_cheek_width=args.right_cheek_width
|
||||||
|
)
|
||||||
|
else: # v1
|
||||||
|
fp = FaceParsing()
|
||||||
|
|
||||||
inference_config = OmegaConf.load(args.inference_config)
|
inference_config = OmegaConf.load(args.inference_config)
|
||||||
print(inference_config)
|
print(inference_config)
|
||||||
|
|
||||||
|
|
||||||
for avatar_id in inference_config:
|
for avatar_id in inference_config:
|
||||||
data_preparation = inference_config[avatar_id]["preparation"]
|
data_preparation = inference_config[avatar_id]["preparation"]
|
||||||
video_path = inference_config[avatar_id]["video_path"]
|
video_path = inference_config[avatar_id]["video_path"]
|
||||||
bbox_shift = inference_config[avatar_id]["bbox_shift"]
|
if args.version == "v15":
|
||||||
|
bbox_shift = 0
|
||||||
|
else:
|
||||||
|
bbox_shift = inference_config[avatar_id]["bbox_shift"]
|
||||||
avatar = Avatar(
|
avatar = Avatar(
|
||||||
avatar_id = avatar_id,
|
avatar_id=avatar_id,
|
||||||
video_path = video_path,
|
video_path=video_path,
|
||||||
bbox_shift = bbox_shift,
|
bbox_shift=bbox_shift,
|
||||||
batch_size = args.batch_size,
|
batch_size=args.batch_size,
|
||||||
preparation= data_preparation)
|
preparation=data_preparation)
|
||||||
|
|
||||||
audio_clips = inference_config[avatar_id]["audio_clips"]
|
audio_clips = inference_config[avatar_id]["audio_clips"]
|
||||||
for audio_num, audio_path in audio_clips.items():
|
for audio_num, audio_path in audio_clips.items():
|
||||||
print("Inferring using:",audio_path)
|
print("Inferring using:", audio_path)
|
||||||
avatar.inference(audio_path,
|
avatar.inference(audio_path,
|
||||||
audio_num,
|
audio_num,
|
||||||
args.fps,
|
args.fps,
|
||||||
args.skip_save_images)
|
args.skip_save_images)
|
||||||
|
|||||||
@@ -0,0 +1,33 @@
|
|||||||
|
import os
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
|
||||||
|
def test_ffmpeg(ffmpeg_path):
|
||||||
|
print(f"Testing ffmpeg path: {ffmpeg_path}")
|
||||||
|
|
||||||
|
# Choose path separator based on operating system
|
||||||
|
path_separator = ';' if sys.platform == 'win32' else ':'
|
||||||
|
|
||||||
|
# Add ffmpeg path to environment variable
|
||||||
|
os.environ["PATH"] = f"{ffmpeg_path}{path_separator}{os.environ['PATH']}"
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Try to run ffmpeg
|
||||||
|
result = subprocess.run(["ffmpeg", "-version"], capture_output=True, text=True)
|
||||||
|
print("FFmpeg test successful!")
|
||||||
|
print("FFmpeg version information:")
|
||||||
|
print(result.stdout)
|
||||||
|
return True
|
||||||
|
except Exception as e:
|
||||||
|
print("FFmpeg test failed!")
|
||||||
|
print(f"Error message: {str(e)}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
# Default ffmpeg path, can be modified as needed
|
||||||
|
default_path = r"ffmpeg-master-latest-win64-gpl-shared\bin"
|
||||||
|
|
||||||
|
# Use command line argument if provided, otherwise use default path
|
||||||
|
ffmpeg_path = sys.argv[1] if len(sys.argv) > 1 else default_path
|
||||||
|
|
||||||
|
test_ffmpeg(ffmpeg_path)
|
||||||
@@ -0,0 +1,580 @@
|
|||||||
|
import argparse
|
||||||
|
import diffusers
|
||||||
|
import logging
|
||||||
|
import math
|
||||||
|
import os
|
||||||
|
import time
|
||||||
|
import torch
|
||||||
|
import torch.nn as nn
|
||||||
|
import torch.nn.functional as F
|
||||||
|
import torch.utils.checkpoint
|
||||||
|
import transformers
|
||||||
|
import warnings
|
||||||
|
import random
|
||||||
|
|
||||||
|
from accelerate import Accelerator
|
||||||
|
from accelerate.utils import LoggerType
|
||||||
|
from accelerate import InitProcessGroupKwargs
|
||||||
|
from accelerate.logging import get_logger
|
||||||
|
from accelerate.utils import DistributedDataParallelKwargs
|
||||||
|
from datetime import datetime
|
||||||
|
from datetime import timedelta
|
||||||
|
|
||||||
|
from diffusers.utils import check_min_version
|
||||||
|
from einops import rearrange
|
||||||
|
from omegaconf import OmegaConf
|
||||||
|
from tqdm.auto import tqdm
|
||||||
|
|
||||||
|
from musetalk.utils.utils import (
|
||||||
|
delete_additional_ckpt,
|
||||||
|
seed_everything,
|
||||||
|
get_mouth_region,
|
||||||
|
process_audio_features,
|
||||||
|
save_models
|
||||||
|
)
|
||||||
|
from musetalk.loss.basic_loss import set_requires_grad
|
||||||
|
from musetalk.loss.syncnet import get_sync_loss
|
||||||
|
from musetalk.utils.training_utils import (
|
||||||
|
initialize_models_and_optimizers,
|
||||||
|
initialize_dataloaders,
|
||||||
|
initialize_loss_functions,
|
||||||
|
initialize_syncnet,
|
||||||
|
initialize_vgg,
|
||||||
|
validation
|
||||||
|
)
|
||||||
|
|
||||||
|
logger = get_logger(__name__, log_level="INFO")
|
||||||
|
warnings.filterwarnings("ignore")
|
||||||
|
check_min_version("0.10.0.dev0")
|
||||||
|
|
||||||
|
def main(cfg):
|
||||||
|
exp_name = cfg.exp_name
|
||||||
|
save_dir = f"{cfg.output_dir}/{exp_name}"
|
||||||
|
os.makedirs(save_dir, exist_ok=True)
|
||||||
|
|
||||||
|
kwargs = DistributedDataParallelKwargs()
|
||||||
|
process_group_kwargs = InitProcessGroupKwargs(
|
||||||
|
timeout=timedelta(seconds=5400))
|
||||||
|
accelerator = Accelerator(
|
||||||
|
gradient_accumulation_steps=cfg.solver.gradient_accumulation_steps,
|
||||||
|
log_with=["tensorboard", LoggerType.TENSORBOARD],
|
||||||
|
project_dir=os.path.join(save_dir, "./tensorboard"),
|
||||||
|
kwargs_handlers=[kwargs, process_group_kwargs],
|
||||||
|
)
|
||||||
|
|
||||||
|
# Make one log on every process with the configuration for debugging.
|
||||||
|
logging.basicConfig(
|
||||||
|
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
|
||||||
|
datefmt="%m/%d/%Y %H:%M:%S",
|
||||||
|
level=logging.INFO,
|
||||||
|
)
|
||||||
|
logger.info(accelerator.state, main_process_only=False)
|
||||||
|
if accelerator.is_local_main_process:
|
||||||
|
transformers.utils.logging.set_verbosity_warning()
|
||||||
|
diffusers.utils.logging.set_verbosity_info()
|
||||||
|
else:
|
||||||
|
transformers.utils.logging.set_verbosity_error()
|
||||||
|
diffusers.utils.logging.set_verbosity_error()
|
||||||
|
|
||||||
|
# If passed along, set the training seed now.
|
||||||
|
if cfg.seed is not None:
|
||||||
|
print('cfg.seed', cfg.seed, accelerator.process_index)
|
||||||
|
seed_everything(cfg.seed + accelerator.process_index)
|
||||||
|
|
||||||
|
weight_dtype = torch.float32
|
||||||
|
|
||||||
|
model_dict = initialize_models_and_optimizers(cfg, accelerator, weight_dtype)
|
||||||
|
dataloader_dict = initialize_dataloaders(cfg)
|
||||||
|
loss_dict = initialize_loss_functions(cfg, accelerator, model_dict['scheduler_max_steps'])
|
||||||
|
syncnet = initialize_syncnet(cfg, accelerator, weight_dtype)
|
||||||
|
vgg_IN, pyramid, downsampler = initialize_vgg(cfg, accelerator)
|
||||||
|
|
||||||
|
# Prepare everything with our `accelerator`.
|
||||||
|
model_dict['net'], model_dict['optimizer'], model_dict['lr_scheduler'], dataloader_dict['train_dataloader'], dataloader_dict['val_dataloader'] = accelerator.prepare(
|
||||||
|
model_dict['net'], model_dict['optimizer'], model_dict['lr_scheduler'], dataloader_dict['train_dataloader'], dataloader_dict['val_dataloader']
|
||||||
|
)
|
||||||
|
print("length train/val", len(dataloader_dict['train_dataloader']), len(dataloader_dict['val_dataloader']))
|
||||||
|
|
||||||
|
# Calculate training steps and epochs
|
||||||
|
num_update_steps_per_epoch = math.ceil(
|
||||||
|
len(dataloader_dict['train_dataloader']) / cfg.solver.gradient_accumulation_steps
|
||||||
|
)
|
||||||
|
num_train_epochs = math.ceil(
|
||||||
|
cfg.solver.max_train_steps / num_update_steps_per_epoch
|
||||||
|
)
|
||||||
|
|
||||||
|
# Initialize trackers on the main process
|
||||||
|
if accelerator.is_main_process:
|
||||||
|
run_time = datetime.now().strftime("%Y%m%d-%H%M")
|
||||||
|
accelerator.init_trackers(
|
||||||
|
cfg.exp_name,
|
||||||
|
init_kwargs={"mlflow": {"run_name": run_time}},
|
||||||
|
)
|
||||||
|
|
||||||
|
# Calculate total batch size
|
||||||
|
total_batch_size = (
|
||||||
|
cfg.data.train_bs
|
||||||
|
* accelerator.num_processes
|
||||||
|
* cfg.solver.gradient_accumulation_steps
|
||||||
|
)
|
||||||
|
|
||||||
|
# Log training information
|
||||||
|
logger.info("***** Running training *****")
|
||||||
|
logger.info(f"Num Epochs = {num_train_epochs}")
|
||||||
|
logger.info(f"Instantaneous batch size per device = {cfg.data.train_bs}")
|
||||||
|
logger.info(
|
||||||
|
f"Total train batch size (w. parallel, distributed & accumulation) = {total_batch_size}"
|
||||||
|
)
|
||||||
|
logger.info(
|
||||||
|
f"Gradient Accumulation steps = {cfg.solver.gradient_accumulation_steps}")
|
||||||
|
logger.info(f"Total optimization steps = {cfg.solver.max_train_steps}")
|
||||||
|
|
||||||
|
global_step = 0
|
||||||
|
first_epoch = 0
|
||||||
|
|
||||||
|
# Load checkpoint if resuming training
|
||||||
|
if cfg.resume_from_checkpoint:
|
||||||
|
resume_dir = save_dir
|
||||||
|
dirs = os.listdir(resume_dir)
|
||||||
|
dirs = [d for d in dirs if d.startswith("checkpoint")]
|
||||||
|
dirs = sorted(dirs, key=lambda x: int(x.split("-")[1]))
|
||||||
|
if len(dirs) > 0:
|
||||||
|
path = dirs[-1]
|
||||||
|
accelerator.load_state(os.path.join(resume_dir, path))
|
||||||
|
accelerator.print(f"Resuming from checkpoint {path}")
|
||||||
|
global_step = int(path.split("-")[1])
|
||||||
|
first_epoch = global_step // num_update_steps_per_epoch
|
||||||
|
resume_step = global_step % num_update_steps_per_epoch
|
||||||
|
|
||||||
|
# Initialize progress bar
|
||||||
|
progress_bar = tqdm(
|
||||||
|
range(global_step, cfg.solver.max_train_steps),
|
||||||
|
disable=not accelerator.is_local_main_process,
|
||||||
|
)
|
||||||
|
progress_bar.set_description("Steps")
|
||||||
|
|
||||||
|
# Log model types
|
||||||
|
print("log type of models")
|
||||||
|
print("unet", model_dict['unet'].dtype)
|
||||||
|
print("vae", model_dict['vae'].dtype)
|
||||||
|
print("wav2vec", model_dict['wav2vec'].dtype)
|
||||||
|
|
||||||
|
def get_ganloss_weight(step):
|
||||||
|
"""Calculate GAN loss weight based on training step"""
|
||||||
|
if step < cfg.discriminator_train_params.start_gan:
|
||||||
|
return 0.0
|
||||||
|
else:
|
||||||
|
return 1.0
|
||||||
|
|
||||||
|
# Training loop
|
||||||
|
for epoch in range(first_epoch, num_train_epochs):
|
||||||
|
# Set models to training mode
|
||||||
|
model_dict['unet'].train()
|
||||||
|
if cfg.loss_params.gan_loss > 0:
|
||||||
|
loss_dict['discriminator'].train()
|
||||||
|
if cfg.loss_params.mouth_gan_loss > 0:
|
||||||
|
loss_dict['mouth_discriminator'].train()
|
||||||
|
|
||||||
|
# Initialize loss accumulators
|
||||||
|
train_loss = 0.0
|
||||||
|
train_loss_D = 0.0
|
||||||
|
train_loss_D_mouth = 0.0
|
||||||
|
l1_loss_accum = 0.0
|
||||||
|
vgg_loss_accum = 0.0
|
||||||
|
gan_loss_accum = 0.0
|
||||||
|
gan_loss_accum_mouth = 0.0
|
||||||
|
fm_loss_accum = 0.0
|
||||||
|
sync_loss_accum = 0.0
|
||||||
|
adapted_weight_accum = 0.0
|
||||||
|
|
||||||
|
t_data_start = time.time()
|
||||||
|
for step, batch in enumerate(dataloader_dict['train_dataloader']):
|
||||||
|
t_data = time.time() - t_data_start
|
||||||
|
t_model_start = time.time()
|
||||||
|
|
||||||
|
with torch.no_grad():
|
||||||
|
# Process input data
|
||||||
|
pixel_values = batch["pixel_values_vid"].to(weight_dtype).to(
|
||||||
|
accelerator.device,
|
||||||
|
non_blocking=True
|
||||||
|
)
|
||||||
|
bsz, num_frames, c, h, w = pixel_values.shape
|
||||||
|
|
||||||
|
# Process reference images
|
||||||
|
ref_pixel_values = batch["pixel_values_ref_img"].to(weight_dtype).to(
|
||||||
|
accelerator.device,
|
||||||
|
non_blocking=True
|
||||||
|
)
|
||||||
|
|
||||||
|
# Get face mask for GAN
|
||||||
|
pixel_values_face_mask = batch['pixel_values_face_mask']
|
||||||
|
|
||||||
|
# Process audio features
|
||||||
|
audio_prompts = process_audio_features(cfg, batch, model_dict['wav2vec'], bsz, num_frames, weight_dtype)
|
||||||
|
|
||||||
|
# Initialize adapted weight
|
||||||
|
adapted_weight = 1
|
||||||
|
|
||||||
|
# Process sync loss if enabled
|
||||||
|
if cfg.loss_params.sync_loss > 0:
|
||||||
|
mels = batch['mel']
|
||||||
|
# Prepare frames for latentsync (combine channels and frames)
|
||||||
|
gt_frames = rearrange(pixel_values, 'b f c h w-> b (f c) h w')
|
||||||
|
# Use lower half of face for latentsync
|
||||||
|
height = gt_frames.shape[2]
|
||||||
|
gt_frames = gt_frames[:, :, height // 2:, :]
|
||||||
|
|
||||||
|
# Get audio embeddings
|
||||||
|
audio_embed = syncnet.get_audio_embed(mels)
|
||||||
|
|
||||||
|
# Calculate adapted weight based on audio-visual similarity
|
||||||
|
if cfg.use_adapted_weight:
|
||||||
|
vision_embed_gt = syncnet.get_vision_embed(gt_frames)
|
||||||
|
image_audio_sim_gt = F.cosine_similarity(
|
||||||
|
audio_embed,
|
||||||
|
vision_embed_gt,
|
||||||
|
dim=1
|
||||||
|
)[0]
|
||||||
|
|
||||||
|
if image_audio_sim_gt < 0.05 or image_audio_sim_gt > 0.65:
|
||||||
|
if cfg.adapted_weight_type == "cut_off":
|
||||||
|
adapted_weight = 0.0 # Skip this batch
|
||||||
|
print(
|
||||||
|
f"\nThe i-a similarity in step {global_step} is {image_audio_sim_gt}, set adapted_weight to {adapted_weight}.")
|
||||||
|
elif cfg.adapted_weight_type == "linear":
|
||||||
|
adapted_weight = image_audio_sim_gt
|
||||||
|
else:
|
||||||
|
print(f"unknown adapted_weight_type: {cfg.adapted_weight_type}")
|
||||||
|
adapted_weight = 1
|
||||||
|
|
||||||
|
# Random frame selection for memory efficiency
|
||||||
|
max_start = 16 - cfg.num_backward_frames
|
||||||
|
frames_left_index = random.randint(0, max_start) if max_start > 0 else 0
|
||||||
|
frames_right_index = frames_left_index + cfg.num_backward_frames
|
||||||
|
else:
|
||||||
|
frames_left_index = 0
|
||||||
|
frames_right_index = cfg.data.n_sample_frames
|
||||||
|
|
||||||
|
# Extract frames for backward pass
|
||||||
|
pixel_values_backward = pixel_values[:, frames_left_index:frames_right_index, ...]
|
||||||
|
ref_pixel_values_backward = ref_pixel_values[:, frames_left_index:frames_right_index, ...]
|
||||||
|
pixel_values_face_mask_backward = pixel_values_face_mask[:, frames_left_index:frames_right_index, ...]
|
||||||
|
audio_prompts_backward = audio_prompts[:, frames_left_index:frames_right_index, ...]
|
||||||
|
|
||||||
|
# Encode target images
|
||||||
|
frames = rearrange(pixel_values_backward, 'b f c h w-> (b f) c h w')
|
||||||
|
latents = model_dict['vae'].encode(frames).latent_dist.mode()
|
||||||
|
latents = latents * model_dict['vae'].config.scaling_factor
|
||||||
|
latents = latents.float()
|
||||||
|
|
||||||
|
# Create masked images
|
||||||
|
masked_pixel_values = pixel_values_backward.clone()
|
||||||
|
masked_pixel_values[:, :, :, h//2:, :] = -1
|
||||||
|
masked_frames = rearrange(masked_pixel_values, 'b f c h w -> (b f) c h w')
|
||||||
|
masked_latents = model_dict['vae'].encode(masked_frames).latent_dist.mode()
|
||||||
|
masked_latents = masked_latents * model_dict['vae'].config.scaling_factor
|
||||||
|
masked_latents = masked_latents.float()
|
||||||
|
|
||||||
|
# Encode reference images
|
||||||
|
ref_frames = rearrange(ref_pixel_values_backward, 'b f c h w-> (b f) c h w')
|
||||||
|
ref_latents = model_dict['vae'].encode(ref_frames).latent_dist.mode()
|
||||||
|
ref_latents = ref_latents * model_dict['vae'].config.scaling_factor
|
||||||
|
ref_latents = ref_latents.float()
|
||||||
|
|
||||||
|
# Prepare face mask and audio features
|
||||||
|
pixel_values_face_mask_backward = rearrange(
|
||||||
|
pixel_values_face_mask_backward,
|
||||||
|
"b f c h w -> (b f) c h w"
|
||||||
|
)
|
||||||
|
audio_prompts_backward = rearrange(
|
||||||
|
audio_prompts_backward,
|
||||||
|
'b f c h w-> (b f) c h w'
|
||||||
|
)
|
||||||
|
audio_prompts_backward = rearrange(
|
||||||
|
audio_prompts_backward,
|
||||||
|
'(b f) c h w -> (b f) (c h) w',
|
||||||
|
b=bsz
|
||||||
|
)
|
||||||
|
|
||||||
|
# Apply reference dropout (currently inactive)
|
||||||
|
dropout = nn.Dropout(p=cfg.ref_dropout_rate)
|
||||||
|
ref_latents = dropout(ref_latents)
|
||||||
|
|
||||||
|
# Prepare model inputs
|
||||||
|
input_latents = torch.cat([masked_latents, ref_latents], dim=1)
|
||||||
|
input_latents = input_latents.to(weight_dtype)
|
||||||
|
timesteps = torch.tensor([0], device=input_latents.device)
|
||||||
|
|
||||||
|
# Forward pass
|
||||||
|
latents_pred = model_dict['net'](
|
||||||
|
input_latents,
|
||||||
|
timesteps,
|
||||||
|
audio_prompts_backward,
|
||||||
|
)
|
||||||
|
latents_pred = (1 / model_dict['vae'].config.scaling_factor) * latents_pred
|
||||||
|
image_pred = model_dict['vae'].decode(latents_pred).sample
|
||||||
|
|
||||||
|
# Convert to float
|
||||||
|
image_pred = image_pred.float()
|
||||||
|
frames = frames.float()
|
||||||
|
|
||||||
|
# Calculate L1 loss
|
||||||
|
l1_loss = loss_dict['L1_loss'](frames, image_pred)
|
||||||
|
l1_loss_accum += l1_loss.item()
|
||||||
|
loss = cfg.loss_params.l1_loss * l1_loss * adapted_weight
|
||||||
|
|
||||||
|
# Process mouth GAN loss if enabled
|
||||||
|
if cfg.loss_params.mouth_gan_loss > 0:
|
||||||
|
frames_mouth, image_pred_mouth = get_mouth_region(
|
||||||
|
frames,
|
||||||
|
image_pred,
|
||||||
|
pixel_values_face_mask_backward
|
||||||
|
)
|
||||||
|
pyramide_real_mouth = pyramid(downsampler(frames_mouth))
|
||||||
|
pyramide_generated_mouth = pyramid(downsampler(image_pred_mouth))
|
||||||
|
|
||||||
|
# Process VGG loss if enabled
|
||||||
|
if cfg.loss_params.vgg_loss > 0:
|
||||||
|
pyramide_real = pyramid(downsampler(frames))
|
||||||
|
pyramide_generated = pyramid(downsampler(image_pred))
|
||||||
|
|
||||||
|
loss_IN = 0
|
||||||
|
for scale in cfg.loss_params.pyramid_scale:
|
||||||
|
x_vgg = vgg_IN(pyramide_generated['prediction_' + str(scale)])
|
||||||
|
y_vgg = vgg_IN(pyramide_real['prediction_' + str(scale)])
|
||||||
|
for i, weight in enumerate(cfg.loss_params.vgg_layer_weight):
|
||||||
|
value = torch.abs(x_vgg[i] - y_vgg[i].detach()).mean()
|
||||||
|
loss_IN += weight * value
|
||||||
|
loss_IN /= sum(cfg.loss_params.vgg_layer_weight)
|
||||||
|
loss += loss_IN * cfg.loss_params.vgg_loss * adapted_weight
|
||||||
|
vgg_loss_accum += loss_IN.item()
|
||||||
|
|
||||||
|
# Process GAN loss if enabled
|
||||||
|
if cfg.loss_params.gan_loss > 0:
|
||||||
|
set_requires_grad(loss_dict['discriminator'], False)
|
||||||
|
loss_G = 0.
|
||||||
|
discriminator_maps_generated = loss_dict['discriminator'](pyramide_generated)
|
||||||
|
discriminator_maps_real = loss_dict['discriminator'](pyramide_real)
|
||||||
|
|
||||||
|
for scale in loss_dict['disc_scales']:
|
||||||
|
key = 'prediction_map_%s' % scale
|
||||||
|
value = ((1 - discriminator_maps_generated[key]) ** 2).mean()
|
||||||
|
loss_G += value
|
||||||
|
gan_loss_accum += loss_G.item()
|
||||||
|
|
||||||
|
loss += loss_G * cfg.loss_params.gan_loss * get_ganloss_weight(global_step) * adapted_weight
|
||||||
|
|
||||||
|
# Process feature matching loss if enabled
|
||||||
|
if cfg.loss_params.fm_loss[0] > 0:
|
||||||
|
L_feature_matching = 0.
|
||||||
|
for scale in loss_dict['disc_scales']:
|
||||||
|
key = 'feature_maps_%s' % scale
|
||||||
|
for i, (a, b) in enumerate(zip(discriminator_maps_real[key], discriminator_maps_generated[key])):
|
||||||
|
value = torch.abs(a - b).mean()
|
||||||
|
L_feature_matching += value * cfg.loss_params.fm_loss[i]
|
||||||
|
loss += L_feature_matching * adapted_weight
|
||||||
|
fm_loss_accum += L_feature_matching.item()
|
||||||
|
|
||||||
|
# Process mouth GAN loss if enabled
|
||||||
|
if cfg.loss_params.mouth_gan_loss > 0:
|
||||||
|
set_requires_grad(loss_dict['mouth_discriminator'], False)
|
||||||
|
loss_G = 0.
|
||||||
|
mouth_discriminator_maps_generated = loss_dict['mouth_discriminator'](pyramide_generated_mouth)
|
||||||
|
mouth_discriminator_maps_real = loss_dict['mouth_discriminator'](pyramide_real_mouth)
|
||||||
|
|
||||||
|
for scale in loss_dict['disc_scales']:
|
||||||
|
key = 'prediction_map_%s' % scale
|
||||||
|
value = ((1 - mouth_discriminator_maps_generated[key]) ** 2).mean()
|
||||||
|
loss_G += value
|
||||||
|
gan_loss_accum_mouth += loss_G.item()
|
||||||
|
|
||||||
|
loss += loss_G * cfg.loss_params.mouth_gan_loss * get_ganloss_weight(global_step) * adapted_weight
|
||||||
|
|
||||||
|
# Process feature matching loss for mouth if enabled
|
||||||
|
if cfg.loss_params.fm_loss[0] > 0:
|
||||||
|
L_feature_matching = 0.
|
||||||
|
for scale in loss_dict['disc_scales']:
|
||||||
|
key = 'feature_maps_%s' % scale
|
||||||
|
for i, (a, b) in enumerate(zip(mouth_discriminator_maps_real[key], mouth_discriminator_maps_generated[key])):
|
||||||
|
value = torch.abs(a - b).mean()
|
||||||
|
L_feature_matching += value * cfg.loss_params.fm_loss[i]
|
||||||
|
loss += L_feature_matching * adapted_weight
|
||||||
|
fm_loss_accum += L_feature_matching.item()
|
||||||
|
|
||||||
|
# Process sync loss if enabled
|
||||||
|
if cfg.loss_params.sync_loss > 0:
|
||||||
|
pred_frames = rearrange(
|
||||||
|
image_pred, '(b f) c h w-> b (f c) h w', f=pixel_values_backward.shape[1])
|
||||||
|
pred_frames = pred_frames[:, :, height // 2 :, :]
|
||||||
|
sync_loss, image_audio_sim_pred = get_sync_loss(
|
||||||
|
audio_embed,
|
||||||
|
gt_frames,
|
||||||
|
pred_frames,
|
||||||
|
syncnet,
|
||||||
|
adapted_weight,
|
||||||
|
frames_left_index=frames_left_index,
|
||||||
|
frames_right_index=frames_right_index,
|
||||||
|
)
|
||||||
|
sync_loss_accum += sync_loss.item()
|
||||||
|
loss += sync_loss * cfg.loss_params.sync_loss * adapted_weight
|
||||||
|
|
||||||
|
# Backward pass
|
||||||
|
avg_loss = accelerator.gather(loss.repeat(cfg.data.train_bs)).mean()
|
||||||
|
train_loss += avg_loss.item()
|
||||||
|
accelerator.backward(loss)
|
||||||
|
|
||||||
|
# Train discriminator if GAN loss is enabled
|
||||||
|
if cfg.loss_params.gan_loss > 0:
|
||||||
|
set_requires_grad(loss_dict['discriminator'], True)
|
||||||
|
loss_D = loss_dict['discriminator_full'](frames, image_pred.detach())
|
||||||
|
avg_loss_D = accelerator.gather(loss_D.repeat(cfg.data.train_bs)).mean()
|
||||||
|
train_loss_D += avg_loss_D.item() / 1
|
||||||
|
loss_D = loss_D * get_ganloss_weight(global_step) * adapted_weight
|
||||||
|
accelerator.backward(loss_D)
|
||||||
|
|
||||||
|
if accelerator.sync_gradients:
|
||||||
|
accelerator.clip_grad_norm_(
|
||||||
|
loss_dict['discriminator'].parameters(), cfg.solver.max_grad_norm)
|
||||||
|
if (global_step + 1) % cfg.solver.gradient_accumulation_steps == 0:
|
||||||
|
loss_dict['optimizer_D'].step()
|
||||||
|
loss_dict['scheduler_D'].step()
|
||||||
|
loss_dict['optimizer_D'].zero_grad()
|
||||||
|
|
||||||
|
# Train mouth discriminator if mouth GAN loss is enabled
|
||||||
|
if cfg.loss_params.mouth_gan_loss > 0:
|
||||||
|
set_requires_grad(loss_dict['mouth_discriminator'], True)
|
||||||
|
mouth_loss_D = loss_dict['mouth_discriminator_full'](
|
||||||
|
frames_mouth, image_pred_mouth.detach())
|
||||||
|
avg_mouth_loss_D = accelerator.gather(
|
||||||
|
mouth_loss_D.repeat(cfg.data.train_bs)).mean()
|
||||||
|
train_loss_D_mouth += avg_mouth_loss_D.item() / 1
|
||||||
|
mouth_loss_D = mouth_loss_D * get_ganloss_weight(global_step) * adapted_weight
|
||||||
|
accelerator.backward(mouth_loss_D)
|
||||||
|
|
||||||
|
if accelerator.sync_gradients:
|
||||||
|
accelerator.clip_grad_norm_(
|
||||||
|
loss_dict['mouth_discriminator'].parameters(), cfg.solver.max_grad_norm)
|
||||||
|
if (global_step + 1) % cfg.solver.gradient_accumulation_steps == 0:
|
||||||
|
loss_dict['mouth_optimizer_D'].step()
|
||||||
|
loss_dict['mouth_scheduler_D'].step()
|
||||||
|
loss_dict['mouth_optimizer_D'].zero_grad()
|
||||||
|
|
||||||
|
# Update main model
|
||||||
|
if (global_step + 1) % cfg.solver.gradient_accumulation_steps == 0:
|
||||||
|
if accelerator.sync_gradients:
|
||||||
|
accelerator.clip_grad_norm_(
|
||||||
|
model_dict['trainable_params'],
|
||||||
|
cfg.solver.max_grad_norm,
|
||||||
|
)
|
||||||
|
model_dict['optimizer'].step()
|
||||||
|
model_dict['lr_scheduler'].step()
|
||||||
|
model_dict['optimizer'].zero_grad()
|
||||||
|
|
||||||
|
# Update progress and log metrics
|
||||||
|
if accelerator.sync_gradients:
|
||||||
|
progress_bar.update(1)
|
||||||
|
global_step += 1
|
||||||
|
accelerator.log({
|
||||||
|
"train_loss": train_loss,
|
||||||
|
"train_loss_D": train_loss_D,
|
||||||
|
"train_loss_D_mouth": train_loss_D_mouth,
|
||||||
|
"l1_loss": l1_loss_accum,
|
||||||
|
"vgg_loss": vgg_loss_accum,
|
||||||
|
"gan_loss": gan_loss_accum,
|
||||||
|
"fm_loss": fm_loss_accum,
|
||||||
|
"sync_loss": sync_loss_accum,
|
||||||
|
"adapted_weight": adapted_weight_accum,
|
||||||
|
"lr": model_dict['lr_scheduler'].get_last_lr()[0],
|
||||||
|
}, step=global_step)
|
||||||
|
|
||||||
|
# Reset loss accumulators
|
||||||
|
train_loss = 0.0
|
||||||
|
l1_loss_accum = 0.0
|
||||||
|
vgg_loss_accum = 0.0
|
||||||
|
gan_loss_accum = 0.0
|
||||||
|
fm_loss_accum = 0.0
|
||||||
|
sync_loss_accum = 0.0
|
||||||
|
adapted_weight_accum = 0.0
|
||||||
|
train_loss_D = 0.0
|
||||||
|
train_loss_D_mouth = 0.0
|
||||||
|
|
||||||
|
# Run validation if needed
|
||||||
|
if global_step % cfg.val_freq == 0 or global_step == 10:
|
||||||
|
try:
|
||||||
|
validation(
|
||||||
|
cfg,
|
||||||
|
dataloader_dict['val_dataloader'],
|
||||||
|
model_dict['net'],
|
||||||
|
model_dict['vae'],
|
||||||
|
model_dict['wav2vec'],
|
||||||
|
accelerator,
|
||||||
|
save_dir,
|
||||||
|
global_step,
|
||||||
|
weight_dtype,
|
||||||
|
syncnet_score=adapted_weight,
|
||||||
|
)
|
||||||
|
except Exception as e:
|
||||||
|
print(f"An error occurred during validation: {e}")
|
||||||
|
|
||||||
|
# Save checkpoint if needed
|
||||||
|
if global_step % cfg.checkpointing_steps == 0:
|
||||||
|
save_path = os.path.join(save_dir, f"checkpoint-{global_step}")
|
||||||
|
try:
|
||||||
|
start_time = time.time()
|
||||||
|
if accelerator.is_main_process:
|
||||||
|
save_models(
|
||||||
|
accelerator,
|
||||||
|
model_dict['net'],
|
||||||
|
save_dir,
|
||||||
|
global_step,
|
||||||
|
cfg,
|
||||||
|
logger=logger
|
||||||
|
)
|
||||||
|
delete_additional_ckpt(save_dir, cfg.total_limit)
|
||||||
|
elapsed_time = time.time() - start_time
|
||||||
|
if elapsed_time > 300:
|
||||||
|
print(f"Skipping storage as it took too long in step {global_step}.")
|
||||||
|
else:
|
||||||
|
print(f"Resume states saved at {save_dir} successfully in {elapsed_time}s.")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error when saving model in step {global_step}:", e)
|
||||||
|
|
||||||
|
# Update progress bar
|
||||||
|
t_model = time.time() - t_model_start
|
||||||
|
logs = {
|
||||||
|
"step_loss": loss.detach().item(),
|
||||||
|
"lr": model_dict['lr_scheduler'].get_last_lr()[0],
|
||||||
|
"td": f"{t_data:.2f}s",
|
||||||
|
"tm": f"{t_model:.2f}s",
|
||||||
|
}
|
||||||
|
t_data_start = time.time()
|
||||||
|
progress_bar.set_postfix(**logs)
|
||||||
|
|
||||||
|
if global_step >= cfg.solver.max_train_steps:
|
||||||
|
break
|
||||||
|
|
||||||
|
# Save model after each epoch
|
||||||
|
if (epoch + 1) % cfg.save_model_epoch_interval == 0:
|
||||||
|
try:
|
||||||
|
start_time = time.time()
|
||||||
|
if accelerator.is_main_process:
|
||||||
|
save_models(accelerator, model_dict['net'], save_dir, global_step, cfg)
|
||||||
|
accelerator.save_state(save_path)
|
||||||
|
elapsed_time = time.time() - start_time
|
||||||
|
if elapsed_time > 120:
|
||||||
|
print(f"Skipping storage as it took too long in step {global_step}.")
|
||||||
|
else:
|
||||||
|
print(f"Model saved successfully in {elapsed_time}s.")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error when saving model in step {global_step}:", e)
|
||||||
|
accelerator.wait_for_everyone()
|
||||||
|
|
||||||
|
# End training
|
||||||
|
accelerator.end_training()
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument("--config", type=str, default="./configs/training/stage2.yaml")
|
||||||
|
args = parser.parse_args()
|
||||||
|
config = OmegaConf.load(args.config)
|
||||||
|
main(config)
|
||||||
@@ -0,0 +1,34 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
# MuseTalk Training Script
|
||||||
|
# This script combines both training stages for the MuseTalk model
|
||||||
|
# Usage: sh train.sh [stage1|stage2]
|
||||||
|
# Example: sh train.sh stage1 # To run stage 1 training
|
||||||
|
# Example: sh train.sh stage2 # To run stage 2 training
|
||||||
|
|
||||||
|
# Check if stage argument is provided
|
||||||
|
if [ $# -ne 1 ]; then
|
||||||
|
echo "Error: Please specify the training stage"
|
||||||
|
echo "Usage: ./train.sh [stage1|stage2]"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
STAGE=$1
|
||||||
|
|
||||||
|
# Validate stage argument
|
||||||
|
if [ "$STAGE" != "stage1" ] && [ "$STAGE" != "stage2" ]; then
|
||||||
|
echo "Error: Invalid stage. Must be either 'stage1' or 'stage2'"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Launch distributed training using accelerate
|
||||||
|
# --config_file: Path to the GPU configuration file
|
||||||
|
# --main_process_port: Port number for the main process, used for distributed training communication
|
||||||
|
# train.py: Training script
|
||||||
|
# --config: Path to the training configuration file
|
||||||
|
echo "Starting $STAGE training..."
|
||||||
|
accelerate launch --config_file ./configs/training/gpu.yaml \
|
||||||
|
--main_process_port 29502 \
|
||||||
|
train.py --config ./configs/training/$STAGE.yaml
|
||||||
|
|
||||||
|
echo "Training completed for $STAGE"
|
||||||
Reference in New Issue
Block a user