: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations
This repository contains the official implementation code for SCAIL (Studio-Grade Character Animation via In-Context Learning), a framework that enables high-fidelity character animation under diverse and challenging conditions, including large motion variations, stylized characters, and multi-character interactions.
SCAIL identifies the key bottlenecks that hinder character animation towards production level: limited generalization towards characters and incoherent motion under complex scenarios (e.g., the long-standing challenge of multi-character interactions, as well as common failures in basic motions like flipping and turning). We revisit the core components of character animation -- how to represent the pose representation and how to inject the pose. Our framework resolves the challenge that pose representations cannot simultaneously prevent identity leakage and preserve rich motion information, and compels the model to perform spatiotemporal reasoning over the entire motion sequence for more natural and coherent movements. Check our methods, results gallery, as well as comparisons against other baselines at our project page.
- 2025.12.08: π₯ We release the inference code of SCAIL on SAT.
- 2025.12.11: π Weβve added more interesting cases to our gallery on project page! Check it out!
- 2025.12.11: π₯ SCAIL is now officially opensourced on HuggingFace and ModelScope!!
- 2025.12.14: π₯³ Thanks to friends in the community for testing the work! Despite the fact that only 1.5% of SCAILβs training samples are anime data, and that we did not intentionally collect any multi-character anime data, we were surprised to see that the model can already handle many complex anime characters and even support multi-character anime interactions. The release of SCAIL-Preview is intended to demonstrate the soundness of our proposed pose representation and model architecture, with clear potential for further scaling and enhancement.
- 2025.12.16: β€οΈ Huge thanks to KJ for the work done on adaptation β SCAIL is now available in ComfyUI-WanVideoWrapper!!! Meanwhile, the pose extraction & rendering has also been partly adapted to ComfyUI in ComfyUI-SCAIL-Pose, currently without multi-character tracking and multi-character facial keypoints.
- 2025.12.17: β€οΈ Thanks to VantageWithAI, GGUF version is now available at SCAIL-Preview-GGUF!
- SCAIL-14B-Preview Model Weights(512p, 5s) and Inference Config
- Prompt Optimization Snippets
- SCAIL-Official(1.3B/14B) Model Weights(Improved Stability and Clarity, Innate Long Video Generation Capability) and Inference Config
| ckpts | Download Link | Notes |
|---|---|---|
| SCAIL-Preview(14B) | π€ Hugging Face π€ ModelScope |
Supports 512P |
Use the following commands to download the model weights (We have integrated both Wan VAE and T5 modules into this checkpoint for convenience).
# Download the repository (skip automatic LFS file downloads)
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/zai-org/SCAIL-PreviewThe files should be organized like:
SCAIL-Preview/
βββ Wan2.1_VAE.pth
βββ model
β βββ 1
β β βββ mp_rank_00_model_states.pt
β βββ latest
βββ umt5-xxl
βββ ...
Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.
pip install -r requirements.txt
The input data should be organized as follows, we have provided some example data in examples/:
examples/
βββ 001
β βββ driving.mp4
β βββ ref.jpg
βββ 002
βββ driving.mp4
βββ ref.jpg
...
We provide our pose extraction and rendering code in another repo SCAIL-Pose, which can be used to extract the pose from the driving video and render them. We recommand using another environment for pose extraction due to dependency issues. Clone that repo to SCAIL-Pose folder and follow instructions in it.
After pose extraction and rendering, the input data should be organized as follows:
examples/
βββ 001
β βββ driving.mp4
β βββ ref.jpg
β βββ rendered.mp4 (or rendered_aligned.mp4)
βββ 002
...
Run the following command to start the inference:
bash scripts/sample_sgl_1Bsc_xc_cli.sh
The CLI will ask you to input in format like <prompt>@@<example_dir>, e.g. the girl is dancing@@examples/001. The example_dir should contain rendered.mp4 or rendered_aligned.mp4 after pose extraction and rendering. Results will be save to samples/.
Note that our model is trained with long detailed prompts, even though a short or even null prompt can be used, the result may not be as good as the long prompt. We will provide our prompt generation snippets, using Google Gemini to read from the reference image and the driving motion and generate a detailed prompt like A woman with curly hair is joyfully dancing along a rocky shoreline, wearing a sleek blue two-piece outfit. She performs various dance moves, including twirling, raising her hands, and embracing the lively seaside atmosphere, her tattoos and confident demeanor adding to her dynamic presence.
You can further choose sampling configurations like resolution in the yaml file under configs/sampling/ or directly modify sample_video.py for customized sampling logic.
Our implementation is built upon the foundation of Wan 2.1 and the overall project architecture is built using SAT. We utilized NLFPose for reliable pose extraction. Thanks for their remarkable contribution and released code.
If you find this work useful in your research, please cite:
@article{yan2025scail,
title={SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations},
author={Yan, Wenhao and Ye, Sheng and Yang, Zhuoyi and Teng, Jiayan and Dong, ZhenHui and Wen, Kairui and Gu, Xiaotao and Liu, Yong-Jin and Tang, Jie},
journal={arXiv preprint arXiv:2512.05905},
year={2025}
}This project is licensed under the Apache License 2.0 - see the LICENSE file for details