Jingyun Liang, Jingkai Zhou, Shikai Li, Chenjie Cao, Lei Sun, Yichen Qian, Weihua Chen, Fan Wang
DAMO Academy, Alibaba
This repository is the official implementation of RealisMotion: Decomposed Human Motion Control and Video Generation in the World Space.
Here are several example videos generated by RealisMotion. Note that the GIFs shown here have some degree of visual quality degradation. Please visit our project page for more than 100 videos examples.
Generating human videos with realistic and controllable motions is a challenging task. While existing methods can generate visually compelling videos, they lack separate control over four key video elements: foreground subject, background video, human trajectory and action patterns. In this paper, we propose a decomposed human motion control and video generation framework that explicitly decouples motion from appearance, subject from background, and action from trajectory, enabling flexible mix-and-match composition of these elements. Concretely, we first build a ground-aware 3D world coordinate system and perform motion editing directly in the 3D space. Trajectory control is implemented by unprojecting edited 2D trajectories into 3D with focal-length calibration and coordinate transformation, followed by speed alignment and orientation adjustment; actions are supplied by a motion bank or generated via text-to-motion methods. Then, based on modern text-to-video diffusion transformer models, we inject the subject as tokens for full attention, concatenate the background along the channel dimension, and add motion (trajectory and action) control signals by addition. Such a design opens up the possibility for us to generate realistic videos of anyone doing anything anywhere. Extensive experiments on benchmark datasets and real-world cases demonstrate that our method achieves state-of-the-art performance on both element-wise controllability and overall video quality.
git clone https://github.com/jingyunliang/RealisMotion.git
cd RealisMotion
conda create -n realismotion python=3.10
conda activate realismotion
pip install -r requirements.txt
# install FA3
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
git checkout 0dfb28174333d9eefb7c1dd4292690a8458d1e89 # Important: using other FA3 might yield bad results on H20 GPUs
cd hopper
python setup.py install
cd ../../We provide two versions for inference: the first is the text-to-video (T2V) version (same as the model in the paper), the second is the image-to-video (I2V) version (to avoid duplicate work, we directly combine with the concurrent work RealisDance-Dit).
| Version Type | Advantage | Disadvantage |
|---|---|---|
| Text-to-Video (T2V) version |
|
|
| Image-to-Video (I2V) version |
|
|
Please download the checkpoints as below. Use HF_ENDPOINT=https://hf-mirror.com huggingface-cli xxxx if you need to speed up downloading. By the way, put pretrained_models/ under a fast disk path (e.g., /tmp/) can reduce the model loading time significantly. You can use --ckpt /tmp/pretrained_models/RealisMotion in that case.
| Version Type | Bash Command |
|---|---|
| T2V |
|
| I2V |
|
- Inference with Single GPU
| Version Type | Bash Command |
|---|---|
| T2V |
|
| I2V |
|
Note: add --enable-teacache to inference with TeaCache for acceleration (optional, may cause quality degradation); add --save-gpu-memory to inference with small GPU memory (optional, will be super slow. Can be used with TeaCache).
- Inference with multi GPUs (Optional. Can be used with TeaCache)
| Version Type | Bash Command |
|---|---|
| T2V |
|
| I2V |
|
| Version Type | Bash Command |
|---|---|
| T2V |
|
| I2V |
|
To edit the trajectory, orientation and action of human, please follow followng steps.
First, please install GVHMR and DPVO following install GVHMR. The nvcc in third-party/DPVO/setup.py should be modified as ['-O3', '-gencode', 'arch=compute_90,code=sm_90'] for H20 GPUs. Please symlink the checkpoints by ln -s YOUR_PATH/GVHMR/inputs/checkpoints inputs/checkpoints.
Then, please install DepthPro for focal length calibration as follows (optional).
git clone https://github.com/apple/ml-depth-pro
cd ml-depth-pro
pip install .
source get_pretrained_models.sh
cd ..
We first estimate the SMPL-X for the input foreground subject, background and motion videos / images.
cd RealisMotion
export PYTHONPATH="YOUR_PATH/GVHMR/hmr4d:$PYTHONPATH"
# process foreground and background
python hmc/render_demo.py --video=inputs/example_video/internalaffairs.mp4 --output_root inputs/demo --track_id 1
# process motion
python hmc/render_demo.py --video=inputs/example_video/falldown.mp4 --output_root inputs/demo
By default, --track_id is set as 0 to track the first person. Use -s for static background. When you only have an image, turn it to a video first as below.
ffmpeg -loop 1 -i inputs/example_video/YOUR_IMAGE.jpg -c:v libx264 -preset veryslow -crf 0 -t 1 -pix_fmt yuv420p -vf "fps=25,scale=trunc(iw/2)*2:trunc(ih/2)*2" inputs/example_video/YOUR_VIDEO.mp4 -y
To edit the motion, you need to specify the background path, the motion path, and the reference foreground path. We currently provide four examples for different usecases.
export PYTHONPATH="YOUR_PATH/GVHMR/hmr4d:$PYTHONPATH"
# example 1: affine transformation
python hmc/realismotion_render_demo.py \
--video=inputs/example_video/internalaffairs.mp4 \
--motion_path inputs/motion_bank/falldown \
--reference_path inputs/demo/internalaffairs \
--output_root inputs/demo \
--window_size 1 \
--repeat_smpl 0 50 1 \
--pause_at_begin 50 \
--pause_at_end 105 \
--edit_type affine_transform\
--affine_transform_args 0 0 0 -0.3
# example 2: move according to any trajectory
python hmc/realismotion_render_demo.py \
--video=inputs/example_video/justin.mp4 \
--motion_path inputs/motion_bank/tstageboy \
--reference_path inputs/demo/justin \
--output_root inputs/demo \
--window_size 25 \
--repeat_smpl 18 51 8 \
--edit_type edit_trajectory \
--edit_trajectory_args 815 500 952 1050 1192 502 \
-s
python hmc/realismotion_render_demo.py \
--video=inputs/example_video/male.mp4 \
--motion_path inputs/motion_bank/tstageboy \
--reference_path inputs/demo/male \
--output_root inputs/demo \
--window_size 25 \
--repeat_smpl 18 51 12 \
--edit_type edit_trajectory \
--edit_trajectory_args 1107 48 1098 662 1347 486 1154 303 952 682 1136 832 1100 494 910 787 \
--append circle \
-s
# example 3: move according to a heart-shape trajectory
python hmc/realismotion_render_demo.py \
--video=inputs/example_video/male.mp4 \
--motion_path inputs/motion_bank/tstageboy \
--reference_path inputs/demo/male \
--output_root inputs/demo \
--window_size 25 \
--repeat_smpl 18 51 12 \
--edit_type edit_trajectory_as_heart \
--edit_trajectory_as_heart_args 1 2 \
--append heart \
-s
# example 4: off-the-ground kickoff demo
python hmc/realismotion_render_demo.py \
--video=inputs/example_video/male.mp4 \
--motion_path inputs/demo/male \
--reference_path inputs/demo/male \
--output_root inputs/demo \
--window_size 25 \
--pause_at_begin 200 \
--edit_type edit_trajectory_kickoff \
--edit_trajectory_kickoff_args 546 515 277 510 186 752 311 929 282 601 45 687 \
--speed_ratio 20 \
--append kickoff \
-s
More examples could be found at inputs/README.md. As for kid, add --kid 1.0. One can use a float number between 0 and 1 to interpolate between adult and kid.
The human mask and hamer (hand pose) are optional, but providing them could improve the video quality. To obtain the human mask, one can install MatAnyone locally or use this MatAnyone Online Demo. Without the human mask, we will extract one from the SMPL-X depth. To obtain the hamer, please refer to Hamer Preparation. Without hamer, we will use the standard hand pose in SMPL-X.
This project is released for academic use. We disclaim responsibility for user-generated content.
@article{liang2025realismotion,
title={RealisMotion: Decomposed Human Motion Control and Video Generation in the World Space},
author={Liang, Jingyun and Zhou, Jingkai and Li, Shikai and Cao, Chenjie and Sun, Lei and Qian, Yichen and Chen, Weihua and Wang, Fan},
journal={arXiv preprint arXiv:2508.08588},
year={2025}
}
We thank the authors of WHAM, 4D-Humans, and ViTPose-Pytorch for their great works, without which our project/code would not be possible.