We aim to enhance instructional video generation with a diffusion-based framework, achieving state-of-the-art results in hand motion clarity and task-specific region localization. Visit project page) for more video resutls.
If you find our project helpful, please give it a star β or cite it, we would be very grateful π .
| Input image | Result | Input image | Result |
| Action Description: Knit the fabric. | Action Description: Roll dough. | ||
| Action Description: Pour vinegar into bowl. | Action Description: Pick up and crack egg. | ||
2024.12.9: Released inference code
2025.2.19: Released training/finetuning code
- π₯ updated model weights (coming soon)
- π₯ Solving camera movement issure: data preprocessing
- π₯ Support Huggingface Demo / Google Colab
- etc.
This repository is based on animate-anything.
It is recommended to install Anaconda.
Windows Installation: https://docs.anaconda.com/anaconda/install/windows/
Linux Installation: https://docs.anaconda.com/anaconda/install/linux/
conda create -n IVG python=3.10
conda activate IVGcd Instructional-Video-Generation-IVG
pip install -r requirements.txt pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121- Download our video data which are preprocessed subsets of the EPIC-KITCHENS/EGO4D. Also, downlaod the corresponding prompt files. Put them under
downloads/dataset/(e.g.,downloads/dataset/video_epickitchen,downloads/dataset/prompt_epickitchen.json). - Download the pretrained model to folder
downloads/weights/(e.g.,downloads/weights/animate_anything_512_v1.02). - Download our region of motion masks of the video datasets and put it under
downloads/masks/(e.g.,downloads/masks/mask_epickitchen). Then changemask_pathunderVideoJsonDatasetclass inutils/dataset.py. - In your config in
example/train_mask_motion.yaml, make sure to setdataset_typestovideo_jsonand setoutput_dir,output_dir,train_data:video_dir, andtrain_data:video_jsonlike this:
- dataset_types:
- video_json
train_data:
video_dir: '/path/to/your/video_directory'
video_json: '/path/to/your/json_file.json'
- Run the following command to fine-tune. The following config requires around 30G GPU RAM. You can reduce the
train_batch_size,train_data.width,train_data.height, andn_sample_framesin the config to reduce GPU RAM:
python train.py --config example/train_mask_motion.yaml pretrained_model_path=downloads/weights/animate_anything_512_v1.02- Create your own dataset. Simply place the videos into a folder and create a json with captions like this:
[
{"caption": "The person uses their left hand to pick up a plate with a piece of chicken on it.", "video": "1.mp4"},
{"caption": "The person holds a plate with the left hand and places it down on the cupboard, while the right hand holds a paper.", "video": "2.mp4"}
]
- Download the pretrained model to output/latent.
- Create region of motion masks for your own videos by running following command:
python mask_video.py --video_dir /path/to/video_directory --save_dir /path/to/output_directoryFollow step 4 and step 5 in previous section.
I highly recommend utilizing multiple GPUs for training with Accelerator, as it significantly reduces VRAM requirements. First, configure the Accelerator with DeepSpeed. An example configuration file can be found at example/deepspeed.yaml.
Next, replace the 'python train_xx.py ...' commands mentioned earlier with 'accelerate launch train_xx.py ...'. For instance:
accelerate launch --config_file example/deepspeed.yaml train.py --config example/train_mask_motion.yaml
Please download the pretrained model to folder downloads/weights/ (e.g., downloads/weights/IVG.1.0). Then run the following command:
python train.py --config downloads/weights/IVG.1.0/config.yaml --eval validation_data.prompt_image=example/Julienne_carrot.png validation_data.prompt='The person holds a carrot on the chopping board with the left hand and uses a knife in the right hand to julienne the carrot.'To control the motion area, we use the provided script mask_video.py. Update the input and output video folder paths as needed, and run the following command:
python mask_video.py --video_dir /path/to/video_directory --save_dir /path/to/output_directoryBelow are examples of an input image and its corresponding RoM mask:
Then run the following command for inference:
python train.py --config output/latent/{download_model}/config.yaml --eval validation_data.prompt_image=example/Julienne_carrot.png validation_data.prompt='The person holds a carrot on the chopping board with the left hand and uses a knife in the right hand to julienne the carrot.' validation_data.mask=example/carrot_mask.jpg To evaluate the model on multiple examples, we provide a script for multi-sample inference along with a random small subset of the test dataset. We also provided a random small subsets of test dataset (downloads/test/source, downloads/test/masks, downloads/test/prompt.json). The results will be generated under downloads/test/result.
Then run the following command for multi sample inference:
python evaluation.py --eval --config downloads/weights/IVG.1.0/config.yaml --image_folder downloads/test --prompt_file downloads/test/prompt.json --mask_folder downloads/test/masks --output_folder downloads/test/resultThe configuration uses a YAML config borrowed from Tune-A-Video repositories.
All configuration details are placed in example/train_mask_motion.yaml. Each parameter has a definition for what it does.
Please cite this paper if you find the code is useful for your research:
@article{li2024handi,
title={HANDI: Hand-Centric Text-and-Image Conditioned Video Generation},
author={Li, Yayuan and Cao, Zhi and Corso, Jason J},
journal={arXiv preprint arXiv:2412.04189},
year={2024}
}