HANDI: Hand-Centric Text-and-Image Conditioned Video Generation

COG Research Group, University of Michigan

We aim to enhance instructional video generation with a diffusion-based framework, achieving state-of-the-art results in hand motion clarity and task-specific region localization. Visit project page) for more video resutls.

If you find our project helpful, please give it a star ⭐ or cite it, we would be very grateful 💖 .

Showcases

Input image	Result	Input image	Result

*Action Description:* Knit the fabric.		*Action Description:* Roll dough.

*Action Description:* Pour vinegar into bowl.		*Action Description:* Pick up and crack egg.

Framework

News 🔥

2024.12.9: Released inference code

2025.2.19: Released training/finetuning code

Features Planned

💥 updated model weights (coming soon)
💥 Solving camera movement issure: data preprocessing
💥 Support Huggingface Demo / Google Colab
etc.

Getting Started

This repository is based on animate-anything.

Create Conda Environment (Optional)

It is recommended to install Anaconda.

Windows Installation: https://docs.anaconda.com/anaconda/install/windows/

Linux Installation: https://docs.anaconda.com/anaconda/install/linux/

conda create -n IVG python=3.10
conda activate IVG

Python Requirements

cd Instructional-Video-Generation-IVG
pip install -r requirements.txt

PyTorch Dependencies

 pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121

💥 Training / Fine-tuning

Fine-tuning on EPIC-KITCHENS/EGO4D dataset

Download our video data which are preprocessed subsets of the EPIC-KITCHENS/EGO4D. Also, downlaod the corresponding prompt files. Put them under downloads/dataset/ (e.g., downloads/dataset/video_epickitchen, downloads/dataset/prompt_epickitchen.json).
Download the pretrained model to folder downloads/weights/ (e.g., downloads/weights/animate_anything_512_v1.02).
Download our region of motion masks of the video datasets and put it under downloads/masks/ (e.g., downloads/masks/mask_epickitchen). Then change mask_path under VideoJsonDataset class in utils/dataset.py.
In your config in example/train_mask_motion.yaml, make sure to set dataset_types to video_json and set output_dir, output_dir, train_data:video_dir, and train_data:video_json like this:

  - dataset_types: 
      - video_json
    train_data:
      video_dir: '/path/to/your/video_directory'
      video_json: '/path/to/your/json_file.json'

Run the following command to fine-tune. The following config requires around 30G GPU RAM. You can reduce the train_batch_size, train_data.width, train_data.height, and n_sample_frames in the config to reduce GPU RAM:

python train.py --config example/train_mask_motion.yaml pretrained_model_path=downloads/weights/animate_anything_512_v1.02

Fine-tuning on your own dataset

Create your own dataset. Simply place the videos into a folder and create a json with captions like this:

[
      {"caption": "The person uses their left hand to pick up a plate with a piece of chicken on it.", "video": "1.mp4"}, 
      {"caption": "The person holds a plate with the left hand and places it down on the cupboard, while the right hand holds a paper.", "video": "2.mp4"}
]

Download the pretrained model to output/latent.
Create region of motion masks for your own videos by running following command:

python mask_video.py --video_dir /path/to/video_directory --save_dir /path/to/output_directory

Follow step 4 and step 5 in previous section.

Multiple GPUs Training

I highly recommend utilizing multiple GPUs for training with Accelerator, as it significantly reduces VRAM requirements. First, configure the Accelerator with DeepSpeed. An example configuration file can be found at example/deepspeed.yaml.

Next, replace the 'python train_xx.py ...' commands mentioned earlier with 'accelerate launch train_xx.py ...'. For instance:

accelerate launch --config_file example/deepspeed.yaml  train.py --config example/train_mask_motion.yaml

💫 Inference

Please download the pretrained model to folder downloads/weights/ (e.g., downloads/weights/IVG.1.0). Then run the following command:

python train.py --config downloads/weights/IVG.1.0/config.yaml --eval validation_data.prompt_image=example/Julienne_carrot.png validation_data.prompt='The person holds a carrot on the chopping board with the left hand and uses a knife in the right hand to julienne the carrot.'

To control the motion area, we use the provided script mask_video.py. Update the input and output video folder paths as needed, and run the following command:

python mask_video.py --video_dir /path/to/video_directory --save_dir /path/to/output_directory

Below are examples of an input image and its corresponding RoM mask:

Then run the following command for inference:

python train.py --config output/latent/{download_model}/config.yaml --eval validation_data.prompt_image=example/Julienne_carrot.png validation_data.prompt='The person holds a carrot on the chopping board with the left hand and uses a knife in the right hand to julienne the carrot.' validation_data.mask=example/carrot_mask.jpg

Multi-Sample Inference

To evaluate the model on multiple examples, we provide a script for multi-sample inference along with a random small subset of the test dataset. We also provided a random small subsets of test dataset (downloads/test/source, downloads/test/masks, downloads/test/prompt.json). The results will be generated under downloads/test/result.

Then run the following command for multi sample inference:

python evaluation.py --eval --config downloads/weights/IVG.1.0/config.yaml --image_folder downloads/test --prompt_file downloads/test/prompt.json --mask_folder downloads/test/masks --output_folder downloads/test/result

Configuration

The configuration uses a YAML config borrowed from Tune-A-Video repositories.

All configuration details are placed in example/train_mask_motion.yaml. Each parameter has a definition for what it does.

Bibtex

Please cite this paper if you find the code is useful for your research:

@article{li2024handi,
  title={HANDI: Hand-Centric Text-and-Image Conditioned Video Generation},
  author={Li, Yayuan and Cao, Zhi and Corso, Jason J},
  journal={arXiv preprint arXiv:2412.04189},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
docs		docs
downloads/test		downloads/test
example		example
models		models
stable_lora		stable_lora
svd_video2video_examples		svd_video2video_examples
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
app_svd.py		app_svd.py
colab.ipynb		colab.ipynb
compress_video.py		compress_video.py
hand.py		hand.py
hand_landmarker.task		hand_landmarker.task
inference.py		inference.py
mask_video.py		mask_video.py
requirements.txt		requirements.txt
run.sh		run.sh
train.py		train.py
train_lora.py		train_lora.py
train_mask.py		train_mask.py
train_svd.py		train_svd.py
train_transparent_i2v_stage2.py		train_transparent_i2v_stage2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HANDI: Hand-Centric Text-and-Image Conditioned Video Generation

Showcases

Framework

News 🔥

Features Planned

Getting Started

Create Conda Environment (Optional)

Python Requirements

PyTorch Dependencies

💥 Training / Fine-tuning

Fine-tuning on EPIC-KITCHENS/EGO4D dataset

Fine-tuning on your own dataset

Multiple GPUs Training

💫 Inference

Multi-Sample Inference

Configuration

Bibtex

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HANDI: Hand-Centric Text-and-Image Conditioned Video Generation

Showcases

Framework

News 🔥

Features Planned

Getting Started

Create Conda Environment (Optional)

Python Requirements

PyTorch Dependencies

💥 Training / Fine-tuning

Fine-tuning on EPIC-KITCHENS/EGO4D dataset

Fine-tuning on your own dataset

Multiple GPUs Training

💫 Inference

Multi-Sample Inference

Configuration

Bibtex

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages