Skip to content

ExcitedButter/HANDI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

64 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

HANDI: Hand-Centric Text-and-Image Conditioned Video Generation

Yayuan Li*, Zhi Cao*, Jason J. Corso

COG Research Group, University of Michigan

We aim to enhance instructional video generation with a diffusion-based framework, achieving state-of-the-art results in hand motion clarity and task-specific region localization. Visit project page) for more video resutls.

If you find our project helpful, please give it a star ⭐ or cite it, we would be very grateful πŸ’– .

Showcases

Input image Result Input image Result
Input image Result Input image Result
Action Description: Knit the fabric. Action Description: Roll dough.
Input image Result Input image Result
Action Description: Pour vinegar into bowl. Action Description: Pick up and crack egg.

Framework

framework

News πŸ”₯

2024.12.9: Released inference code

2025.2.19: Released training/finetuning code

Features Planned

  • πŸ’₯ updated model weights (coming soon)
  • πŸ’₯ Solving camera movement issure: data preprocessing
  • πŸ’₯ Support Huggingface Demo / Google Colab
  • etc.

Getting Started

This repository is based on animate-anything.

Create Conda Environment (Optional)

It is recommended to install Anaconda.

Windows Installation: https://docs.anaconda.com/anaconda/install/windows/

Linux Installation: https://docs.anaconda.com/anaconda/install/linux/

conda create -n IVG python=3.10
conda activate IVG

Python Requirements

cd Instructional-Video-Generation-IVG
pip install -r requirements.txt

PyTorch Dependencies

 pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121

πŸ’₯ Training / Fine-tuning

Fine-tuning on EPIC-KITCHENS/EGO4D dataset

  1. Download our video data which are preprocessed subsets of the EPIC-KITCHENS/EGO4D. Also, downlaod the corresponding prompt files. Put them under downloads/dataset/ (e.g., downloads/dataset/video_epickitchen, downloads/dataset/prompt_epickitchen.json).
  2. Download the pretrained model to folder downloads/weights/ (e.g., downloads/weights/animate_anything_512_v1.02).
  3. Download our region of motion masks of the video datasets and put it under downloads/masks/ (e.g., downloads/masks/mask_epickitchen). Then change mask_path under VideoJsonDataset class in utils/dataset.py.
  4. In your config in example/train_mask_motion.yaml, make sure to set dataset_types to video_json and set output_dir, output_dir, train_data:video_dir, and train_data:video_json like this:
  - dataset_types: 
      - video_json
    train_data:
      video_dir: '/path/to/your/video_directory'
      video_json: '/path/to/your/json_file.json'
  1. Run the following command to fine-tune. The following config requires around 30G GPU RAM. You can reduce the train_batch_size, train_data.width, train_data.height, and n_sample_frames in the config to reduce GPU RAM:
python train.py --config example/train_mask_motion.yaml pretrained_model_path=downloads/weights/animate_anything_512_v1.02

Fine-tuning on your own dataset

  1. Create your own dataset. Simply place the videos into a folder and create a json with captions like this:
[
      {"caption": "The person uses their left hand to pick up a plate with a piece of chicken on it.", "video": "1.mp4"}, 
      {"caption": "The person holds a plate with the left hand and places it down on the cupboard, while the right hand holds a paper.", "video": "2.mp4"}
]

  1. Download the pretrained model to output/latent.
  2. Create region of motion masks for your own videos by running following command:
python mask_video.py --video_dir /path/to/video_directory --save_dir /path/to/output_directory

Follow step 4 and step 5 in previous section.

Multiple GPUs Training

I highly recommend utilizing multiple GPUs for training with Accelerator, as it significantly reduces VRAM requirements. First, configure the Accelerator with DeepSpeed. An example configuration file can be found at example/deepspeed.yaml.

Next, replace the 'python train_xx.py ...' commands mentioned earlier with 'accelerate launch train_xx.py ...'. For instance:

accelerate launch --config_file example/deepspeed.yaml  train.py --config example/train_mask_motion.yaml

πŸ’« Inference

Please download the pretrained model to folder downloads/weights/ (e.g., downloads/weights/IVG.1.0). Then run the following command:

python train.py --config downloads/weights/IVG.1.0/config.yaml --eval validation_data.prompt_image=example/Julienne_carrot.png validation_data.prompt='The person holds a carrot on the chopping board with the left hand and uses a knife in the right hand to julienne the carrot.'

To control the motion area, we use the provided script mask_video.py. Update the input and output video folder paths as needed, and run the following command:

python mask_video.py --video_dir /path/to/video_directory --save_dir /path/to/output_directory

Below are examples of an input image and its corresponding RoM mask:

Original Video Frame Generated RoM Mask

Then run the following command for inference:

python train.py --config output/latent/{download_model}/config.yaml --eval validation_data.prompt_image=example/Julienne_carrot.png validation_data.prompt='The person holds a carrot on the chopping board with the left hand and uses a knife in the right hand to julienne the carrot.' validation_data.mask=example/carrot_mask.jpg 

Inference Result

Multi-Sample Inference

To evaluate the model on multiple examples, we provide a script for multi-sample inference along with a random small subset of the test dataset. We also provided a random small subsets of test dataset (downloads/test/source, downloads/test/masks, downloads/test/prompt.json). The results will be generated under downloads/test/result.

Then run the following command for multi sample inference:

python evaluation.py --eval --config downloads/weights/IVG.1.0/config.yaml --image_folder downloads/test --prompt_file downloads/test/prompt.json --mask_folder downloads/test/masks --output_folder downloads/test/result

Configuration

The configuration uses a YAML config borrowed from Tune-A-Video repositories.

All configuration details are placed in example/train_mask_motion.yaml. Each parameter has a definition for what it does.

Bibtex

Please cite this paper if you find the code is useful for your research:

@article{li2024handi,
  title={HANDI: Hand-Centric Text-and-Image Conditioned Video Generation},
  author={Li, Yayuan and Cao, Zhi and Corso, Jason J},
  journal={arXiv preprint arXiv:2412.04189},
  year={2024}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages