Skip to content

YN35/minisora

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

minisora: A minimal & Scalable PyTorch re-implementation of the OpenAI Sora training

minisora logo (placeholder)

A Minimal & Scalable PyTorch Implementation of DiT Video Generation

Hugging Face GitHub PyTorch License


πŸ“– Introduction

minisora is a minimalist, educational, yet scalable re-implementation of the training process behind OpenAI's Sora (Diffusion Transformers). This project aims to strip away the complexity of large-scale video generation codebases while maintaining the ability to train on multi-node clusters.

It leverages ColossalAI for distributed training efficiency and Diffusers for a standardized inference pipeline.

✨ Key Features

  • πŸš€ Scalable Training: Built on ColossalAI to support multi-node, multi-GPU training out of the box.
  • 🧩 Simple & Educational: The codebase is designed to be readable and hackable, avoiding "spaghetti code" common in research repos.
  • 🎬 Video Continuation: Supports not just text-to-video, but also extending existing video clips (autoregressive-style generation in latent space).
  • πŸ› οΈ Modern Tooling: Uses uv for fast dependency management and Docker for reproducible environments.

πŸŽ₯ Demos

Unconditional Generation Video Continuation
Random Video Generation Video Continuation

πŸ—οΈ Architecture

minisora implements a Latent Diffusion Transformer (DiT). It processes video latents as a sequence of patches, handling both spatial and temporal dimensions via attention mechanisms.

Diffusion Transformer architecture

The library is organized to separate the model definition from the training logic:

  • minisora/models: Contains the DiT implementation and pipeline logic.
  • minisora/data: Data loading logic for DMLab and Minecraft datasets.
  • scripts/: Training and inference entry points.

⬇️ Model Zoo

Model Name Dataset Resolution Frames Download
minisora-dmlab DeepMind Lab $64 \times 64$ 20 πŸ€— Hugging Face
minisora-minecraft Minecraft $128 \times 128$ 20 (Coming Soon)

πŸš€ Quick Start

Installation

We recommend using uv for lightning-fast dependency management.

git clone https://github.com/YN35/minisora
cd minisora

# Install dependencies including dev tools
uv sync --dev

Inference (Python)

You can generate video using the pre-trained weights hosted on Hugging Face.

from minisora.models import DiTPipeline

# Load the pipeline
pipeline = DiTPipeline.from_pretrained("ramu0e/minisora-dmlab")

# Run inference
output = pipeline(
    batch_size=1,
    num_inference_steps=28,
    height=64,
    width=64,
    num_frames=20,
)

# Access the latents or decode them
latents = output.latents  # shape: (B, C, F, H, W)
print(f"Generated video latents shape: {latents.shape}")

Run Demos

# Random unconditional generation
uv run scripts/demo/full_vgen.py

# Continuation (fixing the first frame and generating the rest)
uv run scripts/demo/full_continuation.py

πŸ‹οΈ Training

We provide a containerized workflow to ensure reproducibility.

1. Environment Setup (Docker)

Start the development container:

docker compose up -d

Tip: You can mount your local data directories by editing docker-compose.override.yml:

services:
  minisora:
    volumes:
      - .:/workspace/minisora
      - /path/to/your/data:/data

2. Data Preparation

Download the sample datasets (DMLab or Minecraft) to your data directory:

# Example: Downloading DMLab dataset
uv run bash scripts/download/dmlab.sh /data/minisora

# Example: Downloading Minecraft dataset
uv run bash scripts/download/minecraft.sh /data/minisora

3. Run Training Job

Training is launched via torchrun. The following command starts a single-node training job.

# Set your GPU ID
export CUDA_VISIBLE_DEVICES=0

# Start training
nohup uv run torchrun --standalone --nnodes=1 --nproc_per_node=1 \
  scripts/train.py --dataset_type=dmlab > outputs/train.log 2>&1 &

You can monitor the progress in outputs/train.log. Change --dataset_type to minecraft to train on the Minecraft dataset.


πŸ—“οΈ Todo & Roadmap

  • Basic DiT Implementation
  • Integration with Diffusers Pipeline
  • Multi-node training with ColossalAI
  • Video Continuation support

🀝 Acknowledgements

  • ColossalAI: For making distributed training accessible.
  • Diffusers: For the robust diffusion pipeline structure.
  • DiT Paper: Scalable Diffusion Models with Transformers.

πŸ“„ License

This project is licensed under the MIT License. See LICENSE for details.

About

A minimal & Scalable PyTorch re-implementation of the OpenAI Sora training

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages