A Minimal & Scalable PyTorch Implementation of DiT Video Generation
minisora is a minimalist, educational, yet scalable re-implementation of the training process behind OpenAI's Sora (Diffusion Transformers). This project aims to strip away the complexity of large-scale video generation codebases while maintaining the ability to train on multi-node clusters.
It leverages ColossalAI for distributed training efficiency and Diffusers for a standardized inference pipeline.
- π Scalable Training: Built on ColossalAI to support multi-node, multi-GPU training out of the box.
- π§© Simple & Educational: The codebase is designed to be readable and hackable, avoiding "spaghetti code" common in research repos.
- π¬ Video Continuation: Supports not just text-to-video, but also extending existing video clips (autoregressive-style generation in latent space).
- π οΈ Modern Tooling: Uses
uvfor fast dependency management and Docker for reproducible environments.
minisora implements a Latent Diffusion Transformer (DiT). It processes video latents as a sequence of patches, handling both spatial and temporal dimensions via attention mechanisms.
The library is organized to separate the model definition from the training logic:
minisora/models: Contains the DiT implementation and pipeline logic.minisora/data: Data loading logic for DMLab and Minecraft datasets.scripts/: Training and inference entry points.
| Model Name | Dataset | Resolution | Frames | Download |
|---|---|---|---|---|
| minisora-dmlab | DeepMind Lab | 20 | π€ Hugging Face | |
| minisora-minecraft | Minecraft | 20 | (Coming Soon) |
We recommend using uv for lightning-fast dependency management.
git clone https://github.com/YN35/minisora
cd minisora
# Install dependencies including dev tools
uv sync --devYou can generate video using the pre-trained weights hosted on Hugging Face.
from minisora.models import DiTPipeline
# Load the pipeline
pipeline = DiTPipeline.from_pretrained("ramu0e/minisora-dmlab")
# Run inference
output = pipeline(
batch_size=1,
num_inference_steps=28,
height=64,
width=64,
num_frames=20,
)
# Access the latents or decode them
latents = output.latents # shape: (B, C, F, H, W)
print(f"Generated video latents shape: {latents.shape}")# Random unconditional generation
uv run scripts/demo/full_vgen.py
# Continuation (fixing the first frame and generating the rest)
uv run scripts/demo/full_continuation.pyWe provide a containerized workflow to ensure reproducibility.
Start the development container:
docker compose up -dTip: You can mount your local data directories by editing
docker-compose.override.yml:services: minisora: volumes: - .:/workspace/minisora - /path/to/your/data:/data
Download the sample datasets (DMLab or Minecraft) to your data directory:
# Example: Downloading DMLab dataset
uv run bash scripts/download/dmlab.sh /data/minisora
# Example: Downloading Minecraft dataset
uv run bash scripts/download/minecraft.sh /data/minisoraTraining is launched via torchrun. The following command starts a single-node training job.
# Set your GPU ID
export CUDA_VISIBLE_DEVICES=0
# Start training
nohup uv run torchrun --standalone --nnodes=1 --nproc_per_node=1 \
scripts/train.py --dataset_type=dmlab > outputs/train.log 2>&1 &You can monitor the progress in outputs/train.log. Change --dataset_type to minecraft to train on the Minecraft dataset.
- Basic DiT Implementation
- Integration with Diffusers Pipeline
- Multi-node training with ColossalAI
- Video Continuation support
- ColossalAI: For making distributed training accessible.
- Diffusers: For the robust diffusion pipeline structure.
- DiT Paper: Scalable Diffusion Models with Transformers.
This project is licensed under the MIT License. See LICENSE for details.