Dresty is an advanced virtual try-on system that leverages Stable Diffusion 3 to generate realistic images of clothing items on model images. The system allows users to visualize how different garments would look when worn by a model, with high-quality results that maintain the original pose and appearance.
- Multi-category support: Try on upper-body garments (shirts, tops, sweaters), lower-body garments (skirts, pants), and dresses
- High-quality generation: Powered by Stable Diffusion 3 for realistic and detailed results
- Pose preservation: Maintains the original pose of the model through advanced pose guiding
- Interactive masking: Adjust mask parameters to fine-tune the try-on area
- Customizable generation: Control parameters like steps, guidance scale, and output resolution
- User-friendly interface: Simple two-step process with a Gradio web UI
- Deep Learning Framework: PyTorch, Torchvision
- Diffusion Models: Stable Diffusion 3 (customized for virtual try-on)
- Computer Vision: DWpose for pose detection, Human Parsing for segmentation
- Image Encoding: CLIP Vision models (large and bigG variants)
- Web Interface: Gradio
- Additional Libraries: OpenCV, scikit-image, NumPy, Einops
- CUDA-compatible GPU (recommended)
- Python 3.10
- Git LFS (for downloading models)
Run the setup script to create a conda environment and download required models:
bash setup.sh
- Create and activate a conda environment:
conda create -y -n viton python=3.10
conda activate viton
- Install PyTorch with CUDA support:
conda install -y pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
- Install dependencies:
pip install -r requirements.txt
- Download models from Hugging Face:
git lfs install
git clone https://huggingface.co/BoyuanJiang/Dresty ~/viton_models
python app.py --model_path /path/to/models
Additional options:
--device cpu
: Run on CPU (much slower)--fp16
: Use FP16 precision (faster but may be less accurate)--offload
: Offload models to CPU when not in use (saves GPU memory)--aggressive_offload
: More aggressively offload models (for low memory GPUs)
The virtual try-on process consists of two main steps:
- Select a model image from the examples or upload your own
- Select a garment image from the examples or upload your own
- Choose the appropriate garment category (Upper-body, Lower-body, or Dresses)
- Adjust mask offsets if needed (top, bottom, left, right)
- Click "Step 1: Run Mask"
- Adjust generation parameters if desired:
- Steps: Number of diffusion steps (higher = better quality but slower)
- Guidance scale: Controls adherence to the input garment (higher = more faithful to garment style)
- Seed: Random seed for reproducibility (-1 for random)
- Number of images: How many variations to generate
- Resolution: Output image resolution
- Click "Step 2: Run Try-on"
- View and save the generated results
-
Input Processing:
- Model image and garment image are loaded and preprocessed
- Human parsing identifies body parts in the model image
- DWpose detects keypoints for pose information
-
Mask Generation:
- Based on garment category and body keypoints
- Creates a mask indicating where the new garment will be placed
- User can adjust mask boundaries with offset parameters
-
Pose Encoding:
- Extracts pose features from the model image
- Helps maintain the person's pose in the final output
-
Garment Processing:
- Garment image is encoded using CLIP vision models
- Creates embeddings that guide the generation process
-
Try-On Generation:
- Combines masked model image, garment image, pose features, and mask
- Runs the Stable Diffusion 3 pipeline with specialized transformers:
transformer_garm
: Processes the garment imagetransformer_vton
: Handles the virtual try-on process
- Generates the final image with the garment realistically fitted on the model
The preprocessing components in the preprocess
directory play a crucial role in the virtual try-on process:
-
Usage in the Application:
- The human parsing functionality is used through the
Parsing
class defined inpreprocess/humanparsing/run_parsing.py
. - This class is instantiated in
src/generator.py
and used in thegenerate_mask
method to segment different parts of the human body in model images. - The parsing results are then used by
get_mask_location
insrc/utils_mask.py
to create masks for different clothing categories (Upper-body, Lower-body, Dresses).
- The human parsing functionality is used through the
-
Implementation Details:
- The human parsing functionality uses pre-trained ONNX models (
parsing_atr.onnx
andparsing_lip.onnx
) for inference. - The actual inference is performed by the
onnx_inference
function inparsing_api.py
, which processes images to segment different body parts. - The segmentation results are used to create masks that define where garments should be placed on the model's body.
- The human parsing functionality uses pre-trained ONNX models (
-
Usage in the Application:
- The pose detection functionality is used through the
DWposeDetector
class defined inpreprocess/dwpose/__init__.py
. - This class is instantiated in
src/generator.py
and used in thegenerate_mask
method to detect body keypoints. - The detected keypoints are used alongside the human parsing results to create accurate masks for virtual try-on.
- The pose detection functionality is used through the
-
Implementation Details:
- The pose detection uses pre-trained ONNX models (
yolox_l.onnx
anddw-ll_ucoco_384.onnx
). - The
Wholebody
class inwholebody.py
handles the actual pose detection, identifying keypoints for the body, face, and hands. - These keypoints are then used in
utils_mask.py
to determine the boundaries of different body parts when creating masks.
- The pose detection uses pre-trained ONNX models (
-
When a user uploads model and garment images in the application:
- The
DrestyGenerator.generate_mask
method uses both human parsing and pose detection to create a mask for the specified garment category. - This mask defines where the garment should be placed on the model's body.
- The mask, along with the model and garment images, is then passed to the
process
method, which uses a Stable Diffusion 3 pipeline to generate the virtual try-on result.
- The
-
The preprocessing components are critical for the application's functionality as they enable:
- Accurate segmentation of the model's body parts
- Precise detection of body keypoints
- Creation of masks that define garment placement
- Generation of realistic virtual try-on results
- Human parsing and pose detection
- Mask generation based on garment category
- Image preprocessing (resizing, padding)
- CLIP encoding of garment images
- Pose feature extraction
- Running the diffusion model
- Post-processing of generated images
- Selection of model image
- Selection of garment image
- Selection of garment category
- Adjustment of mask offsets
- Generation parameters (steps, guidance scale, seed, etc.)
Dresty/
├── examples/ # Example images for models and garments
│ ├── garment/ # Example garment images
│ └── model/ # Example model images
├── preprocess/ # Preprocessing modules
│ ├── dwpose/ # DWpose for pose detection
│ └── humanparsing/ # Human parsing for segmentation
├── resource/ # Resource files
│ └── img/ # Images for documentation
├── src/ # Source code
│ ├── attention_garm.py # Attention mechanism for garment processing
│ ├── attention_processor_garm.py # Attention processor for garment
│ ├── attention_processor_vton.py # Attention processor for try-on
│ ├── attention_vton.py # Attention mechanism for try-on
│ ├── generator.py # Main generator class
│ ├── image_utils.py # Image processing utilities
│ ├── pipeline_stable_diffusion_3_tryon.py # Main pipeline
│ ├── pose_guider.py # Pose guiding module
│ ├── transformer_sd3_garm.py # Transformer for garment processing
│ ├── transformer_sd3_vton.py # Transformer for try-on
│ ├── ui.py # Gradio UI components
│ └── utils_mask.py # Mask utilities
├── app.py # Main entry point
├── requirements.txt # Python dependencies
├── setup.sh # Setup script
└── README.md # This file
- Best results are achieved with front-facing model images
- Performance depends on GPU capabilities
- Some garment types may produce better results than others
- Mask adjustments may be needed for optimal results with certain body types
This project builds upon several open-source projects and research:
- Stable Diffusion 3 by Stability AI
- DWpose for human pose estimation
- Human parsing for semantic segmentation
- CLIP by OpenAI for vision encoding
- Gradio for the web interface