Unified Model for Multimodal Understanding and Generation

GPT4o 이미지 생성과 유사한? 멀티모달 오픈소스 모델 BAGEL이라는게 나왔네요.

여러 트랜스포머를 조합하는 Mot(Mixture-of-Transformer-Experts)를 사용했다고 합니다. 요런 혼합 방식이 요즘 대세인 듯. 아래는 클로드 요약~

읽기 전 알아두면 좋은 정보

멀티모달 AI: 텍스트, 이미지, 비디오 등 다양한 형태의 데이터를 동시에 처리할 수 있는 인공지능 모델

MoT(Mixture-of-Transformer-Experts): 여러 개의 전문화된 트랜스포머를 조합하여 모델의 성능을 극대화하는 아키텍처

VLM(Vision-Language Model): 시각과 언어를 함께 이해하고 처리할 수 있는 AI 모델

주요 내용 요약

• ByteDance가 7B 파라미터(총 14B)를 가진 오픈소스 멀티모달 AI 모델 'BAGEL'을 공개했습니다

• BAGEL은 Qwen2.5-VL, InternVL-2.5 같은 기존 최고 성능의 오픈소스 모델들을 능가하는 성과를 보여줍니다

• 텍스트-이미지 생성에서 SD3와 경쟁할 수 있는 품질을 제공하며, 이미지 편집 분야에서도 우수한 성능을 발휘합니다

• 자유형태 시각 조작, 다중뷰 합성, 월드 내비게이션 등 기존 모델을 넘어서는 '월드 모델링' 기능을 구현했습니다

• 대규모 멀티모달 데이터로 사전훈련을 진행할수록 이해, 생성, 편집 작업에서 일관된 성능 향상을 보이는 것을 확인했습니다

특별 포인트

BAGEL의 가장 주목할 점은 단순한 이미지 편집을 넘어서 미래 프레임 예측, 3D 조작, 월드 내비게이션 등 '월드 모델링' 기능을 구현했다는 것.

이는 AI가 세상을 이해하고 상호작용하는 새로운 차원의 능력을 보여주는 혁신적인 발전이라고 할 수 있음.

Unified Model for Multimodal Understanding and Generation

We present BAGEL, an open‑source multimodal foundation model with 7B active parameters (14B total) trained on large‑scale interleaved multimodal data. BAGEL outperforms the current top‑tier open‑source VLMs like Qwen2.5-VL and InternVL-2.5 on standard multimodal understanding leaderboards, and delivers text‑to‑image quality that is competitive with strong specialist generators such as SD3. Moreover, BAGEL demonstrates superior qualitative results in classical image‑editing scenarios than the leading open-source models. More importantly, it extends to free-form visual manipulation, multiview synthesis, and world navigation, capabilities that constitute "world-modeling" tasks beyond the scope of previous image-editing models. The figure below showcases BAGEL's qualitative performance.

🧠 Method

BAGEL adopts a Mixture-of-Transformer-Experts (MoT) architecture to maximize the model’s capacity to learn from richly diverse multimodal information. Following the same principle of capacity maximization, it utilizes two separate encoders to capture pixel-level and semantic-level features of an image. The overall framework follows a Next Group of Token Prediction paradigm, where the model is trained to predict the next group of language or visual tokens as a compression target.

BAGEL scales MoT’s capacity through Pre-training, Continued Training, and Supervised Finetuning on trillions of interleaved multimodal tokens spanning language, image, video, and web data. It surpasses open models on standard understanding and generation benchmarks and demonstrates advanced in-context multimodal abilities like free-form image editing, future frame prediction, 3D manipulation, world navigation, and sequential reasoning.

🌱 Emerging Properties

As we scale up BAGEL’s pretraining with more multimodal tokens, we observe consistent performance gains across understanding, generation, and editing tasks. Different capabilities emerge at distinct training stages—multimodal understanding and generation appear early, followed by basic editing, while complex, intelligent editing emerges later. This staged progression suggests an emergent pattern, where advanced multimodal reasoning builds on well-formed foundational skills. Ablation studies further show that combining VAE and ViT features significantly improves intelligent editing, underscoring the importance of visual-semantic context in enabling complex multimodal reasoning and further supporting its role in the emergence of advanced capabilities.

📮 Notice

Call for Bad Cases: If you have encountered any cases where the model performs poorly, we would greatly appreciate it if you could share them in the issue#11 or Discord.

About Inference Hyperparameters:

cfg_text_scale: Controls how strongly the model follows the text prompt. 1.0 disables text guidance. Typical range: 4.0–8.0.
cfg_image_scale: Controls how much the model preserves input image details. 1.0 disables image guidance. Typical range: 1.0–2.0.
cfg_interval: Fraction of denoising steps where CFG is applied. Later steps can skip CFG to reduce computation. Typical: [0.4, 1.0].
timestep_shift: Shifts the distribution of denoising steps. Higher values allocate more steps at the start (affects layout); lower values allocate more at the end (improves details).
num_timesteps: Total denoising steps. Typical: 50.
cfg_renorm_min: Minimum value for CFG-Renorm. 1.0 disables renorm. Typical: 0.
cfg_renorm_type: CFG-Renorm method:
- global: Normalize over all tokens and channels (default for T2I).
- local: Normalize per channel.
- text_channel: Like local, but only applies to text condition (good for editing, may cause blur).
If edited images appear blurry, try global CFG-Renorm, decrease cfg_renorm_min or decrease cfg_scale.

🔥 Quick Start

1️⃣ Set up environment

git clone https://github.com/bytedance-seed/BAGEL.git
cd BAGEL
conda create -n bagel python=3.10 -y
conda activate bagel
pip install -r requirements.txt

2️⃣ Download pretrained checkpoint

from huggingface_hub import snapshot_download

save_dir = "/path/to/save/BAGEL-7B-MoT"
repo_id = "ByteDance-Seed/BAGEL-7B-MoT"
cache_dir = save_dir + "/cache"

snapshot_download(cache_dir=cache_dir,
    local_dir=save_dir,
    repo_id=repo_id,
    local_dir_use_symlinks=False,
    resume_download=True,
    allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
    )

3️⃣ Go to inference.ipynb to start playing with BAGEL!

🔥 Train & Eval

Train

bash scripts/train.sh

You can replace the variables in the script with your own before running. Training & fine-tuning docs are coming soon

Eval

We provide the scripts for evaluating VLM, T2I and Editing benchmarks. Please See EVAL for more details.

📊 Benchmarks

1. Visual Understanding

Model	MME ↑	MMBench ↑	MMMU ↑	MM-Vet ↑	MathVista ↑
Janus-Pro-7B	-	79.2	41.0	50.0	–
Qwen2.5-VL-7B	2347	83.5	58.6	67.1	68.2
BAGEL	2388	85.0	55.3	67.2	73.1

2. Text-to-Image Generation

Model	GenEval ↑	WISE ↑
Janus-Pro-7B	0.80	0.35
SD3-Medium	0.74	-
FLUX-1-dev	0.82	0.50
BAGEL	-	0.52
BAGEL + CoT	0.88	0.70

3. Image Editing

Model	GEdit-Bench-EN (SC) ↑	GEdit-Bench-EN (PQ) ↑	GEdit-Bench-EN (O) ↑	IntelligentBench ↑
Step1X-Edit	7.09	6.76	6.70	14.9
Gemini-2-exp.	6.73	6.61	6.32	57.6
BAGEL	7.36	6.83	6.52	44.0
BAGEL+CoT	–	–	–	55.3

✍️ Citation

@article{deng2025bagel,
  title   = {Emerging Properties in Unified Multimodal Pretraining},
  author  = {Deng, Chaorui and Zhu, Deyao and Li, Kunchang and Gou, Chenhui and Li, Feng and Wang, Zeyu and Zhong, Shu and Yu, Weihao and Nie, Xiaonan and Song, Ziang and Shi, Guang and Fan, Haoqi},
  journal = {arXiv preprint arXiv:2505.14683},
  year    = {2025}
}

📜 License

BAGEL is licensed under the Apache 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
data		data
eval		eval
modeling		modeling
scripts		scripts
test_images		test_images
train		train
.gitignore		.gitignore
EVAL.md		EVAL.md
LICENSE		LICENSE
README.md		README.md
inference.ipynb		inference.ipynb
inferencer.py		inferencer.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

읽기 전 알아두면 좋은 정보

주요 내용 요약

특별 포인트

Unified Model for Multimodal Understanding and Generation

🧠 Method

🌱 Emerging Properties

📮 Notice

🔥 Quick Start

🔥 Train & Eval

Train

Eval

📊 Benchmarks

1. Visual Understanding

2. Text-to-Image Generation

3. Image Editing

✍️ Citation

📜 License

About

Uh oh!

Releases

Packages

Languages

License

aebonlee/Bagel

Folders and files

Latest commit

History

Repository files navigation

읽기 전 알아두면 좋은 정보

주요 내용 요약

특별 포인트

Unified Model for Multimodal Understanding and Generation

🧠 Method

🌱 Emerging Properties

📮 Notice

🔥 Quick Start

🔥 Train & Eval

Train

Eval

📊 Benchmarks

1. Visual Understanding

2. Text-to-Image Generation

3. Image Editing

✍️ Citation

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages