📌 This is the official implementation and benchmark evaluation repository of Unleashing and Benchmarking the Interleaved Cross-modality Comprehension and Generation
[Your introduction content will go here]
.
├── eval/ # Evaluation scripts
│ ├── config.py # Configuration settings
│ ├── main.py # Main evaluation pipeline
│ ├── prompts.py # Evaluation prompts
│ ├── summarize.py # Results summarization
│ ├── utils.py # Utility functions
│ └── vlm_tools.py # GPT implementation
├── infer/ # Inference tools
│ ├── case_bagel.py # Example for combined tasks
│ ├── case_gpt.py # Example for VQA tasks
│ ├── case_step1x.py # Example for image generation
│ └── loader.py # Data loading utilities
└── vis.ipynb # Visualization notebookFirst, prepare the test dataset:
mkdir <YOUR_DATA_PATH>
cd <YOUR_DATA_PATH>
huggingface-cli download WeiChow/WEAVE --include test/ --repo-type dataset --local-dir .
cd test
unzip test.zipROOT_PATH in loader.py
The test set file format (test.json) is as follows:
⚠️ Note:Image #1references the first image, starting from 1. This represents the image index, not the conversation turn. It corresponds to the first image in theimagesarray (images[0]). When using multi-turn conversations, each number index should be replaced once withImage #{idx}<image>\n. For single-turn, simply replace directly.
{
"domain": str,
"images": [],
"chats": [
]
}The infer/ directory contains tools for inference. To use WeaveBench, you only need to copy loader.py and integrate it with your model.
For your convenience, we provide three example scenarios:
- 🖼️ Image generation inference only (
case_step1x.py) - 💬 VQA task inference only (
case_gpt.py) - 🔄 Combined inference for both image generation and VQA tasks (
case_bagel.py)
if chat[1]['type'] != 'text':. Otherwise, the text's ending index may be incorrect.
Use the eval/ directory for testing scores:
pip3 install -r requirements.txtSet the GPT config in eval/config.py.
ROOT_PATH = <YOUR_DATA_PATH>
OPENAI_MODEL = <MODEL> # like 'gpt-4o-2024-08-06'
AZURE_API_KEY = <YOUR KEY>
AZURE_ENDPOINT = <YOUR ENDPOINT> # like "https://api.openai.com/v1/chat/completions"To run evaluation:
python3 eval/main.py --input_dir result/bagel/imgs --output_dir result/bagel/ --mode img
python3 eval/summarize.py --metrics_file result/bagel/metrics.jsonlThe input_dir is the location where you saved images after inference (e.g., result/bagel/imgs).
The output_dir is where you store GPT evaluation results.
The mode parameter has three options: img, txt, or umm. Please select the appropriate evaluation mode.
💡 Tip: Since we require GPT to output in JSON format, some items may not be scored due to format errors. You can rerun the first script to score these items; it will automatically skip already scored content.
WeaveBench evaluates 4 core metrics:
| Metric | Code | Description | Requires Text | Requires Image |
|---|---|---|---|---|
| Key Point Correctness | KP | Measures whether the edited image satisfies the specified editing requirements. | ❌ No | ✅ Yes |
| Visual Consistency | VC | Ensures non-target elements remain unchanged and maintains consistency with the original image. | ❌ No | ✅ Yes |
| Image Quality | IQ | Evaluates the overall quality of the generated image. | ❌ No | ✅ Yes |
| Accuracy | ACC | Measures the correctness of the reasoning result for comprehension tasks. | ✅ Yes | ❌ No |
Consider adding your results to our leaderboard at Weave Project Page. Please contact weichow@qq.com for submission details!
@article{
}