Multi-Modal Modeling and Evaluation on Novel Tasks
This project aims to evaluate and improve Large Vision Language Models (LVLMs) in multimodal tasks. We designed a series of benchmarks to identify model strengths and limitations in handling complex tasks.
- Clone the repository:
git clone https://github.com/khazic/Mmoment
cd Mmoment- Install dependencies:
pip install -e .- Configure:
cp config_tmp.json config.json
# Edit config.json with your API key and settings- Prepare test data:
mkdir -p data/inputs/images
mkdir -p data/inputs/videos
# Place your test images and videos in respective directories- Run single test:
python test_api.py- Run full evaluation:
python evaluate_task.pyMmoment/
├── data/
│ ├── inputs/ # Test data
│ │ ├── images/ # Test images
│ │ └── videos/ # Test videos
│ └── outputs/ # Model outputs
├── results/ # Evaluation results
├── logs/ # Log files
└── config.json # Configuration file
- Purpose: Evaluate model accuracy in handling negative samples
- Example:
{
"id": "neg_001",
"prompt": "Is there a banana on the table?",
"image_path": "data/inputs/images/apple_on_table.jpg",
"ground_truth": "No, there is no banana on the table."
}- Purpose: Test model's recall ability in multi-object scenes
- Metrics: Entity recall rate, attribute accuracy, relationship coverage
- Purpose: Evaluate model's ability to understand relative attributes and degree descriptions
- Example:
{
"id": "degree_001",
"prompt": "Is this heavy rain, moderate rain, or light rain?",
"image_path": "data/inputs/images/rain.jpg",
"ground_truth": "moderate rain",
"metrics": ["accuracy", "confidence_score"]
}- Purpose: Evaluate model's comprehension of spatial relationships
- Example:
{
"id": "spatial_001",
"prompt": "What is the position of the red square relative to the blue circle?",
"image_path": "data/inputs/images/shapes.jpg",
"ground_truth": "The red square is to the left of the blue circle",
"type": "relative_position"
}- Purpose: Test model's ability to recognize brand logos and famous figures
- Example:
{
"id": "icon_001",
"prompt": "Describe this iconic character",
"image_path": "data/inputs/images/mcdonalds_mascot.jpg",
"ground_truth": "Ronald McDonald, the McDonald's brand mascot",
"attributes": ["brand_recognition", "character_identification"]
}- Purpose: Test model's ability to strictly follow given instructions
- Example:
{
"id": "instruction_001",
"prompt": "Only describe the red objects in the image",
"image_path": "data/inputs/images/mixed_objects.jpg",
"ground_truth": "There is a red apple and a red backpack",
"constraints": ["only_red_objects"]
}- Purpose: Evaluate model's counting ability in various scenarios
- Example:
{
"id": "counting_001",
"prompt": "How many people are in the image?",
"image_path": "data/inputs/images/crowd.jpg",
"ground_truth": "27",
"difficulty": "complex",
"metrics": ["exact_match", "error_margin"]
}- Purpose: Assess model's ability to extract and understand structured information
- Example:
{
"id": "chart_001",
"prompt": "Extract the sales data for Q2 2023",
"image_path": "data/inputs/images/sales_chart.jpg",
"ground_truth": {
"q2_2023": {
"revenue": 1250000,
"growth_rate": "15%"
}
}
}Main config.json settings:
{
"api": {
"openai": {
"api_key": "your-api-key",
"image_config": {
"model": "gpt-4-vision-preview",
"max_tokens": 1024
},
"video_config": {
"model": "gpt-4-vision-preview",
"max_tokens": 2048
}
}
}
}Results are saved in results/ directory:
- metrics_{timestamp}.json: Detailed metrics
- report_{timestamp}.html: Visual report
- Create new test data:
# data/inputs/custom_test.json
{
"image_tests": [],
"video_tests": []
}- Implement new evaluation task:
from evaluator import TaskBase
class CustomTask(TaskBase):
def evaluate_response(self, sample: Sample) -> Dict[str, float]:
# Implement evaluation logic
pass- API Usage
- Ensure sufficient API quota
- Video evaluation may require more tokens
- Set reasonable concurrency
- Data Preparation
- Supported image formats: jpg, png
- Supported video formats: mp4
- Control media file size
- Security
- Don't commit config.json to version control
- Protect API keys
- Add more evaluation scenarios
- Support more model backends
- Improve evaluation metrics
- Add batch testing support
- Enhance report visualization
Issues and Pull Requests are welcome!
MIT License