A multimodal embedding model based on Qwen3-VL for text and image retrieval tasks. Built with Tevatron framework and fine-tuned using LoRA.
- Multimodal retrieval: Encode text queries and images into a shared embedding space
- Based on Qwen3-VL-4B-Instruct: Leverages powerful vision-language foundation model
- Efficient training: Uses LoRA for parameter-efficient fine-tuning
- MTEB evaluation: Compatible with MTEB benchmark tasks
# Install with uv
uv syncTrain a retriever model using the provided training script:
bash train.shThe training uses:
- DeepSpeed for distributed training
- LoRA fine-tuning of Qwen3-VL-4B-Instruct
- Contrastive learning with temperature scaling
- EOS pooling with normalized embeddings
Configure your training data in dataset_config.yaml.
With MTEB wrapper: Check inference_mteb.py
With Transformers: Check inference_transformers.py
import mteb
model_meta = mteb.get_model_meta('eagerworks/eager-embed-v1')
model = model_meta.load_model()
# Get benchmarks and extract tasks from them
benchmarks = mteb.get_benchmarks(["ViDoRe(v2)"])
tasks = []
for benchmark in benchmarks:
tasks.extend(benchmark.tasks)
print(tasks)
# Run evaluation with reduced batch size to save CUDA memory
results = mteb.evaluate(model=model, tasks=tasks, encode_kwargs={"batch_size": 8})
print("Evaluation complete!")
print(results)For more examples, see evaluate_mteb.py
python merge_lora_and_push.py \
--adapter_path run2_8x5090 \
--push_to_hub_id eagerworks/eager-embed-v1 \
--dtype float32- Base Model: Qwen3-VL-4B-Instruct (~3B parameters)
- Embedding Dimension: 2560
- Pooling: Last token (EOS) pooling
- Normalization: L2 normalized embeddings
- Training: Contrastive learning with in-batch negatives
Apache 2.0
@article{EagerEmbed,
title={Eager Embed V1: Multimodal Dense Embeddings for Retrieval},
author={Juan Pablo Balarini},
year={2025},
publisher={Eagerworks},
url={https://github.com/eagerworks/eager-embed}
}