Benchmark suite for comparing ColQwen3 embedding models (BASE vs AWQ quantized).
| Size | BASE Model | AWQ Model |
|---|---|---|
| 4B | TomoroAI/tomoro-colqwen3-embed-4b |
shubhamg2208/tomoro-ai-colqwen3-embed-4b-w4a16-autoawq-seqlen-1024 |
| 8B | TomoroAI/tomoro-colqwen3-embed-8b |
shubhamg2208/tomoro-ai-colqwen3-embed-8b-w4a16-autoawq-seqlen-1024 |
- Python 3.11+
- CUDA-capable GPU
- uv package manager
uv sync# Run both BASE and AWQ
uv run python benchmark.py
# Run BASE model only
uv run python benchmark.py --only_base --output_json base_results.json
# Run AWQ model only
uv run python benchmark.py --only_awq --output_json awq_results.json# Run both BASE and AWQ
uv run python benchmark_8b.py
# Run BASE model only
uv run python benchmark_8b.py --only_base --output_json base_8b_results.json
# Run AWQ model only
uv run python benchmark_8b.py --only_awq --output_json awq_8b_results.json| Option | Default | Description |
|---|---|---|
--text_samples |
64 | Number of text samples |
--text_batch_size |
8 | Batch size for text |
--image_samples |
16 | Number of image samples |
--image_batch_size |
4 | Batch size for images |
--image_size |
512 | Image dimensions (pixels) |
--warmup_steps |
3 | Warmup iterations |
--measure_steps |
10 | Measurement iterations |
--only_base |
- | Benchmark BASE model only |
--only_awq |
- | Benchmark AWQ model only |
--text_only |
- | Skip image benchmarks (text tower only) |
--sweep_batch_sizes |
- | Comma-separated batch sizes to sweep |
--high_batch_sweep |
- | High batch sizes for quantized model only (e.g. '512,1024,1536,2048') |
--output_json |
- | Save results to JSON file |
The sweep mode tests multiple batch sizes to find the optimal throughput for each model under memory constraints. This demonstrates how quantization enables higher throughput by allowing larger batch sizes.
# Sweep batch sizes for BASE model
uv run python benchmark.py --only_base \
--sweep_batch_sizes "8,16,32,64,128,256" \
--output_json sweep_base.json
# Sweep batch sizes for Quantized model (can use larger batches)
uv run python benchmark.py --only_awq \
--sweep_batch_sizes "8,16,32,64,128,256,512" \
--output_json sweep_awq.json
# High batch sweep for Quantized model only (demonstrates max batch advantage)
uv run python benchmark.py --only_awq \
--high_batch_sweep "512,1024,1536,2048,2560,3072" \
--output_json sweep_awq_high.jsonuv run python benchmark.py \
--text_samples 32 \
--text_batch_size 8 \
--image_samples 16 \
--image_batch_size 4 \
--warmup_steps 5 \
--measure_steps 20 \
--output_json results.json- REPORT.md - 4B model benchmark results
- REPORT_8B.md - 8B model benchmark results
- AutoRound quantization is applied only to the text tower; the vision tower remains in FP16/BF16
- Run models separately (
--only_base/--only_awq) to avoid GPU memory conflicts - Results are saved to JSON for further analysis when using
--output_json - Quantized models use ~60% less memory, enabling larger batch sizes and higher throughput on memory-constrained GPUs