Skip to content

goodhamgupta/serving

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ColQwen3 Embedding Model Benchmarks

Benchmark suite for comparing ColQwen3 embedding models (BASE vs AWQ quantized).

Models

Size BASE Model AWQ Model
4B TomoroAI/tomoro-colqwen3-embed-4b shubhamg2208/tomoro-ai-colqwen3-embed-4b-w4a16-autoawq-seqlen-1024
8B TomoroAI/tomoro-colqwen3-embed-8b shubhamg2208/tomoro-ai-colqwen3-embed-8b-w4a16-autoawq-seqlen-1024

Requirements

  • Python 3.11+
  • CUDA-capable GPU
  • uv package manager

Installation

uv sync

Running Benchmarks

4B Models

# Run both BASE and AWQ
uv run python benchmark.py

# Run BASE model only
uv run python benchmark.py --only_base --output_json base_results.json

# Run AWQ model only
uv run python benchmark.py --only_awq --output_json awq_results.json

8B Models

# Run both BASE and AWQ
uv run python benchmark_8b.py

# Run BASE model only
uv run python benchmark_8b.py --only_base --output_json base_8b_results.json

# Run AWQ model only
uv run python benchmark_8b.py --only_awq --output_json awq_8b_results.json

Configuration Options

Option Default Description
--text_samples 64 Number of text samples
--text_batch_size 8 Batch size for text
--image_samples 16 Number of image samples
--image_batch_size 4 Batch size for images
--image_size 512 Image dimensions (pixels)
--warmup_steps 3 Warmup iterations
--measure_steps 10 Measurement iterations
--only_base - Benchmark BASE model only
--only_awq - Benchmark AWQ model only
--text_only - Skip image benchmarks (text tower only)
--sweep_batch_sizes - Comma-separated batch sizes to sweep
--high_batch_sweep - High batch sizes for quantized model only (e.g. '512,1024,1536,2048')
--output_json - Save results to JSON file

Batch Size Sweep

The sweep mode tests multiple batch sizes to find the optimal throughput for each model under memory constraints. This demonstrates how quantization enables higher throughput by allowing larger batch sizes.

# Sweep batch sizes for BASE model
uv run python benchmark.py --only_base \
    --sweep_batch_sizes "8,16,32,64,128,256" \
    --output_json sweep_base.json

# Sweep batch sizes for Quantized model (can use larger batches)
uv run python benchmark.py --only_awq \
    --sweep_batch_sizes "8,16,32,64,128,256,512" \
    --output_json sweep_awq.json

# High batch sweep for Quantized model only (demonstrates max batch advantage)
uv run python benchmark.py --only_awq \
    --high_batch_sweep "512,1024,1536,2048,2560,3072" \
    --output_json sweep_awq_high.json

Example with Custom Parameters

uv run python benchmark.py \
    --text_samples 32 \
    --text_batch_size 8 \
    --image_samples 16 \
    --image_batch_size 4 \
    --warmup_steps 5 \
    --measure_steps 20 \
    --output_json results.json

Reports

Notes

  • AutoRound quantization is applied only to the text tower; the vision tower remains in FP16/BF16
  • Run models separately (--only_base / --only_awq) to avoid GPU memory conflicts
  • Results are saved to JSON for further analysis when using --output_json
  • Quantized models use ~60% less memory, enabling larger batch sizes and higher throughput on memory-constrained GPUs

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages