SIE unifies embeddings, reranking, and extraction under one API with on-demand loading + LRU eviction. What were the biggest technical challenges in supporting such a diverse set of architectures (dense, sparse, multi-vector, ColBERT, vision, cross-encoders, GLiNER, etc.) in a single Rust/Python backend?
How do you handle model compatibility and quality verification? You mentioned MTEB CI checks — what’s your process for adding new models, and how often do you update the 85+ model catalog?
Memory management: What’s the practical limit for running multiple large models simultaneously on a single GPU (e.g., 24GB or 80GB)? Any clever optimizations for mixed precision, quantization, or batching?
In production (Kubernetes with KEDA autoscaling), what kind of latency and throughput do you typically see for common models like BGE, Stella, or BGE-reranker on GPU vs CPU?
Compared to alternatives like vLLM, TGI, Ollama, or dedicated embedding servers (e.g., TEI), what are the biggest advantages and trade-offs of SIE?
OpenAI-compatible /v1/embeddings endpoint is great for migration. How complete is the compatibility, and do you plan to expand it (e.g., reranking or extraction endpoints)?
If someone wants to add custom models or new task types (e.g., more advanced extraction, OCR, or multimodal), what’s the best way to extend SIE while keeping it within the unified API?
Plans for the roadmap: Any support for fine-tuned models, LoRA adapters, continuous batching, or more advanced routing strategies?
The project mixes Rust (core performance) and Python. How do you manage the boundary, and any lessons learned from that hybrid approach?
SIE unifies embeddings, reranking, and extraction under one API with on-demand loading + LRU eviction. What were the biggest technical challenges in supporting such a diverse set of architectures (dense, sparse, multi-vector, ColBERT, vision, cross-encoders, GLiNER, etc.) in a single Rust/Python backend?
How do you handle model compatibility and quality verification? You mentioned MTEB CI checks — what’s your process for adding new models, and how often do you update the 85+ model catalog?
Memory management: What’s the practical limit for running multiple large models simultaneously on a single GPU (e.g., 24GB or 80GB)? Any clever optimizations for mixed precision, quantization, or batching?
In production (Kubernetes with KEDA autoscaling), what kind of latency and throughput do you typically see for common models like BGE, Stella, or BGE-reranker on GPU vs CPU?
Compared to alternatives like vLLM, TGI, Ollama, or dedicated embedding servers (e.g., TEI), what are the biggest advantages and trade-offs of SIE?
OpenAI-compatible /v1/embeddings endpoint is great for migration. How complete is the compatibility, and do you plan to expand it (e.g., reranking or extraction endpoints)?
If someone wants to add custom models or new task types (e.g., more advanced extraction, OCR, or multimodal), what’s the best way to extend SIE while keeping it within the unified API?
Plans for the roadmap: Any support for fine-tuned models, LoRA adapters, continuous batching, or more advanced routing strategies?
The project mixes Rust (core performance) and Python. How do you manage the boundary, and any lessons learned from that hybrid approach?