A blazing-fast inference engine that loads models from SSD to GPU VRAM up to 10x faster than alternative loaders.
Hotswap large models in < 2 seconds.Acknowledgement: This project includes substantial code originally developed by ServerlessLLM , used under the Apache License, Version 2.0.
Traditional model loaders slow down your workflow with painful startup times. flashtensors is built on top of ServerlessLLM storage library to eliminate bottlenecks and maximize performance.
- β‘ Up to 10x faster than standard loaders
- β± Coldstarts < 2 seconds
The result: An inference engine that scales by usage not by model.
- Host hundreds of models in a single device, and hot-swap them on demand with low to none effect on user experience.
- Run Agentic workflows on constrained devices (like robots, wearables, etc)
Use cases:
- Affordable Personalized AI
- Serverless AI Inference
- On Prem Deployments
- Robotics
- Local Inference
Our Goal is to make it dead simple to run large models on any device without having to wait minutes for startup times. Flashtensors offers a simple SDK that wraps on top of SOTA inference engines like VLLM (more engines coming soon), and allows you to switch models with ease. It contains an interactive CLI that allows you to run models in the terminal. We are working in a pytorch compatible API to make it very simple to adapt custom models to fast loading.
pip install git+https://github.com/leoheuler/flashtensors.git# Start the daemon server
flash start# Pull the model of your preference
flash pull Qwen/Qwen3-0.6B# Run the model
flash run Qwen/Qwen3-0.6B "Hello world"import flashtensors as ft
from vllm import SamplingParams
import time
ft.shutdown_server() # Ensure any existing server is shut down
ft.configure(
storage_path="/tmp/models", # Where models will be stored
mem_pool_size=1024**3*30, # 30GB memory pool (GPU Size)
chunk_size=1024**2*32, # 32MB chunks
num_threads=4, # Number of threads
gpu_memory_utilization=0.8, # Use 80% of GPU memory
server_host="0.0.0.0", # gRPC server host
server_port=8073 # gRPC server port
)
ft.activate_vllm_integration()
# Step 2: Transform a model to fast-loading format
model_id = "Qwen/Qwen3-0.6B"
result = ft.register_model(
model_id=model_id,
backend="vllm", # We should have an "auto" backend option
torch_dtype="bfloat16",
force=False, # Don't overwrite if already exists
hf_token=None # Add HuggingFace token if needed for private models
)
# Step 3: Load model with ultra-fast loading
print(f"\nβ‘ Loading model {model_id} with fast loading...")
load_start_time = time.time()
llm = ft.load_model(
model_id=model_id,
backend="vllm",
dtype="bfloat16",
gpu_memory_utilization=0.8
)
load_time = time.time() - load_start_time
print(f"β
Model loaded successfully with fast loading in {load_time:.2f}s")
# Step 4: Use the model for inference
print("\nπ€ Running inference...")
prompts = [
"Hello, my name is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.1, top_p=0.95, max_tokens=50)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f" Prompt: {prompt!r}")
print(f" Generated: {generated_text!r}")
print()
# Step 5: Clean up GPU memory
ft.cleanup_gpu()
# Step 6: Show model information
print("\nπ Model information:")
info = ft.get_model_info(model_id)
if info:
print(f" Model ID: {info['model_id']}")
print(f" Backend: {info['backend']}")
print(f" Size: {info['size'] / (1024**3):.2f} GB")
print(f" Ranks: {info['ranks']}")
# Step 7: List all models
print("\nπ All available models:")
models = ft.list_models()
for model_key, model_info in models.items():
print(f" {model_key}: {model_info['size'] / (1024**3):.2f} GB")from flashtensors import flash
class SimpleModel(nn.Module):
def __init__(self, size=(3,3)):
super(SimpleModel, self).__init__()
# Create a single parameter tensor of shape (3, 3)
self.weight = nn.Parameter(torch.randn(*size))
def forward(self, x):
return x @ self.weight # Simple matrix multiplication
model = SimpleModel()
state_dict = model.state_dict()
# Save your state dict
flash.save_dict(state_dict, "/your/model/folder")
# Load your state dict blazing fast
device_map = {"":0}
new_state_dict = flash.load_dict("/your/model/folder", device_map)flashtensors drastically reduces coldstart times compared to alternative loaders like safetensors.
| Model | flashtensors (sllm) β‘ (s) | safetensors (mmap) (s) | Speedup |
|---|---|---|---|
| Qwen/Qwen3-0.6B | 2.74 | 11.68 | ~4.3Γ |
| Qwen/Qwen3-4B | 2.26 | 8.54 | ~3.8Γ |
| Qwen/Qwen3-8B | 2.57 | 9.08 | ~3.5Γ |
| Qwen/Qwen3-14B | 3.02 | 12.91 | ~4.3Γ |
| Qwen/Qwen3-32B | 4.08 | 24.05 | ~5.9Γ |
(Results measured on H100 GPUs using NVLink)
β‘ Average speedup: ~4β6Γ faster model loads
Coldstarts stay consistently under 5 seconds, even for 32B parameter models.
- Run benchmarks on a diversity of hardware
- Docker Integration
- Inference Server
- SGLang Integration
- LlamaCPP Integration
- Dynamo Integration
- Ollama Integration
Credits:
- Inspired and adapted from the great work of ServerlessLLM