A fast CLI for LLM model merging, format conversion, and diffing. Written in Rust. Reads safetensors natively. Compatible with mergekit configs.
cargo install --path .Requires Rust (any recent stable version).
# Merge two models
alloy merge config.yaml --output ./merged
# Compare two models tensor-by-tensor
alloy diff ./model_a ./model_b
# Convert dtype (e.g. FP32 to FP16)
alloy convert ./model --output ./model_f16 --dtype f16
# Inspect model metadata
alloy info ./model| Method | Description |
|---|---|
linear |
Weighted average of N models |
slerp |
Spherical interpolation between 2 models |
nuslerp |
Multi-model SLERP via sequential pairwise interpolation |
task_arithmetic |
Base + scaled sum of task vectors |
ties |
Trim, elect sign, disjoint merge |
dare_linear |
Random dropout + rescaled linear merge |
dare_ties |
DARE dropout + TIES sign election |
della_linear |
Magnitude-aware dropout + linear merge |
della |
Magnitude-aware dropout + TIES sign election |
passthrough |
Concatenate layer ranges from different models |
alloy reads mergekit-compatible YAML configs:
merge_method: slerp
base_model: mistralai/Mistral-7B-v0.1
models:
- model: mistralai/Mistral-7B-v0.1
- model: teknium/OpenHermes-2.5-Mistral-7B
parameters:
t: 0.5
dtype: float16HuggingFace model IDs work directly. Models download to ~/.cache/huggingface/hub/. Set HF_TOKEN for gated models (Llama, Gemma, etc.). PyTorch .bin files are auto-converted on first use.
Real models on Azure Standard_E48s_v5 (48 cores, 384 GB RAM). All times are wall-clock (hyperfine, 3 runs with warmup).
7B - Mistral-7B-v0.1 (BF16, 14.48 GB)
| Method | alloy | mergekit | |
|---|---|---|---|
| linear | 7.3 s | 15.4 s | 2.1x faster |
| slerp | 9.5 s | 12.9 s | 1.4x faster |
| ties | 23.3 s | 13.1 s | 1.8x slower |
| dare_ties | 38.0 s | 15.3 s | 2.5x slower |
14B - Qwen2.5-14B (BF16, 29.54 GB)
| Method | alloy | mergekit | |
|---|---|---|---|
| linear | 14.1 s | 25.8 s | 1.8x faster |
| slerp | 18.5 s | 22.1 s | 1.2x faster |
| ties | 44.9 s | 21.9 s | 2.0x slower |
| dare_ties | 75.7 s | 27.3 s | 2.8x slower |
alloy uses fused SIMD kernels (AVX2/NEON) that read BF16 directly from memory-mapped safetensors and compute in f32 registers, avoiding intermediate allocations. IO-bound methods (linear, slerp) are consistently faster. Compute-heavy methods (ties, dare_ties) are still slower due to PyTorch's optimized C tensor kernels.
See alloy.how for the full technical writeup including architecture, algorithm breakdowns, and future work.