Reusable Python package for unsupervised embedding-space alignment.
Provides two aligners that learn to map embeddings from one model's space to another without paired data:
| Aligner | Method | Speed | Dependencies |
|---|---|---|---|
MiniVec2Vec |
Linear (orthogonal Procrustes + ICP) | Fast | torch, scipy, scikit-learn |
Vec2Vec |
GAN-based (cycle-consistent) | Slower | above + accelerate, diffusers |
Both are based on the papers:
- vec2vec: Universal Geometry Alignment (Jha et al., 2025)
- mini-vec2vec: Scaling Universal Geometry Alignment with Linear Transformations (Dar, 2025)
# Core (MiniVec2Vec only)
pip install git+https://github.com/skojaku/vec2vec.git
# With GAN-based Vec2Vec support
pip install "git+https://github.com/skojaku/vec2vec.git#egg=vec2vec[gan]"import numpy as np
from vec2vec import MiniVec2Vec
src_embs = np.random.randn(5000, 768).astype("float32") # e.g. from model A
tgt_embs = np.random.randn(5000, 1024).astype("float32") # e.g. from model B
aligner = MiniVec2Vec(epochs=100, bs=1000, seed=42)
aligner.fit(src_embs, tgt_embs) # unpaired — N and M can differ
aligned = aligner.transform(src_embs) # (5000, 1024)
back = aligner.transform(tgt_embs, src="tgt", tgt="src") # (5000, 768)
aligner.save("./my_aligner")
aligner2 = MiniVec2Vec.load("./my_aligner")from vec2vec import Vec2Vec
aligner = Vec2Vec(epochs=100, bs=256, normalize_embeddings=True)
aligner.fit(src_embs, tgt_embs)
aligned = aligner.transform(src_embs)| Parameter | Default | Description |
|---|---|---|
epochs |
100 | ICP refinement iterations |
bs |
1000 | Sample size per ICP iteration |
icp_k |
50 | Nearest neighbours averaged per ICP step |
n_anchor_runs |
30 | KMeans+QAP runs to collect anchors |
n_anchors |
20 | Clusters per anchor run |
n_qap_repeats |
30 | QAP restarts per run |
anchor_subsample |
10000 | Points subsampled for KMeans per run |
anchor_match_k |
50 | Top-k tgt points for soft anchor matching |
cluster_refine_iters |
1 | Cluster-correction passes |
n_cluster_centers |
500 | KMeans clusters in correction pass |
normalize_embeddings |
True |
L2-normalise inputs |
seed |
42 | Random seed |