Skip to content

skojaku/vec2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

vec2vec

Reusable Python package for unsupervised embedding-space alignment.

Provides two aligners that learn to map embeddings from one model's space to another without paired data:

Aligner Method Speed Dependencies
MiniVec2Vec Linear (orthogonal Procrustes + ICP) Fast torch, scipy, scikit-learn
Vec2Vec GAN-based (cycle-consistent) Slower above + accelerate, diffusers

Both are based on the papers:

Installation

# Core (MiniVec2Vec only)
pip install git+https://github.com/skojaku/vec2vec.git

# With GAN-based Vec2Vec support
pip install "git+https://github.com/skojaku/vec2vec.git#egg=vec2vec[gan]"

Quick start

MiniVec2Vec (recommended)

import numpy as np
from vec2vec import MiniVec2Vec

src_embs = np.random.randn(5000, 768).astype("float32")   # e.g. from model A
tgt_embs = np.random.randn(5000, 1024).astype("float32")  # e.g. from model B

aligner = MiniVec2Vec(epochs=100, bs=1000, seed=42)
aligner.fit(src_embs, tgt_embs)          # unpaired — N and M can differ

aligned = aligner.transform(src_embs)   # (5000, 1024)
back    = aligner.transform(tgt_embs, src="tgt", tgt="src")  # (5000, 768)

aligner.save("./my_aligner")
aligner2 = MiniVec2Vec.load("./my_aligner")

Vec2Vec (GAN-based)

from vec2vec import Vec2Vec

aligner = Vec2Vec(epochs=100, bs=256, normalize_embeddings=True)
aligner.fit(src_embs, tgt_embs)
aligned = aligner.transform(src_embs)

MiniVec2Vec parameters

Parameter Default Description
epochs 100 ICP refinement iterations
bs 1000 Sample size per ICP iteration
icp_k 50 Nearest neighbours averaged per ICP step
n_anchor_runs 30 KMeans+QAP runs to collect anchors
n_anchors 20 Clusters per anchor run
n_qap_repeats 30 QAP restarts per run
anchor_subsample 10000 Points subsampled for KMeans per run
anchor_match_k 50 Top-k tgt points for soft anchor matching
cluster_refine_iters 1 Cluster-correction passes
n_cluster_centers 500 KMeans clusters in correction pass
normalize_embeddings True L2-normalise inputs
seed 42 Random seed

About

Reusable embedding-space aligners: MiniVec2Vec (linear) and Vec2Vec (GAN-based)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages