Skip to content

rdilip/kanzi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kanzi: Flow Autoencoders for Protein Tokenization

Kanzi schematic Kanzi is a family of discrete tokenizers for modeling biological structures.
It is parameter-efficient (30M), fast to tokenize, and does not rely on complex SE(3)-invariant architectures.


Installation

pip install .

or with uv:

uv pip install -e .

Quick Start

First, download a pretrained Kanzi checkpoint

from kanzi import DAE, kabsch_rmsd
import fastpdb
import torch

device = "cuda:0"
model = DAE.from_pretrained("checkpoints/cleaned_model.pt").to(device).eval()

# Load PDB data
arr = fastpdb.PDBFile.read("pdbs/3bg1B01.pdb").get_structure(model=1)
arr = arr[arr.atom_name == "CA"].coord.reshape(1, -1, 3)
arr = torch.from_numpy(arr).to(device)

with torch.no_grad():
    *_, idx = model.encode(arr, preprocess=True)
    recon = model.decode(idx)

# Multiply by 10 to convert from Å to nm
print("Reconstruction error:", kabsch_rmsd(recon.cpu() * 10, arr.cpu()))

Usage Notes

  • Kanzi’s encode and decode operate in nanometers. If you read coordinates from a PDB file (typically in Å), divide by 10.
  • Inputs must be zero-centered. You can use preprocess=True in model.encode() to handle both centering and unit conversion.
  • Batch encoding is not yet supported.
  • Current release provides Cα-only tokenizers (full backbone support coming soon).

Encoding

The encoder is a shallow transformer with sliding-window attention, making it both efficient and fast. The main point to be aware of is the encoding requires coordinates to be expressed in angstroms and mean centered. The following snippet describes what you need for a valid input.

coords = torch.randn(1, 100, 3)  # fake Cα-only data
coords = (coords - coords.mean(dim=-2, keepdim=True)) / 10.0

Decoding

model.decode() accepts two key parameters:

  • noise_weight
  • score_weight (recommended: 1.0)

Adjusting score_weight controls exploration by increasing the noise scale in the SDE sampler.

⚠️ Important: Some models are not trained with classifier-free guidance. In those cases, setting cfg_weight != 1 will lead to very poor reconstructions.


Generation

Coming soon!


Training

Coming soon! In the interim, we use the standard AFDB-Foldseek clustered dataset with all proteins smaller than 256 residues.


Citation

Kanzi uses code from several other packages, notably Proteina and torchCFM.

If you use Kanzi, please cite:

@article{kanzi2025,
  title   = {Kanzi: Flow Autoencoders are Effective Protein Tokenizers},
  author  = {Rohit Dilip, Evan Zhang, Ayush Varshney, David Van Valen},
  journal = {arXiv preprint arxiv:2510.00351},
  year    = {2025},
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages