Skip to content

tahoebio/tahoe-x1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

361 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

tahoe-therapeutics

PyPI Linter: Ruff License Code style: black


Tahoe-x1: Scaling Perturbation-Trained Single-Cell Foundation Models to 3 Billion Parameters

πŸ“„ Paper | πŸ€— HuggingFace | πŸš€ Getting Started | πŸ§‘β€πŸ« Tutorials

Tahoe-x1 (Tx1) is a family of perturbation-trained single-cell foundation models with up to 3 billion parameters, developed by Tahoe Therapeutics. Tx1 is pretrained on large-scale single-cell transcriptomic datasets including the Tahoe-100M perturbation compendium, and fine-tuned for cancer-relevant tasks. Through architectural optimizations and efficient training strategies, Tx1 achieves 3–30Γ— higher compute efficiency than prior implementations while delivering state-of-the-art performance across disease-relevant benchmarks.

Abstract Logo

πŸ“‘ Table of Contents

πŸ“ Repository Structure

This repository follows a similar structure to llm-foundry and imports several utility functions from it.

tahoe-x1/
β”œβ”€β”€ tahoe_x1/                    # Core Tahoe-x1 library
β”‚   β”œβ”€β”€ model/
β”‚   β”‚   β”œβ”€β”€ blocks/           # Building block modules used across models
β”‚   β”‚   └── model/            # Full architecture subclassed from ComposerModel
β”‚   β”œβ”€β”€ tasks/                # Helper functions for downstream tasks
β”‚   β”œβ”€β”€ tokenizer/            # Vocabulary building and tokenization functions
β”‚   β”œβ”€β”€ data/                 # Data loaders and collators
β”‚   └── utils/                # Utility functions 
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ train.py              # Training script
β”‚   β”œβ”€β”€ depmap/               # DepMap benchmark scripts
β”‚   β”œβ”€β”€ msigdb/               # MSigDB pathway benchmark scripts
β”‚   β”œβ”€β”€ state_transition/     # State transition prediction scripts
β”‚   β”œβ”€β”€ data_prep/            # Dataset preparation scripts
β”‚   └── inference/            # Inference utilities
|       β”œβ”€β”€predict_embeddings.py            # Embedding extraction script
β”‚       └──prepare_for_inference.py         # Prepares model for inference
β”œβ”€β”€ tutorials/                # Jupyter notebook tutorials
β”‚   β”œβ”€β”€ clustering_tutorial.ipynb  # Cell clustering and UMAP visualization
β”‚   └── training_tutorial.ipynb    # Training walkthrough
└── configs/                      
    β”œβ”€β”€runai/                 # RunAI configuration files
    β”œβ”€β”€mcli/                  # MosaicML platform configuration files
    β”œβ”€β”€gcloud/                # Google Cloud configuration files
    └──test_run.yaml          # Sample config file

πŸš€ Installation

Docker installation provides better reproducibility and avoids dependency conflicts.

Docker Installation (Recommended)

# Clone the repository
git clone https://github.com/tahoebio/tahoe-x1.git
cd tahoe-x1

# Pull the latest Docker image with all the dependencies pre-installed
docker pull ghcr.io/tahoebio/tahoe-x1:latest

# Start an interactive container with GPU support
# Note that nvidia-container-toolkit is required for this step
# Large SHM is needed to use dataloaders with multiple workers
docker run -it --rm \
  --gpus all \
  --shm-size=64g \   # Allocate enough shared memory within container
  -v "$(pwd)":/workspace \
  -w /workspace \
  ghcr.io/tahoebio/tahoe-x1:latest \
  /bin/bash     # Start an interactive Bash shell in the container              

# Inside the container, install the Tahoe-x1 package (dependencies are pre-installed)
pip install -e . --no-deps

The Docker image includes all necessary dependencies including PyTorch, CUDA drivers, and flash-attention for optimal performance.

Native Installation (Alternative)

For direct installation without Docker, we recommend using uv for dependency management:

# Clone the repository
git clone https://github.com/tahoebio/tahoe-x1.git
cd tahoe-x1

# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create and activate virtual environment
uv venv tx1
source tx1/bin/activate

# Install the package with dependencies
uv pip install -e . --no-build-isolation-package flash-attn

# Alternatively, install the latest stable release via PyPi
# uv pip install tahoe-x1 --no-build-isolation-package flash-attn

Note: Native installation requires compatible CUDA drivers and may encounter dependency conflicts. Docker installation is recommended for the best experience.

Quickstart

Launch a Tx1 pre-training run on Tahoe-100M. The dataset will be streamed as needed. No need to wait for download! Play around the various options in the test_config to configure your architecture and hyperparameters.

composer scripts/train.py \
  configs/test_run.yaml

You can also pass individual configuration parameters directly from the command line

composer scripts/train.py \
  configs/test_run.yaml \ 
  model.d_model=512 \
  model.n_layers=12 \
  train_loader.num_workers=8 \
  max_duration=1000ba

πŸ’» System Requirements & Training Capabilities

Tahoe-x1 is built natively on Composer and llm-foundry, inheriting their full suite of large-scale training capabilities:

Hardware Requirements

  • GPU: NVIDIA Ampere (A100) or newer for FlashAttention support
  • CUDA: Version 12.1+
  • Python: 3.10+

Advanced Training Features

The codebase leverages Composer's state-of-the-art training stack, configurable via YAML:

  • Automatic micro-batching for optimal memory utilization
  • Mixed precision training with BF16/FP16, plus FP8 support on Hopper (H100) and newer GPUs
  • Multi-GPU and multi-node distributed training with FSDP (Fully Sharded Data Parallel)
  • Gradient accumulation and checkpointing for training larger models on limited hardware
  • Advanced optimizers and schedulers from the LLM training ecosystem
  • Streaming datasets for efficient data loading at scale

This infrastructure supports training models from 70M to 3B parameters and can scale to larger architectures.

Docker Images

We provide pre-built Docker images for ease of use:

Image Name Base Image Description
ghcr.io/tahoebio/tahoe-x1:1.0.0 mosaicml/llm-foundry:2.2.1_cu121_flash2-813d596 Current release image for Tahoe-x1

πŸ“Š Datasets

Tx1 was pretrained on 266 million single-cell profiles from three major sources. The following datasets were used for training and evaluation:

Dataset Description Usage Location
CellxGene 2025-01 ~61M cells from Jan 2025 CellxGene release Tx1-3B stage 1 Pre-training s3://tahoe-hackathon-data/MFM/cellxgene_2025_01_21_merged_MDS/
scBaseCamp 2025-02 ~112M cells from Feb 2025 scBaseCamp release Tx1-3B stage 1 Pre-training s3://tahoe-hackathon-data/MFM/scbasecamp_2025_02_25_MDS_v2/
Tahoe 100M ~96M cells from Tahoe-100M Tx1-3B stage 1 Pre-training s3://tahoe-hackathon-data/MFM/tahoe_100m_MDS_v2/
filtered CellxGene 2025-01 ~43M filtered cells from Jan 2025 CellxGene release Tx1-3B stage 2 Pre-training s3://tahoe-hackathon-data/MFM/cellxgene_2025_01_21_merged_MDS_filtered/
filtered scBaseCamp 2025-02 ~76M filtered cells from Feb 2025 scBaseCamp release Tx1-3B stage 2 Pre-training s3://tahoe-hackathon-data/MFM/scbasecamp_2025_02_25_MDS_v2_filtered/
filtered Tahoe 100M ~34M filtered cells from Tahoe-100M Tx1-3B stage 2 Pre-training s3://tahoe-hackathon-data/MFM/tahoe_100m_MDS_v2_filtered/
DepMap Cancer cell line dependency data DepMap Benchmark s3://tahoe-hackathon-data/MFM/benchmarks/depmap/
MSigDB Pathway signature data MsigDB Benchmark s3://tahoe-hackathon-data/MFM/benchmarks/msigdb/

Filtered versions of the pre-training datasets above exclude cells with very few expressed genes and are used for stage 2 pre-training of Tx1-3B.

Public access to datasets: s3://tahoe-hackathon-data/MFM/benchmarks/

If you require access to datasets not available in the public bucket, please open a GitHub issue or contact the team.

For more information on dataset preparation, see scripts/data_prep/README.md.

πŸ€– Pre-trained Models

We provide pre-trained Tahoe-x1 models of various sizes:

Model Name Parameters Context Length Checkpoint Path WandB ID Config File
Tx1-3B 3B 2048 s3://tahoe-hackathon-data/MFM/ckpts/3b/ mygjkq5c ./configs/mcli/tahoe_x1-3b-v2-cont-train.yaml
Tx1-1.3B 1.3B 2048 s3://tahoe-hackathon-data/MFM/ckpts/1b/ 26iormxc ./configs/gcloud/tahoe_x1-1_3b-merged.yaml
Tx1-70M 70M 1024 s3://tahoe-hackathon-data/MFM/ckpts/70m/ ftb65le8 ./configs/gcloud/tahoe_x1-70m-merged.yaml

Model weights are also available as safetensor files on our πŸ€— Huggingface model card.

πŸ‹οΈ Training and Fine-tuning

Training from Scratch

You can start with configs/test_run.yaml, which is a sample configuration showing how to train the 70M model on the Tahoe100M dataset for a few iterations. Customize this configuration file for your own training runs.

Use the main training script with a YAML configuration file:

composer scripts/train.py -f configs/test_run.yaml

Or with command-line arguments:

composer scripts/train.py \
  --model_name tahoe_x1 \
  --data_path /path/to/data \
  --max_seq_len 2048 \
  --batch_size 32

Note that the current codebase only supports attn_impl: flash and use_attn_mask: False. The Triton backend and custom attention masks (used for training Tx1-1B and Tx1-70M) are no longer supported. If you have questions about using custom attention masks with the Triton backend, please contact us.

Fine-tuning

To fine-tune a pre-trained model on your own data:

  1. Download a pre-trained checkpoint from S3
  2. Modify the training configuration to load from checkpoint
  3. Prepare your dataset in the MDS format (see scripts/data_prep/README.md)
  4. Launch training with the --load_path argument
python scripts/train.py \
  -f configs/finetune_config.yaml \
  --load_path s3://path/to/checkpoint

Launching runs on different platforms

For launching runs on specific platforms such as MosaicML, Google Cloud, or RunAI, refer to the corresponding configuration folders under configs/ and their respective README files.

Preparing Models for Inference

Package a trained model with its vocabulary and metadata by editing the configuration section in the script:

  1. Open scripts/inference/prepare_for_inference.py and update the configuration variables:

    • model_name: Your model name (e.g., "tx-3b-prod")
    • wandb_id: Your WandB run ID (e.g., "mygjkq5c")
    • wandb_project: Your WandB project name (e.g., "vevotx/tahoex")
    • save_dir: Output directory path
    • default_vocab_url: S3 URL for vocabulary file (e.g., "s3://tahoe-hackathon-data/MFM/vevo_v2_vocab.json")
  2. Run the script:

python scripts/inference/prepare_for_inference.py

The script will download the model config from WandB, process the vocabulary, and save inference-ready files to your specified output directory.

🧬 Generating Cell and Gene Embeddings

Quick Start with Inference Script

Extract cell embeddings from an AnnData object:

from omegaconf import OmegaConf as om
from scripts.inference.predict_embeddings import predict_embeddings

cfg = {
    "model_name": "Tx1-70m",
    "paths": {
        "hf_repo_id": "tahoebio/Tahoe-x1",
        "hf_model_size": "70m",
        "adata_input": "/path/to/your_data.h5ad",
    },
    "data": {
        "cell_type_key": "cell_type",
        "gene_id_key": "ensembl_id"
    },
    "predict": {
        "seq_len_dataset": 2048,
        "return_gene_embeddings": False,
    }
}

cfg = om.create(cfg)
adata = predict_embeddings(cfg)

# Access embeddings
cell_embeddings = adata.obsm["Tx1-70m"]

Extracting Gene Embeddings

Set return_gene_embeddings: True in the configuration to extract gene-level representations.

πŸ“š Tutorials and Benchmarks

Tutorials

Benchmarks

Tx1 achieves state-of-the-art performance across disease-relevant benchmarks. See our preprint for detailed results.

Benchmark Task Code Location
DepMap Essentiality Predict broad and context-specific gene dependencies scripts/depmap/
MSigDB Hallmarks Recover 50 hallmark pathway memberships from gene embeddings scripts/msigdb/
Cell-Type Classification Classify cell types across 5 tissues (Tabula Sapiens 2.0) cz-benchmarks
Perturbation Prediction Predict transcriptional responses in held-out contexts scripts/state_transition/

Additional Resources

πŸ”§ Troubleshooting

Common Issues and Solutions

  • PyTorch/CUDA mismatch: Ensure PyTorch is installed with the correct CUDA version for your system
  • Docker permission denied: Run Docker commands with sudo or add your user to the docker group
  • OOM (Out of Memory): Ensure half-precision, flash-attention are enabled, set microbatch_size to auto
  • S3 access denied: For public buckets, the code will automatically retry with unsigned requests

For additional help, please open an issue on GitHub with:

  • Your system configuration (OS, GPU, PyTorch version)
  • Complete error message and stack trace
  • Steps to reproduce the issue

πŸ“„ Citation

If you use Tahoe-x1 in your research, please cite:

@article{gandhi2025tahoe,
  title        = {Tahoe-x1: Scaling Perturbation-Trained Single-Cell Foundation Models to 3 Billion Parameters},
  author       = {Gandhi, Shreshth and Javadi, Farnoosh and Svensson, Valentine and Khan, Umair and Jones, Matthew G. and Yu, Johnny and Merico, Daniele and Goodarzi, Hani and Alidoust, Nima},
  journal      = {bioRxiv},
  year         = {2025},
  doi          = {10.1101/2025.10.23.683759},
  url          = {https://www.biorxiv.org/content/10.1101/2025.10.23.683759v1},
  publisher    = {Cold Spring Harbor Laboratory}
}

πŸ™ Acknowledgements

We thank the developers of the following open-source projects:


For questions, issues, or collaboration inquiries, please open an issue on GitHub or write to us at admin@tahoebio.ai.

About

Tahoe-x1 is a single cell foundation model designed for gigascale datasets

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors