π Paper | π€ HuggingFace | π Getting Started | π§βπ« Tutorials
Tahoe-x1 (Tx1) is a family of perturbation-trained single-cell foundation models with up to 3 billion parameters, developed by Tahoe Therapeutics. Tx1 is pretrained on large-scale single-cell transcriptomic datasets including the Tahoe-100M perturbation compendium, and fine-tuned for cancer-relevant tasks. Through architectural optimizations and efficient training strategies, Tx1 achieves 3β30Γ higher compute efficiency than prior implementations while delivering state-of-the-art performance across disease-relevant benchmarks.
- Repository Structure
- Installation
- Training Infrastructure
- Datasets
- Pre-trained Models
- Training and Fine-tuning
- Generating Cell and Gene Embeddings
- Tutorials and Benchmarks
- Developer Guidelines
- Acknowledgements
- License
This repository follows a similar structure to llm-foundry and imports several utility functions from it.
tahoe-x1/
βββ tahoe_x1/ # Core Tahoe-x1 library
β βββ model/
β β βββ blocks/ # Building block modules used across models
β β βββ model/ # Full architecture subclassed from ComposerModel
β βββ tasks/ # Helper functions for downstream tasks
β βββ tokenizer/ # Vocabulary building and tokenization functions
β βββ data/ # Data loaders and collators
β βββ utils/ # Utility functions
βββ scripts/
β βββ train.py # Training script
β βββ depmap/ # DepMap benchmark scripts
β βββ msigdb/ # MSigDB pathway benchmark scripts
β βββ state_transition/ # State transition prediction scripts
β βββ data_prep/ # Dataset preparation scripts
β βββ inference/ # Inference utilities
| βββpredict_embeddings.py # Embedding extraction script
β βββprepare_for_inference.py # Prepares model for inference
βββ tutorials/ # Jupyter notebook tutorials
β βββ clustering_tutorial.ipynb # Cell clustering and UMAP visualization
β βββ training_tutorial.ipynb # Training walkthrough
βββ configs/
βββrunai/ # RunAI configuration files
βββmcli/ # MosaicML platform configuration files
βββgcloud/ # Google Cloud configuration files
βββtest_run.yaml # Sample config file
Docker installation provides better reproducibility and avoids dependency conflicts.
# Clone the repository
git clone https://github.com/tahoebio/tahoe-x1.git
cd tahoe-x1
# Pull the latest Docker image with all the dependencies pre-installed
docker pull ghcr.io/tahoebio/tahoe-x1:latest
# Start an interactive container with GPU support
# Note that nvidia-container-toolkit is required for this step
# Large SHM is needed to use dataloaders with multiple workers
docker run -it --rm \
--gpus all \
--shm-size=64g \ # Allocate enough shared memory within container
-v "$(pwd)":/workspace \
-w /workspace \
ghcr.io/tahoebio/tahoe-x1:latest \
/bin/bash # Start an interactive Bash shell in the container
# Inside the container, install the Tahoe-x1 package (dependencies are pre-installed)
pip install -e . --no-depsThe Docker image includes all necessary dependencies including PyTorch, CUDA drivers, and flash-attention for optimal performance.
For direct installation without Docker, we recommend using uv for dependency management:
# Clone the repository
git clone https://github.com/tahoebio/tahoe-x1.git
cd tahoe-x1
# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create and activate virtual environment
uv venv tx1
source tx1/bin/activate
# Install the package with dependencies
uv pip install -e . --no-build-isolation-package flash-attn
# Alternatively, install the latest stable release via PyPi
# uv pip install tahoe-x1 --no-build-isolation-package flash-attnNote: Native installation requires compatible CUDA drivers and may encounter dependency conflicts. Docker installation is recommended for the best experience.
Launch a Tx1 pre-training run on Tahoe-100M. The dataset will be streamed as needed. No need to wait for download! Play around the various options in the test_config to configure your architecture and hyperparameters.
composer scripts/train.py \
configs/test_run.yamlYou can also pass individual configuration parameters directly from the command line
composer scripts/train.py \
configs/test_run.yaml \
model.d_model=512 \
model.n_layers=12 \
train_loader.num_workers=8 \
max_duration=1000baTahoe-x1 is built natively on Composer and llm-foundry, inheriting their full suite of large-scale training capabilities:
- GPU: NVIDIA Ampere (A100) or newer for FlashAttention support
- CUDA: Version 12.1+
- Python: 3.10+
The codebase leverages Composer's state-of-the-art training stack, configurable via YAML:
- Automatic micro-batching for optimal memory utilization
- Mixed precision training with BF16/FP16, plus FP8 support on Hopper (H100) and newer GPUs
- Multi-GPU and multi-node distributed training with FSDP (Fully Sharded Data Parallel)
- Gradient accumulation and checkpointing for training larger models on limited hardware
- Advanced optimizers and schedulers from the LLM training ecosystem
- Streaming datasets for efficient data loading at scale
This infrastructure supports training models from 70M to 3B parameters and can scale to larger architectures.
We provide pre-built Docker images for ease of use:
| Image Name | Base Image | Description |
|---|---|---|
ghcr.io/tahoebio/tahoe-x1:1.0.0 |
mosaicml/llm-foundry:2.2.1_cu121_flash2-813d596 |
Current release image for Tahoe-x1 |
Tx1 was pretrained on 266 million single-cell profiles from three major sources. The following datasets were used for training and evaluation:
| Dataset | Description | Usage | Location |
|---|---|---|---|
| CellxGene 2025-01 | ~61M cells from Jan 2025 CellxGene release | Tx1-3B stage 1 Pre-training | s3://tahoe-hackathon-data/MFM/cellxgene_2025_01_21_merged_MDS/ |
| scBaseCamp 2025-02 | ~112M cells from Feb 2025 scBaseCamp release | Tx1-3B stage 1 Pre-training | s3://tahoe-hackathon-data/MFM/scbasecamp_2025_02_25_MDS_v2/ |
| Tahoe 100M | ~96M cells from Tahoe-100M | Tx1-3B stage 1 Pre-training | s3://tahoe-hackathon-data/MFM/tahoe_100m_MDS_v2/ |
| filtered CellxGene 2025-01 | ~43M filtered cells from Jan 2025 CellxGene release | Tx1-3B stage 2 Pre-training | s3://tahoe-hackathon-data/MFM/cellxgene_2025_01_21_merged_MDS_filtered/ |
| filtered scBaseCamp 2025-02 | ~76M filtered cells from Feb 2025 scBaseCamp release | Tx1-3B stage 2 Pre-training | s3://tahoe-hackathon-data/MFM/scbasecamp_2025_02_25_MDS_v2_filtered/ |
| filtered Tahoe 100M | ~34M filtered cells from Tahoe-100M | Tx1-3B stage 2 Pre-training | s3://tahoe-hackathon-data/MFM/tahoe_100m_MDS_v2_filtered/ |
| DepMap | Cancer cell line dependency data | DepMap Benchmark | s3://tahoe-hackathon-data/MFM/benchmarks/depmap/ |
| MSigDB | Pathway signature data | MsigDB Benchmark | s3://tahoe-hackathon-data/MFM/benchmarks/msigdb/ |
Filtered versions of the pre-training datasets above exclude cells with very few expressed genes and are used for stage 2 pre-training of Tx1-3B.
Public access to datasets: s3://tahoe-hackathon-data/MFM/benchmarks/
If you require access to datasets not available in the public bucket, please open a GitHub issue or contact the team.
For more information on dataset preparation, see scripts/data_prep/README.md.
We provide pre-trained Tahoe-x1 models of various sizes:
| Model Name | Parameters | Context Length | Checkpoint Path | WandB ID | Config File |
|---|---|---|---|---|---|
| Tx1-3B | 3B | 2048 | s3://tahoe-hackathon-data/MFM/ckpts/3b/ |
mygjkq5c | ./configs/mcli/tahoe_x1-3b-v2-cont-train.yaml |
| Tx1-1.3B | 1.3B | 2048 | s3://tahoe-hackathon-data/MFM/ckpts/1b/ |
26iormxc | ./configs/gcloud/tahoe_x1-1_3b-merged.yaml |
| Tx1-70M | 70M | 1024 | s3://tahoe-hackathon-data/MFM/ckpts/70m/ |
ftb65le8 | ./configs/gcloud/tahoe_x1-70m-merged.yaml |
Model weights are also available as safetensor files on our π€ Huggingface model card.
You can start with configs/test_run.yaml, which is a sample configuration showing how to train the 70M model on the Tahoe100M dataset for a few iterations. Customize this configuration file for your own training runs.
Use the main training script with a YAML configuration file:
composer scripts/train.py -f configs/test_run.yamlOr with command-line arguments:
composer scripts/train.py \
--model_name tahoe_x1 \
--data_path /path/to/data \
--max_seq_len 2048 \
--batch_size 32Note that the current codebase only supports attn_impl: flash and use_attn_mask: False. The Triton backend and custom attention masks (used for training Tx1-1B and Tx1-70M) are no longer supported. If you have questions about using custom attention masks with the Triton backend, please contact us.
To fine-tune a pre-trained model on your own data:
- Download a pre-trained checkpoint from S3
- Modify the training configuration to load from checkpoint
- Prepare your dataset in the MDS format (see scripts/data_prep/README.md)
- Launch training with the
--load_pathargument
python scripts/train.py \
-f configs/finetune_config.yaml \
--load_path s3://path/to/checkpointFor launching runs on specific platforms such as MosaicML, Google Cloud, or RunAI, refer to the corresponding configuration folders under configs/ and their respective README files.
Package a trained model with its vocabulary and metadata by editing the configuration section in the script:
-
Open
scripts/inference/prepare_for_inference.pyand update the configuration variables:model_name: Your model name (e.g.,"tx-3b-prod")wandb_id: Your WandB run ID (e.g.,"mygjkq5c")wandb_project: Your WandB project name (e.g.,"vevotx/tahoex")save_dir: Output directory pathdefault_vocab_url: S3 URL for vocabulary file (e.g.,"s3://tahoe-hackathon-data/MFM/vevo_v2_vocab.json")
-
Run the script:
python scripts/inference/prepare_for_inference.pyThe script will download the model config from WandB, process the vocabulary, and save inference-ready files to your specified output directory.
Extract cell embeddings from an AnnData object:
from omegaconf import OmegaConf as om
from scripts.inference.predict_embeddings import predict_embeddings
cfg = {
"model_name": "Tx1-70m",
"paths": {
"hf_repo_id": "tahoebio/Tahoe-x1",
"hf_model_size": "70m",
"adata_input": "/path/to/your_data.h5ad",
},
"data": {
"cell_type_key": "cell_type",
"gene_id_key": "ensembl_id"
},
"predict": {
"seq_len_dataset": 2048,
"return_gene_embeddings": False,
}
}
cfg = om.create(cfg)
adata = predict_embeddings(cfg)
# Access embeddings
cell_embeddings = adata.obsm["Tx1-70m"]Set return_gene_embeddings: True in the configuration to extract gene-level representations.
- Clustering Tutorial: Cell clustering and UMAP visualization
- Training Tutorial: Step-by-step guide to training Tahoe-x1 models
Tx1 achieves state-of-the-art performance across disease-relevant benchmarks. See our preprint for detailed results.
| Benchmark | Task | Code Location |
|---|---|---|
| DepMap Essentiality | Predict broad and context-specific gene dependencies | scripts/depmap/ |
| MSigDB Hallmarks | Recover 50 hallmark pathway memberships from gene embeddings | scripts/msigdb/ |
| Cell-Type Classification | Classify cell types across 5 tissues (Tabula Sapiens 2.0) | cz-benchmarks |
| Perturbation Prediction | Predict transcriptional responses in held-out contexts | scripts/state_transition/ |
- Data Preparation: scripts/data_prep/README.md
- Platform Usage: mcli/README.md and gcloud/README.md
- PyTorch/CUDA mismatch: Ensure PyTorch is installed with the correct CUDA version for your system
- Docker permission denied: Run Docker commands with
sudoor add your user to the docker group - OOM (Out of Memory): Ensure half-precision, flash-attention are enabled, set microbatch_size to auto
- S3 access denied: For public buckets, the code will automatically retry with unsigned requests
For additional help, please open an issue on GitHub with:
- Your system configuration (OS, GPU, PyTorch version)
- Complete error message and stack trace
- Steps to reproduce the issue
If you use Tahoe-x1 in your research, please cite:
@article{gandhi2025tahoe,
title = {Tahoe-x1: Scaling Perturbation-Trained Single-Cell Foundation Models to 3 Billion Parameters},
author = {Gandhi, Shreshth and Javadi, Farnoosh and Svensson, Valentine and Khan, Umair and Jones, Matthew G. and Yu, Johnny and Merico, Daniele and Goodarzi, Hani and Alidoust, Nima},
journal = {bioRxiv},
year = {2025},
doi = {10.1101/2025.10.23.683759},
url = {https://www.biorxiv.org/content/10.1101/2025.10.23.683759v1},
publisher = {Cold Spring Harbor Laboratory}
}We thank the developers of the following open-source projects:
- scGPT: For inspiring the Tahoe-x1 architecture
- llm-foundry: Efficient training infrastructure for large language models
- streaming: Fast, efficient dataset streaming
- CZ CELLxGENE: Chan Zuckerberg Initiative's single-cell atlas
- Arc scBaseCount: Arc Institute's virtual cell atlas
- Arc Institute STATE: State Transition model for perturbation prediction
For questions, issues, or collaboration inquiries, please open an issue on GitHub or write to us at admin@tahoebio.ai.