Gradient Garden is a research platform for model training, evaluation, and experimentation across architectures, benchmarks, and multi stage training recipes.
The project began with a decoder-only transformer baseline and has evolved into a broader codebase for training, evaluation, and experimentation across modern machine learning models, distributed training workflows, and post-training methods.
- Research and experimentation across architectures, benchmarks, and training recipes
- Multi-stage model training and post-training workflows
- Distributed training
- single-GPU training
- multi-GPU / multi-node training
- DDP
- FSDP2 (~ZeRO3-like sharding)
- Gradient accumulation
- Mixed precision
- BF16
- FP16
- FP32
- Early stopping
- Checkpoint save / load / resume
- Torch profiler integration
- Weights & Biases (W&B) integration
- AdamW
- Fused if available on the device
- Muon
- Separate max LR, min LR, and warmup settings from AdamW
- Applied to matrix parameters
- AdamW is applied to the remaining trainable parameters
- Cosine LR scheduling
- Pretraining
- Supervised fine-tuning (SFT / instruct)
- Direct Preference Optimization (DPO)
- Optional distillation support
- Teacher models are currently loaded from the Hugging Face Hub, but this can be adapted to other sources if needed
- Tendril
- Current default model architecture. Dense decoder-only transformer with GQA and RoPE support.
- TendrilMoE
- Mixture of Experts (MoE) implementation of Tendril.
- Reuses the existing FF module for expert MLPs.
- Includes load balancing and z-loss
- LoRA
- KV cache for autoregressive decoding
- Shared multiple-choice evaluation path
- HellaSwag
- WinoGrande
- ARC-Challenge
- Additional benchmarks can be added through the same multiple-choice evaluation flow
- The project is currently focused on CUDA-based training workflows.
- Additional model architectures can be added through the model registry.
- By default, the project uses a Hugging Face tokenizer.
- It also supports a
tiktokentokenizer that can be loaded from a local BPE file. For now, this path expects a tokenizer compatible with the Llama 3 tiktoken configuration and chat/control tokens, e.g. (<|begin_of_text|>,<|end_of_text|>,<|start_header_id|>,<|end_header_id|>,<|eot_id|>). The local tokenizer file is not included in this repository and must be provided separately. This will be made more generic later.
datasets_preparation/Components used for downloading, preparing, and tokenizing datasets.engine/Trainer and runtime core components.evals/Shared evaluation loading and scoring utilities.- Multiple choice evals:
- HellaSwag
- WinoGrande
- ARC-Challenge
- Multiple choice evals:
examples/Templates for local setup files like the.envsecrets file and dataset mix.inference/Contains inference related components like KV cache implementation and logic for sampling and text generation.metrics/Utilities for metric aggregation.models/Contains the model registry, the builder and the base model interface.adapters/Contains PEFT adapters.lora.pyLoRA module that handles the model modification. Rank, alpha, dropout and target modules can be configured accordingly.
implementations/Contains model implementations.
recipes/Recipe definitions for training and dataset preparation. More will be added here.recipes/config.pyDefines the recipe schema and recipe loading logic.recipes/pretraining/debug.yaml
recipes/instruct/debug.yaml
recipes/dpo/debug.yaml
tasks/Groups the training tasks.tests/Groups tests for different components.checkpoints.pyLogic to handle checkpointing.config.pyDefines the nestedGlobalConfigused by the trainer, dataset preparation, evaluation, checkpointing, and runtime setup.- Most experiment settings currently live as defaults in
config.py. - Recipes define experiment settings under the
configsection. .envis only used for third party secrets and local cache settings:WANDB_API_KEY,HF_TOKEN, andHF_HOME.
- Most experiment settings currently live as defaults in
dataloaders.pyDataloader logic for sampling and distributing data.ddp_utils.pyContains the main logic to set up the PyTorch DDP (Distributed Data Parallel) and FSDP2 (Fully Sharded Data Parallel).- PyTorch DDP here
distillation_utils.pyLogic for distillation loss.dpo_utils.pyLogic for DPO loss.logger.pySimple reusable logger.lr_schedulers.pyStores learning rate schedulers. At the moment, it includes a cosine scheduler.prepare_datasets.pyEntry point for data downloading and preparation.test_prompts.jsonJSON with the list of input prompts to try during training. The expected keys in the JSON (as provided in the file) are "pretraining", "instruct", "dpo".tokenizer.pyProvides the tokenizer abstraction used by the project and supports two backends:TikTokenizer: loads tiktoken BPE weights from a local file path and configures the special tokens used by the project.HFTokenizer: loads a tokenizer from Hugging Face viaAutoTokenizer.from_pretrained(...)and aligns the required special tokens (bos,eos, headers,eot,pad).init_tokenizer(...)selects the backend based onconfig.tokenizer.huggingface_tokenizer.
train.pyEntry point for training runs.utils.pyCommon generic logic that can be reused in different components.wandb_utils.pyA wrapper for Weights & Biases.- Weights & Biases here
-
Create a python environment. Example with conda:
conda create -n my_env python=3.11; -
Activate the environment and run:
pip install -r requirements.txt; -
Download and prepare the data:
- Recipes:
- Example with recipe:
python prepare_datasets.py --recipe recipes/pretraining/debug.yaml
- Example with recipe:
- Or manually:
- Evals:
- HellaSwag:
python prepare_datasets.py --hellaswag - WinoGrande:
python prepare_datasets.py --winogrande - ARC-Challenge:
python prepare_datasets.py --arc-challenge
- HellaSwag:
- Training and validation:
- Pretraining:
python prepare_datasets.py --pretraining - Instruct:
python prepare_datasets.py --instruct - DPO:
python prepare_datasets.py --dpo
- Pretraining:
- NOTE: Dataset paths are configured in
config.pyunderGlobalConfig.paths.datasetsandGlobalConfig.paths.evals. - In manual mode, the training dataset preparation commands also support a custom mix file by passing
--mix-file <file_path>. Checkexamples/pretraining_data_mix.example.jsonfor an example. Local custom mix files should use the.local.jsonsuffix, for examplepretraining_debug.local.json, so they are ignored by Git. If no--mix-fileis provided, the built-in default mix for that stage is used.- The default mix can be found in
datasets_preparation/default_mixes.py
- The default mix can be found in
- Evals:
- Recipes:
-
(OPTIONAL) Setup your Weights & Biases API key:
- Set
WANDB_API_KEYenvironment variable if you want to log the progress there.
- Set
-
NOTE: For some scenarios you might need to also pass your Hugging Face API token
HF_TOKEN. E.g.: If performing knowledge distillation and the teacher model requires access permissions.
Configuration can be provided directly through a config YAML file or through a recipe YAML file.
A recipe is the preferred way to define an experiment. It contains:
config: the nestedGlobalConfigused for training and runtime setupdata: dataset preparation settings, including the dataset mix and optional eval dataset preparation
The main sections are:
runtime: device, precision, FSDP, torch compile, CPU workersmodel: model architecture and model-specific settingstraining: stage, seed, total batch size, max steps, early stoppingoptimizers: AdamW and optional Muon configurationpaths: dataset, evaluation, runs, and prompt pathsvalidation: validation frequency and number of validation stepsevals: HellaSwag, WinoGrande, and ARC-Challenge settingsgeneration: text generation frequency and max generation lengthcheckpointing: checkpoint save frequency and retentiontokenizer: tokenizer backend and checkpoint pathlora: optional LoRA configsdistillation: optional teacher model distillation settingsdpo: DPO configurationwandb: W&B logging settingstorch_profiler: profiler settings
.env is no longer used as the experiment configuration file. It is only used for secrets and local cache settings:
WANDB_API_KEY=''
HF_TOKEN=''
HF_HOME='./cache'
train.py accepts some flags that are useful to load a checkpoint or override some properties:
--config <file> # Load config directly from a config YAML file
--recipe <file> # Load config from a recipe YAML file
--pretraining # Automatically sets pretraining stage.
--instruct # Automatically sets instruct stage.
--dpo # Automatically sets DPO stage.
--checkpoint <file> # Resume training from a specific checkpoint
--reset-optimizers # Ignore stored optimizer(s) state
--start-step <N> # Override internal step counterNOTES:
- The project output path can be configured in
config.pyunderGlobalConfig.paths.runs. By default it will use./runs - When using
--recipe, the recipe defines the training stage. The flags--pretraining,--instruct, and--dpocannot be combined with--recipe.
Example recipe commands:
python prepare_datasets.py --recipe recipes/pretraining/debug.yaml
python train.py --recipe recipes/pretraining/debug.yaml-
To train on single-GPU, run:
python train.py --recipe recipes/pretraining/debug.yaml
-
To train on multi-GPU run:
export OMP_NUM_THREADS=1 torchrun \ --standalone \ --nproc_per_node <NUMBER_OF_GPUs> \ train.py --recipe recipes/pretraining/debug.yaml
-
To load a checkpoint and continue training, pass the flag to any of the above commands. E.g.:
export OMP_NUM_THREADS=1 torchrun \ --standalone \ --nproc_per_node <NUMBER_OF_GPUs> \ train.py --recipe recipes/pretraining/debug.yaml --checkpoint <CHECKPOINT_FILE_PATH>
-
To train on multiple nodes with 1 or more GPUs per node, configure each node as follows:
- Static
- Ethernet
export NCCL_IB_DISABLE=1 export NCCL_SOCKET_NTHREADS=4 export NCCL_NSOCKS_PERTHREAD=8
- InfiniBand
export NCCL_IB_DISABLE=0 export NCCL_IB_HCA=$(ls /sys/class/infiniband | paste -sd, -)
export OMP_NUM_THREADS=1 export PYTHONUNBUFFERED=1 export TORCH_NCCL_ASYNC_ERROR_HANDLING=1 export TORCH_DIST_BIND_ADDR=0.0.0.0 # only needed in master but no impact on the workers export NCCL_DEBUG=WARN NNODES=<NUMBER_OF_NODES> NPERNODE=<NUMBER_OF_GPUs> NODE_RANK=<NODE_RANK> MASTER_ADDR=<MASTER_NODE_MACHINE_IP> MASTER_PORT=<MASTER_NODE_MACHINE_PORT> # make sure we can find the correct NIC _IFACE=$(ip -o route get "$MASTER_ADDR" | awk '{for(i=1;i<=NF;i++) if($i=="dev"){print $(i+1); exit}}') [ -n "$_IFACE" ] && [ "$_IFACE" != "lo" ] && export NCCL_SOCKET_IFNAME="$_IFACE" torchrun \ --nnodes ${NNODES} \ --nproc-per-node ${NPERNODE} \ --node-rank ${NODE_RANK} \ --master_addr ${MASTER_ADDR} \ --master_port ${MASTER_PORT} \ train.py --recipe recipes/pretraining/debug.yaml
- Ethernet
- Elastic
export OMP_NUM_THREADS=1 export PYTHONUNBUFFERED=1 export TORCH_NCCL_ASYNC_ERROR_HANDLING=1 export TORCH_DIST_BIND_ADDR=0.0.0.0 # only needed in master but no impact on the workers export NCCL_DEBUG=WARN NNODES=<NUMBER_OF_NODES> NPERNODE=<NUMBER_OF_GPUs> MASTER_ADDR=<MASTER_NODE_MACHINE_IP> MASTER_PORT=<MASTER_NODE_MACHINE_PORT> RDZV_EP="$MASTER_ADDR:$MASTER_PORT" RDZV_ID=<SOME_SHARED_JOB_NAME> # make sure we can find the correct NIC _IFACE=$(ip -o route get "$MASTER_ADDR" | awk '{for(i=1;i<=NF;i++) if($i=="dev"){print $(i+1); exit}}') [ -n "$_IFACE" ] && [ "$_IFACE" != "lo" ] && export NCCL_SOCKET_IFNAME="$_IFACE" torchrun \ --nnodes ${NNODES} \ --nproc-per-node ${NPERNODE} \ --rdzv-backend c10d \ --rdzv-endpoint ${RDZV_EP} \ --rdzv-id ${RDZV_ID} \ train.py --recipe recipes/pretraining/debug.yaml
- NOTE: The same command needs to be run on all nodes
- Static
-
More details on torchrun here
-
More details on NCCL here
Torch profiler settings are configured in config.py under GlobalConfig.torch_profiler.
More details on the profiler API can be found here.
From the root folder:
pytestFor convenience local configurations, recipes or other files should use the naming convention as defined in .gitignore:
*.local.json
*.private.json
*.local.ipynb
*.private.ipynb
*.local.yaml
*.private.yaml
*.local.yml
*.private.yml
Feel free to reach out if interested in contributing!
Tokenizer files, model weights, and datasets obtained from third parties are not included in this repository unless explicitly stated, and may be subject to their own licenses and terms.
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
Please cite this project if it was useful in your work:
@software{rui2024gradientgarden,
author = {Rui Malheiro},
title = {Gradient Garden},
year = {2024},
url = {https://github.com/ruimalheiro/gradient-garden}
}