Gradient Garden

Gradient Garden is a research platform for model training, evaluation, and experimentation across architectures, benchmarks, and multi stage training recipes.

The project began with a decoder-only transformer baseline and has evolved into a broader codebase for training, evaluation, and experimentation across modern machine learning models, distributed training workflows, and post-training methods.

Current focus

Research and experimentation across architectures, benchmarks, and training recipes
Multi-stage model training and post-training workflows
Distributed training

Current capabilities

Training and distributed execution

single-GPU training
multi-GPU / multi-node training
- DDP
- FSDP2 (~ZeRO3-like sharding)
Gradient accumulation
Mixed precision
- BF16
- FP16
- FP32
Early stopping
Checkpoint save / load / resume
Torch profiler integration
Weights & Biases (W&B) integration

Optimization

AdamW
- Fused if available on the device
Muon
- Separate max LR, min LR, and warmup settings from AdamW
- Applied to matrix parameters
- AdamW is applied to the remaining trainable parameters
Cosine LR scheduling

Training workflows

Pretraining
Supervised fine-tuning (SFT / instruct)
Direct Preference Optimization (DPO)
Optional distillation support
- Teacher models are currently loaded from the Hugging Face Hub, but this can be adapted to other sources if needed

Models

Tendril
- Current default model architecture. Dense decoder-only transformer with GQA and RoPE support.
TendrilMoE
- Mixture of Experts (MoE) implementation of Tendril.
- Reuses the existing FF module for expert MLPs.
- Includes load balancing and z-loss

Other features

LoRA
KV cache for autoregressive decoding

Evaluation

Shared multiple-choice evaluation path
- HellaSwag
- WinoGrande
- ARC-Challenge
Additional benchmarks can be added through the same multiple-choice evaluation flow

Notes

The project is currently focused on CUDA-based training workflows.
Additional model architectures can be added through the model registry.
By default, the project uses a Hugging Face tokenizer.
It also supports a tiktoken tokenizer that can be loaded from a local BPE file. For now, this path expects a tokenizer compatible with the Llama 3 tiktoken configuration and chat/control tokens, e.g. (<|begin_of_text|>, <|end_of_text|>, <|start_header_id|>, <|end_header_id|>, <|eot_id|>). The local tokenizer file is not included in this repository and must be provided separately. This will be made more generic later.

Project structure

datasets_preparation/ Components used for downloading, preparing, and tokenizing datasets.
engine/ Trainer and runtime core components.
evals/ Shared evaluation loading and scoring utilities.
- Multiple choice evals:
  - HellaSwag
  - WinoGrande
  - ARC-Challenge
examples/ Templates for local setup files like the .env secrets file and dataset mix.
inference/ Contains inference related components like KV cache implementation and logic for sampling and text generation.
metrics/ Utilities for metric aggregation.
models/ Contains the model registry, the builder and the base model interface.
- adapters/ Contains PEFT adapters.
  - lora.py LoRA module that handles the model modification. Rank, alpha, dropout and target modules can be configured accordingly.
- implementations/ Contains model implementations.
recipes/ Recipe definitions for training and dataset preparation. More will be added here.
- recipes/config.py Defines the recipe schema and recipe loading logic.
- recipes/pretraining/
  - debug.yaml
- recipes/instruct/
  - debug.yaml
- recipes/dpo/
  - debug.yaml
tasks/ Groups the training tasks.
tests/ Groups tests for different components.
checkpoints.py Logic to handle checkpointing.
config.py Defines the nested GlobalConfig used by the trainer, dataset preparation, evaluation, checkpointing, and runtime setup.
- Most experiment settings currently live as defaults in config.py.
- Recipes define experiment settings under the config section.
- .env is only used for third party secrets and local cache settings: WANDB_API_KEY, HF_TOKEN, and HF_HOME.
dataloaders.py Dataloader logic for sampling and distributing data.
ddp_utils.py Contains the main logic to set up the PyTorch DDP (Distributed Data Parallel) and FSDP2 (Fully Sharded Data Parallel).
- PyTorch DDP here
distillation_utils.py Logic for distillation loss.
dpo_utils.py Logic for DPO loss.
logger.py Simple reusable logger.
lr_schedulers.py Stores learning rate schedulers. At the moment, it includes a cosine scheduler.
prepare_datasets.py Entry point for data downloading and preparation.
test_prompts.json JSON with the list of input prompts to try during training. The expected keys in the JSON (as provided in the file) are "pretraining", "instruct", "dpo".
tokenizer.py Provides the tokenizer abstraction used by the project and supports two backends:
- TikTokenizer: loads tiktoken BPE weights from a local file path and configures the special tokens used by the project.
- HFTokenizer: loads a tokenizer from Hugging Face via AutoTokenizer.from_pretrained(...) and aligns the required special tokens (bos, eos, headers, eot, pad).
- init_tokenizer(...) selects the backend based on config.tokenizer.huggingface_tokenizer.
train.py Entry point for training runs.
utils.py Common generic logic that can be reused in different components.
wandb_utils.py A wrapper for Weights & Biases.
- Weights & Biases here

Setup

Create a python environment. Example with conda: conda create -n my_env python=3.11;
Activate the environment and run: pip install -r requirements.txt;
Download and prepare the data:
- Recipes:
  - Example with recipe: python prepare_datasets.py --recipe recipes/pretraining/debug.yaml
- Or manually:
  - Evals:
    - HellaSwag: python prepare_datasets.py --hellaswag
    - WinoGrande: python prepare_datasets.py --winogrande
    - ARC-Challenge: python prepare_datasets.py --arc-challenge
  - Training and validation:
    - Pretraining: python prepare_datasets.py --pretraining
    - Instruct: python prepare_datasets.py --instruct
    - DPO: python prepare_datasets.py --dpo
  - NOTE: Dataset paths are configured in config.py under GlobalConfig.paths.datasets and GlobalConfig.paths.evals.
  - In manual mode, the training dataset preparation commands also support a custom mix file by passing --mix-file <file_path>. Check examples/pretraining_data_mix.example.json for an example. Local custom mix files should use the .local.json suffix, for example pretraining_debug.local.json, so they are ignored by Git. If no --mix-file is provided, the built-in default mix for that stage is used.
    - The default mix can be found in datasets_preparation/default_mixes.py
(OPTIONAL) Setup your Weights & Biases API key:
- Set WANDB_API_KEY environment variable if you want to log the progress there.
NOTE: For some scenarios you might need to also pass your Hugging Face API token HF_TOKEN. E.g.: If performing knowledge distillation and the teacher model requires access permissions.

Configuring and training

Configuration can be provided directly through a config YAML file or through a recipe YAML file.

A recipe is the preferred way to define an experiment. It contains:

config: the nested GlobalConfig used for training and runtime setup
data: dataset preparation settings, including the dataset mix and optional eval dataset preparation

The main sections are:

runtime: device, precision, FSDP, torch compile, CPU workers
model: model architecture and model-specific settings
training: stage, seed, total batch size, max steps, early stopping
optimizers: AdamW and optional Muon configuration
paths: dataset, evaluation, runs, and prompt paths
validation: validation frequency and number of validation steps
evals: HellaSwag, WinoGrande, and ARC-Challenge settings
generation: text generation frequency and max generation length
checkpointing: checkpoint save frequency and retention
tokenizer: tokenizer backend and checkpoint path
lora: optional LoRA configs
distillation: optional teacher model distillation settings
dpo: DPO configuration
wandb: W&B logging settings
torch_profiler: profiler settings

.env is no longer used as the experiment configuration file. It is only used for secrets and local cache settings:

WANDB_API_KEY=''
HF_TOKEN=''
HF_HOME='./cache'

Common flags

train.py accepts some flags that are useful to load a checkpoint or override some properties:

  --config <file>                   # Load config directly from a config YAML file
  --recipe <file>                   # Load config from a recipe YAML file
  --pretraining                     # Automatically sets pretraining stage.
  --instruct                        # Automatically sets instruct stage.
  --dpo                             # Automatically sets DPO stage.
  --checkpoint <file>               # Resume training from a specific checkpoint
  --reset-optimizers                # Ignore stored optimizer(s) state
  --start-step <N>                  # Override internal step counter

NOTES:

The project output path can be configured in config.py under GlobalConfig.paths.runs. By default it will use ./runs
When using --recipe, the recipe defines the training stage. The flags --pretraining, --instruct, and --dpo cannot be combined with --recipe.

Example recipe commands:

python prepare_datasets.py --recipe recipes/pretraining/debug.yaml
python train.py --recipe recipes/pretraining/debug.yaml

Running the Training

To train on single-GPU, run:

python train.py --recipe recipes/pretraining/debug.yaml

To train on multi-GPU run:

export OMP_NUM_THREADS=1

torchrun \
  --standalone \
  --nproc_per_node <NUMBER_OF_GPUs> \
  train.py --recipe recipes/pretraining/debug.yaml

To load a checkpoint and continue training, pass the flag to any of the above commands. E.g.:

export OMP_NUM_THREADS=1

torchrun \
  --standalone \
  --nproc_per_node <NUMBER_OF_GPUs> \
  train.py --recipe recipes/pretraining/debug.yaml --checkpoint <CHECKPOINT_FILE_PATH>

To train on multiple nodes with 1 or more GPUs per node, configure each node as follows:

Static

Ethernet

export NCCL_IB_DISABLE=1
export NCCL_SOCKET_NTHREADS=4
export NCCL_NSOCKS_PERTHREAD=8

InfiniBand

export NCCL_IB_DISABLE=0
export NCCL_IB_HCA=$(ls /sys/class/infiniband | paste -sd, -)

export OMP_NUM_THREADS=1
export PYTHONUNBUFFERED=1
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
export TORCH_DIST_BIND_ADDR=0.0.0.0 # only needed in master but no impact on the workers
export NCCL_DEBUG=WARN

NNODES=<NUMBER_OF_NODES>
NPERNODE=<NUMBER_OF_GPUs>
NODE_RANK=<NODE_RANK>
MASTER_ADDR=<MASTER_NODE_MACHINE_IP>
MASTER_PORT=<MASTER_NODE_MACHINE_PORT>

# make sure we can find the correct NIC
_IFACE=$(ip -o route get "$MASTER_ADDR" | awk '{for(i=1;i<=NF;i++) if($i=="dev"){print $(i+1); exit}}')
[ -n "$_IFACE" ] && [ "$_IFACE" != "lo" ] && export NCCL_SOCKET_IFNAME="$_IFACE"

torchrun \
  --nnodes ${NNODES} \
  --nproc-per-node ${NPERNODE} \
  --node-rank ${NODE_RANK} \
  --master_addr ${MASTER_ADDR} \
  --master_port ${MASTER_PORT} \
  train.py --recipe recipes/pretraining/debug.yaml

Elastic

export OMP_NUM_THREADS=1
export PYTHONUNBUFFERED=1
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
export TORCH_DIST_BIND_ADDR=0.0.0.0 # only needed in master but no impact on the workers
export NCCL_DEBUG=WARN

NNODES=<NUMBER_OF_NODES>
NPERNODE=<NUMBER_OF_GPUs>
MASTER_ADDR=<MASTER_NODE_MACHINE_IP>
MASTER_PORT=<MASTER_NODE_MACHINE_PORT>
RDZV_EP="$MASTER_ADDR:$MASTER_PORT"
RDZV_ID=<SOME_SHARED_JOB_NAME>

# make sure we can find the correct NIC
_IFACE=$(ip -o route get "$MASTER_ADDR" | awk '{for(i=1;i<=NF;i++) if($i=="dev"){print $(i+1); exit}}')
[ -n "$_IFACE" ] && [ "$_IFACE" != "lo" ] && export NCCL_SOCKET_IFNAME="$_IFACE"


torchrun \
  --nnodes ${NNODES} \
  --nproc-per-node ${NPERNODE} \
  --rdzv-backend c10d \
  --rdzv-endpoint ${RDZV_EP} \
  --rdzv-id ${RDZV_ID} \
  train.py --recipe recipes/pretraining/debug.yaml

NOTE: The same command needs to be run on all nodes

More details on torchrun here
More details on NCCL here

Using Torch Profiler

Torch profiler settings are configured in config.py under GlobalConfig.torch_profiler. More details on the profiler API can be found here.

Running tests

From the root folder:

pytest

Local files

For convenience local configurations, recipes or other files should use the naming convention as defined in .gitignore:

*.local.json
*.private.json
*.local.ipynb
*.private.ipynb
*.local.yaml
*.private.yaml
*.local.yml
*.private.yml

Contributions

Feel free to reach out if interested in contributing!

Third-party assets and licenses

Tokenizer files, model weights, and datasets obtained from third parties are not included in this repository unless explicitly stated, and may be subject to their own licenses and terms.

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Citation

Please cite this project if it was useful in your work:

@software{rui2024gradientgarden,
  author = {Rui Malheiro},
  title = {Gradient Garden},
  year = {2024},
  url = {https://github.com/ruimalheiro/gradient-garden}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gradient Garden

Current focus

Current capabilities

Training and distributed execution

Optimization

Training workflows

Models

Other features

Evaluation

Notes

Project structure

Setup

Configuring and training

Common flags

Running the Training

Using Torch Profiler

Running tests

Local files

Contributions

Third-party assets and licenses

License

Citation

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 157 Commits
datasets_preparation		datasets_preparation
engine		engine
evals		evals
examples		examples
inference		inference
metrics		metrics
models		models
recipes		recipes
tasks		tasks
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
checkpoints.py		checkpoints.py
config.py		config.py
dataloaders.py		dataloaders.py
ddp_utils.py		ddp_utils.py
distillation_utils.py		distillation_utils.py
dpo_utils.py		dpo_utils.py
logger.py		logger.py
lr_schedulers.py		lr_schedulers.py
prepare_datasets.py		prepare_datasets.py
requirements.txt		requirements.txt
test_prompts.json		test_prompts.json
tokenizer.py		tokenizer.py
train.py		train.py
utils.py		utils.py
wandb_utils.py		wandb_utils.py

Folders and files

Latest commit

History

Repository files navigation

Gradient Garden

Current focus

Current capabilities

Training and distributed execution

Optimization

Training workflows

Models

Other features

Evaluation

Notes

Project structure

Setup

Configuring and training

Common flags

Running the Training

Using Torch Profiler

Running tests

Local files

Contributions

Third-party assets and licenses

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages