Skip to content

utnslab/halos

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HALoS: Hierarchical Asynchronous Local SGD over Slow Networks for Geo-Distributed Large Language Model Training

This repository contains the geo-distributed LLM-training simulator used in our ICML 2025 paper (HALoS).

Installation

# 1. Create and activate a conda environment
conda create -n halos python=3.10.14
conda activate halos

# 2. Install dependencies and the HALoS package
pip install -r requirements.txt
pip install -r requirements-flashattn.txt
pip install -e .

Prepare the deduped pile dataset (≈ 1.2 TB)

  1. Choose a directory and export its path:
export HALOS_DATA_DIR=/path/to/data
  1. Download EleutherAI/pile-deduped-pythia-preshuffled into $HALOS_DATA_DIR/datasets/pile-deduped-pythia-preshuffled.
mkdir $HALOS_DATA_DIR/datasets && cd $HALOS_DATA_DIR/datasets
git lfs clone https://huggingface.co/datasets/EleutherAI/pile-deduped-pythia-preshuffled
  1. Merge the 20 shards into a single mem-map file using utils/unshard_memmap.py in pythia repo:
python pythia/utils/unshard_memmap.py \
    --input_file $HALOS_DATA_DIR/datasets/pile-deduped-pythia-preshuffled/document-00000-of-00020.bin \
    --num_shards 20 \
    --output_dir $HALOS_DATA_DIR/datasets/pile-deduped-pythia-preshuffled

Launch a local Ray cluster

# Start the head node
ray start --head --port=6379

# (Optional) Add a worker node from another machine
ray start --address="$HEAD_ADDRESS"

# Shut down the cluster later (when you're done)
ray stop

Run the examples

Each script trains Pythia-70M for 12.9B tokens and writes results to $HALOS_DATA_DIR/${ALGO_NAME}_results. If $WANDB_API_KEY is set, training and validation losses are logged to Weights & Biases.

# HALoS (our method)
bash examples/run_halos.sh

# DiLoCo baseline
bash examples/run_diloco.sh

# DiLoCo + DynUpd baseline
bash examples/run_diloco_dynupd.sh

# Async-Local-SGD baseline
bash examples/run_async_local_sgd.sh

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages