HALoS: Hierarchical Asynchronous Local SGD over Slow Networks for Geo-Distributed Large Language Model Training

This repository contains the geo-distributed LLM-training simulator used in our ICML 2025 paper (HALoS).

Installation

# 1. Create and activate a conda environment
conda create -n halos python=3.10.14
conda activate halos

# 2. Install dependencies and the HALoS package
pip install -r requirements.txt
pip install -r requirements-flashattn.txt
pip install -e .

Prepare the deduped pile dataset (≈ 1.2 TB)

Choose a directory and export its path:

export HALOS_DATA_DIR=/path/to/data

Download EleutherAI/pile-deduped-pythia-preshuffled into $HALOS_DATA_DIR/datasets/pile-deduped-pythia-preshuffled.

mkdir $HALOS_DATA_DIR/datasets && cd $HALOS_DATA_DIR/datasets
git lfs clone https://huggingface.co/datasets/EleutherAI/pile-deduped-pythia-preshuffled

Merge the 20 shards into a single mem-map file using utils/unshard_memmap.py in pythia repo:

python pythia/utils/unshard_memmap.py \
    --input_file $HALOS_DATA_DIR/datasets/pile-deduped-pythia-preshuffled/document-00000-of-00020.bin \
    --num_shards 20 \
    --output_dir $HALOS_DATA_DIR/datasets/pile-deduped-pythia-preshuffled

Launch a local Ray cluster

# Start the head node
ray start --head --port=6379

# (Optional) Add a worker node from another machine
ray start --address="$HEAD_ADDRESS"

# Shut down the cluster later (when you're done)
ray stop

Run the examples

Each script trains Pythia-70M for 12.9B tokens and writes results to $HALOS_DATA_DIR/${ALGO_NAME}_results. If $WANDB_API_KEY is set, training and validation losses are logged to Weights & Biases.

# HALoS (our method)
bash examples/run_halos.sh

# DiLoCo baseline
bash examples/run_diloco.sh

# DiLoCo + DynUpd baseline
bash examples/run_diloco_dynupd.sh

# Async-Local-SGD baseline
bash examples/run_async_local_sgd.sh

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
src/halos		src/halos
.gitignore		.gitignore
README.md		README.md
requirements-flashattn.txt		requirements-flashattn.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HALoS: Hierarchical Asynchronous Local SGD over Slow Networks for Geo-Distributed Large Language Model Training

Installation

Prepare the deduped pile dataset (≈ 1.2 TB)

Launch a local Ray cluster

Run the examples

About

Uh oh!

Releases

Packages

Languages

utnslab/halos

Folders and files

Latest commit

History

Repository files navigation

HALoS: Hierarchical Asynchronous Local SGD over Slow Networks for Geo-Distributed Large Language Model Training

Installation

Prepare the deduped pile dataset (≈ 1.2 TB)

Launch a local Ray cluster

Run the examples

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages