Olmix

Warning

This project is under active development. We are migrating from our internal infrastructure to open source — expect rough edges, missing docs, and breaking changes.

Toolkit for optimizing pretraining data mixtures. Learns from small-scale proxy experiments ("swarms") to predict how data mixing ratios affect downstream performance, then proposes optimized mixtures for full-scale training.

Installation

git clone https://github.com/allenai/olmix.git
cd olmix
uv pip install -e ".[dev]"

Part 1: Fitting from CSV data

The easiest way to use Olmix is to bring your own swarm results as CSV files. Our swarms are available on Huggingface. This section explains how to run regression fitting and mixture optimization.

Input format

Prepare two CSV files. Each row is one training run from your swarm, and the two files are joined on the ID column (run or run_id).

ratios.csv — the mixing weights used in each run. Domain columns must sum to ~1.0 per row:

run	dclm	wikipedia	arxiv
run_001	0.45	0.30	0.25
run_002	0.60	0.20	0.20
run_003	0.33	0.33	0.34

metrics.csv — the evaluation metrics measured for each run (lower is better for BPB metrics):

run	arc_challenge_bpb	hellaswag_bpb	mmlu_stem_bpb
run_001	1.23	0.87	1.45
run_002	1.15	0.91	1.38
run_003	1.20	0.89	1.42

The domain column names in ratios.csv and the metric column names in metrics.csv can be anything — olmix derives them from the CSV headers. Both run and run_id are accepted as the ID column. An optional name column in either file provides human-readable run labels.

Fit config

olmix fit is configured entirely via a YAML file. See configs/fits/dclm_baseline.yaml for a full example. The config has these sections:

swarm:
  ratios: path/to/ratios.csv        # Required — CSV with domain mixture ratios per run
  metrics: path/to/metrics.csv      # Required — CSV with eval metrics per run

priors:                              # Required — token distribution across domains
  token_counts:
    domain_a: 600000000
    domain_b: 400000000

eval:                                 # Required — evaluation task definitions
  type: offline                       # offline | inloop
  tasks:
    math:
      - "minerva_math_algebra::olmes"
    code:
      - "codex_humaneval:3shot::none"
    qa:
      - "arc_challenge:rc::olmes"

regression:
  type: log_linear                   # log_linear | lightgbm | search | gp | autoscale | bimix
  seed: 0
  n_test: 0
  train_split: [1.0]
  aggregate_task_families: false

proposer:
  type: exact                        # exact | simulation | search
  temperature: null
  kl_reg: 0.1
  use_natural_kl: false
  fit_only: false
  make_worst_mix: false

constraints:
  enabled: false
  target_tokens: null                # Total token budget for the final training run
  repetition_factor: 5.0

filtering:
  keep_sources: []
  support_domains: []
  drop_metrics: []
  fixed_weight: {}
  obj_weights: {}

The priors section defines the natural token distribution across your domains via token_counts. Relative sizes and total tokens are computed automatically. Use olmix priors compute to scan S3 sources and generate the token counts for a config.

The eval section defines which evaluation tasks to use, grouped by family. Two types are supported:

offline — for cookbook-eval metrics (used by olmix fit with CSV data). Tasks are metric names matching CSV column headers.
inloop — for WandB in-loop metrics (used by olmix launch and olmix fit). Tasks map olmo-core task IDs to WandB metric names: {task_id: "eval/downstream/task_id (BPB v2)"}.

Task families are defined by the nesting structure (e.g., math, code, qa) and are used by aggregate_task_families.

Running a fit

olmix fit --config configs/fits/dclm_baseline.yaml --output-dir output/my_fit

That's it. All settings come from the YAML config. The two required CLI flags are:

Flag	Description
`--config`	Path to the YAML fit configuration file
`--output-dir`	Directory for saving fit outputs

Config reference

`eval` section

Field	What it does
`type`	Eval type: `offline` (cookbook-eval metrics for CSV-based fitting) or `inloop` (WandB in-loop metrics for launch + fitting)
`tasks`	Tasks grouped by family. For `offline`: `{family: [metric_name, ...]}`. For `inloop`: `{family: {task_id: wandb_metric_name}}`.

`regression` section

Field	What it does
`type`	Model type: `log_linear` (default, parametric scaling law), `lightgbm` (gradient-boosted trees), `gp` (Gaussian process), `autoscale` (power-law autoscaling), `bimix` (BiMix-style power law)
`aggregate_task_families`	Fit one model per task family (math, code, QA) instead of per individual task. Much faster with many metrics.
`train_split`	Fraction of runs used for training. Default `[1.0]` uses all runs for both training and evaluation.
`n_test`	Number of held-out test samples for evaluating the regression model.
`seed`	Random state for train-test split.

`proposer` section

Field	What it does
`type`	How to search for optimal weights: `exact` (convex optimization for log-linear), `simulation` (Dirichlet Monte Carlo), `search` (grid over observed points)
`kl_reg`	KL divergence regularization strength (exact proposer only). Penalizes the proposed mix for diverging from the prior.
`use_natural_kl`	Use the natural (token-count-based) distribution as the KL reference, even when a manual prior is set.
`temperature`	Temperature for adjusting the Dirichlet prior in simulation. Closer to 0 = more uniform.
`fit_only`	Only fit the regression models, skip the mixture proposal step. Useful for inspecting model quality.
`make_worst_mix`	Invert the objective function and produce a bad mix (for counterfactual analysis).

`constraints` section

Field	What it does
`enabled`	Enable token budget constraints on the proposed mixture.
`target_tokens`	Total token budget for the final training run. Required when `enabled: true`.
`repetition_factor`	Maximum times a source's tokens can be repeated (default: 5.0).

`filtering` section

Field	What it does
`keep_sources`	Only use runs where these sources have nonzero weight (and all others are zero).
`support_domains`	Only use runs where these domains' ratios sum to 1.
`drop_metrics`	Exclude specific metrics from fitting.
`fixed_weight`	Pin specific domains to fixed weights — they are excluded from optimization. Native dict syntax (not JSON string).
`obj_weights`	Non-uniform weights for averaging BPB across tasks. Default is uniform.

Output

All results are written to a hashed subdirectory under the --output-dir you specify. The subdirectory name is derived from a hash of the config, so different configurations produce separate output folders.

File	Description
`config.json`	Full configuration used for this fit (for reproducibility)
`interaction_matrix.png`	Heatmap of regression coefficients: rows are domains, columns are metrics. Shows which domains help or hurt each metric.
`interaction_matrix_signed_evidence.png`	Same matrix colored by statistical significance. Green = significant positive effect, red = significant negative effect.
`interaction_matrix.npy`	Raw interaction matrix as a NumPy array (for downstream analysis).
`{metric}_*_fit.png`	Per-metric regression plot: predicted vs. actual values. Tight clustering along the diagonal means the model fits well.
`{metric}_*_correlations.json`	Correlation metrics (e.g. R²) for each regression fit.
`path_to_regression_model.txt`	Path to the cached regression model (pickle). Reused on subsequent fits with the same regression config.

When fit_only: false, the proposer step also produces:

File	Description
`{metric}_*_optimal.json`	Proposed optimal weights for this metric (list of `{"domain": ..., "weight": ...}`).
`{metric}_*_optimal.png`	Bar chart comparing the natural prior (corpus distribution) to the proposed optimal weights.
`predicted_performance.json`	Predicted average metric value at the proposed optimal weights.

The key output is opt_avg_all_metrics_*_optimal.json — the single set of weights that optimizes the average across all metrics.

Part 2: Launching swarms and fitting from W&B

Once you're comfortable with the fitting workflow above, you can use olmix end-to-end: generate candidate mixtures, launch proxy training runs on Beaker, and fit directly from the W&B results.

The workflow uses two separate configs:

GenerationConfig — controls how mixes are sampled (data sources, priors, swarm parameters, token budget). See configs/generations/example.yaml.
LaunchConfig — controls how training runs are launched (infra, training hyperparams, eval, mix). See configs in configs/experiments/.

Every LaunchConfig requires an explicit top-level mix field that maps domain keys to weights and repetition factors. The data.sources section describes what data exists; the mix section describes how much of each domain to use.

The mix supports two formats — nested (recommended for hand-written configs) and flat (used by generated configs). Both are equivalent; nested mixes are auto-flattened on load.

Nested format — mirrors the source/topic/quality hierarchy. Weights at each level are multiplied to get the final leaf weight. repetition_factor is inherited from the nearest ancestor that sets it:

mix:
  dclm:
    weight: 0.8
    repetition_factor: 1.0
    science_math_and_technology:
      weight: 0.25
      repetition_factor: 1.0
    software_development:
      weight: 0.625
      repetition_factor: 1.0
    education_and_jobs:
      weight: 0.125
      repetition_factor: 1.0
  wikipedia:
    weight: 0.1
    repetition_factor: 2.0
  arxiv:
    weight: 0.1
    repetition_factor: 1.5

For quality-level nesting:

mix:
  all_dressed:
    weight: 0.98
    repetition_factor: 1.0
    science:
      weight: 0.20
      high: { weight: 0.70, repetition_factor: 1.0 }
      med: { weight: 0.30, repetition_factor: 1.0 }
    code:
      weight: 0.50
      high: { weight: 0.70, repetition_factor: 1.0 }
      med: { weight: 0.30, repetition_factor: 1.0 }
  arxiv:
    weight: 0.02
    repetition_factor: 1.5

Flat format — colon-separated domain keys, each with weight and repetition_factor. This is what olmix generate produces:

mix:
  dclm:science_math_and_technology:
    weight: 0.2
    repetition_factor: 1.0
  dclm:software_development:
    weight: 0.5
    repetition_factor: 1.0
  wikipedia:
    weight: 0.1
    repetition_factor: 2.0

Step 0: Compute priors (token counts)

Before generating mixes, compute the token counts for your data sources. This scans S3 paths and outputs the priors block for your generation config:

olmix priors compute --config configs/generations/example.yaml

This outputs a YAML block you can paste directly into your generation config:

priors:
  token_counts:
    arxiv: 21377485731
    dclm:education_and_jobs: 20771836713
    dclm:science_math_and_technology: 84526121193
    dclm:software_development: 23878302458
    wikipedia: 3692487830

Copy the output into your generation config's priors: section. Use --output priors.yaml to write to a file instead. Results are cached in cache/ for subsequent runs; use --no-cache to force a fresh scan.

Step 1: Generate candidate mixtures

Use olmix generate to sample mixture variants from a generation config. The --base flag provides a launch config template, and each variant is written as a self-contained launch config YAML file — ready to submit directly.

olmix generate \
  --config configs/generations/example.yaml \
  --base configs/experiments/data_proportions/mix_baseline.yaml \
  --output output/my_variants/

This produces one YAML file per variant in the output directory:

output/my_variants/
  example-swarm-a1b2c3d4-0000.yaml
  example-swarm-a1b2c3d4-0001.yaml
  example-swarm-a1b2c3d4-0002.yaml
  example-swarm-a1b2c3d4-0003.yaml

Each variant file is a complete launch config with infra, training, data, eval, and the sampled mix:

name: example-swarm-a1b2c3d4-0000
description: Data proportions experiment - balanced baseline mix
infra:
  budget: ai2/oe-base
  cluster: ai2/jupiter
  # ...
training:
  proxy_model_id: olmo3_14m
  # ...
data:
  sources:
  - name: dclm
    topics:
    - name: science_math_and_technology
      paths:
      - s3://...
  - name: wikipedia
    paths:
    - s3://...
eval:
  tasks: { ... }
mix:
  dclm:science_math_and_technology:
    weight: 0.55
  wikipedia:
    weight: 0.10
group_id: a1b2c3d4

Inspect and edit these files before launching — this is the point where you have full control over what gets trained.

Step 2: Preview training commands

Renders the full OLMo training command for each variant. The --variants flag accepts a directory of configs or a single config file. Prints to stdout without launching anything.

olmix launch preview --variants output/my_variants/          # directory
olmix launch preview --variants configs/experiments/data_proportions/mix_heavy_code.yaml  # single file

Step 3: Launch a swarm

Submits one Beaker job per variant. Each job trains a proxy model on its mixture and logs eval metrics to W&B under a shared group ID. Launch metadata is saved in the variants directory.

olmix launch run --variants output/my_variants/
olmix launch run --variants configs/experiments/data_proportions/mix_heavy_code.yaml  # single file

Use --dry-run to generate the metadata JSON without launching any jobs.

Step 4: Export to CSV and fit

Once the swarm runs complete, export the ratios and metrics to CSV files (e.g. from W&B), then fit using the YAML config workflow described in Part 1:

olmix fit --config configs/fits/my_config.yaml --output-dir output/my_fit

Development

make run-checks   # format + lint + typecheck + test

Citation

@article{chen2026olmix,
  title={Olmix: A Framework for Data Mixing Throughout LM Development},
  author={Chen, Mayee F and Murray, Tyler and Heineman, David and Jordan, Matt and Hajishirzi, Hannaneh and Re, Christopher and Soldaini, Luca and Lo, Kyle},
  year={2026},
  month={February}
}

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
.github		.github
configs		configs
olmix		olmix
plans		plans
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
dclm_metrics.csv		dclm_metrics.csv
dclm_ratios.csv		dclm_ratios.csv
pyproject.toml		pyproject.toml
setup.py		setup.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Olmix

Installation

Part 1: Fitting from CSV data

Input format

Fit config

Running a fit

Config reference

`eval` section

`regression` section

`proposer` section

`constraints` section

`filtering` section

Output

Part 2: Launching swarms and fitting from W&B

Step 0: Compute priors (token counts)

Step 1: Generate candidate mixtures

Step 2: Preview training commands

Step 3: Launch a swarm

Step 4: Export to CSV and fit

Development

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

allenai/olmix

Folders and files

Latest commit

History

Repository files navigation

Olmix

Installation

Part 1: Fitting from CSV data

Input format

Fit config

Running a fit

Config reference

eval section

regression section

proposer section

constraints section

filtering section

Output

Part 2: Launching swarms and fitting from W&B

Step 0: Compute priors (token counts)

Step 1: Generate candidate mixtures

Step 2: Preview training commands

Step 3: Launch a swarm

Step 4: Export to CSV and fit

Development

Citation

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

`eval` section

`regression` section

`proposer` section

`constraints` section

`filtering` section

Packages