Megaprop

Pedagogical, non-frontier training with non-powerful right-preconditioned optimizers!

Figure: Locoprop-S beats AdamW on wallclock time on a TP=2 sweep on a Megatron GPT over a FineWeb Edu set, 2000 steps, two sweeps, with the feature gram matrix refreshed every 8 steps.

Background

Newton-Muon and Locoprop are part of an evolving line of optimizers that study right-preconditioning on the gradient. However, well-known limitation has been some of them require local activation info at the optimization step.

However,

Newton-Muon can be expressed as $msign(G f(C))$, where C is the feature gram matrix. (NM chooses $f(C)$ to be $C^{-1}$ but it could be something else, from what I understand).
Locoprop can also be expressed purely as a function of $(G,C)$ through the following construction:

Hence, by adding support to route the _feature_gram_ ($X^T X$) beside the main_grad ($dY^TX$), via a series of changes into TransformerEngine, Megatron-LM, and Emerging-Optimizers, with the possibility for a neat abstraction:

Experiments

On a TP=2 sweep on a Megatron GPT over a FineWeb Edu set, 2000 steps, two sweeps, with the feature gram matrix refreshed every 8 steps, Locoprop beats AdamW on wallclock time:

It's wasteful to materialize the full gram matrix, so with Locoprop I tried a diagonal approximation and a block diagonal approximation. The diagonal approximation seems to do well! NM appears to be slower at the moment due to the polar iteration maybe, I need to check.

I think there should be a few more AdamW LRs checked, but the initial results look promising, and not streaming the activations seem to work. I double-checked to make sure that the calculations come out to be equivalent.

I've also attached a design doc here for reference: feature_gram_matrix_optimizers_design.pdf if someone wants to upstream this in their own training pipeline.

The diff excl. test files is not that huge:

repo                 | non-test diff
---------------------|-----------------------
Megatron-LM          | 13 files, +1336/-2
Emerging-Optimizers  | 6 files, +1140/-3
TransformerEngine    | 5 files, +281/-0
Total                | 24 files, +2757/-5

Thanks to: @mkhona-nvidia for his help!

TL;DR: Cross-repo control repository for the FEATURE_GRAM matrix optimizer integration.

Pinned Components

Component	Fork branch	Purpose
Megatron-LM	`plugyawn/Megatron-LM@codex/feature-gram-matrix-optimizers`	Megatron Core metadata, config, native collection, optimizer routing
Emerging-Optimizers	`plugyawn/Emerging-Optimizers@codex/feature-gram-matrix-optimizers`	Newton-Muon/LocoProp-S rules, TP apply helpers, FEATURE_GRAM kernels
TransformerEngine	`plugyawn/TransformerEngine@codex/feature-gram-matrix-optimizers`	TE extra wgrad helper and fused-module FEATURE_GRAM collection

Checkout

git clone --recurse-submodules https://github.com/plugyawn/Megaprop.git

To refresh to the branch tips declared in .gitmodules:

git submodule update --init --remote --recursive

The superproject commit pins exact SHAs for reproducibility.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Emerging-Optimizers @ 6edcb19		Emerging-Optimizers @ 6edcb19
Megatron-LM @ 7d46e3d		Megatron-LM @ 7d46e3d
TransformerEngine @ 8786177		TransformerEngine @ 8786177
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Megaprop

Background

Experiments

Pinned Components

Checkout

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Megaprop

Background

Experiments

Pinned Components

Checkout

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages