Figure: Locoprop-S beats AdamW on wallclock time on a TP=2 sweep on a Megatron GPT over a FineWeb Edu set, 2000 steps, two sweeps, with the feature gram matrix refreshed every 8 steps.
Newton-Muon and Locoprop are part of an evolving line of optimizers that study right-preconditioning on the gradient. However, well-known limitation has been some of them require local activation info at the optimization step.
However,
- Newton-Muon can be expressed as
$msign(G f(C))$ , where C is the feature gram matrix. (NM chooses$f(C)$ to be$C^{-1}$ but it could be something else, from what I understand). - Locoprop can also be expressed purely as a function of
$(G,C)$ through the following construction:
Hence, by adding support to route the _feature_gram_ (main_grad (TransformerEngine, Megatron-LM, and Emerging-Optimizers, with the possibility for a neat abstraction:
On a TP=2 sweep on a Megatron GPT over a FineWeb Edu set, 2000 steps, two sweeps, with the feature gram matrix refreshed every 8 steps, Locoprop beats AdamW on wallclock time:
It's wasteful to materialize the full gram matrix, so with Locoprop I tried a diagonal approximation and a block diagonal approximation. The diagonal approximation seems to do well! NM appears to be slower at the moment due to the polar iteration maybe, I need to check.
I think there should be a few more AdamW LRs checked, but the initial results look promising, and not streaming the activations seem to work. I double-checked to make sure that the calculations come out to be equivalent.
I've also attached a design doc here for reference: feature_gram_matrix_optimizers_design.pdf if someone wants to upstream this in their own training pipeline.
The diff excl. test files is not that huge:
repo | non-test diff
---------------------|-----------------------
Megatron-LM | 13 files, +1336/-2
Emerging-Optimizers | 6 files, +1140/-3
TransformerEngine | 5 files, +281/-0
Total | 24 files, +2757/-5
Thanks to: @mkhona-nvidia for his help!
TL;DR: Cross-repo control repository for the FEATURE_GRAM matrix optimizer integration.
| Component | Fork branch | Purpose |
|---|---|---|
| Megatron-LM | plugyawn/Megatron-LM@codex/feature-gram-matrix-optimizers |
Megatron Core metadata, config, native collection, optimizer routing |
| Emerging-Optimizers | plugyawn/Emerging-Optimizers@codex/feature-gram-matrix-optimizers |
Newton-Muon/LocoProp-S rules, TP apply helpers, FEATURE_GRAM kernels |
| TransformerEngine | plugyawn/TransformerEngine@codex/feature-gram-matrix-optimizers |
TE extra wgrad helper and fused-module FEATURE_GRAM collection |
git clone --recurse-submodules https://github.com/plugyawn/Megaprop.gitTo refresh to the branch tips declared in .gitmodules:
git submodule update --init --remote --recursiveThe superproject commit pins exact SHAs for reproducibility.