Skip to content

plugyawn/Megaprop

Repository files navigation

Megaprop

Pedagogical, non-frontier training with non-powerful right-preconditioned optimizers!

image

Figure: Locoprop-S beats AdamW on wallclock time on a TP=2 sweep on a Megatron GPT over a FineWeb Edu set, 2000 steps, two sweeps, with the feature gram matrix refreshed every 8 steps.


Background

Newton-Muon and Locoprop are part of an evolving line of optimizers that study right-preconditioning on the gradient. However, well-known limitation has been some of them require local activation info at the optimization step.

However,

  • Newton-Muon can be expressed as $msign(G f(C))$, where C is the feature gram matrix. (NM chooses $f(C)$ to be $C^{-1}$ but it could be something else, from what I understand).
  • Locoprop can also be expressed purely as a function of $(G,C)$ through the following construction:
Image

Hence, by adding support to route the _feature_gram_ ($X^T X$) beside the main_grad ($dY^TX$), via a series of changes into TransformerEngine, Megatron-LM, and Emerging-Optimizers, with the possibility for a neat abstraction:

Image

Experiments

On a TP=2 sweep on a Megatron GPT over a FineWeb Edu set, 2000 steps, two sweeps, with the feature gram matrix refreshed every 8 steps, Locoprop beats AdamW on wallclock time:

Image

It's wasteful to materialize the full gram matrix, so with Locoprop I tried a diagonal approximation and a block diagonal approximation. The diagonal approximation seems to do well! NM appears to be slower at the moment due to the polar iteration maybe, I need to check.

image

I think there should be a few more AdamW LRs checked, but the initial results look promising, and not streaming the activations seem to work. I double-checked to make sure that the calculations come out to be equivalent.

I've also attached a design doc here for reference: feature_gram_matrix_optimizers_design.pdf if someone wants to upstream this in their own training pipeline.

The diff excl. test files is not that huge:

repo                 | non-test diff
---------------------|-----------------------
Megatron-LM          | 13 files, +1336/-2
Emerging-Optimizers  | 6 files, +1140/-3
TransformerEngine    | 5 files, +281/-0
Total                | 24 files, +2757/-5

Thanks to: @mkhona-nvidia for his help!

TL;DR: Cross-repo control repository for the FEATURE_GRAM matrix optimizer integration.

Pinned Components

Component Fork branch Purpose
Megatron-LM plugyawn/Megatron-LM@codex/feature-gram-matrix-optimizers Megatron Core metadata, config, native collection, optimizer routing
Emerging-Optimizers plugyawn/Emerging-Optimizers@codex/feature-gram-matrix-optimizers Newton-Muon/LocoProp-S rules, TP apply helpers, FEATURE_GRAM kernels
TransformerEngine plugyawn/TransformerEngine@codex/feature-gram-matrix-optimizers TE extra wgrad helper and fused-module FEATURE_GRAM collection

Checkout

git clone --recurse-submodules https://github.com/plugyawn/Megaprop.git

To refresh to the branch tips declared in .gitmodules:

git submodule update --init --remote --recursive

The superproject commit pins exact SHAs for reproducibility.

Releases

No releases published

Packages

 
 
 

Contributors