Add Megatron FA4 CP/GDN CP, streaming offload, and optimized LoRA publishing by FurtherAI · Pull Request #716 · OpenPipe/ART

FurtherAI · 2026-06-04T05:22:39Z

Summary

This PR adds production Megatron support for the FA4/context-parallel training path, GDN CP execution, streaming weight offload for memory reduction, and optimized LoRA publishing for vLLM serving.

Flex attn with FA4 backend is significantly faster than the old Triton backend. Context parallelism is very efficient in memory and throughput, being nearly linear in throughput and superlinear in memory/max packed sequence length (workload is repeated though, not growing quadratically).

Most lines of code are blackboxed in the context_parallel and gdn modules, which implement the actual custom context parallelism operations. Other major source is tests.

Highlights

Adds FA4-backed Megatron training support with CP-first topology defaults.
Adds real GDN CP execution, including packed/native FLA paths, CP layout handling, conv tail exchange, and recurrent-state transport.
Adds streaming weight offload for large models with resident-layer/slot controls and compile-compatible execution.
Replaces slow sharded LoRA merge/save with optimized model-support LoRA publishing, preserving fused expert tensors for vLLM. This is not yet nearly as fast as I want it as publishing a lora can be ~1s for 3.5 35B and 3s for the 397B, but for the 397B, this is 14s -> 3s.
Refactors Megatron runtime pieces: provider setup, microbatch prep, backend lifecycle, flex-attn modules, and runtime patches. Goal is to keep train.py clean.
Adds/updates oracle and integration coverage for CP attention, GDN CP, Qwen3/Qwen3.5 workflows, train/inf mismatch, streaming offload, and LoRA publish correctness.

Validation

CP attention oracle and backend tests.
GDN CP packed/native/topology oracle coverage across CP sizes.
Streaming offload oracle, including CP2/EP2 comparison.
LoRA publish tests comparing optimized output against the old implementation.

Headline Throughput

Qwen3.5-35B-A3B on H200, packed shared-prefix workload:

GPUs	No CP best	CP best	Gain
1 H200	`tp1 ep1`, seq `57344`, `6.05k tok/s`	no CP option	n/a
2 H200	`tp1 ep1 dp2`, seq `57344`, `11.26k tok/s`	`tp1 cp2 ep2`, seq `122880`, `15.22k tok/s`	`+35%`
4 H200	`tp1 ep2`, seq `73728`, `21.25k tok/s`	`tp1 cp4 ep2`, seq `262144`, `27.88k tok/s`	`+31%`
8 H200	`tp1 ep8`, seq `81920`, `29.52k tok/s`	`tp1 cp8 ep8`, seq `581632`, `47.96k tok/s`	`+62%`

# Conflicts: # pyproject.toml # src/art/local/backend.py # src/art/megatron/compile_workarounds.py # src/art/megatron/flex_attention.py # src/art/megatron/jobs.py # src/art/megatron/lora.py # src/art/megatron/offload.py # src/art/megatron/provider.py # src/art/megatron/runtime/backend.py # src/art/megatron/service.py # src/art/megatron/setup.sh # src/art/megatron/train.py # src/art/pipeline_trainer/trainer.py # src/art/preprocessing/tokenize.py # src/art/tinker/renderers.py # src/art/tinker/server.py # src/art/unsloth/service.py # src/art/unsloth/train.py # src/art/vllm/patches.py # tests/integration/megatron/model_support/oracle_worker.py # tests/unit/test_preprocessing_tokenize.py # tests/unit/test_vllm_patches_contract.py # uv.lock

FurtherAI added 30 commits May 5, 2026 08:57

Validate native vLLM LoRA for Qwen3 dense

5b520e3

Promote dense Qwen models to validated support

d70ab2c

Avoid eager model support workflow imports

3d77ba3

Use compact packed GDN kernels for local buckets

3663266

Use chunked FLA GDN kernel

5d32ac0

Use fused Megatron cross entropy

697f392

Remove legacy GDN executor path

632eefb

Add harness CE fusion override worker

4d60c94

Add GDN timing hooks to harness wrapper

d57b48e

Organize Megatron modules and integration tests

02f221b

Fix HF parity invariant handler call

06814b0

Port main dependency and lifecycle updates

df52d07

Update Qwen handler for newer bridge mappings

4c1fde1

Validate Qwen3.5 vLLM LoRA layout

6c66d67

Remove flex attention compile tuning options

470f966

Ignore train inference mismatch artifacts

6b43ef0

Avoid assert bytecode in flex attention forward

5fe1f1b

Report flex attention bias type mismatches

70e9db4

Propagate Qwen3.5 MTP shared-prefix attention

f79e63e

Forward Qwen3.5 MTP attention bias to layers

1506236

Avoid checkpointing Qwen3.5 MTP attention state

dd16e0a

Disable Qwen3.5 MTP in ART Megatron

5bf2c87

Drop MTP diagnostic flex attention changes

e9b869d

Assert Qwen3.5 ART training has no MTP

d26ecb7

Clean PR artifacts and fix type checks

6b40e71

Unify runtime process supervision

7edba06

Model asyncio subprocess contract in runtime tests

a31a581

Defer supervised wait coroutine creation

815d577

Prune oracle topology artifacts by default

f662370

FurtherAI added 25 commits May 30, 2026 21:42

Prestage routing replay targets before forward

d427295

Test prestaged routing replay layout switches

f095db5

Keep workflow architecture inspection single-rank

5705a00

Stage routing replay targets in validation harnesses

6fcacdb

Remove branch-only assertion tests

aedb7ed

Keep CP scoring token UIDs on CPU

88d4f15

Retry train inf mismatch workflow stage

8de63fd

Fix CP routing replay trace token uids

d486c38

Relax router score oracle for CP replay

139a64b

Drop padded expert rows from forward traces

a0df118

Pack oracle LoRA snapshots before safetensors save

9f80b5c

Disable compiled qwen35 routed expert compute

dcc25a8

Normalize Megatron identity LoRA through model support

ad055f8

Preserve GDN layout across checkpoint recompute

899b917

Tighten router score oracle threshold

18dad24

Narrow Qwen3.5 MoE compile workaround

075031c

Use GDN island boundary layout state

23d32d4

Remove GDN layout inference fallback

0264231

Patch weighted SwiGLU compile autograd

3470ce8

Remove no-op CP training guard

d8b2209

Remove CP timing from production training results

342b100

Trim GDN shared-prefix PR test surface

64144ca

Drop GDN shared-prefix README from PR surface

81fc8b2

Remove dead GDN production helpers

a2b0ec8

Merge latest main into vllm merge worktree

25e0a7f

FurtherAI force-pushed the austin/megatron_context_parallel branch from 4c42ebd to 25e0a7f Compare June 4, 2026 06:27

bradhilton approved these changes Jun 5, 2026

View reviewed changes

Merge gql weave compatibility fix from main

8aac18f

FurtherAI merged commit cd47b88 into main Jun 5, 2026
5 checks passed

FurtherAI deleted the austin/megatron_context_parallel branch June 5, 2026 05:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Megatron FA4 CP/GDN CP, streaming offload, and optimized LoRA publishing#716

Add Megatron FA4 CP/GDN CP, streaming offload, and optimized LoRA publishing#716
FurtherAI merged 416 commits into
mainfrom
austin/megatron_context_parallel

FurtherAI commented Jun 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FurtherAI commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Highlights

Validation

Headline Throughput

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FurtherAI commented Jun 4, 2026 •

edited

Loading