Skip to content

Add Megatron FA4 CP/GDN CP, streaming offload, and optimized LoRA publishing#716

Merged
FurtherAI merged 416 commits into
mainfrom
austin/megatron_context_parallel
Jun 5, 2026
Merged

Add Megatron FA4 CP/GDN CP, streaming offload, and optimized LoRA publishing#716
FurtherAI merged 416 commits into
mainfrom
austin/megatron_context_parallel

Conversation

@FurtherAI

@FurtherAI FurtherAI commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR adds production Megatron support for the FA4/context-parallel training path, GDN CP execution, streaming weight offload for memory reduction, and optimized LoRA publishing for vLLM serving.

Flex attn with FA4 backend is significantly faster than the old Triton backend. Context parallelism is very efficient in memory and throughput, being nearly linear in throughput and superlinear in memory/max packed sequence length (workload is repeated though, not growing quadratically).

Most lines of code are blackboxed in the context_parallel and gdn modules, which implement the actual custom context parallelism operations. Other major source is tests.

Highlights

  • Adds FA4-backed Megatron training support with CP-first topology defaults.
  • Adds real GDN CP execution, including packed/native FLA paths, CP layout handling, conv tail exchange, and recurrent-state transport.
  • Adds streaming weight offload for large models with resident-layer/slot controls and compile-compatible execution.
  • Replaces slow sharded LoRA merge/save with optimized model-support LoRA publishing, preserving fused expert tensors for vLLM. This is not yet nearly as fast as I want it as publishing a lora can be ~1s for 3.5 35B and 3s for the 397B, but for the 397B, this is 14s -> 3s.
  • Refactors Megatron runtime pieces: provider setup, microbatch prep, backend lifecycle, flex-attn modules, and runtime patches. Goal is to keep train.py clean.
  • Adds/updates oracle and integration coverage for CP attention, GDN CP, Qwen3/Qwen3.5 workflows, train/inf mismatch, streaming offload, and LoRA publish correctness.

Validation

  • CP attention oracle and backend tests.
  • GDN CP packed/native/topology oracle coverage across CP sizes.
  • Streaming offload oracle, including CP2/EP2 comparison.
  • LoRA publish tests comparing optimized output against the old implementation.

Headline Throughput

Qwen3.5-35B-A3B on H200, packed shared-prefix workload:

GPUs No CP best CP best Gain
1 H200 tp1 ep1, seq 57344, 6.05k tok/s no CP option n/a
2 H200 tp1 ep1 dp2, seq 57344, 11.26k tok/s tp1 cp2 ep2, seq 122880, 15.22k tok/s +35%
4 H200 tp1 ep2, seq 73728, 21.25k tok/s tp1 cp4 ep2, seq 262144, 27.88k tok/s +31%
8 H200 tp1 ep8, seq 81920, 29.52k tok/s tp1 cp8 ep8, seq 581632, 47.96k tok/s +62%

FurtherAI added 30 commits May 5, 2026 08:57
# Conflicts:
#	pyproject.toml
#	src/art/local/backend.py
#	src/art/megatron/compile_workarounds.py
#	src/art/megatron/flex_attention.py
#	src/art/megatron/jobs.py
#	src/art/megatron/lora.py
#	src/art/megatron/offload.py
#	src/art/megatron/provider.py
#	src/art/megatron/runtime/backend.py
#	src/art/megatron/service.py
#	src/art/megatron/setup.sh
#	src/art/megatron/train.py
#	src/art/pipeline_trainer/trainer.py
#	src/art/preprocessing/tokenize.py
#	src/art/tinker/renderers.py
#	src/art/tinker/server.py
#	src/art/unsloth/service.py
#	src/art/unsloth/train.py
#	src/art/vllm/patches.py
#	tests/integration/megatron/model_support/oracle_worker.py
#	tests/unit/test_preprocessing_tokenize.py
#	tests/unit/test_vllm_patches_contract.py
#	uv.lock
@FurtherAI FurtherAI force-pushed the austin/megatron_context_parallel branch from 4c42ebd to 25e0a7f Compare June 4, 2026 06:27
@FurtherAI FurtherAI merged commit cd47b88 into main Jun 5, 2026
5 checks passed
@FurtherAI FurtherAI deleted the austin/megatron_context_parallel branch June 5, 2026 05:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants