-
Notifications
You must be signed in to change notification settings - Fork 7.8k
Open
Labels
bugSomething isn't workingSomething isn't workingpendingThis problem is yet to be addressedThis problem is yet to be addressed
Description
Reminder
- I have read the above rules and searched the existing issues.
System Info
When training a model with the MCore Adapter, the RoPE scaling configuration does not take effect correctly. This affects both the rope_scaling settings in the model's config.json and those provided via ModelArguments.
As a result, the RoPE scaling parameters are not properly passed to Megatron-LM, which can lead to incorrect attention behavior.
Reproduction
from mcore_adapter.models import AutoModel
model = AutoModel.from_pretrained(model_args.model_name_or_path, training_args)export DISTRIBUTED_ARGS="
--nproc_per_node 8 \
--nnodes 4 \
--node_rank $node_rank \
--master_addr $master_addr \
--master_port $master_port
"
USE_MCA=1 torchrun $DISTRIBUTED_ARGS src/train.py \
--model_name_or_path Qwen3-32B \
--do_train \
--stage sft \
--finetuning_type full \
--dataset <any long context dataset> \
--preprocessing_num_workers 8 \
--cutoff_len 65536 \
--rope_scaling linear \
--template qwen3 \
--output_dir saves/mca/qwen3_32b_65536_scaling \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--num_train_epochs 2 \
--learning_rate 3e-6 \
--logging_steps 1 \
--max_steps 3 \
--save_steps 100 \
--lr_scheduler_type cosine \
--bf16 \
--tensor_model_parallel_size 8 \
--sequence_parallel true \
--pipeline_model_parallel_size 4 \
--bias_activation_fusion true \
--apply_rope_fusion true \
--overlap_grad_reduce true \
--use_distributed_optimizer true \
--overlap_param_gather true \
--recompute_granularity fullOthers
Related Link: alibaba/ROLL#287
Metadata
Metadata
Labels
bugSomething isn't workingSomething isn't workingpendingThis problem is yet to be addressedThis problem is yet to be addressed