RoPE scaling configuration not applied when using mcore_adapter for training

### Reminder

- [x] I have read the above rules and searched the existing issues.

### System Info

When training a model with the MCore Adapter, the RoPE scaling configuration does not take effect correctly. This affects both the rope_scaling settings in the model's config.json and those provided via ModelArguments.

As a result, the RoPE scaling parameters are not properly passed to Megatron-LM, which can lead to incorrect attention behavior.

### Reproduction

```python
from mcore_adapter.models import AutoModel
model = AutoModel.from_pretrained(model_args.model_name_or_path, training_args)
```

```shell
export DISTRIBUTED_ARGS="
    --nproc_per_node 8 \
    --nnodes 4 \
    --node_rank $node_rank \
    --master_addr $master_addr \
    --master_port $master_port
"

USE_MCA=1 torchrun $DISTRIBUTED_ARGS src/train.py \
    --model_name_or_path Qwen3-32B \
    --do_train \
    --stage sft \
    --finetuning_type full \
    --dataset <any long context dataset> \
    --preprocessing_num_workers 8 \
    --cutoff_len 65536 \
    --rope_scaling linear \
    --template qwen3 \
    --output_dir saves/mca/qwen3_32b_65536_scaling \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --num_train_epochs 2 \
    --learning_rate 3e-6 \
    --logging_steps 1 \
    --max_steps 3 \
    --save_steps 100 \
    --lr_scheduler_type cosine \
    --bf16 \
    --tensor_model_parallel_size 8 \
    --sequence_parallel true \
    --pipeline_model_parallel_size 4 \
    --bias_activation_fusion true \
    --apply_rope_fusion true \
    --overlap_grad_reduce true \
    --use_distributed_optimizer true \
    --overlap_param_gather true \
    --recompute_granularity full
```


### Others

Related Link: https://github.com/alibaba/ROLL/issues/287

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RoPE scaling configuration not applied when using mcore_adapter for training #9589

Reminder

System Info

Reproduction

Others

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

RoPE scaling configuration not applied when using mcore_adapter for training #9589

Description

Reminder

System Info

Reproduction

Others

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions