4张24GB显存的4090lora微调GLM-4V-9B时，还是爆显存

### System Info / 系統信息

py3.12
cuda12.4


### Who can help? / 谁可以帮助到您？

_No response_

### Information / 问题信息

- [x] The official example scripts / 官方的示例脚本
- [ ] My own modified scripts / 我自己修改的脚本和任务

### Reproduction / 复现过程

[rank3]: OutOfMemoryError: CUDA out of memory. Tried to allocate 986.00 MiB. GPU 3 has a total capacity of 23.64 GiB of which 499.69 MiB is free. Including non-PyTorch memory, this process has 23.15 GiB
[rank3]: memory in use. Of the allocated memory 18.71 GiB is allocated by PyTorch, and 3.83 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting 
[rank3]: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
我的ymal文件如下：
data_config:
  train_file: train.jsonl
  val_file: dev.jsonl
  test_file: test.jsonl
  num_proc: 1

combine: True
freezeV: True
max_input_length: 256
max_output_length: 64
# swanlab: "local"  # set to local if don`t use cloud

training_args:
  # see `transformers.Seq2SeqTrainingArguments`
  output_dir: ./output
  max_steps: 3000
  # needed to be fit for the dataset
  learning_rate: 5e-4
  # settings for data loading
  per_device_train_batch_size: 1
  gradient_accumulation_steps: 16
  dataloader_num_workers: 1
  remove_unused_columns: false
  # settings for saving checkpoints
  save_strategy: steps
  save_steps: 500
  # settings for logging
  log_level: info
  logging_strategy: steps
  logging_steps: 10
  run_name: "glm4-lora-finetune"
  # settings for evaluation
#  per_device_eval_batch_size: 4
#  eval_strategy: steps
#  eval_steps: 500
  # settings for optimizer
  # adam_epsilon: 1e-6
  # uncomment the following line to detect nan or inf values
  # debug: underflow_overflow
  predict_with_generate: true
  # see `transformers.GenerationConfig`
  generation_config:
    max_new_tokens: 64
  # set your absolute deepspeed path here
  deepspeed: configs/ds_zero_3.json
  bf16: true
  #deepspeed: configs/ds_zero_2.json

peft_config:
  peft_type: LORA
  task_type: CAUSAL_LM
  r: 8
  lora_alpha: 32
  lora_dropout: 0.1
  target_modules: ["query_key_value"]

### Expected behavior / 期待表现

我希望能够成功lora

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

4张24GB显存的4090lora微调GLM-4V-9B时，还是爆显存 #785

System Info / 系統信息

Who can help? / 谁可以帮助到您？

Information / 问题信息

Reproduction / 复现过程

swanlab: "local" # set to local if don`t use cloud

see `transformers.Seq2SeqTrainingArguments`

needed to be fit for the dataset

settings for data loading

settings for saving checkpoints

settings for logging

settings for evaluation

per_device_eval_batch_size: 4

eval_strategy: steps

eval_steps: 500

settings for optimizer

adam_epsilon: 1e-6

uncomment the following line to detect nan or inf values

debug: underflow_overflow

see `transformers.GenerationConfig`

set your absolute deepspeed path here

Expected behavior / 期待表现

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

4张24GB显存的4090lora微调GLM-4V-9B时，还是爆显存 #785

Description

System Info / 系統信息

Who can help? / 谁可以帮助到您？

Information / 问题信息

Reproduction / 复现过程

swanlab: "local" # set to local if don`t use cloud

see transformers.Seq2SeqTrainingArguments

needed to be fit for the dataset

settings for data loading

settings for saving checkpoints

settings for logging

settings for evaluation

per_device_eval_batch_size: 4

eval_strategy: steps

eval_steps: 500

settings for optimizer

adam_epsilon: 1e-6

uncomment the following line to detect nan or inf values

debug: underflow_overflow

see transformers.GenerationConfig

set your absolute deepspeed path here

Expected behavior / 期待表现

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

see `transformers.Seq2SeqTrainingArguments`

see `transformers.GenerationConfig`