Skip to content

[Troubleshoot]: RL training with pi0.5 and GRPO in RoboTwin is unstable. #1268

@youmo445

Description

@youmo445

Problem description

Using pi0.5 as the VLA, with absolute joint positions as the action space and the action chunk set to 16, GRPO-based RL training in the place_cans_plastic scenario is unstable.

Configuration YAML file

YAML config file:

You can also paste the full config here.

defaults:
  - env/robotwin_place_cans_plasticbox@env.train
  - env/robotwin_place_cans_plasticbox@env.eval
  - model/pi0_5@actor.model
  - training_backend/fsdp@actor.fsdp_config
  - weight_syncer/patch_syncer@weight_syncer
  - override hydra/job_logging: stdout

hydra:
  run:
    dir: .
  output_subdir: null
  searchpath:
    - file://${oc.env:EMBODIED_PATH}/config/

cluster:
  num_nodes: 1
  component_placement:
    actor, env, rollout: 0,1,3,4,5,6

runner:
  task_type: embodied
  logger:
    log_path: "../results"
    project_name: rlinf
    experiment_name: "robotwin_grpo_openpi_pi05"
    logger_backends: ["tensorboard"] # wandb, swanlab

  max_epochs: 1000
  max_steps: -1

  only_eval: False
  val_check_interval: -1
  save_interval: 10

  resume_dir: null # Optional: path to a saved checkpoint directory, such as 'checkpoints/global_step_10'. If not None, it will be used to resume training.
  ckpt_path: null  # Optional: path to a .pt checkpoint. If not None, it will be loaded after the model is instantiated (for evaluation).

algorithm:
  normalize_advantages: True
  kl_penalty: kl  # how to estimate kl divergence: kl or kl_penalty
  group_size: 8
  reward_coef: 1.0

  rollout_epoch: 4
  eval_rollout_epoch: 1 # set eval_rollout_epoch > 0 when enable runner.only_eval or runner.val_check_interval > 0

  reward_type: chunk_level
  logprob_type: chunk_level
  entropy_type: token_level

  update_epoch: 5
  adv_type: grpo
  loss_type: actor
  loss_agg_func: "token-mean" 
  kl_beta: 0.0
  entropy_bonus: 0
  clip_ratio_high: 0.2
  clip_ratio_low: 0.2
  clip_ratio_c: 3.0
  value_clip: 0.2
  huber_delta: 10.0

  gamma: 0.99
  gae_lambda: 0.95

  filter_rewards: True
  rewards_lower_bound: 0.1
  rewards_upper_bound: 0.9
  # params for generation
  sampling_params:
    do_sample: True
    temperature_train: 1.0
    temperature_eval: 0.6
    top_k: 50
    top_p: 1.0
    repetition_penalty: 1.0
    add_BOS: False

  # length argument for autoregressive sampling
  # max length means max amount of tokens to generate
  length_params:
    max_new_token: null
    max_length: 1024
    min_length: 1

env:
  group_name: "EnvGroup"
  enable_offload: True
  # Override the default values in env/robotwin_place_cans_plasticbox
  train:
    total_num_envs: 240
    reward_coef: ${algorithm.reward_coef}
    max_episode_steps: 320
    max_steps_per_rollout_epoch: 320
    group_size: ${algorithm.group_size}
    assets_path: "/data/zsq/RoboTwin"
    seeds_path: ${oc.env:REPO_PATH}/rlinf/envs/robotwin/seeds/train_seeds.json
    center_crop: False
    task_config:
      embodiment: [aloha-agilex]
      camera:
        collect_wrist_camera: true
      domain_randomization:
        random_background: false
        cluttered_table: false
        clean_background_rate: 1
        random_head_camera_dis: 0
        random_table_height: 0
        random_light: false
        crazy_random_light_rate: 0
  eval:
    total_num_envs: 240
    auto_reset: True
    ignore_terminations: True
    max_episode_steps: 320
    reward_coef: ${algorithm.reward_coef}
    max_steps_per_rollout_epoch: 320
    group_size: 1
    use_fixed_reset_state_ids: True
    is_eval: True
    assets_path: "/data/zsq/RoboTwin"
    seeds_path: ${oc.env:REPO_PATH}/rlinf/envs/robotwin/seeds/eval_seeds.json
    video_cfg:
      save_video: True
      video_base_dir: ${runner.logger.log_path}/video/eval
    center_crop: False
    task_config:
      embodiment: [aloha-agilex]
      camera:
        collect_wrist_camera: true
      domain_randomization:
        random_background: false
        cluttered_table: false
        clean_background_rate: 1
        random_head_camera_dis: 0
        random_table_height: 0
        random_light: false
        crazy_random_light_rate: 0

rollout:
  group_name: "RolloutGroup"
  backend: "huggingface"
  recompute_logprobs: False
  enable_offload: True
  pipeline_stage_num: 1
  model:
    model_path: ${actor.model.model_path}
    precision: ${actor.model.precision}

actor:
  group_name: "ActorGroup"
  training_backend: "fsdp"
  micro_batch_size: 40
  global_batch_size: 960 # 1024
  seed: 42
  enable_offload: False

  # Override the default values in model/openpi_pi05
  model:
    model_path: "/data/zsq/pi05_ckpt/robotwin_place_cans_plastic/20000_torch"
    num_action_chunks: 16 # interface for the env
    action_dim: 14
    # add_value_head: True
    num_steps: 5
    use_proprio: True
    openpi_data:
      adapt_to_pi: False
      extra_delta_transform: False
    openpi:
      config_name: pi05_robotwin
      num_images_in_input: 3
      action_chunk: ${actor.model.num_action_chunks}
      action_env_dim: ${actor.model.action_dim}
      num_steps: ${actor.model.num_steps}
      noise_method: "flow_sde"
      noise_level: 0.3
      # value_after_vlm: True
      # detach_critic_input: True

  optim:
    lr: 5.0e-06
    value_lr: 1.0e-04
    adam_beta1: 0.9
    adam_beta2: 0.95
    adam_eps: 1.0e-08
    weight_decay: 0.01
    clip_grad: 1.0
    critic_warmup_steps: 0

  # Override the default values in training_backend/fsdp
  fsdp_config:
    strategy: "fsdp"
    gradient_checkpointing: False # for openpi, gradient checkpointing is not supported, please do not change this value
    mixed_precision:
      param_dtype: ${actor.model.precision}
      reduce_dtype: ${actor.model.precision}
      buffer_dtype: ${actor.model.precision}

reward:
  use_reward_model: False

critic:
  use_critic_model: False

Log file

You can find the log file in logs/ folder or the $output_dir/$experiment_name folder (defined in the yaml config) if you are using our example scripts.

Log file:

If you cannot find the log, please provide the full log messages here.

Generating Rollout Epochs:   0%|          | 0/4 [00:00<?, ?it/s]
�[36m(RolloutGroup(rank=0) pid=1200033)�[0m 
Generating Rollout Epochs:  25%|██▌       | 1/4 [15:03<45:11, 903.68s/it]
�[36m(RolloutGroup(rank=0) pid=1200033)�[0m 
Generating Rollout Epochs:  50%|█████     | 2/4 [30:32<30:37, 918.57s/it]
�[36m(RolloutGroup(rank=0) pid=1200033)�[0m 
Generating Rollout Epochs:  75%|███████▌  | 3/4 [45:52<15:19, 919.35s/it]
�[36m(RolloutGroup(rank=0) pid=1200033)�[0m 
Generating Rollout Epochs: 100%|██████████| 4/4 [1:06:31<00:00, 1045.42s/it]
Generating Rollout Epochs: 100%|██████████| 4/4 [1:06:31<00:00, 997.91s/it] 

├──────────────────────────────────────────────────── Metric Table ────────────────────────────────────────────────────┤
│ Global Step:    1/1000 │ Progress: ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │   0.1%                                 │
│ Elapsed: 01:15:04 │ ETA: 1250:03:29 │ Step Time: 4504.714s                                                           │
├──────────────────────────────────────────────────────── Time ────────────────────────────────────────────────────────┤
│                                                                                                                      │
│actor/run_training=628.6               │cal_adv_and_returns=0.0054             │env/compute_bootstrap_rewards=0.0034  │
│env/env_interact_step=3529.1           │env/interact=3852.6                    │env/recv_rollout_results=139.4        │
│env/run_interact_once=3852.6           │generate_rollouts=3861.6               │rollout/generate_one_epoch=3848.5     │
│rollout/predict=124.3                  │step=4504.7                            │sync_weights=14.564                   │
│                                                                                                                      │
├──────────────────────────────────────────────────── Environment ─────────────────────────────────────────────────────┤
│                                                                                                                      │
│episode_len=320.0                      │num_trajectories=960                   │return=0.3                            │
│reward=0.0009375                       │success_once=0.3                       │                                      │
│                                                                                                                      │
├────────────────────────────────────────────────────── Rollout ───────────────────────────────────────────────────────┤
│                                                                                                                      │
│advantages_max=2.475                   │advantages_mean=-0.067                 │advantages_min=-2.475                 │
│rewards=9.38e-04                       │                                       │                                      │
│                                                                                                                      │
├─────────────────────────────────────────────────── Training/Actor ───────────────────────────────────────────────────┤
│                                                                                                                      │
│actor/approx_kl=0.032                  │actor/clip_fraction=0.141              │actor/clipped_ratio=0.989             │
│actor/dual_cliped_ratio=0.0000         │actor/entropy_loss=0.0000              │actor/grad_norm=10.714                │
│actor/lr=5.00e-06                      │actor/policy_loss=-0.0056              │actor/policy_loss_abs=0.539           │
│actor/ratio=0.995                      │actor/ratio_abs=0.146                  │actor/total_loss=-0.0014              │
│                                                                                                                      │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯


╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
├──────────────────────────────────────────────────── Metric Table ────────────────────────────────────────────────────┤
│ Global Step:    2/1000 │ Progress: ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │   0.2%                                 │
│ Elapsed: 02:34:05 │ ETA: 1281:30:50 │ Step Time: 4622.696s                                                           │
├──────────────────────────────────────────────────────── Time ────────────────────────────────────────────────────────┤
│                                                                                                                      │
│actor/run_training=629.4               │cal_adv_and_returns=0.0091             │env/compute_bootstrap_rewards=0.0033  │
│env/env_interact_step=3507.8           │env/interact=4098.7                    │env/recv_rollout_results=132.9        │
│env/run_interact_once=4098.7           │generate_rollouts=4103.0               │rollout/generate_one_epoch=4093.4     │
│rollout/predict=124.0                  │step=4740.7                            │sync_weights=8.167                    │
│                                                                                                                      │
├──────────────────────────────────────────────────── Environment ─────────────────────────────────────────────────────┤
│                                                                                                                      │
│episode_len=320.0                      │num_trajectories=960                   │return=0.20520833                     │
│reward=0.00064127607                   │success_once=0.20520833                │                                      │
│                                                                                                                      │
├────────────────────────────────────────────────────── Rollout ───────────────────────────────────────────────────────┤�[36m(RolloutGroup(rank=0) pid=1200033)�[0m 

Environment

python -V:Python 3.11.14
uv pip list:
Package Version


absl-py 2.4.0
accelerate 1.13.0
addict 2.4.0
aiohappyeyeballs 2.6.2
aiohttp 3.14.0
aiohttp-cors 0.8.1
aiosignal 1.4.0
annotated-types 0.7.0
anyio 4.13.0
argcomplete 3.6.3
array-record 0.8.3
asttokens 3.0.1
astunparse 1.6.3
attrs 26.1.0
augmax 0.4.1
av 17.1.0
bddl 3.6.0
beartype 0.19.0
beautifulsoup4 4.14.3
blinker 1.9.0
boto 2.49.0
cachebox 5.2.3
cachetools 5.5.2
certifi 2026.5.20
cffi 2.0.0
cfgv 3.5.0
charset-normalizer 3.4.7
chex 0.1.90
click 8.4.1
cloudpickle 3.1.2
cmake 4.3.2
colorful 0.6.0a1
colorlog 6.10.1
comm 0.2.3
configargparse 1.7.5
contourpy 1.3.3
crcmod 1.7
cryptography 46.0.7
cycler 0.12.1
dash 4.3.0rc0
datasets 3.6.0
debugpy 1.8.21
decorator 5.3.1
deepdiff 9.1.0
diffusers 0.38.0
dill 0.3.8
distlib 0.4.1
distro 1.9.0
dm-control 1.0.41
dm-env 1.6
dm-tree 0.1.10
docstring-parser 0.18.0
donfig 0.8.1.post1
draccus 0.10.0
easydict 1.13
einops 0.8.2
embreex 4.4.0
equinox 0.13.8
etils 1.14.0
evdev 1.9.3
executing 2.2.1
farama-notifications 0.0.6
fasteners 0.20
fastjsonschema 2.21.2
filelock 3.29.1
flash-attn 2.7.4.post1
flask 3.1.3
flatbuffers 25.12.19
flax 0.10.2
fonttools 4.63.0
frozenlist 1.8.0
fsspec 2025.3.0
ftfy 6.3.1
future 1.0.0
gast 0.7.0
gcs-oauth2-boto-plugin 3.3
gcsfs 2025.3.0
gdown 6.1.0
gitdb 4.0.12
gitpython 3.1.50
glfw 2.10.0
google-api-core 2.31.0
google-apitools 0.5.35
google-auth 2.39.0
google-auth-httplib2 0.4.0
google-auth-oauthlib 1.4.0
google-cloud-core 2.6.0
google-cloud-storage 3.11.0
google-crc32c 1.8.0
google-pasta 0.2.0
google-reauth 0.1.1
google-resumable-media 2.10.0
googleapis-common-protos 1.75.0
grpcio 1.81.0
gsutil 5.37
gym 0.26.2
gym-aloha 0.1.3
gym-notices 0.1.0
gymnasium 0.29.1
h11 0.16.0
h5py 3.14.0
hf-transfer 0.1.9
hf-xet 1.5.1.dev1
httpcore 1.0.9
httplib2 0.20.4
httpx 0.28.1
httpx-sse 0.4.3
huggingface-hub 0.36.2
humanize 4.15.0
hydra-core 1.4.0.dev1
icmplib 3.0.4
identify 2.6.19
idna 3.18
imageio 2.37.3
imageio-ffmpeg 0.6.0
immutabledict 4.3.1
importlib-metadata 9.0.0
importlib-resources 7.1.0
iniconfig 2.3.0
inquirerpy 0.3.4
ipython 9.14.1
ipython-pygments-lexers 1.1.1
ipywidgets 8.1.8
itsdangerous 2.2.0
janus 2.0.0
jax 0.5.3
jax-cuda12-pjrt 0.5.3
jax-cuda12-plugin 0.5.3
jaxlib 0.5.3
jaxtyping 0.2.36
jedi 0.20.0
jinja2 3.1.6
jiter 0.15.0
joblib 1.5.3
jsonlines 4.0.0
jsonschema 4.26.0
jsonschema-specifications 2025.9.1
jupyter-core 5.9.1
jupyterlab-widgets 3.0.16
jupytext 1.19.3
keras 3.14.1
kiwisolver 1.5.0
labmaze 1.0.6
lerobot 0.1.0
libclang 18.1.1
liger-kernel 0.8.0
llvmlite 0.48.0rc1
lxml 7.0.0a1
manifold3d 3.5.1
mapbox-earcut 2.0.0
markdown 3.10.2
markdown-it-py 4.2.0
markupsafe 3.0.3
matplotlib 3.11.0rc2
matplotlib-inline 0.2.2
mcp 1.27.2
mdit-py-plugins 0.6.1
mdurl 0.1.2
mergedeep 1.3.4
ml-collections 1.0.0
ml-dtypes 0.5.4
modelscope 1.37.1
monotonic 1.6
mplib 0.2.1
mpmath 1.3.0
msgpack 1.2.0rc1
msgspec 0.21.1
mujoco 3.9.0
multidict 6.7.1
multiprocess 0.70.16
mypy-extensions 1.1.0
namex 0.1.0
narwhals 2.22.1
nbformat 5.10.4
nest-asyncio 1.6.0
networkx 3.6.1
ninja 1.13.0
nltk 3.9.4
nodeenv 1.10.0
numba 0.66.0rc1
numcodecs 0.16.5
numpy 1.26.4
numpy-quaternion 2024.0.13
numpydantic 1.8.1
nvidia-cublas-cu12 12.4.5.8
nvidia-cuda-cupti-cu12 12.4.127
nvidia-cuda-nvcc-cu12 12.9.86
nvidia-cuda-nvrtc-cu12 12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.2.1.3
nvidia-curand-cu12 10.3.5.147
nvidia-curobo 0.0.0
nvidia-cusolver-cu12 11.6.1.9
nvidia-cusparse-cu12 12.3.1.170
nvidia-cusparselt-cu12 0.6.2
nvidia-libnvcomp-cu12 5.2.0.13
nvidia-ml-py 13.595.45
nvidia-nccl-cu12 2.21.5
nvidia-nvcomp-cu12 5.2.0.13
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.4.127
nvitop 1.7.0
oauth2client 4.1.3
oauthlib 3.3.1
omegaconf 2.4.0.dev11
open3d 0.19.0
openai 2.41.0
opencensus 0.11.4
opencensus-context 0.2.dev0
opencv-python 4.11.0.86
opencv-python-headless 4.11.0.86
openexr 3.4.12
openpi 0.1.0
openpi-client 0.1.0
opentelemetry-api 1.42.1
opentelemetry-exporter-prometheus 0.63b1
opentelemetry-proto 1.42.1
opentelemetry-sdk 1.42.1
opentelemetry-semantic-conventions 0.63b1
opt-einsum 3.4.0
optax 0.2.8
optree 0.19.1
orbax-checkpoint 0.11.13
orderly-set 5.5.0
orjson 3.11.9
packaging 26.2
pandas 3.0.3
parso 0.8.7
peft 0.19.1
pexpect 4.9.0
pfzy 0.3.4
pillow 12.2.0
pip 26.1.2
platformdirs 4.10.0
plotly 6.8.0
pluggy 1.6.0
polars 1.41.2
polars-runtime-32 1.41.2
pre-commit 4.6.0
prettytable 3.17.0
prometheus-client 0.25.0
promise 2.3
prompt-toolkit 3.0.52
propcache 0.5.2
proto-plus 1.28.0
protobuf 6.33.6
psutil 7.2.2
ptyprocess 0.7.0
pure-eval 0.2.3
py-spy 0.4.2
pyarrow 24.0.0
pyasn1 0.6.3
pyasn1-modules 0.4.2
pybind11 3.0.4
pycollada 0.9.3
pycparser 3.0
pydantic 2.14.0a1
pydantic-core 2.47.0
pydantic-settings 2.14.1
pyecharts 2.1.0
pygments 2.20.0
pyjwt 2.13.0
pymunk 7.2.0
pynput 1.8.2
pyopengl 3.1.10
pyopenssl 26.0.0
pyparsing 3.3.2
pyperclip 1.11.0
pyquaternion 0.9.9
pyrealsense2 2.58.1.10581
pysocks 1.7.1
pytest 9.0.3
python-dateutil 2.9.0.post0
python-discovery 1.4.0
python-dotenv 1.2.2
python-multipart 0.0.32
python-xlib 0.33
pyu2f 0.1.5
pyyaml 6.0.3
pyyaml-include 1.4.1
pyzmq 27.1.0
ray 2.55.1
referencing 0.37.0
regex 2026.5.9
requests 2.34.2
requests-oauthlib 2.0.0
rerun-sdk 0.23.1
retry-decorator 2.0a1
retrying 1.4.2
rich 14.3.4
robosuite 1.4.1
rpds-py 2026.5.1
rsa 4.7.2
rtree 1.4.1
ruff 0.15.16
safetensors 0.8.0rc1
sapien 3.0.1
scikit-learn 1.9.0
scipy 1.17.1
sentencepiece 0.2.1
sentry-sdk 3.0.0a7
setuptools 75.8.2
setuptools-scm 10.0.5
shapely 2.1.2
simple-parsing 0.1.8
simplejson 4.1.1
six 1.17.0
smart-open 7.6.1
smmap 5.0.3
sniffio 1.3.1
soupsieve 2.8.4
sse-starlette 3.4.4
stack-data 0.6.3
starlette 1.2.1
svg-path 7.0
svgwrite 1.4.3
swanlab 0.8.0rc4
sympy 1.13.1
tensorboard 2.20.0
tensorboard-data-server 0.7.2
tensorflow 2.21.0
tensorflow-addons 0.23.0
tensorflow-datasets 4.9.10
tensorflow-graphics 2021.12.3
tensorflow-metadata 1.17.3
tensorstore 0.1.84
termcolor 3.3.0
threadpoolctl 3.6.0
timm 1.0.27
tokenizers 0.21.4
toml 0.10.2
toolz 1.1.0
toppra 0.6.3
torch 2.6.0
torchcodec 0.2.0
torchdata 0.11.0
torchvision 0.21.0
tqdm 4.67.3
tqdm-loggable 0.4.1
traitlets 5.15.1
transformers 4.53.2
transforms3d 0.4.2
tree 0.2.4
treescope 0.1.10
trimesh 4.12.2
triton 3.2.0
typeguard 4.5.2
typing-extensions 4.15.0
typing-inspect 0.9.0
typing-inspection 0.4.2
tyro 1.0.13
urllib3 2.7.0
uv 0.11.19
uvicorn 0.49.0
vcs-versioning 1.1.1
vhacdx 0.0.10
virtualenv 21.4.2
viser 1.0.30
wadler-lindig 0.1.7
wandb 0.25.0
warp-lang 1.11.1
watchdog 6.0.0
wcwidth 0.7.0
websockets 16.0
werkzeug 3.1.8
wheel 0.47.0
widgetsnbextension 4.0.15
wrapt 2.2.1
xxhash 3.7.0
yarl 1.24.2
yourdfpy 0.0.60
zarr 3.1.5
zipp 4.1.0
zstandard 0.25.0
nvidia-smi:
Fri Jun 12 11:51:56 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:10:00.0 Off | 0 |
| N/A 54C P0 98W / 400W | 67108MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:16:00.0 Off | 0 |
| N/A 47C P0 93W / 400W | 67417MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A100-SXM4-80GB On | 00000000:2F:00.0 Off | 0 |
| N/A 63C P0 356W / 400W | 16538MiB / 81920MiB | 67% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA A100-SXM4-80GB On | 00000000:33:00.0 Off | 0 |
| N/A 54C P0 120W / 400W | 67353MiB / 81920MiB | 94% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA A100-SXM4-80GB On | 00000000:8A:00.0 Off | 0 |
| N/A 50C P0 94W / 400W | 67929MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA A100-SXM4-80GB On | 00000000:8F:00.0 Off | 0 |
| N/A 56C P0 144W / 400W | 67353MiB / 81920MiB | 93% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA A100-SXM4-80GB On | 00000000:C6:00.0 Off | 0 |
| N/A 53C P0 200W / 400W | 68109MiB / 81920MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA A100-SXM4-80GB On | 00000000:CA:00.0 Off | 0 |
| N/A 45C P0 70W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
RLinf version: N/A
Docker image tag: N/A

Other reproduction info

The approx_kl and clip_fraction remain relatively low, suggesting that the policy updates may be too conservative.

Image

Before submitting a new issue...

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is needed

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions