Evaluation code for the CoRL manipulation benchmark audit, focused on task-preserving test-set changes that probe overfitting to benchmark idiosyncrasies.
Current first target:
- Benchmark: SimplerEnv WidowX / Bridge
- Policy:
CogACT-Base - Calibration: official fixed-grid
24 x 4object episodes - Altered test set: randomized valid object positions on the same table
The first SimplerEnv run is a calibration run against the official fixed-grid setting. Treat differences from the paper as possible environment/package version effects until the local evaluator is checked against the reported CogACT WidowX/Bridge number.
eval_policies_corl/: reusable Python helpers.configs/: experiment and policy configs.scripts/: launch, parsing, and Slurm entry points.third_party/: external policy or benchmark repos as pinned git submodules.results/,artifacts/,checkpoints/,cache/,envs/: local-only directories ignored by git.
This repo expects external code as submodules:
third_party/simpler_env:simpler-env/SimplerEnvthird_party/cogact:microsoft/CogACT
Build the runtime on the cluster with:
sbatch scripts/slurm/setup_simplerenv_cogact.sbatchThe conda env is created at envs/simplerenv_cogact_py310_np126, with package
caches under cache/, so /home-nfs/tianchong is not used for large
environment state. The setup script pins NumPy/OpenCV because CogACT's
TensorFlow dependency is not NumPy-2 compatible.
Run a one-episode smoke test:
EPISODE_END=1 TASK_FILTER=stack sbatch scripts/slurm/simplerenv_cogact_bridge.sbatchRun the official WidowX/Bridge calibration:
sbatch scripts/slurm/simplerenv_cogact_bridge.sbatchEach run writes videos under results/ and version metadata under artifacts/.
Commit source code, configs, scripts, submodule pins, and small metadata. Do not commit model weights, datasets, rollout videos, simulator caches, conda envs, container images, or large generated artifacts.
Prefer upstream policy repos as pinned submodules. Use a ripl fork only when we need patches, a frozen evaluation branch, or paper-critical reproducibility.
License is TBD before public release.