Skip to content

feat: add RoboSpatial task#1347

Merged
kcz358 merged 2 commits into
EvolvingLMMs-Lab:mainfrom
njb-nvidia:add-robo_spatial-task
May 25, 2026
Merged

feat: add RoboSpatial task#1347
kcz358 merged 2 commits into
EvolvingLMMs-Lab:mainfrom
njb-nvidia:add-robo_spatial-task

Conversation

@njb-nvidia

Copy link
Copy Markdown
Contributor

Summary

Adds RoboSpatial, a spatial-reasoning benchmark for robotic manipulation scenes (RoboSpatial-Home) covering three sub-categories:

  • compatibility — 105 items
  • configuration — 123 items
  • context — 122 items

Total: 350 items.

This port exposes:

  • `robo_spatial` (group)
  • `robo_spatial_all` (union of all three splits via `dataset_kwargs.data_files` with `verification_mode: no_checks`)
  • `robo_spatial_compatibility` / `robo_spatial_configuration` / `robo_spatial_context` (single-category sub-tasks via `_default_template.yaml`)

Metric: `robo_spatial_score` — task-specific scoring (point / region / affordance correctness; see `pre_process.py` for parsing).

Files

  • `lmms_eval/tasks/robo_spatial/_default_template.yaml` — shared task config.
  • `lmms_eval/tasks/robo_spatial/robo_spatial.yaml` — group definition.
  • `lmms_eval/tasks/robo_spatial/robo_spatial_all.yaml` — concatenated test split.
  • `lmms_eval/tasks/robo_spatial/robo_spatial_{compatibility,configuration,context}.yaml` — per-category tasks.
  • `lmms_eval/tasks/robo_spatial/utils.py` — doc transforms, scoring, aggregation.
  • `lmms_eval/tasks/robo_spatial/pre_process.py` — answer parsing helpers.

Parity vs. local fork

Qwen3-VL-2B-Instruct, full test split on 8x H100, greedy decoding.

Source Compat Config Context Overall (350)
Fork 0.610 0.675 0.320 0.5314
Upstream 0.629 0.732 0.320 0.5571

Per-doc analysis on the 309 shared questions matched by doc_id: 91.9% identical scores.

Delta (+2.6pp overall) is consistent with the qwen3_vl model-class drift we have observed on other ports (e.g. metavqa, egoplan2).

Test plan

  • `uv run lmms-eval --tasks robo_spatial_all --limit 8` smoke
  • Full run on 8x H100 with Qwen3-VL-2B-Instruct; per-category scores match the fork within noise
  • `combined` split assembly via `dataset_kwargs.data_files + verification_mode: no_checks` verified end-to-end (350 docs loaded as expected)

RoboSpatial is a spatial-reasoning benchmark for robotic manipulation
scenes (RoboSpatial-Home) covering three sub-categories:
compatibility, configuration, and context.

Dataset: chanhee-luke/RoboSpatial-Home on HuggingFace.
Per-category splits: compatibility (105), configuration (123), context (122)
(350 items total).

This port exposes:
  - robo_spatial (group)
  - robo_spatial_all (union of all three splits via dataset_kwargs.data_files)
  - robo_spatial_compatibility / robo_spatial_configuration / robo_spatial_context

Metric: robo_spatial_score — task-specific scoring implemented in utils.py
(point/region/affordance correctness; see pre_process.py for parsing).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one seems to be a bit redundant if configure the all split as an individual yaml

Address reviewer feedback on EvolvingLMMs-Lab#1347: the `robo_spatial` group only listed
`robo_spatial_all`, which is already a standalone task and can be invoked
directly via --tasks robo_spatial_all.
@njb-nvidia

Copy link
Copy Markdown
Contributor Author

@kcz358 thanks for the review — dropped robo_spatial.yaml (the single-task group was redundant since robo_spatial_all is already a standalone task). Pushed as fe8b5ec.

@kcz358 kcz358 merged commit bf5e8b8 into EvolvingLMMs-Lab:main May 25, 2026
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants