-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Problem: code and commands related to checkpoint metadata are in different places
Code for dealing with checkpoints metadata is currently a bit scattered across the framework. Here is a short list of modules:
- https://github.com/ecmwf/anemoi-utils/blob/main/src/anemoi/utils/checkpoints.py
- https://github.com/ecmwf/anemoi-core/blob/main/training/src/anemoi/training/utils/checkpoint.py
- https://github.com/ecmwf/anemoi-inference/blob/main/src/anemoi/inference/checkpoint.py and https://github.com/ecmwf/anemoi-inference/blob/main/src/anemoi/inference/metadata.py
Some utilities to interact with checkpoints metadata are available in the CLIs of anemoi-inference and anemoi-training. Here is a list:
From anemoi-inference:
- inspect: Inspect the contents of a checkpoint file.
- metadata: Edit, remove, dump or load metadata from a checkpoint file.
- patch: Patch a checkpoint file.
- sanitise: Sanitise a checkpoint file.
From anemoi-training:
- checkpoint: Commands to interact with training checkpoints.
In most cases, interacting with the metadata of the checkpoint is a simple task that does not require the heavy dependencies that both libraries have, notably PyTorch. It would be ideal if such utilities were moved under anemoi-utils instead, as their primary location, and optionally (and temporarily) be exposed also via the other packages.
Proposal
Note
This is still a work-in-progress. The proposal will be extended gradually as the discussion continues, until and if an agreement is reached.
We could move the code and the commands used to deal with the checkpoint metadata under anemoi-utils. It would also be an opportunity to consolidate the metadata that goes into a checkpoint. anemoi-utils could provide:
- a single
CheckpointMetadatabase class that will handle the checkpoint metadata during both writing and reading. This class will provide all the basic functionality to extract metadata from a checkpoint but it will be generic with respect to the task for which a checkpoint was trained; - a single CLI entry point
anemoi-utils checkpoint, with optionally more sub-commands, providing all the functionality to deal with checkpoint metadata (inspection, extraction, editing, etc.)
Open questions
- Do we first need to have a discussion on what is the metadata?
- What are the downsides of this proposal with respect to the current separation of the repositories (specifically anemoi-inference, anemoi-utils, anemoi-core)? Would it make CI/CD, maintenance and contributions harder?
- How do we separate task-agnostic and task-specific checkpoint metadata? Could task-specific metadata be handled by subclassing
CheckpointMetadata?
Implementation plan
Note
Waiting for a proposal to be accepted.
Organisation
MeteoSwiss
Metadata
Metadata
Assignees
Labels
Type
Projects
Status