Skip to content

Centralize and consolidate checkpoint/metadata utilities #246

@frazane

Description

@frazane

Problem: code and commands related to checkpoint metadata are in different places

Code for dealing with checkpoints metadata is currently a bit scattered across the framework. Here is a short list of modules:

Some utilities to interact with checkpoints metadata are available in the CLIs of anemoi-inference and anemoi-training. Here is a list:

From anemoi-inference:

  • inspect: Inspect the contents of a checkpoint file.
  • metadata: Edit, remove, dump or load metadata from a checkpoint file.
  • patch: Patch a checkpoint file.
  • sanitise: Sanitise a checkpoint file.

From anemoi-training:

  • checkpoint: Commands to interact with training checkpoints.

In most cases, interacting with the metadata of the checkpoint is a simple task that does not require the heavy dependencies that both libraries have, notably PyTorch. It would be ideal if such utilities were moved under anemoi-utils instead, as their primary location, and optionally (and temporarily) be exposed also via the other packages.


Proposal

Note

This is still a work-in-progress. The proposal will be extended gradually as the discussion continues, until and if an agreement is reached.

We could move the code and the commands used to deal with the checkpoint metadata under anemoi-utils. It would also be an opportunity to consolidate the metadata that goes into a checkpoint. anemoi-utils could provide:

  • a single CheckpointMetadata base class that will handle the checkpoint metadata during both writing and reading. This class will provide all the basic functionality to extract metadata from a checkpoint but it will be generic with respect to the task for which a checkpoint was trained;
  • a single CLI entry point anemoi-utils checkpoint, with optionally more sub-commands, providing all the functionality to deal with checkpoint metadata (inspection, extraction, editing, etc.)

Open questions

  • Do we first need to have a discussion on what is the metadata?
  • What are the downsides of this proposal with respect to the current separation of the repositories (specifically anemoi-inference, anemoi-utils, anemoi-core)? Would it make CI/CD, maintenance and contributions harder?
  • How do we separate task-agnostic and task-specific checkpoint metadata? Could task-specific metadata be handled by subclassing CheckpointMetadata?

Implementation plan

Note

Waiting for a proposal to be accepted.

Organisation

MeteoSwiss

Metadata

Metadata

Assignees

No one assigned

    Projects

    Status

    Next Up

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions