Skip to content

resume pytorch lightning + aim experiment run #3401

@mmg10

Description

@mmg10

❓Question

I want to resume a training ( if interrupted). In PyTorch Lightning, I can specify the last checkpoint path via trainer.fit(model, train_loader, val_loader, ckpt_path=resume_ckpt). But this will create a new AIM experiment run. So i am trying this way

  • Create a local file that stores the AIM experiment hash (AIM_RUN_HASH_FILE.write_text(aim_logger.experiment.hash))
  • This file will be deleted if the training loop completes (AIM_RUN_HASH_FILE.unlink(missing_ok=True)) after training completes
  • If training is interrupted, the hash file will not be deleted. In that case I will 're-use' it via
run_hash = AIM_RUN_HASH_FILE.read_text().strip()
aim_logger = AimLogger(repo=AIM["repo"], experiment=AIM["experiment"], run_hash=run_hash)

Is this a valid approach? Is there a better way to do this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions