❓Question
I want to resume a training ( if interrupted). In PyTorch Lightning, I can specify the last checkpoint path via trainer.fit(model, train_loader, val_loader, ckpt_path=resume_ckpt). But this will create a new AIM experiment run. So i am trying this way
- Create a local file that stores the AIM experiment hash (
AIM_RUN_HASH_FILE.write_text(aim_logger.experiment.hash))
- This file will be deleted if the training loop completes (
AIM_RUN_HASH_FILE.unlink(missing_ok=True)) after training completes
- If training is interrupted, the hash file will not be deleted. In that case I will 're-use' it via
run_hash = AIM_RUN_HASH_FILE.read_text().strip()
aim_logger = AimLogger(repo=AIM["repo"], experiment=AIM["experiment"], run_hash=run_hash)
Is this a valid approach? Is there a better way to do this?
❓Question
I want to resume a training ( if interrupted). In PyTorch Lightning, I can specify the last checkpoint path via
trainer.fit(model, train_loader, val_loader, ckpt_path=resume_ckpt). But this will create a new AIM experiment run. So i am trying this wayAIM_RUN_HASH_FILE.write_text(aim_logger.experiment.hash))AIM_RUN_HASH_FILE.unlink(missing_ok=True)) after training completesIs this a valid approach? Is there a better way to do this?