Load pipeline results#1196
Conversation
Implement the class method `load_from_directory` to reload a `PipelineResult` object starting from the results saved using `save_to_directory`.
…pykeen into load-pipeline-results
mberr
left a comment
There was a problem hiding this comment.
First of all, thanks for the PR. It looks already promising!
I quickly went over the changes and left a few comments.
You can do this in the load function via logging.warn (or better, a module-specific logger instance).
Maybe we can try to modify the existing Let me know if you need any help. |
Convert exists check to is_file Co-authored-by: Max Berrendorf <max.berrendorf@gmail.com>
Convert exists check to is_dir Co-authored-by: Max Berrendorf <max.berrendorf@gmail.com>
|
Hi @mberr! I think it is a good idea, we could add a new keyword argument to |
Sounds good. There is already a |
|
Hello @mberr . I've looked into By the way, I tried to store these attributes in a |
Sounds good 👍
Hm, that is quite nasty. I think the reason is that at that point we have already resolved names to the respective classes. Using pickle comes with some downsides compared to simple strings, e.g., that re-loading may fail if the code structure changed in between. However, if we keep reloading these parts optional, I would be fine with your proposed solution. |
Add the possibility to save the pipeline_info, a set of attributes of PipelineResult that were not saved in the previous implementation of `save_to_directory`. Reload the pipeline_info.
|
Hi @mberr ! With these changes we would just miss |
| #: How long in seconds did evaluation take? | ||
| evaluate_seconds: float | ||
|
|
||
| #: The model trained by the pipeline |
There was a problem hiding this comment.
Sorry I haven't been following discussion yet, but I'm pretty opposed to weakening the typing in this class. When people get results from the pipeline, they shouldn't have to check for if there are results there - it should be guaranteed.
This is definitely true for the model and seed. If the training triples don't get loaded properly, people get confused since they need to reference the training triples to do anything valuable. I don't want to make changes that will result in getting more github issues from confused users. Wrt the training loop, this could be a reasonable thing to make optional.
There was a problem hiding this comment.
Hi @cthoyt ! Yes, it's perfectly understandable. So, maybe we should give up the implmentation of this feature or rethink it completely if users need to reload the PipelineResult, since the current implementation of save_to_directory saves only partial information of the object.
Link to the relevant feature request
#1192
Description of the Change
Add the classmethod
load_from_directoryto the classpykeen.pipeline.PipelineResult. The class method produces an instance ofPipelineResultcontaining the information stored in the directory produced viasave_to_directory.Not all the instance attributes can be reloaded, since they're not saved to the directory. The attributes
training_loop,training,modelandrandom_seedhave been made optional, withNonedefault values. Thestopper(if present) is not reloaded because of some missing information (it was already optional). Other attributes such asconfiguration,versionandgit_hashare not reloaded, but their default value has been kept.Alternatives
I considered subclassing
PipelineResultwith a new classReloadedResultwhich contained default values for the parameters that could've not been restored. However, I think this solution adds too much complexity.Possible Drawbacks
Users should be warned about the missing information. Making the
random_seed,training,modelandtraining_loopas optional should not create any problem.Verification Process
Performed tests with
tox -e pyandtox -e docs-testwithout any error.Tried to save and reload the results obtained via:
with different configurations for the boolean flags, with success. Still have to write some unit tests for the new function.