Related to #1896
The current glacier directory generation and fetch is very lazy (in the sense that not much thought has been given in optimizing any of it). Each level is built on top of the next, and data from each level is stored together with data from the previous one.
A classic example:
- L2 directories contain gridded data, outlines, and flowlines
- L3 adds climate data and calibration
- L4 adds spinup (in terms of code, this is agnostic of what kind of climate data was added to L3 btw, its the same code).
L2 data is stored again and again and again. Add two options in L3 (e.g. ERA5 and W5E5), and data duplication is multiplied.
Now that I think of it, this is so silly. Or, better said, I knew this was silly, but laziness and simple of use made it worthwhile. The pros of this strategy is that all what users need to describe a setup they use is the base_url, and all what oggm needs is a base_url.
I think we need to change this paradigm towards a "shopping list" of "slices" or "levels" that users have to add to their gdirs to have what they want. I can easily envision how that would work. The ideal setup would be a "shopping list of everything" (this is what #1896 would have as end goal), but this is a bit scary to think about.
A much easier fruit to pick would be a slightly more layered approach to glacier dir creation:
- L0: outlines
- L1: DEM
- L2: Additional gridded shop datasets
- L3: flowlines (with or without binned data) - which is currently L2
- L4: climate data + calibration (currently L3)
- L5: spinup files (currently L4)
Each level would store only the data it created, and a record of what urls / levels it used to create this data (whether this needs to be strict or not depends on the decisions below).
A user would then say: "I take the L5 spinup files which have been created with ERA5 and COPDEM."
- If the separation of files is strict, the user would then have to also add at least L4 files in order to do a glacier directory in the simplest setup, that would give enough data to do a projection, like the current L5 files.
- If we choose to duplicate a few files (which I am tending towards right now), L4 and L5 would already be enough data to start a run (like the current L5 files)
The user could also say: "I want to add L2 files (preprocessed shop datasets) to my gdir", which is one python line in their code and works regardless of the rest, because this is upstream and independent of the choices you make after. Some internal checks would need to happen because there is a risk that users use L5 files that have been calibrated with, say, COPDEM and then want to add gridded DEMS from another product. This can be checked inside OGGM with the internal record of what has happened.
In terms of changes needed to OGGM, I think this is very doable. The current mechanic for downloading and unpacking tar-ed gdirs would be slightly adjusted to add data to a pre-existing gdir instead of overwriting it. In terms of server access, the current "one access per setup" would become something like 2 or 3 accesses, which is very manageable.
As part of this process, we could revisit some old ways OGGM is doing things:
- the very archaic BASENAMES should be replaced by one or more proper yaml or json file, indicating a "data dependency tree" required by OGGM to run (to be refined)
- the task logging mechanism could be improved to be a more integral part of the workflow.
It's going to be quite hard to make this entirely backwards compatible for old urls, but perhaps not impossible. I think it's a really manageable change which should be discussed internally first (@gampnico @pat-schmitt).
Related to #1896
The current glacier directory generation and fetch is very lazy (in the sense that not much thought has been given in optimizing any of it). Each level is built on top of the next, and data from each level is stored together with data from the previous one.
A classic example:
L2 data is stored again and again and again. Add two options in L3 (e.g. ERA5 and W5E5), and data duplication is multiplied.
Now that I think of it, this is so silly. Or, better said, I knew this was silly, but laziness and simple of use made it worthwhile. The pros of this strategy is that all what users need to describe a setup they use is the
base_url, and all what oggm needs is abase_url.I think we need to change this paradigm towards a "shopping list" of "slices" or "levels" that users have to add to their gdirs to have what they want. I can easily envision how that would work. The ideal setup would be a "shopping list of everything" (this is what #1896 would have as end goal), but this is a bit scary to think about.
A much easier fruit to pick would be a slightly more layered approach to glacier dir creation:
Each level would store only the data it created, and a record of what urls / levels it used to create this data (whether this needs to be strict or not depends on the decisions below).
A user would then say: "I take the L5 spinup files which have been created with ERA5 and COPDEM."
The user could also say: "I want to add L2 files (preprocessed shop datasets) to my gdir", which is one python line in their code and works regardless of the rest, because this is upstream and independent of the choices you make after. Some internal checks would need to happen because there is a risk that users use L5 files that have been calibrated with, say, COPDEM and then want to add gridded DEMS from another product. This can be checked inside OGGM with the internal record of what has happened.
In terms of changes needed to OGGM, I think this is very doable. The current mechanic for downloading and unpacking tar-ed gdirs would be slightly adjusted to add data to a pre-existing gdir instead of overwriting it. In terms of server access, the current "one access per setup" would become something like 2 or 3 accesses, which is very manageable.
As part of this process, we could revisit some old ways OGGM is doing things:
It's going to be quite hard to make this entirely backwards compatible for old urls, but perhaps not impossible. I think it's a really manageable change which should be discussed internally first (@gampnico @pat-schmitt).