XL Datasets: Minimal Zarr-only dataset implementation#186
Conversation
Test-driven development! Yeah
Now the fun starts...
* updates for calculating zarr manifests & adding basic tests for it * moving cache_dir assignment to DatasetV1 and DatasetV2 model validators * Updating argument types for parquet utils * Updating argument types for md5 util * fixing DatasetV1 export & dataset model validators * PR feedback updates * Adding test that checks the length of the manifest after update * PR feedback
jstlaurent
left a comment
There was a problem hiding this comment.
I have a few comments and suggestions. Overall, good, solid work. 😄
|
@jstlaurent @Andrewq11 Thank you for the review! This ended up being a good opportunity to also review the Dataset code that was mostly written over a year ago. One of the things I realized, is that the Unfortunately, this does imply that some of the code examples we've shown in presentations are no longer accurate. See also https://github.com/polaris-hub/polaris-hub/pull/471. I am personally okay with this, since the Dataset API generally does not get a lot of attention compared to the Benchmark API. What do you think? |
Andrewq11
left a comment
There was a problem hiding this comment.
Looks good on my side. Thanks for all the work here!
More progress toward XL datasets 🚀
…est from the v2 dataset and moved the verify_checksum parameter to v1
jstlaurent
left a comment
There was a problem hiding this comment.
Looks good. Still had a couple of things to say. 😉
Changelogs
DatasetV2class in a newpolaris.experimentalmodule.DatasetV1, but create theDatasetalias for it.BaseDatasetclass that encompasses all functionality that is shared between V1 and V2.__getitem__to only allow indexing a specific value or a row.Checklist:
feature,fix,chore,documentationortest(or ask a maintainer to do it for you).Closes #132
The implementation has been relatively straight-forward given the PoC. There are two points that required some further thinking.
Datasetalias for it to not break backwards compatibility. I also introduced a newexperimentalmodule to house the V2 implementation.On V1 and V2 compatibility
You can think of a Pandas DataFrame as a set of named NumPy arrays. For storage, the conversion from Pandas to a Zarr archive is therefore relatively straight-forward. Note that this is limited to the storage - It's not straight-forward to implement the Pandas API on top of a Zarr archive, but this was decided to be out of scope.
An example of this conversion was implemented in this test case. A toy example would look like: