XL Datasets: Upload#191
Conversation
jstlaurent
left a comment
There was a problem hiding this comment.
Added some comments and question. Happy to see the new data structure is pretty similar. That means our initial design choices were good. 😄
Test-driven development! Yeah
Now the fun starts...
* updates for calculating zarr manifests & adding basic tests for it * moving cache_dir assignment to DatasetV1 and DatasetV2 model validators * Updating argument types for parquet utils * Updating argument types for md5 util * fixing DatasetV1 export & dataset model validators * PR feedback updates * Adding test that checks the length of the manifest after update * PR feedback
c3561a4 to
64e8799
Compare
cwognum
left a comment
There was a problem hiding this comment.
This looks great! I think it came together very nicely. Thanks @Andrewq11 and @jstlaurent ! 🙏
My only comment concerns the _md5sum field in the DatasetV2 class. I'm confused... I thought we explicitly decided to remove that from the V2 dataset, see e.g. #186 (comment) . Looking through the commit history, it seems I may have added it? Could it have been reintroduced due to rebasing this branch?
Context switching is hard... I feel like it was a whole other person that worked on this.
There was a problem hiding this comment.
I can't approve it (because I'm assigned to it), but it looks good to me as well. Left just a few comments but none are blocking. Thanks @jstlaurent!
@cwognum: It might have been a rebase issue. It got a little confusing at some point. So, it should be removed? |
cwognum
left a comment
There was a problem hiding this comment.
@jstlaurent Yes, it should be removed. I think #186 is a good PR to reference for what I would expect.
Since the checksum for the DatasetV2 dataset is not deterministic, it's a significant departure from the DatasetV1 class. For example, as @Andrewq11 pointed out, verifying the checksum after download doesn't make much sense.
I've highlighted the changes I think should be reverted, but may have missed some.
Changelogs
DatasetV2.upload_to_hubmethod to call theclient.upload_datasetmethod in the PolarisHubClient classclient._upload_v1_datasetorclient._upload_v2_datasetdepending on the type of the dataset object passed as an argumentclient._upload_v2_datasetmethod to handle uploading V2 datasetsStorageSessionand the storage tokens to upload the manifest and Zarr archive to the storage backendCurrently on-hold until #173 is addressed:- Leveraging storage JWTs inclient._upload_v2_dataset~- Leveraging storage JWTs in
client._get_v2_dataset~Edit: Storage JWT work is done.
Checklist:
feature,fix,chore,documentationortest(or ask a maintainer to do it for you).discussion related to that PR