Skip to content

XL Datasets: Upload#191

Merged
jstlaurent merged 18 commits into
mainfrom
feat/interact-with-xl-datasets
Oct 21, 2024
Merged

XL Datasets: Upload#191
jstlaurent merged 18 commits into
mainfrom
feat/interact-with-xl-datasets

Conversation

@Andrewq11

@Andrewq11 Andrewq11 commented Sep 5, 2024

Copy link
Copy Markdown
Contributor

Changelogs

  • Updated the DatasetV2.upload_to_hub method to call the client.upload_dataset method in the PolarisHubClient class
  • Added some conditional logic to call client._upload_v1_dataset or client._upload_v2_dataset depending on the type of the dataset object passed as an argument
  • Added the client._upload_v2_dataset method to handle uploading V2 datasets
  • Leverage StorageSession and the storage tokens to upload the manifest and Zarr archive to the storage backend

Currently on-hold until #173 is addressed:
- Leveraging storage JWTs in client._upload_v2_dataset
~- Leveraging storage JWTs in client._get_v2_dataset ~
Edit: Storage JWT work is done.


Checklist:

  • Was this PR discussed in an issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.
  • Add tests to cover the fixed bug(s) or the newly introduced feature(s) (if appropriate).
  • Update the API documentation if a new function is added, or an existing one is deleted.
  • Write concise and explanatory changelogs above.
  • If possible, assign one of the following labels to the PR: feature, fix, chore, documentation or test (or ask a maintainer to do it for you).

discussion related to that PR

@Andrewq11 Andrewq11 added the feature Annotates any PR that adds new features; Used in the release process label Sep 5, 2024
@Andrewq11 Andrewq11 self-assigned this Sep 5, 2024
@cwognum cwognum added this to the XL Datasets milestone Sep 6, 2024
Base automatically changed from feat/dataset-v2 to main September 11, 2024 19:43

@jstlaurent jstlaurent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments and question. Happy to see the new data structure is pretty similar. That means our initial design choices were good. 😄

Comment thread polaris/dataset/_base.py Outdated
Comment thread polaris/dataset/_base.py Outdated
Comment thread polaris/dataset/_base.py Outdated
Comment thread polaris/dataset/_base.py
Comment thread polaris/dataset/_base.py
Comment thread polaris/dataset/zarr/_manifest.py Outdated
Comment thread polaris/dataset/zarr/_manifest.py Outdated
Comment thread polaris/dataset/zarr/_manifest.py Outdated
Comment thread polaris/dataset/_base.py Outdated
Comment thread polaris/experimental/_dataset_v2.py Outdated
cwognum and others added 14 commits September 26, 2024 21:12
Test-driven development! Yeah
Now the fun starts...
* updates for calculating zarr manifests & adding basic tests for it

* moving cache_dir assignment to DatasetV1 and DatasetV2 model validators

* Updating argument types for parquet utils

* Updating argument types for md5 util

* fixing DatasetV1 export & dataset model validators

* PR feedback updates

* Adding test that checks the length of the manifest after update

* PR feedback
@jstlaurent jstlaurent force-pushed the feat/interact-with-xl-datasets branch from c3561a4 to 64e8799 Compare October 15, 2024 19:38
@jstlaurent jstlaurent self-assigned this Oct 16, 2024
@jstlaurent jstlaurent marked this pull request as ready for review October 16, 2024 02:33
@jstlaurent jstlaurent requested a review from cwognum as a code owner October 16, 2024 02:33
@jstlaurent jstlaurent changed the title XL Datasets: Upload/Download XL Datasets: Upload Oct 16, 2024
@jstlaurent jstlaurent assigned jstlaurent and unassigned jstlaurent and Andrewq11 Oct 16, 2024

@cwognum cwognum left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I think it came together very nicely. Thanks @Andrewq11 and @jstlaurent ! 🙏

My only comment concerns the _md5sum field in the DatasetV2 class. I'm confused... I thought we explicitly decided to remove that from the V2 dataset, see e.g. #186 (comment) . Looking through the commit history, it seems I may have added it? Could it have been reintroduced due to rebasing this branch?

Context switching is hard... I feel like it was a whole other person that worked on this.

@Andrewq11 Andrewq11 left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't approve it (because I'm assigned to it), but it looks good to me as well. Left just a few comments but none are blocking. Thanks @jstlaurent!

Comment thread polaris/experimental/_dataset_v2.py Outdated
Comment thread polaris/experimental/_dataset_v2.py Outdated
Comment thread polaris/hub/client.py Outdated
Comment thread polaris/hub/client.py
@jstlaurent

Copy link
Copy Markdown
Contributor

My only comment concerns the _md5sum field in the DatasetV2 class. I'm confused... I thought we explicitly decided to remove that from the V2 dataset, see e.g. #186 (comment) . Looking through the commit history, it seems I may have added it? Could it have been reintroduced due to rebasing this branch?

@cwognum: It might have been a rebase issue. It got a little confusing at some point. So, it should be removed?

@cwognum cwognum left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jstlaurent Yes, it should be removed. I think #186 is a good PR to reference for what I would expect.

Since the checksum for the DatasetV2 dataset is not deterministic, it's a significant departure from the DatasetV1 class. For example, as @Andrewq11 pointed out, verifying the checksum after download doesn't make much sense.

I've highlighted the changes I think should be reverted, but may have missed some.

Comment thread polaris/dataset/_base.py Outdated
Comment thread polaris/dataset/_base.py
Comment thread polaris/dataset/_base.py
Comment thread polaris/dataset/_base.py Outdated
Comment thread polaris/experimental/_dataset_v2.py Outdated
Comment thread polaris/hub/client.py
Comment thread tests/test_dataset_v2.py Outdated

@cwognum cwognum left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jstlaurent ! 👍

@jstlaurent jstlaurent merged commit dbb441e into main Oct 21, 2024
@jstlaurent jstlaurent deleted the feat/interact-with-xl-datasets branch October 21, 2024 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Annotates any PR that adds new features; Used in the release process

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants