feat: Bucket operations using storage token#200
Merged
Conversation
70138cc to
2deb92b
Compare
Andrewq11
reviewed
Sep 17, 2024
Andrewq11
left a comment
Contributor
There was a problem hiding this comment.
Thanks for this @jstlaurent, you weren't kidding when you said this was growing into something bigger than expected 😅
Looks really good and I'm excited to get this integrated for the V2 datasets. Left some comments and suggestions!
…g the storage token with the bucket
…fficient fsspec implementation. Use storage token to get Zarr content, when present.
…rr's FSStore, I implemented a S3 store for Zarr, without intermediary library.
cwognum
approved these changes
Sep 23, 2024
cwognum
left a comment
Collaborator
There was a problem hiding this comment.
Thank you @jstlaurent ! Great work! Excited to see this come together!
mercuryseries
approved these changes
Sep 24, 2024
mercuryseries
left a comment
Contributor
There was a problem hiding this comment.
Thanks, @jstlaurent, for the massive work on this PR and for providing such a detailed description – I think I needed every bit of it! 😅 I've left a few comments and suggestions throughout. Overall, it's looking great!
6bbc284 to
d7908f5
Compare
6efaf64 to
1aa72af
Compare
1aa72af to
02b1b72
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changelogs
This PR enables the use of a Hub-supplied storage JWT to access the R2 bucket. Compared to our current approach of signed URLs and fsspec-compliant implementation on top of the Hub's REST API, this approach does not require any requests to the Hub while the storage JWT is valid. This storage token is obtained through a standard OAuth exchange flow with the Hub. A client in possession of a valid, Hub-issued JWT (used to interact with the Hub's REST API) can exchange it for a valid storage JWT, enabling access for a scope (read or write) on one resource (specified as a URN, where datasets are currently the only resource supported.)
The final implementation is a context manager that handles getting a valid storage token when instantiated, and which exposes methods plus a Zarr store to support the current
Datasetoperations: getting/setting the root Parquet file, and getting/setting content in the optional Zarr extension.The approach should be flexible enough to support the new flows in the upcoming XL Datasets.
Along the way, some adventures lead me to implement a Zarr store for S3. My initial approach to use Zarr's
FSStore, on top of either fssspec'ss3fsor PyArrow'sS3FileSystemimplementations encountered some snags:s3fsdoes not do.PyArrowFSStore, since a PyArrow filesystem is not fsspec-compliant, and that's what ZarrFSStoreexpects.Ultimately, building a store implementation directly over Boto3 ended up the only answer that would satisfy our needs: equally-sized parts for multipart uploads, and integrity checks for the uploaded content.
Additional work
Closes #173
Checklist:
feature,fix,chore,documentationortest(or ask a maintainer to do it for you).