Skip to content

feat: Bucket operations using storage token#200

Merged
jstlaurent merged 12 commits into
mainfrom
feat/bucket-token-auth
Sep 26, 2024
Merged

feat: Bucket operations using storage token#200
jstlaurent merged 12 commits into
mainfrom
feat/bucket-token-auth

Conversation

@jstlaurent

@jstlaurent jstlaurent commented Sep 17, 2024

Copy link
Copy Markdown
Contributor

Changelogs

This PR enables the use of a Hub-supplied storage JWT to access the R2 bucket. Compared to our current approach of signed URLs and fsspec-compliant implementation on top of the Hub's REST API, this approach does not require any requests to the Hub while the storage JWT is valid. This storage token is obtained through a standard OAuth exchange flow with the Hub. A client in possession of a valid, Hub-issued JWT (used to interact with the Hub's REST API) can exchange it for a valid storage JWT, enabling access for a scope (read or write) on one resource (specified as a URN, where datasets are currently the only resource supported.)

The final implementation is a context manager that handles getting a valid storage token when instantiated, and which exposes methods plus a Zarr store to support the current Dataset operations: getting/setting the root Parquet file, and getting/setting content in the optional Zarr extension.

The approach should be flexible enough to support the new flows in the upcoming XL Datasets.

Along the way, some adventures lead me to implement a Zarr store for S3. My initial approach to use Zarr's FSStore, on top of either fssspec's s3fs or PyArrow's S3FileSystem implementations encountered some snags:

  • R2 has the specificity that, when using multipart upload, all parts (except for the last) must be of equal size. Something S3 does not require, and something s3fs does not do.
  • PyArrow's implementation has been updated to fix the issue with R2, but being based on AWS' C++ SDK, does not offer the ability to set MD5 checksums on upload, or set metadata on the uploaded files. It's unfortunate that I discovered this after having built a full PyArrowFSStore, since a PyArrow filesystem is not fsspec-compliant, and that's what Zarr FSStore expects.

Ultimately, building a store implementation directly over Boto3 ended up the only answer that would satisfy our needs: equally-sized parts for multipart uploads, and integrity checks for the uploaded content.

Additional work

  • This could do a better job of handling token expiry. There's probably a way to gracefully fetch a new storage token if it expires, provided the client's underlying token is still valid.
  • The multipart upload could be the entrypoint for making a large upload retryable. It likewise could be made concurrent for some performance gains.

Closes #173


Checklist:

  • Was this PR discussed in an issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.
  • Add tests to cover the fixed bug(s) or the newly introduced feature(s) (if appropriate).
  • Update the API documentation if a new function is added, or an existing one is deleted.
  • Write concise and explanatory changelogs above.
  • If possible, assign one of the following labels to the PR: feature, fix, chore, documentation or test (or ask a maintainer to do it for you).

@jstlaurent jstlaurent requested a review from cwognum as a code owner September 17, 2024 03:30
@jstlaurent jstlaurent added the feature Annotates any PR that adds new features; Used in the release process label Sep 17, 2024
@jstlaurent jstlaurent self-assigned this Sep 17, 2024
@jstlaurent jstlaurent added this to the XL Datasets milestone Sep 17, 2024
@jstlaurent jstlaurent force-pushed the feat/bucket-token-auth branch from 70138cc to 2deb92b Compare September 17, 2024 03:59

@Andrewq11 Andrewq11 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @jstlaurent, you weren't kidding when you said this was growing into something bigger than expected 😅

Looks really good and I'm excited to get this integrated for the V2 datasets. Left some comments and suggestions!

Comment thread polaris/_artifact.py Outdated
Comment thread polaris/benchmark/_base.py
Comment thread polaris/dataset/_dataset.py Outdated
Comment thread polaris/hub/storage.py Outdated
Comment thread polaris/hub/client.py
Comment thread polaris/hub/storage.py Outdated
Comment thread polaris/hub/storage.py Outdated
Comment thread polaris/dataset/_dataset.py Outdated
Comment thread polaris/dataset/_dataset.py Outdated
Comment thread polaris/dataset/_dataset.py Outdated

@cwognum cwognum left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jstlaurent ! Great work! Excited to see this come together!

Comment thread polaris/hub/client.py Outdated
Comment thread polaris/dataset/_dataset.py Outdated
Comment thread polaris/hub/storage.py Outdated
Comment thread tests/test_storage.py
Comment thread pyproject.toml
Comment thread tests/test_storage.py
Comment thread tests/test_storage.py Outdated

@mercuryseries mercuryseries left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @jstlaurent, for the massive work on this PR and for providing such a detailed description – I think I needed every bit of it! 😅 I've left a few comments and suggestions throughout. Overall, it's looking great!

Comment thread .gitignore
Comment thread polaris/_artifact.py Outdated
Comment thread polaris/_artifact.py Outdated
Comment thread polaris/_artifact.py
Comment thread polaris/dataset/_dataset.py Outdated
Comment thread polaris/hub/storage.py
Comment thread polaris/utils/errors.py
@jstlaurent jstlaurent force-pushed the feat/bucket-token-auth branch from 6bbc284 to d7908f5 Compare September 24, 2024 13:48
@jstlaurent jstlaurent force-pushed the feat/bucket-token-auth branch from 6efaf64 to 1aa72af Compare September 24, 2024 14:30
@jstlaurent jstlaurent force-pushed the feat/bucket-token-auth branch from 1aa72af to 02b1b72 Compare September 24, 2024 14:38
@jstlaurent jstlaurent merged commit 23fd61e into main Sep 26, 2024
@jstlaurent jstlaurent deleted the feat/bucket-token-auth branch September 26, 2024 02:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Annotates any PR that adds new features; Used in the release process

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Client Bucket Authentication: Get a bucket JWT from the Hub and use it to access the R2 bucket

4 participants