-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] fetch: add batch-oids plumbing option #5061
base: main
Are you sure you want to change the base?
Conversation
Reads newline-delimited oids from stdin, and performs a download Batch API request, returning the results as JSON to stdout. This allows advanced users to access content directly from storage instead of downloading via LFS.
Hey, I'm a little interested in your use case for this. In general, there's a lot more to processing a batch request than just extracting the URLs and making a request. For example, if the request isn't already authenticated, there's a lot of logic to decide about which authentication is to be used, to handle native Kerberos authentication (if that's in use), to handle TLS certificates and proxy information from the Git config, and so on. As a consequence, I'm having a little trouble imagining that this would be a generally useful feature to add, although I can admit that in some specific cases it might be more straightforward. But I can fully anticipate that maybe I'm not understanding why this is valuable, so that's why I'm asking. I will say that I think this would be best put in a separate API target, such as Finally, as I mentioned in #5065, the core team is about to be on vacation, and so we'll probably discuss this more when we get back. |
For example, a git+lfs repository containing large data files of some flavour that are HTTP-consumable. In the geospatial world, there are Cloud Optimised GeoTIFFs, flatgeobuf files, Cloud Optimised Point Clouds, and others — these are all file formats with predefined layouts which can be usefully and efficiently direct-accessed from object storage systems (S3/CDNs/etc) using HTTP Range requests — streaming just an overview of a dataset, or a specific subset to a user. The same concepts apply in other science/data-wrangling domains too. It's logical to manage these files via Git+LFS, but many read-only repo users don't need to actually download the full files: they could clone the git repo, get all the metadata, then pass object URLs to software that already understands the file format and can stream only the subsets of the LFS data objects it actually wants to the local machine.
It isn't intended to solve all the potential use cases on day one, and network/auth/etc aspects will definitely complicate things :-) A user could always make assumptions about the server, auth, and network environment to achieve the same thing (if I know an object url is alongside the git repo at
Yes, feels like
No hurry, enjoy your break 😃 |
Okay, I think I'm okay with this as If you have questions about how to do any of that or need help, please say so, and we'll try to help out. |
@bk2204 thanks! FYI, if I don't get time for it in the next week it's likely to be mid-August until I get back to it again. |
That's fine. There's no rush. Whenever you're ready, we can look at it; just let us know. |
I've been working on some other stuff recently, but I'll attempt to update this over the holiday period. |
Adds the ability to perform a download Batch API request from the command line, without downloading the objects. Use case is to get content URLs for accessing directly — eg: for use with HTTP range requests or software that can access HTTP URLs directly. This command reuses all the significant config, server discovery, authentication, transfer adapters, etc steps necessary for the Batch API request, which is where the key value lies.
The new
--batch-oids
option tolfs fetch
reads newline-delimited oids from stdin, and performs a download BatchAPI request for those objects, and returns the results as a JSON object to stdout. There are some caveats:
What I've attached here is a basic implementation, but obviously will need some further work. Questions
fetch
a reasonable place for this to live?Todo
Action.expires_at
should benull
/absent in the JSON output, not"0001-01-01T00:00:00Z"
)