[RFC] fetch: add batch-oids plumbing option #5061

rcoup · 2022-06-29T09:33:00Z

Adds the ability to perform a download Batch API request from the command line, without downloading the objects. Use case is to get content URLs for accessing directly — eg: for use with HTTP range requests or software that can access HTTP URLs directly. This command reuses all the significant config, server discovery, authentication, transfer adapters, etc steps necessary for the Batch API request, which is where the key value lies.

$ git lfs fetch --batch-oids <<EOF
4c95c46a18f120048d7bcf5a3f665c26965844e41217da1488810acabf69a990
61b4c705859f4158d38090c1e38e8fdc4f3d29db007f012766276aa498835cf6
EOF

{
  "objects": [
    {
      "oid": "4c95c46a18f120048d7bcf5a3f665c26965844e41217da1488810acabf69a990",
      "size": 12345,
      "authenticated": true,
      "actions": {
        "download": {
          "href": "https://example.com/lfs/myrepo/4c95c46a18f120048d7bcf5a3f665c26965844e41217da1488810acabf69a990",
          "expires_at": "0001-01-01T00:00:00Z",
          "expires_in": 3600
        }
      }
    },
    {
      "oid": "61b4c705859f4158d38090c1e38e8fdc4f3d29db007f012766276aa498835cf6",
      "size": 23456,
      "authenticated": true,
      "actions": {
        "download": {
          "href": "https://example.com/lfs/myrepo/61b4c705859f4158d38090c1e38e8fdc4f3d29db007f012766276aa498835cf6",
          "expires_at": "0001-01-01T00:00:00Z",
          "expires_in": 3600
        }
      }
    }
  ],
  "transfer": "basic",
  "hash_algo": "sha256"
}

The new --batch-oids option to lfs fetch reads newline-delimited oids from stdin, and performs a download Batch
API request for those objects, and returns the results as a JSON object to stdout. There are some caveats:

The caller is responsible for understanding auth/expires/etc and potentially refreshing URLs
Object URLs may need to use authentication from Git's credentials store/etc if they're not part of the Batch response.
At least conceptually it could be enhanced to rewrite URLs/responses further, basically to the final URL & headers for a download request. IMO this can probably be left until there's some clear feedback from users.

What I've attached here is a basic implementation, but obviously will need some further work. Questions

Is this something the team would accept?
Is fetch a reasonable place for this to live?
Comments on the implementation, caveats noted above, etc.

Todo

tests
fix the output encoding of the response (particularly a missing Action.expires_at should be null/absent in the JSON output, not "0001-01-01T00:00:00Z")
check it works via the SSH transfer/adapter mechanism (may be of limited/no use, but shouldn't break)
docs

Reads newline-delimited oids from stdin, and performs a download Batch API request, returning the results as JSON to stdout. This allows advanced users to access content directly from storage instead of downloading via LFS.

bk2204 · 2022-06-30T18:03:10Z

Hey,

I'm a little interested in your use case for this. In general, there's a lot more to processing a batch request than just extracting the URLs and making a request. For example, if the request isn't already authenticated, there's a lot of logic to decide about which authentication is to be used, to handle native Kerberos authentication (if that's in use), to handle TLS certificates and proxy information from the Git config, and so on.

As a consequence, I'm having a little trouble imagining that this would be a generally useful feature to add, although I can admit that in some specific cases it might be more straightforward. But I can fully anticipate that maybe I'm not understanding why this is valuable, so that's why I'm asking.

I will say that I think this would be best put in a separate API target, such as git lfs api batch, although I wouldn't suggest implementing that change quite yet while we discuss whether we'd like to have this as a feature.

Finally, as I mentioned in #5065, the core team is about to be on vacation, and so we'll probably discuss this more when we get back.

rcoup · 2022-07-01T14:05:48Z

I'm a little interested in your use case for this.

For example, a git+lfs repository containing large data files of some flavour that are HTTP-consumable. In the geospatial world, there are Cloud Optimised GeoTIFFs, flatgeobuf files, Cloud Optimised Point Clouds, and others — these are all file formats with predefined layouts which can be usefully and efficiently direct-accessed from object storage systems (S3/CDNs/etc) using HTTP Range requests — streaming just an overview of a dataset, or a specific subset to a user. The same concepts apply in other science/data-wrangling domains too.

It's logical to manage these files via Git+LFS, but many read-only repo users don't need to actually download the full files: they could clone the git repo, get all the metadata, then pass object URLs to software that already understands the file format and can stream only the subsets of the LFS data objects it actually wants to the local machine.

In general, there's a lot more to processing a batch request than just extracting the URLs and making a request. For example, if the request isn't already authenticated, there's a lot of logic to decide about which authentication is to be used, to handle native Kerberos authentication (if that's in use), to handle TLS certificates and proxy information from the Git config, and so on.

It isn't intended to solve all the potential use cases on day one, and network/auth/etc aspects will definitely complicate things :-) A user could always make assumptions about the server, auth, and network environment to achieve the same thing (if I know an object url is alongside the git repo at https://example.com/{repo}.git/lfs/objects/{oid} then I can go from oid → URL myself).

I will say that I think this would be best put in a separate API target, such as git lfs api batch, although I wouldn't suggest implementing that change quite yet while we discuss whether we'd like to have this as a feature.

Yes, feels like api batch might be a useful approach: then it can be clear what it is/isn't doing; and that it's low-level plumbing — just a manual way to perform some steps of what "workflow" commands like lfs fetch undertake.

git-lfs ls-files path → oids
git-lfs api batch batch request: oids → download actions
git-lfs api download basic basic-transfer download action → request/stream

Finally, as I mentioned in #5065, the core team is about to be on vacation, and so we'll probably discuss this more when we get back.

No hurry, enjoy your break 😃

bk2204 · 2022-07-12T20:07:50Z

Okay, I think I'm okay with this as git lfs api batch if you want to do that. I'd like to make this experimental without backward compatibility guarantees for now, so that should be noted in the manual page (which, once you rebase on main, will need to be in AsciiDoc). We'll also need tests in t/ for this to make sure it continues to work as expected.

If you have questions about how to do any of that or need help, please say so, and we'll try to help out.

rcoup · 2022-07-12T20:56:22Z

@bk2204 thanks! FYI, if I don't get time for it in the next week it's likely to be mid-August until I get back to it again.

bk2204 · 2022-07-13T12:06:32Z

That's fine. There's no rush. Whenever you're ready, we can look at it; just let us know.

rcoup · 2022-12-13T22:29:26Z

I've been working on some other stuff recently, but I'll attempt to update this over the holiday period.

fetch: add batch-oids plumbing option

d7ccde3

Reads newline-delimited oids from stdin, and performs a download Batch API request, returning the results as JSON to stdout. This allows advanced users to access content directly from storage instead of downloading via LFS.

rcoup requested a review from a team as a code owner June 29, 2022 09:33

AnuchitB approved these changes Dec 1, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] fetch: add batch-oids plumbing option #5061

[RFC] fetch: add batch-oids plumbing option #5061

rcoup commented Jun 29, 2022

bk2204 commented Jun 30, 2022

rcoup commented Jul 1, 2022

bk2204 commented Jul 12, 2022

rcoup commented Jul 12, 2022

bk2204 commented Jul 13, 2022

rcoup commented Dec 13, 2022

[RFC] fetch: add batch-oids plumbing option #5061

Are you sure you want to change the base?

[RFC] fetch: add batch-oids plumbing option #5061

Conversation

rcoup commented Jun 29, 2022

Todo

bk2204 commented Jun 30, 2022

rcoup commented Jul 1, 2022

bk2204 commented Jul 12, 2022

rcoup commented Jul 12, 2022

bk2204 commented Jul 13, 2022

rcoup commented Dec 13, 2022