Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] fetch: add batch-oids plumbing option #5061

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rcoup
Copy link
Contributor

@rcoup rcoup commented Jun 29, 2022

Adds the ability to perform a download Batch API request from the command line, without downloading the objects. Use case is to get content URLs for accessing directly — eg: for use with HTTP range requests or software that can access HTTP URLs directly. This command reuses all the significant config, server discovery, authentication, transfer adapters, etc steps necessary for the Batch API request, which is where the key value lies.

$ git lfs fetch --batch-oids <<EOF
4c95c46a18f120048d7bcf5a3f665c26965844e41217da1488810acabf69a990
61b4c705859f4158d38090c1e38e8fdc4f3d29db007f012766276aa498835cf6
EOF

{
  "objects": [
    {
      "oid": "4c95c46a18f120048d7bcf5a3f665c26965844e41217da1488810acabf69a990",
      "size": 12345,
      "authenticated": true,
      "actions": {
        "download": {
          "href": "https://example.com/lfs/myrepo/4c95c46a18f120048d7bcf5a3f665c26965844e41217da1488810acabf69a990",
          "expires_at": "0001-01-01T00:00:00Z",
          "expires_in": 3600
        }
      }
    },
    {
      "oid": "61b4c705859f4158d38090c1e38e8fdc4f3d29db007f012766276aa498835cf6",
      "size": 23456,
      "authenticated": true,
      "actions": {
        "download": {
          "href": "https://example.com/lfs/myrepo/61b4c705859f4158d38090c1e38e8fdc4f3d29db007f012766276aa498835cf6",
          "expires_at": "0001-01-01T00:00:00Z",
          "expires_in": 3600
        }
      }
    }
  ],
  "transfer": "basic",
  "hash_algo": "sha256"
}

The new --batch-oids option to lfs fetch reads newline-delimited oids from stdin, and performs a download Batch
API request for those objects, and returns the results as a JSON object to stdout. There are some caveats:

  • The caller is responsible for understanding auth/expires/etc and potentially refreshing URLs
  • Object URLs may need to use authentication from Git's credentials store/etc if they're not part of the Batch response.
  • At least conceptually it could be enhanced to rewrite URLs/responses further, basically to the final URL & headers for a download request. IMO this can probably be left until there's some clear feedback from users.

What I've attached here is a basic implementation, but obviously will need some further work. Questions

  1. Is this something the team would accept?
  2. Is fetch a reasonable place for this to live?
  3. Comments on the implementation, caveats noted above, etc.

Todo

  • tests
  • fix the output encoding of the response (particularly a missing Action.expires_at should be null/absent in the JSON output, not "0001-01-01T00:00:00Z")
  • check it works via the SSH transfer/adapter mechanism (may be of limited/no use, but shouldn't break)
  • docs

Reads newline-delimited oids from stdin, and performs a download Batch
API request, returning the results as JSON to stdout.

This allows advanced users to access content directly from storage
instead of downloading via LFS.
@rcoup rcoup requested a review from a team as a code owner June 29, 2022 09:33
@bk2204
Copy link
Member

bk2204 commented Jun 30, 2022

Hey,

I'm a little interested in your use case for this. In general, there's a lot more to processing a batch request than just extracting the URLs and making a request. For example, if the request isn't already authenticated, there's a lot of logic to decide about which authentication is to be used, to handle native Kerberos authentication (if that's in use), to handle TLS certificates and proxy information from the Git config, and so on.

As a consequence, I'm having a little trouble imagining that this would be a generally useful feature to add, although I can admit that in some specific cases it might be more straightforward. But I can fully anticipate that maybe I'm not understanding why this is valuable, so that's why I'm asking.

I will say that I think this would be best put in a separate API target, such as git lfs api batch, although I wouldn't suggest implementing that change quite yet while we discuss whether we'd like to have this as a feature.

Finally, as I mentioned in #5065, the core team is about to be on vacation, and so we'll probably discuss this more when we get back.

@rcoup
Copy link
Contributor Author

rcoup commented Jul 1, 2022

I'm a little interested in your use case for this.

For example, a git+lfs repository containing large data files of some flavour that are HTTP-consumable. In the geospatial world, there are Cloud Optimised GeoTIFFs, flatgeobuf files, Cloud Optimised Point Clouds, and others — these are all file formats with predefined layouts which can be usefully and efficiently direct-accessed from object storage systems (S3/CDNs/etc) using HTTP Range requests — streaming just an overview of a dataset, or a specific subset to a user. The same concepts apply in other science/data-wrangling domains too.

It's logical to manage these files via Git+LFS, but many read-only repo users don't need to actually download the full files: they could clone the git repo, get all the metadata, then pass object URLs to software that already understands the file format and can stream only the subsets of the LFS data objects it actually wants to the local machine.

In general, there's a lot more to processing a batch request than just extracting the URLs and making a request. For example, if the request isn't already authenticated, there's a lot of logic to decide about which authentication is to be used, to handle native Kerberos authentication (if that's in use), to handle TLS certificates and proxy information from the Git config, and so on.

It isn't intended to solve all the potential use cases on day one, and network/auth/etc aspects will definitely complicate things :-) A user could always make assumptions about the server, auth, and network environment to achieve the same thing (if I know an object url is alongside the git repo at https://example.com/{repo}.git/lfs/objects/{oid} then I can go from oid → URL myself).

I will say that I think this would be best put in a separate API target, such as git lfs api batch, although I wouldn't suggest implementing that change quite yet while we discuss whether we'd like to have this as a feature.

Yes, feels like api batch might be a useful approach: then it can be clear what it is/isn't doing; and that it's low-level plumbing — just a manual way to perform some steps of what "workflow" commands like lfs fetch undertake.

  • git-lfs ls-files path → oids
  • git-lfs api batch batch request: oids → download actions
  • git-lfs api download basic basic-transfer download action → request/stream

Finally, as I mentioned in #5065, the core team is about to be on vacation, and so we'll probably discuss this more when we get back.

No hurry, enjoy your break 😃

@bk2204
Copy link
Member

bk2204 commented Jul 12, 2022

Okay, I think I'm okay with this as git lfs api batch if you want to do that. I'd like to make this experimental without backward compatibility guarantees for now, so that should be noted in the manual page (which, once you rebase on main, will need to be in AsciiDoc). We'll also need tests in t/ for this to make sure it continues to work as expected.

If you have questions about how to do any of that or need help, please say so, and we'll try to help out.

@rcoup
Copy link
Contributor Author

rcoup commented Jul 12, 2022

@bk2204 thanks! FYI, if I don't get time for it in the next week it's likely to be mid-August until I get back to it again.

@bk2204
Copy link
Member

bk2204 commented Jul 13, 2022

That's fine. There's no rush. Whenever you're ready, we can look at it; just let us know.

@rcoup
Copy link
Contributor Author

rcoup commented Dec 13, 2022

I've been working on some other stuff recently, but I'll attempt to update this over the holiday period.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants