go,proto: remotesapi: Add a more efficient RPC, StreamChunkLocations, to use for fetch and pull.#10918
Merged
Merged
Conversation
StreamChunkLocations is a more efficient StreamDownloadLocations. It is available on a given remotesapi implementation if GetRepoMetadataResponse advertises FEATURE_STREAM_CHUNK_LOCATIONS in its features.
…pi.GetRepoMetadataResponse.
…instead of repeated bytes. Saves ~10% on the wire.
…ations support. This is useful for testing because client and server all need to still support StreamDownloadLocations for now. Set DOLT_REMOTESAPI_DISABLED_FEATURES=FEATURE_STREAM_CHUNK_LOCATIONS environment variable.
…g sure to test pull/fetch on legacy StreamDownloadLocations path as well.
Contributor
|
@reltuk DOLT
|
Contributor
Contributor
Contributor
Author
|
Comparative statistics for a fetch of a database with So in this example we save ~13% egress and ~86% ingress for the download location resolution overhead. The database itself is about 704MB of transited chunk data. This is for from the only overhead associated with a fetch, but it's a real win. |
|
@coffeegoddd DOLT
|
|
@coffeegoddd DOLT
|
|
@coffeegoddd DOLT
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The current streaming RPC,
StreamDownloadLocations, was a straight translation of the unary RPCGetDownloadLocations. We added streaming because it found it interacted much better with TCP and HTTP/2 window scaling. At the time, the RPCs were no reworked to take advantage of the stateful nature of the stream. Since then, the pipelined ChunkFetcher machinery has been added, which makes opportunities for reuse on the individual streaming RPC even better. We also added theRefreshTableFileUrlendpoint, which decouples a client's ability to continue using previously communicated table files from its need to see them in a GetDownloadLocsResponse in particular.StreamChunkLocationsis transiting the exact same semantic payloads asStreamDownloadLocations. It's just not re-transmitting a bunch of stuff it does not need to. In particular:We transit table file URLs separately from the chunk locations. A table_file_id is assigned to a table file the first time the server tells the client about it. Then that same table_file_id is used to refer to that table file for all communicated chunk locations in all response messages on the same stream.
We do not re-transit chunk hashes. The responses refer to the chunk hashes which were provided in the corresponding request by index. The client already knows them.
We do not need to transit
RefreshTableFileUrlRequestmessages for the table files. The client can build these with its own knowledge.Those are three major improvements for bandwidth utilization. There are also some smaller things, like sending
chunk_hashesasbytesinstead ofrepeated bytes.This PR adds a
featuresfield inGetRepoMetadataResponse. That field lets a client know that it can call the new available endpoint. Otherwise the client continues callingStreamDownloadLocations.This PR adds both server-side and client-side implementations for the new endpoint. The server-side implementation ends up looking a lot like the existing
StreamDownloadLocationscode. It keeps some local maps so it can include the appropriate reference ids in the outgoing messages. The client-side implementation intentionally remains about as minimal as possible. In particular, it does not touch range coalescing or most aspects of the fetch pipeline. It targets just generating theStreamChunkLocationsRequestand handling theStreamChunkLocationsResponsemessages. It translates the responses back into what StreamDownloadLocations would have generated before handing those pieces off to the rest of the fetch pipeline.In addition to unit tests, some machinery in
remotesrvis updated so we can optionally disable advertising support forStreamChunkLocations. This allows us to update some integration tests so that they continue to exercise theStreamDownloadLocationscode paths on both the client and the server.