-
Notifications
You must be signed in to change notification settings - Fork 875
[COR-68] Make cluster traversal executor (with DFS) batch its neighbour retrieval #22081
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
[COR-68] Make cluster traversal executor (with DFS) batch its neighbour retrieval #22081
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the final PR Bugbot will review for you during this billing cycle
Your free Bugbot reviews will reset on November 28
Details
Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
54a9942 to
9579482
Compare
0140306 to
4aa3ef4
Compare
9579482 to
850065f
Compare
4aa3ef4 to
9aa0f0b
Compare
d977225 to
0571b84
Compare
b6cbd71 to
ae36866
Compare
e126b68 to
1e5810d
Compare
ae36866 to
a03639c
Compare
5084d48 to
1396ff1
Compare
a03639c to
7de83bd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is being reviewed by Cursor Bugbot
Details
Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
Bug: QueueTracer::isBatched() doesn't delegate to wrapped implementation
The QueueTracer::isBatched() method always returns false instead of delegating to _impl.isBatched(). When tracing is enabled (useTracing = true), the queue is wrapped in QueueTracer, but the batching optimization will never be used because isBatched() returns false. Additionally, append(Expansion) contains TRI_ASSERT(false) which will abort in debug builds if batching is attempted with tracing. The QueueTracer wrapper needs to properly forward these batching-related methods to the underlying BatchedLifoQueue implementation.
arangod/Graph/Queues/QueueTracer.h#L44-L45
arangodb/arangod/Graph/Queues/QueueTracer.h
Lines 44 to 45 in 7de83bd
| bool isBatched() { return false; } |
arangod/Graph/Queues/QueueTracer.h#L47-L48
arangodb/arangod/Graph/Queues/QueueTracer.h
Lines 47 to 48 in 7de83bd
| void append(Step step); | |
| void append(Expansion expansion) { TRI_ASSERT(false); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: QueueTracer isBatched returns false, disabling batched mode
The QueueTracer::isBatched() method always returns false instead of delegating to _impl.isBatched(). This PR adds template instantiations for QueueTracer<BatchedLifoQueue<...>> to enable tracing with batched queues, but the hardcoded return value means when tracing is enabled, the batched code path in OneSidedEnumerator::computeNeighbourhoodOfNextVertex (which checks _queue.isBatched()) will never be taken, falling back to the non-batched implementation. This defeats the purpose of the batching optimization when tracing is active.
arangod/Graph/Queues/QueueTracer.h#L44-L45
arangodb/arangod/Graph/Queues/QueueTracer.h
Lines 44 to 45 in 9b6d305
| bool isBatched() { return false; } |
arangod/Graph/Queues/QueueTracer.cpp#L186-L188
arangodb/arangod/Graph/Queues/QueueTracer.cpp
Lines 186 to 188 in 9b6d305
| arangodb::graph::LifoQueue<SingleServerProviderStep>>; | |
| template class ::arangodb::graph::QueueTracer< | |
| arangodb::graph::BatchedLifoQueue<SingleServerProviderStep>>; |
1396ff1 to
f01b5cd
Compare
f01b5cd to
d702fa5
Compare
9b6d305 to
3ec9760
Compare
Request resolution can be out of sync with with actual request inquiry when creating new cursors.
Smart traversals on coordinator need to to use the non-batched queue at the moment. With the batched queue, the SmartGraphProvider needs at least to implement addExpansionIterator and expandToNextBatch properly, possibly more is needed.
Code should stay as it makes some optimizations possible
3ec9760 to
ca017e4
Compare
This PR adds the capability of batching neighbour retrieval in traversals in the cluster. The batching is already implemented for single server (see #22062).
Batching is here only implemented for DFS (depth first search) and does not include smart traversals.
Additionally to the batching, this PR improves on the waiting time for neighbour requests to db servers in case of batching:
Before, we sent a neighbour request to the involved db servers after we were working on one vertex and then directly waited for all these requests to resolve, although we did not directly needed these neighbours. Now we send a neighbour request (now for one batch) to the db servers. But we just continue here without waiting and wait for the request only when we need it: Which is when we need to start work on one of these neighbours. At that time, the request is probably already resolved and we don't have to wait at all.
These are the main changes:
On coordinator side
OneSidedEnumeratorprovides the paths to the traversal. Here, the PR switches on batching in the cluster.algorithm-aliases.hmake sure that smart traversals don't use the batching.ClusterProviderprovides the neighbours for the currently traversed vertex. Now, theClusterProviderowns a map of edge-requests per cursor that are sent to the db servers to get new batches of neighbours - these are requests to the open cursors in theTraverserEngine(described below).ClusterProviderproperly implements the two new functions that were added in [COR-67] Make single-server traversal executor (with DFS) batch its neighbour retrieval #22062. The implementation of these two functions is basically a split version of theClusterProvider::fetchEdgesFromEnginesfunction that is used in the non-batched case. Caching of edges in ClusterProvider is disabled for now, we need to see if we should switch it on again, then it has to be rewritten because of the batching.BatchedLifoQueue.On db server side
TraverserEnginegets the neighbours (in our case in batches) from RocksDB. Instead of just one cursor, it owns now a map of not-yet finished edge cursors . Resusing cursor - as it was done before - can now lead to incorrect behaviour, this is why I deleted cached cursor variables like_generalCursorand_depthSpecificCursors.InternalRestTraverserHandlerreceives the neighbour REST request from the coordinator and calls theTraverserEngineaccordingly. The handler is just adapted based on the changes in theTraverserEngine.InternalRestTraverserHandlerEdgeTestare adapted to account for the new map of cursors in theTraverserEngine.Other
_neighboursStackmember inSingleServerProviderFirst performance results (ran on my machine at home)
Explanation
Code: see arangodb/simple-performance-test#69
Run a traversal on a binary tree of depth 5 where one vertex is a supernode with 10.000 neighbours. (
FOR v IN 0..8 OUTBOUND <root-vertex> GRAPH <graph> LIMIT @limit RETURN v.data)With limit: limit vertex results by 18 (this makes sure that we require at least one neighbour of the supernode to be in the results).
Without limit: just get rid of the limit at all
Data
Interpretation
On the given graph, the batched version gives large runtime improvements when using a limit: In the batched version only the first batch of neighbours is requested instead of all neighbours at once in devel.
Also without a limit, the new version is faster than devel. This is probably due to the runtime improvement described above: The devel version directly request all neighbours of a worked on vertex from the db servers and waits on their replies before continuing. This new version requests the first batch of neighbours of a worked on vertex and only waits for that result when starting to work on of of these neighbours (at which time the request has probably already been resolved and we are not waiting at all).
Enterprise PR
https://github.com/arangodb/enterprise/pull/1557
Note
Adds batched DFS neighbor retrieval for clustered traversals via per-cursor streaming in
TraverserEngine, with coordinator/provider support, updated REST API, queues, and tests.TraverserEnginenow streams edges via multiple per-request cursors (_cursors), addscreateNewCursorandnextEdgeBatch(cursorId, batchId, ...), and updatesallEdges(..., variables, ...).getCursorreturns a freshEdgeCursorper depth; fixesEdgeCursorForMultipleVerticesiteration and state.ClusterProviderimplements batched expansion:addExpansionIteratorandexpandToNextBatchsend initial/continuation requests asynchronously, track requests percursorId, and accumulate results; destructor waits outstanding requests.fetchEdgesFromEnginesadapted to batched protocol; usesExecutionBlock::DefaultBatchSize; stats wiring and optional caching retained.SingleServerProvider: renames_neighboursStack→_neighbourCursorsand aligns batched expansion helpers.BatchedLifoQueue(except smart traversals, which stay non-batched); fixesBatchedLifoQueue::firstIsVertexFetchedfor expansion markers.inspecthooks for steps/queues; minor debug logging macros.InternalRestTraverserHandler):batchSizeand continuing via{cursorId,batchId}; validates input and injects variables; preserves non-batched behavior whenbatchSizeomitted.InternalRestTraverserHandlerEdgeTestfor multi-cursor and multi-vertex batching flows; minor mock updates.Written by Cursor Bugbot for commit ca017e4. This will update automatically on new commits. Configure here.