Skip to content

Conversation

@jvolmer
Copy link
Contributor

@jvolmer jvolmer commented Nov 6, 2025

This PR adds the capability of batching neighbour retrieval in traversals in the cluster. The batching is already implemented for single server (see #22062).
Batching is here only implemented for DFS (depth first search) and does not include smart traversals.

Additionally to the batching, this PR improves on the waiting time for neighbour requests to db servers in case of batching:
Before, we sent a neighbour request to the involved db servers after we were working on one vertex and then directly waited for all these requests to resolve, although we did not directly needed these neighbours. Now we send a neighbour request (now for one batch) to the db servers. But we just continue here without waiting and wait for the request only when we need it: Which is when we need to start work on one of these neighbours. At that time, the request is probably already resolved and we don't have to wait at all.

These are the main changes:

On coordinator side

  • The OneSidedEnumerator provides the paths to the traversal. Here, the PR switches on batching in the cluster.
  • The changes in algorithm-aliases.h make sure that smart traversals don't use the batching.
  • The ClusterProvider provides the neighbours for the currently traversed vertex. Now, the ClusterProvider owns a map of edge-requests per cursor that are sent to the db servers to get new batches of neighbours - these are requests to the open cursors in the TraverserEngine (described below).
  • The ClusterProvider properly implements the two new functions that were added in [COR-67] Make single-server traversal executor (with DFS) batch its neighbour retrieval #22062. The implementation of these two functions is basically a split version of the ClusterProvider::fetchEdgesFromEngines function that is used in the non-batched case. Caching of edges in ClusterProvider is disabled for now, we need to see if we should switch it on again, then it has to be rewritten because of the batching.
  • Fixed a bug in BatchedLifoQueue.

On db server side

  • The TraverserEngine gets the neighbours (in our case in batches) from RocksDB. Instead of just one cursor, it owns now a map of not-yet finished edge cursors . Resusing cursor - as it was done before - can now lead to incorrect behaviour, this is why I deleted cached cursor variables like _generalCursor and _depthSpecificCursors.
  • The InternalRestTraverserHandler receives the neighbour REST request from the coordinator and calls the TraverserEngine accordingly. The handler is just adapted based on the changes in the TraverserEngine.
  • Tests in InternalRestTraverserHandlerEdgeTest are adapted to account for the new map of cursors in the TraverserEngine.

Other

  • I added some inspectors during debugging which could be of help at a later point.
  • I renamed the _neighboursStack member in SingleServerProvider

First performance results (ran on my machine at home)

Explanation

Code: see arangodb/simple-performance-test#69
Run a traversal on a binary tree of depth 5 where one vertex is a supernode with 10.000 neighbours. (FOR v IN 0..8 OUTBOUND <root-vertex> GRAPH <graph> LIMIT @limit RETURN v.data)
With limit: limit vertex results by 18 (this makes sure that we require at least one neighbour of the supernode to be in the results).
Without limit: just get rid of the limit at all

Data

| use-case                | version     | runtime |
|-------------------------+-------------+---------|
| traversal with limit    | devel       |   1.855 |
|                         | this branch |   0.319 |
|-------------------------+-------------+---------|
| traversal without limit | devel       |    16.6 |
|                         | this branch |    11.9 |

Interpretation

On the given graph, the batched version gives large runtime improvements when using a limit: In the batched version only the first batch of neighbours is requested instead of all neighbours at once in devel.

Also without a limit, the new version is faster than devel. This is probably due to the runtime improvement described above: The devel version directly request all neighbours of a worked on vertex from the db servers and waits on their replies before continuing. This new version requests the first batch of neighbours of a worked on vertex and only waits for that result when starting to work on of of these neighbours (at which time the request has probably already been resolved and we are not waiting at all).

Enterprise PR

https://github.com/arangodb/enterprise/pull/1557


Note

Adds batched DFS neighbor retrieval for clustered traversals via per-cursor streaming in TraverserEngine, with coordinator/provider support, updated REST API, queues, and tests.

  • Cluster traversal (DFS) batching:
    • TraverserEngine now streams edges via multiple per-request cursors (_cursors), adds createNewCursor and nextEdgeBatch(cursorId, batchId, ...), and updates allEdges(..., variables, ...).
    • Removes depth/general cursor caching; getCursor returns a fresh EdgeCursor per depth; fixes EdgeCursorForMultipleVertices iteration and state.
  • Coordinator/provider changes:
    • ClusterProvider implements batched expansion: addExpansionIterator and expandToNextBatch send initial/continuation requests asynchronously, track requests per cursorId, and accumulate results; destructor waits outstanding requests.
    • fetchEdgesFromEngines adapted to batched protocol; uses ExecutionBlock::DefaultBatchSize; stats wiring and optional caching retained.
    • SingleServerProvider: renames _neighboursStack_neighbourCursors and aligns batched expansion helpers.
  • Enumerators/queues:
    • DFS uses BatchedLifoQueue (except smart traversals, which stay non-batched); fixes BatchedLifoQueue::firstIsVertexFetched for expansion markers.
    • Adds lightweight inspect hooks for steps/queues; minor debug logging macros.
  • REST API (InternalRestTraverserHandler):
    • Edge endpoint supports starting cursors with batchSize and continuing via {cursorId,batchId}; validates input and injects variables; preserves non-batched behavior when batchSize omitted.
  • Tests:
    • Extend/update InternalRestTraverserHandlerEdgeTest for multi-cursor and multi-vertex batching flows; minor mock updates.
  • CHANGELOG:
    • Notes DFS batching for cluster traversal and related fixes.

Written by Cursor Bugbot for commit ca017e4. This will update automatically on new commits. Configure here.

@jvolmer jvolmer self-assigned this Nov 6, 2025
@cla-bot cla-bot bot added the cla-signed label Nov 6, 2025
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the final PR Bugbot will review for you during this billing cycle

Your free Bugbot reviews will reset on November 28

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@jvolmer jvolmer force-pushed the feature/make-traversal-executor-get-one-batch-after-the-other-cluster branch 2 times, most recently from 54a9942 to 9579482 Compare November 18, 2025 11:16
@jvolmer jvolmer force-pushed the feature/make-traversal-executor-get-one-batch-after-the-other branch from 0140306 to 4aa3ef4 Compare November 18, 2025 12:00
@jvolmer jvolmer force-pushed the feature/make-traversal-executor-get-one-batch-after-the-other-cluster branch from 9579482 to 850065f Compare November 18, 2025 12:01
@jvolmer jvolmer force-pushed the feature/make-traversal-executor-get-one-batch-after-the-other branch from 4aa3ef4 to 9aa0f0b Compare November 18, 2025 12:21
@jvolmer jvolmer force-pushed the feature/make-traversal-executor-get-one-batch-after-the-other-cluster branch 2 times, most recently from d977225 to 0571b84 Compare November 19, 2025 15:41
@jvolmer jvolmer force-pushed the feature/make-traversal-executor-get-one-batch-after-the-other-cluster branch from b6cbd71 to ae36866 Compare November 26, 2025 17:56
@jvolmer jvolmer force-pushed the feature/make-traversal-executor-get-one-batch-after-the-other branch 2 times, most recently from e126b68 to 1e5810d Compare November 27, 2025 08:40
@jvolmer jvolmer force-pushed the feature/make-traversal-executor-get-one-batch-after-the-other-cluster branch from ae36866 to a03639c Compare November 27, 2025 08:52
@jvolmer jvolmer force-pushed the feature/make-traversal-executor-get-one-batch-after-the-other branch 2 times, most recently from 5084d48 to 1396ff1 Compare December 4, 2025 09:11
@jvolmer jvolmer force-pushed the feature/make-traversal-executor-get-one-batch-after-the-other-cluster branch from a03639c to 7de83bd Compare December 4, 2025 10:24
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is being reviewed by Cursor Bugbot

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Bug: QueueTracer::isBatched() doesn't delegate to wrapped implementation

The QueueTracer::isBatched() method always returns false instead of delegating to _impl.isBatched(). When tracing is enabled (useTracing = true), the queue is wrapped in QueueTracer, but the batching optimization will never be used because isBatched() returns false. Additionally, append(Expansion) contains TRI_ASSERT(false) which will abort in debug builds if batching is attempted with tracing. The QueueTracer wrapper needs to properly forward these batching-related methods to the underlying BatchedLifoQueue implementation.

arangod/Graph/Queues/QueueTracer.h#L44-L45

bool isBatched() { return false; }

arangod/Graph/Queues/QueueTracer.h#L47-L48

void append(Step step);
void append(Expansion expansion) { TRI_ASSERT(false); }

Fix in Cursor Fix in Web


Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: QueueTracer isBatched returns false, disabling batched mode

The QueueTracer::isBatched() method always returns false instead of delegating to _impl.isBatched(). This PR adds template instantiations for QueueTracer<BatchedLifoQueue<...>> to enable tracing with batched queues, but the hardcoded return value means when tracing is enabled, the batched code path in OneSidedEnumerator::computeNeighbourhoodOfNextVertex (which checks _queue.isBatched()) will never be taken, falling back to the non-batched implementation. This defeats the purpose of the batching optimization when tracing is active.

arangod/Graph/Queues/QueueTracer.h#L44-L45

bool isBatched() { return false; }

arangod/Graph/Queues/QueueTracer.cpp#L186-L188

arangodb::graph::LifoQueue<SingleServerProviderStep>>;
template class ::arangodb::graph::QueueTracer<
arangodb::graph::BatchedLifoQueue<SingleServerProviderStep>>;

Fix in Cursor Fix in Web


@jvolmer jvolmer force-pushed the feature/make-traversal-executor-get-one-batch-after-the-other branch from 1396ff1 to f01b5cd Compare December 9, 2025 14:18
@jvolmer jvolmer marked this pull request as draft December 11, 2025 12:12
@jvolmer jvolmer force-pushed the feature/make-traversal-executor-get-one-batch-after-the-other branch from f01b5cd to d702fa5 Compare December 16, 2025 12:09
Base automatically changed from feature/make-traversal-executor-get-one-batch-after-the-other to devel December 17, 2025 10:11
@jvolmer jvolmer force-pushed the feature/make-traversal-executor-get-one-batch-after-the-other-cluster branch from 9b6d305 to 3ec9760 Compare December 17, 2025 10:40
Request resolution can be out of sync with with actual request inquiry
when creating new cursors.
Smart traversals on coordinator need to to use the non-batched queue
at the moment. With the batched queue, the SmartGraphProvider needs at
least to implement addExpansionIterator and expandToNextBatch
properly, possibly more is needed.
Code should stay as it makes some optimizations possible
@jvolmer jvolmer force-pushed the feature/make-traversal-executor-get-one-batch-after-the-other-cluster branch from 3ec9760 to ca017e4 Compare December 17, 2025 10:49
@jvolmer jvolmer changed the title [COR-7] Make cluster traversal executor (with DFS) batch its neighbour retrieval [COR-68] Make cluster traversal executor (with DFS) batch its neighbour retrieval Dec 17, 2025
@jvolmer jvolmer marked this pull request as ready for review December 17, 2025 12:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant