Skip to content

Fix/eviction queue purge correctness#22680

Draft
krleonid wants to merge 3 commits into
duckdb:mainfrom
krleonid:fix/eviction-queue-purge-correctness
Draft

Fix/eviction queue purge correctness#22680
krleonid wants to merge 3 commits into
duckdb:mainfrom
krleonid:fix/eviction-queue-purge-correctness

Conversation

@krleonid
Copy link
Copy Markdown
Contributor

@krleonid krleonid commented May 15, 2026

Summary

This PR fixes three related bugs in the buffer pool's eviction queue purge path. Together they caused dead nodes to silently accumulate in the eviction queue, alive-but-pinned nodes to be dropped (disturbing LRU and corrupting the dead-node counter), and the multi-iteration purge loop to make essentially no progress under contention.

The fix is small (3 files, +37 / −16) but the underlying issues interact and are worth fixing together so the purge accounting becomes self-consistent again.

Background

EvictionQueue uses a duckdb_moodycamel::ConcurrentQueue<BufferEvictionNode> where each node holds:

  • weak_ptr<BlockMemory> — the block this entry refers to.
  • handle_sequence_number — incremented every time the block is re-added to the queue.

A node is logically dead if either:

  1. The weak_ptr is expired (the BlockMemory was destroyed), or
  2. The block's current eviction_seq_num is newer than the node's recorded sequence number (a newer entry superseded it).

Dead nodes are never proactively removed — they are dequeued lazily by EvictionQueue::Purge(), which runs on one thread at a time and tries to keep the queue's dead-node ratio under control.

Bugs fixed

1. IncrementDeadNodes was skipped when the buffer was already null

BlockMemory::~BlockMemory only called IncrementDeadNodes when GetBuffer() returned non-null:

```cpp
// before
if (GetBuffer() && GetBufferType() != FileBufferType::TINY_BUFFER) {
GetBufferManager().GetBufferPool().IncrementDeadNodes(*this);
}
```

But the queue entry's lifetime is independent of whether the buffer is currently loaded — only the dequeue side ever removes it. If the block had previously been evicted (buffer = nullptr) and then the BlockMemory itself was destroyed, the queue entry that pointed at it became dead but was never accounted for. This systematically under-counted total_dead_nodes, which made the purge ratio early-out (approx_alive_nodes * (ALIVE_NODE_MULTIPLIER - 1) > approx_dead_nodes) fire too often and let dead entries accumulate.

Fix: drop the GetBuffer() guard.

2. Pinned-but-current nodes were classified as dead during purge

PurgeIteration used TryGetBlockMemory() to decide whether a dequeued node was alive:

```cpp
shared_ptr BufferEvictionNode::TryGetBlockMemory() {
auto shared_memory_p = memory_p.lock();
if (!shared_memory_p) {
return nullptr;
}
if (!CanUnload(*shared_memory_p)) { // also checks readers / pinning
return nullptr;
}
return shared_memory_p;
}
```

CanUnload is not a deadness predicate — it also returns false when the block is currently pinned (readers != 0). So a perfectly valid latest-version node that just happened to be pinned at the moment of the purge was:

  • Dropped from the queue (LRU position lost).
  • Counted toward total_dead_nodes -= actually_dequeued - alive_nodes, decrementing the dead counter for a node that was never dead.

This actively desyncs total_dead_nodes from reality and combines badly with bug.

Fix: introduce a dedicated BufferEvictionNode::IsDeadNode() that checks only the two real deadness conditions (expired weak_ptr or stale sequence number) and use it for purge classification:

```cpp
bool BufferEvictionNode::IsDeadNode() {
auto shared_memory_p = memory_p.lock();
if (!shared_memory_p) {
return true;
}
if (handle_sequence_number != shared_memory_p->GetEvictionSequenceNumber()) {
return true;
}
return false;
}
```

PurgeIteration now decrements total_dead_nodes by an explicit dead_count, which can never include pinned-but-current entries.

3. Multi-iteration purge churned the purge thread's own moodycamel sub-queue

This is the most impactful bug.

moodycamel::ConcurrentQueue is not a single FIFO. Internally it maintains one sub-queue per producer thread (implicit producers are created lazily on first enqueue from a given thread). try_dequeue_bulk walks the producer list and tends to drain the most recently used sub-queues first.

The previous purge loop re-enqueued alive nodes between iterations:

```cpp
// PurgeIteration (old)
q.enqueue_bulk(purge_nodes.begin(), alive_nodes);
```

That enqueue happens on the purge thread, which in nearly all interesting scenarios was not previously a producer for this queue (producers are worker threads). It creates an implicit-producer sub-queue owned by the purge thread, and on the next try_dequeue_bulk call moodycamel preferentially re-drains it — pulling back the very alive nodes the purge just inserted, instead of reaching dead entries sitting in worker producer sub-queues. Under sustained pressure the loop made little to no real progress: it kept "purging" the same alive nodes round and round while real dead entries accumulated elsewhere.

Fix: accumulate alive nodes into a local vector across the entire Purge() loop and re-enqueue them once at the end. The early-out queue-size estimate is compensated with the size of the held-aside vector so the early-out check still reflects the true logical queue size:

```cpp
vector alive_nodes_to_reenqueue;
while (max_purges != 0) {
PurgeIteration(purge_size, alive_nodes_to_reenqueue);
approx_q_size = q.size_approx() + alive_nodes_to_reenqueue.size();
// ... ratio / early-out checks unchanged ...
max_purges--;
}
if (!alive_nodes_to_reenqueue.empty()) {
q.enqueue_bulk(alive_nodes_to_reenqueue.begin(), alive_nodes_to_reenqueue.size());
}
```

This guarantees that each purge iteration sees fresh entries from worker sub-queues instead of bouncing the purge thread's own buffer.

How they interact

Bug Effect on total_dead_nodes Effect on the queue
1 — buffer-null skip Under-counts dead nodes Dead entries leak in
2 — pinned treated as dead Decremented for non-dead nodes Pinned entries are dropped (LRU disturbed)
3 — sub-queue churn Multi-iteration purge does no useful work

Bugs 1 and 2 push the counter in opposite directions, masking each other intermittently and making the symptom hard to trace. Bug 3 means even when the ratio check correctly demands more aggressive purging, the loop fails to actually purge.

Files changed

  • src/include/duckdb/storage/buffer/buffer_pool.hpp — declare BufferEvictionNode::IsDeadNode.
  • src/storage/buffer/buffer_pool.cpp — implement IsDeadNode, restructure Purge / PurgeIteration to defer re-enqueue.
  • src/storage/buffer/block_handle.cpp — drop GetBuffer() guard before IncrementDeadNodes.

Test plan

  • make reldebug builds cleanly.
  • build/reldebug/test/unittest passes (full unit suite).
  • Targeted buffer-manager / eviction tests pass (build/reldebug/test/unittest "[buffermanager]").
  • Local repro of dead-node accumulation under heavy churn no longer grows total_dead_nodes unboundedly between purges.
  • LRU-sensitive workloads do not show pinned blocks being repeatedly evicted/reloaded after purge.
  • Multi-iteration purge actually shrinks the queue under contention (verified by observing q.size_approx() between iterations in a stress harness).

Notes for reviewers

  • BufferEvictionNode::IsDeadNode() is intentionally not const because weak_ptr::lock() is a mutating operation only in spirit — happy to mark it const if preferred (it would still compile since lock() is const on the weak_ptr).
  • The held-aside vector is local to a single Purge() invocation and bounded by the total size of dequeued nodes, so memory overhead is at most one purge_size worth of entries beyond what purge_nodes already held.
  • Re-enqueuing once at the end means alive nodes briefly leave the queue during purge. Any concurrent TryDequeueWithLock on the same queue is gated by purge_lock (held for the entire Purge call), so no observer can see the queue mid-purge.

krleonid and others added 2 commits May 14, 2026 23:36
…ue churn

Three related bugs in the buffer pool's eviction queue purge path
caused dead nodes to accumulate, alive but pinned nodes to be lost,
and the multi-iteration purge loop to make no real progress.

1. Dead increment skipped when buffer was already null

   `BlockMemory::~BlockMemory` only called `IncrementDeadNodes` when
   `GetBuffer()` was non-null. But an eviction queue entry holds a
   `weak_ptr<BlockMemory>` and a sequence number, and stays in the
   queue regardless of whether the buffer has been unloaded - only
   the dequeue side ever removes it. So when the `BlockMemory`
   itself is destroyed, the latest queue entry pointing at it is
   dead independently of the buffer state. The old guard caused
   `total_dead_nodes` to under-count, which made the purge ratio
   early-out fire too often and let dead entries accumulate.

   Fix: remove the `GetBuffer()` guard.

2. Pinned/current nodes were classified as dead during purge

   `PurgeIteration` used `TryGetBlockMemory()` to decide whether
   a dequeued node was alive. `TryGetBlockMemory()` also requires
   `CanUnload()` to be true, so a perfectly valid latest-version
   node with `readers != 0` (pinned) was treated as dead, dropped
   from the queue, and `total_dead_nodes` was decremented for it.
   This both disturbed LRU and desynced the dead counter from
   reality.

   Fix: introduce `BufferEvictionNode::IsDeadNode()` that checks
   only the two conditions that actually mean "dead" - expired
   weak_ptr or stale sequence number - and use it in the purge
   loop.

3. Multi-iteration purge churned the purge thread's own
   moodycamel sub-queue

   moodycamel's `ConcurrentQueue` is not a single FIFO; internally
   it has one sub-queue per producer thread, and `try_dequeue_bulk`
   tends to drain the most recently used sub-queues first. The old
   purge loop re-enqueued alive nodes between iterations, so on
   the next iteration `try_dequeue_bulk` preferentially re-drained
   the purge thread's own sub-queue (the alive nodes it had just
   pushed) instead of reaching dead entries sitting in worker
   producer sub-queues. Under contention this turned "purge more
   aggressively" into a no-op.

   Fix: collect alive nodes into a local vector across the entire
   purge loop and re-enqueue them once after the loop completes.
   Compensate the early-out queue-size estimate with the held-aside
   alive nodes so the size check still reflects the true logical
   queue size.

Also adjust the dead-node decrement in `PurgeIteration` to count
genuinely dead nodes (`dead_count`) instead of `dequeued - alive`,
which previously counted pinned-but-current nodes as dead and drove
the counter further out of sync.

Co-authored-by: Cursor <cursoragent@cursor.com>
Two follow-up changes on top of the eviction-queue purge fix.

1. Refine the dead-node accounting fix in ~BlockMemory.

   The previous commit removed the GetBuffer() guard from ~BlockMemory's
   IncrementDeadNodes call so that blocks evicted before destruction were
   correctly accounted for. That fixed the under-counting case but
   over-counted in two others: blocks that were never inserted into the
   eviction queue at all (DestroyBufferUpon::UNPIN, used by tuple_data
   and column_data allocators) and BLOCK-type BlockHandles that were
   created but never pinned. This broke the dead_nodes <= total_insertions
   invariant in test/sql/table_function/duckdb_eviction_queues.test.

   Replace the ad-hoc guard with an explicit per-BlockMemory flag,
   ever_in_eviction_queue, set once by BufferPool::AddToEvictionQueue
   (the single chokepoint where queue entries originate). This captures
   the real precondition - "this BlockMemory has at least one entry in
   the eviction queue" - directly, without depending on buffer state or
   destroy_buffer_upon as proxies. The flag deliberately stays true even
   after Unload (which resets eviction_seq_num) because the stale queue
   entry from before the Unload is still in the queue and must be
   accounted for at destruction.

   The flag is atomic<bool> with relaxed ordering on both the store and
   the load. We don't need release/acquire here: the destructor only
   uses the flag value as a local boolean, not as a publication signal
   for other memory. The happens-before edge between the producer's
   store and the destructor's load is already provided by the
   shared_ptr<BlockHandle> refcount drop, so relaxed is both correct
   and avoids the seq_cst fences the implicit operator= would emit on
   the hot Unpin path.

2. Add C++ regression tests in test/api/test_buffer_pool_eviction.cpp.

   - "dead_nodes is incremented on BlockMemory destruction even after
     eviction" is a deterministic regression test for the original
     under-counting bug. It allocates 6 destroyable buffers under a
     memory limit that only fits 2, verifies that 4 of them are evicted
     (state == BLOCK_UNLOADED, buffer == nullptr), then drops all
     BlockHandles and asserts dead_nodes increased by exactly 6. With
     the original GetBuffer()-guarded destructor this fails 2 == 6
     because only the still-loaded handles are counted. Verified by
     reverting just the destructor guard.

   - "dead_nodes invariants hold under destroyable block churn" runs
     20000 destroyable allocate/drop cycles (>> the 4096 INSERT_INTERVAL
     so Purge is invoked many times) while keeping a small set of
     pinned-resident blocks alive. It asserts that total_dead_nodes
     never wraps under unsigned underflow and that per-queue
     dead_nodes <= total_insertions. This guards against the previous
     PurgeIteration bug where pinned-but-current entries were
     classified as dead and over-decremented the counter.

All [storage] tests pass (120 cases, 88009 assertions), including the
existing duckdb_eviction_queues.test invariants.

Co-authored-by: Cursor <cursoragent@cursor.com>
@krleonid krleonid marked this pull request as draft May 15, 2026 05:56
Copy link
Copy Markdown
Member

@taniabogatsch taniabogatsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find! Left a small review.

Comment thread src/storage/buffer/buffer_pool.cpp Outdated
Comment thread test/api/test_buffer_pool_eviction.cpp Outdated
Comment thread test/api/test_buffer_pool_eviction.cpp
…buffers, add stress test

- Move debug_sleep_micros into BufferEvictionNode::IsDeadNode() so race
  condition testing exercises the ownership path after acquiring the shared_ptr
- Rename page_size/total_pages/held_pages/resident_pages to buffer-based
  naming (buffer_size/total_buffers/held_buffers/resident_buffers)
- Add concurrent stress test (test_slow) using duckdb_eviction_queues that
  validates dead_nodes never goes negative or exceeds total_insertions
@krleonid krleonid requested a review from taniabogatsch May 15, 2026 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants