Fix/eviction queue purge correctness by krleonid · Pull Request #22680 · duckdb/duckdb

krleonid · 2026-05-15T05:56:39Z

Summary

This PR fixes three related bugs in the buffer pool's eviction queue purge path. Together they caused dead nodes to silently accumulate in the eviction queue, alive-but-pinned nodes to be dropped (disturbing LRU and corrupting the dead-node counter), and the multi-iteration purge loop to make essentially no progress under contention.

The fix is small (3 files, +37 / −16) but the underlying issues interact and are worth fixing together so the purge accounting becomes self-consistent again.

Background

EvictionQueue uses a duckdb_moodycamel::ConcurrentQueue<BufferEvictionNode> where each node holds:

weak_ptr<BlockMemory> — the block this entry refers to.
handle_sequence_number — incremented every time the block is re-added to the queue.

A node is logically dead if either:

The weak_ptr is expired (the BlockMemory was destroyed), or
The block's current eviction_seq_num is newer than the node's recorded sequence number (a newer entry superseded it).

Dead nodes are never proactively removed — they are dequeued lazily by EvictionQueue::Purge(), which runs on one thread at a time and tries to keep the queue's dead-node ratio under control.

Bugs fixed

1. `IncrementDeadNodes` was skipped when the buffer was already null

BlockMemory::~BlockMemory only called IncrementDeadNodes when GetBuffer() returned non-null:

```cpp
// before
if (GetBuffer() && GetBufferType() != FileBufferType::TINY_BUFFER) {
GetBufferManager().GetBufferPool().IncrementDeadNodes(*this);
}
```

But the queue entry's lifetime is independent of whether the buffer is currently loaded — only the dequeue side ever removes it. If the block had previously been evicted (buffer = nullptr) and then the BlockMemory itself was destroyed, the queue entry that pointed at it became dead but was never accounted for. This systematically under-counted total_dead_nodes, which made the purge ratio early-out (approx_alive_nodes * (ALIVE_NODE_MULTIPLIER - 1) > approx_dead_nodes) fire too often and let dead entries accumulate.

Fix: drop the GetBuffer() guard.

2. Pinned-but-current nodes were classified as dead during purge

PurgeIteration used TryGetBlockMemory() to decide whether a dequeued node was alive:

```cpp
shared_ptr BufferEvictionNode::TryGetBlockMemory() {
auto shared_memory_p = memory_p.lock();
if (!shared_memory_p) {
return nullptr;
}
if (!CanUnload(*shared_memory_p)) { // also checks readers / pinning
return nullptr;
}
return shared_memory_p;
}
```

CanUnload is not a deadness predicate — it also returns false when the block is currently pinned (readers != 0). So a perfectly valid latest-version node that just happened to be pinned at the moment of the purge was:

Dropped from the queue (LRU position lost).
Counted toward total_dead_nodes -= actually_dequeued - alive_nodes, decrementing the dead counter for a node that was never dead.

This actively desyncs total_dead_nodes from reality and combines badly with bug.

Fix: introduce a dedicated BufferEvictionNode::IsDeadNode() that checks only the two real deadness conditions (expired weak_ptr or stale sequence number) and use it for purge classification:

```cpp
bool BufferEvictionNode::IsDeadNode() {
auto shared_memory_p = memory_p.lock();
if (!shared_memory_p) {
return true;
}
if (handle_sequence_number != shared_memory_p->GetEvictionSequenceNumber()) {
return true;
}
return false;
}
```

PurgeIteration now decrements total_dead_nodes by an explicit dead_count, which can never include pinned-but-current entries.

3. Multi-iteration purge churned the purge thread's own moodycamel sub-queue

This is the most impactful bug.

moodycamel::ConcurrentQueue is not a single FIFO. Internally it maintains one sub-queue per producer thread (implicit producers are created lazily on first enqueue from a given thread). try_dequeue_bulk walks the producer list and tends to drain the most recently used sub-queues first.

The previous purge loop re-enqueued alive nodes between iterations:

```cpp
// PurgeIteration (old)
q.enqueue_bulk(purge_nodes.begin(), alive_nodes);
```

That enqueue happens on the purge thread, which in nearly all interesting scenarios was not previously a producer for this queue (producers are worker threads). It creates an implicit-producer sub-queue owned by the purge thread, and on the next try_dequeue_bulk call moodycamel preferentially re-drains it — pulling back the very alive nodes the purge just inserted, instead of reaching dead entries sitting in worker producer sub-queues. Under sustained pressure the loop made little to no real progress: it kept "purging" the same alive nodes round and round while real dead entries accumulated elsewhere.

Fix: accumulate alive nodes into a local vector across the entire Purge() loop and re-enqueue them once at the end. The early-out queue-size estimate is compensated with the size of the held-aside vector so the early-out check still reflects the true logical queue size:

```cpp
vector alive_nodes_to_reenqueue;
while (max_purges != 0) {
PurgeIteration(purge_size, alive_nodes_to_reenqueue);
approx_q_size = q.size_approx() + alive_nodes_to_reenqueue.size();
// ... ratio / early-out checks unchanged ...
max_purges--;
}
if (!alive_nodes_to_reenqueue.empty()) {
q.enqueue_bulk(alive_nodes_to_reenqueue.begin(), alive_nodes_to_reenqueue.size());
}
```

This guarantees that each purge iteration sees fresh entries from worker sub-queues instead of bouncing the purge thread's own buffer.

How they interact

Bug	Effect on `total_dead_nodes`	Effect on the queue
1 — buffer-null skip	Under-counts dead nodes	Dead entries leak in
2 — pinned treated as dead	Decremented for non-dead nodes	Pinned entries are dropped (LRU disturbed)
3 — sub-queue churn	—	Multi-iteration purge does no useful work

Bugs 1 and 2 push the counter in opposite directions, masking each other intermittently and making the symptom hard to trace. Bug 3 means even when the ratio check correctly demands more aggressive purging, the loop fails to actually purge.

Files changed

src/include/duckdb/storage/buffer/buffer_pool.hpp — declare BufferEvictionNode::IsDeadNode.
src/storage/buffer/buffer_pool.cpp — implement IsDeadNode, restructure Purge / PurgeIteration to defer re-enqueue.
src/storage/buffer/block_handle.cpp — drop GetBuffer() guard before IncrementDeadNodes.

Test plan

make reldebug builds cleanly.
build/reldebug/test/unittest passes (full unit suite).
Targeted buffer-manager / eviction tests pass (build/reldebug/test/unittest "[buffermanager]").
Local repro of dead-node accumulation under heavy churn no longer grows total_dead_nodes unboundedly between purges.
LRU-sensitive workloads do not show pinned blocks being repeatedly evicted/reloaded after purge.
Multi-iteration purge actually shrinks the queue under contention (verified by observing q.size_approx() between iterations in a stress harness).

Notes for reviewers

BufferEvictionNode::IsDeadNode() is intentionally not const because weak_ptr::lock() is a mutating operation only in spirit — happy to mark it const if preferred (it would still compile since lock() is const on the weak_ptr).
The held-aside vector is local to a single Purge() invocation and bounded by the total size of dequeued nodes, so memory overhead is at most one purge_size worth of entries beyond what purge_nodes already held.
Re-enqueuing once at the end means alive nodes briefly leave the queue during purge. Any concurrent TryDequeueWithLock on the same queue is gated by purge_lock (held for the entire Purge call), so no observer can see the queue mid-purge.

…ue churn Three related bugs in the buffer pool's eviction queue purge path caused dead nodes to accumulate, alive but pinned nodes to be lost, and the multi-iteration purge loop to make no real progress. 1. Dead increment skipped when buffer was already null `BlockMemory::~BlockMemory` only called `IncrementDeadNodes` when `GetBuffer()` was non-null. But an eviction queue entry holds a `weak_ptr<BlockMemory>` and a sequence number, and stays in the queue regardless of whether the buffer has been unloaded - only the dequeue side ever removes it. So when the `BlockMemory` itself is destroyed, the latest queue entry pointing at it is dead independently of the buffer state. The old guard caused `total_dead_nodes` to under-count, which made the purge ratio early-out fire too often and let dead entries accumulate. Fix: remove the `GetBuffer()` guard. 2. Pinned/current nodes were classified as dead during purge `PurgeIteration` used `TryGetBlockMemory()` to decide whether a dequeued node was alive. `TryGetBlockMemory()` also requires `CanUnload()` to be true, so a perfectly valid latest-version node with `readers != 0` (pinned) was treated as dead, dropped from the queue, and `total_dead_nodes` was decremented for it. This both disturbed LRU and desynced the dead counter from reality. Fix: introduce `BufferEvictionNode::IsDeadNode()` that checks only the two conditions that actually mean "dead" - expired weak_ptr or stale sequence number - and use it in the purge loop. 3. Multi-iteration purge churned the purge thread's own moodycamel sub-queue moodycamel's `ConcurrentQueue` is not a single FIFO; internally it has one sub-queue per producer thread, and `try_dequeue_bulk` tends to drain the most recently used sub-queues first. The old purge loop re-enqueued alive nodes between iterations, so on the next iteration `try_dequeue_bulk` preferentially re-drained the purge thread's own sub-queue (the alive nodes it had just pushed) instead of reaching dead entries sitting in worker producer sub-queues. Under contention this turned "purge more aggressively" into a no-op. Fix: collect alive nodes into a local vector across the entire purge loop and re-enqueue them once after the loop completes. Compensate the early-out queue-size estimate with the held-aside alive nodes so the size check still reflects the true logical queue size. Also adjust the dead-node decrement in `PurgeIteration` to count genuinely dead nodes (`dead_count`) instead of `dequeued - alive`, which previously counted pinned-but-current nodes as dead and drove the counter further out of sync. Co-authored-by: Cursor <cursoragent@cursor.com>

Two follow-up changes on top of the eviction-queue purge fix. 1. Refine the dead-node accounting fix in ~BlockMemory. The previous commit removed the GetBuffer() guard from ~BlockMemory's IncrementDeadNodes call so that blocks evicted before destruction were correctly accounted for. That fixed the under-counting case but over-counted in two others: blocks that were never inserted into the eviction queue at all (DestroyBufferUpon::UNPIN, used by tuple_data and column_data allocators) and BLOCK-type BlockHandles that were created but never pinned. This broke the dead_nodes <= total_insertions invariant in test/sql/table_function/duckdb_eviction_queues.test. Replace the ad-hoc guard with an explicit per-BlockMemory flag, ever_in_eviction_queue, set once by BufferPool::AddToEvictionQueue (the single chokepoint where queue entries originate). This captures the real precondition - "this BlockMemory has at least one entry in the eviction queue" - directly, without depending on buffer state or destroy_buffer_upon as proxies. The flag deliberately stays true even after Unload (which resets eviction_seq_num) because the stale queue entry from before the Unload is still in the queue and must be accounted for at destruction. The flag is atomic<bool> with relaxed ordering on both the store and the load. We don't need release/acquire here: the destructor only uses the flag value as a local boolean, not as a publication signal for other memory. The happens-before edge between the producer's store and the destructor's load is already provided by the shared_ptr<BlockHandle> refcount drop, so relaxed is both correct and avoids the seq_cst fences the implicit operator= would emit on the hot Unpin path. 2. Add C++ regression tests in test/api/test_buffer_pool_eviction.cpp. - "dead_nodes is incremented on BlockMemory destruction even after eviction" is a deterministic regression test for the original under-counting bug. It allocates 6 destroyable buffers under a memory limit that only fits 2, verifies that 4 of them are evicted (state == BLOCK_UNLOADED, buffer == nullptr), then drops all BlockHandles and asserts dead_nodes increased by exactly 6. With the original GetBuffer()-guarded destructor this fails 2 == 6 because only the still-loaded handles are counted. Verified by reverting just the destructor guard. - "dead_nodes invariants hold under destroyable block churn" runs 20000 destroyable allocate/drop cycles (>> the 4096 INSERT_INTERVAL so Purge is invoked many times) while keeping a small set of pinned-resident blocks alive. It asserts that total_dead_nodes never wraps under unsigned underflow and that per-queue dead_nodes <= total_insertions. This guards against the previous PurgeIteration bug where pinned-but-current entries were classified as dead and over-decremented the counter. All [storage] tests pass (120 cases, 88009 assertions), including the existing duckdb_eviction_queues.test invariants. Co-authored-by: Cursor <cursoragent@cursor.com>

taniabogatsch

Great find! Left a small review.

…buffers, add stress test - Move debug_sleep_micros into BufferEvictionNode::IsDeadNode() so race condition testing exercises the ownership path after acquiring the shared_ptr - Rename page_size/total_pages/held_pages/resident_pages to buffer-based naming (buffer_size/total_buffers/held_buffers/resident_buffers) - Add concurrent stress test (test_slow) using duckdb_eviction_queues that validates dead_nodes never goes negative or exceeds total_insertions

krleonid and others added 2 commits May 14, 2026 23:36

krleonid marked this pull request as draft May 15, 2026 05:56

taniabogatsch suggested changes May 15, 2026

View reviewed changes

Comment thread src/storage/buffer/buffer_pool.cpp Outdated

Comment thread test/api/test_buffer_pool_eviction.cpp Outdated

Comment thread test/api/test_buffer_pool_eviction.cpp

taniabogatsch requested a review from lnkuiper May 15, 2026 11:44

taniabogatsch added the Ready For Review label May 15, 2026

krleonid requested a review from taniabogatsch May 15, 2026 18:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/eviction queue purge correctness#22680

Fix/eviction queue purge correctness#22680
krleonid wants to merge 3 commits into
duckdb:mainfrom
krleonid:fix/eviction-queue-purge-correctness

krleonid commented May 15, 2026 •

edited

Loading

Uh oh!

taniabogatsch left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

krleonid commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

Bugs fixed

1. IncrementDeadNodes was skipped when the buffer was already null

2. Pinned-but-current nodes were classified as dead during purge

3. Multi-iteration purge churned the purge thread's own moodycamel sub-queue

How they interact

Files changed

Test plan

Notes for reviewers

Uh oh!

taniabogatsch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

krleonid commented May 15, 2026 •

edited

Loading

1. `IncrementDeadNodes` was skipped when the buffer was already null