Fix/eviction queue purge correctness#22680
Draft
krleonid wants to merge 3 commits into
Draft
Conversation
…ue churn Three related bugs in the buffer pool's eviction queue purge path caused dead nodes to accumulate, alive but pinned nodes to be lost, and the multi-iteration purge loop to make no real progress. 1. Dead increment skipped when buffer was already null `BlockMemory::~BlockMemory` only called `IncrementDeadNodes` when `GetBuffer()` was non-null. But an eviction queue entry holds a `weak_ptr<BlockMemory>` and a sequence number, and stays in the queue regardless of whether the buffer has been unloaded - only the dequeue side ever removes it. So when the `BlockMemory` itself is destroyed, the latest queue entry pointing at it is dead independently of the buffer state. The old guard caused `total_dead_nodes` to under-count, which made the purge ratio early-out fire too often and let dead entries accumulate. Fix: remove the `GetBuffer()` guard. 2. Pinned/current nodes were classified as dead during purge `PurgeIteration` used `TryGetBlockMemory()` to decide whether a dequeued node was alive. `TryGetBlockMemory()` also requires `CanUnload()` to be true, so a perfectly valid latest-version node with `readers != 0` (pinned) was treated as dead, dropped from the queue, and `total_dead_nodes` was decremented for it. This both disturbed LRU and desynced the dead counter from reality. Fix: introduce `BufferEvictionNode::IsDeadNode()` that checks only the two conditions that actually mean "dead" - expired weak_ptr or stale sequence number - and use it in the purge loop. 3. Multi-iteration purge churned the purge thread's own moodycamel sub-queue moodycamel's `ConcurrentQueue` is not a single FIFO; internally it has one sub-queue per producer thread, and `try_dequeue_bulk` tends to drain the most recently used sub-queues first. The old purge loop re-enqueued alive nodes between iterations, so on the next iteration `try_dequeue_bulk` preferentially re-drained the purge thread's own sub-queue (the alive nodes it had just pushed) instead of reaching dead entries sitting in worker producer sub-queues. Under contention this turned "purge more aggressively" into a no-op. Fix: collect alive nodes into a local vector across the entire purge loop and re-enqueue them once after the loop completes. Compensate the early-out queue-size estimate with the held-aside alive nodes so the size check still reflects the true logical queue size. Also adjust the dead-node decrement in `PurgeIteration` to count genuinely dead nodes (`dead_count`) instead of `dequeued - alive`, which previously counted pinned-but-current nodes as dead and drove the counter further out of sync. Co-authored-by: Cursor <cursoragent@cursor.com>
Two follow-up changes on top of the eviction-queue purge fix.
1. Refine the dead-node accounting fix in ~BlockMemory.
The previous commit removed the GetBuffer() guard from ~BlockMemory's
IncrementDeadNodes call so that blocks evicted before destruction were
correctly accounted for. That fixed the under-counting case but
over-counted in two others: blocks that were never inserted into the
eviction queue at all (DestroyBufferUpon::UNPIN, used by tuple_data
and column_data allocators) and BLOCK-type BlockHandles that were
created but never pinned. This broke the dead_nodes <= total_insertions
invariant in test/sql/table_function/duckdb_eviction_queues.test.
Replace the ad-hoc guard with an explicit per-BlockMemory flag,
ever_in_eviction_queue, set once by BufferPool::AddToEvictionQueue
(the single chokepoint where queue entries originate). This captures
the real precondition - "this BlockMemory has at least one entry in
the eviction queue" - directly, without depending on buffer state or
destroy_buffer_upon as proxies. The flag deliberately stays true even
after Unload (which resets eviction_seq_num) because the stale queue
entry from before the Unload is still in the queue and must be
accounted for at destruction.
The flag is atomic<bool> with relaxed ordering on both the store and
the load. We don't need release/acquire here: the destructor only
uses the flag value as a local boolean, not as a publication signal
for other memory. The happens-before edge between the producer's
store and the destructor's load is already provided by the
shared_ptr<BlockHandle> refcount drop, so relaxed is both correct
and avoids the seq_cst fences the implicit operator= would emit on
the hot Unpin path.
2. Add C++ regression tests in test/api/test_buffer_pool_eviction.cpp.
- "dead_nodes is incremented on BlockMemory destruction even after
eviction" is a deterministic regression test for the original
under-counting bug. It allocates 6 destroyable buffers under a
memory limit that only fits 2, verifies that 4 of them are evicted
(state == BLOCK_UNLOADED, buffer == nullptr), then drops all
BlockHandles and asserts dead_nodes increased by exactly 6. With
the original GetBuffer()-guarded destructor this fails 2 == 6
because only the still-loaded handles are counted. Verified by
reverting just the destructor guard.
- "dead_nodes invariants hold under destroyable block churn" runs
20000 destroyable allocate/drop cycles (>> the 4096 INSERT_INTERVAL
so Purge is invoked many times) while keeping a small set of
pinned-resident blocks alive. It asserts that total_dead_nodes
never wraps under unsigned underflow and that per-queue
dead_nodes <= total_insertions. This guards against the previous
PurgeIteration bug where pinned-but-current entries were
classified as dead and over-decremented the counter.
All [storage] tests pass (120 cases, 88009 assertions), including the
existing duckdb_eviction_queues.test invariants.
Co-authored-by: Cursor <cursoragent@cursor.com>
taniabogatsch
suggested changes
May 15, 2026
Member
taniabogatsch
left a comment
There was a problem hiding this comment.
Great find! Left a small review.
…buffers, add stress test - Move debug_sleep_micros into BufferEvictionNode::IsDeadNode() so race condition testing exercises the ownership path after acquiring the shared_ptr - Rename page_size/total_pages/held_pages/resident_pages to buffer-based naming (buffer_size/total_buffers/held_buffers/resident_buffers) - Add concurrent stress test (test_slow) using duckdb_eviction_queues that validates dead_nodes never goes negative or exceeds total_insertions
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes three related bugs in the buffer pool's eviction queue purge path. Together they caused dead nodes to silently accumulate in the eviction queue, alive-but-pinned nodes to be dropped (disturbing LRU and corrupting the dead-node counter), and the multi-iteration purge loop to make essentially no progress under contention.
The fix is small (3 files, +37 / −16) but the underlying issues interact and are worth fixing together so the purge accounting becomes self-consistent again.
Background
EvictionQueueuses aduckdb_moodycamel::ConcurrentQueue<BufferEvictionNode>where each node holds:weak_ptr<BlockMemory>— the block this entry refers to.handle_sequence_number— incremented every time the block is re-added to the queue.A node is logically dead if either:
weak_ptris expired (theBlockMemorywas destroyed), oreviction_seq_numis newer than the node's recorded sequence number (a newer entry superseded it).Dead nodes are never proactively removed — they are dequeued lazily by
EvictionQueue::Purge(), which runs on one thread at a time and tries to keep the queue's dead-node ratio under control.Bugs fixed
1.
IncrementDeadNodeswas skipped when the buffer was already nullBlockMemory::~BlockMemoryonly calledIncrementDeadNodeswhenGetBuffer()returned non-null:```cpp
// before
if (GetBuffer() && GetBufferType() != FileBufferType::TINY_BUFFER) {
GetBufferManager().GetBufferPool().IncrementDeadNodes(*this);
}
```
But the queue entry's lifetime is independent of whether the buffer is currently loaded — only the dequeue side ever removes it. If the block had previously been evicted (buffer = nullptr) and then the
BlockMemoryitself was destroyed, the queue entry that pointed at it became dead but was never accounted for. This systematically under-countedtotal_dead_nodes, which made the purge ratio early-out (approx_alive_nodes * (ALIVE_NODE_MULTIPLIER - 1) > approx_dead_nodes) fire too often and let dead entries accumulate.Fix: drop the
GetBuffer()guard.2. Pinned-but-current nodes were classified as dead during purge
PurgeIterationusedTryGetBlockMemory()to decide whether a dequeued node was alive:```cpp
shared_ptr BufferEvictionNode::TryGetBlockMemory() {
auto shared_memory_p = memory_p.lock();
if (!shared_memory_p) {
return nullptr;
}
if (!CanUnload(*shared_memory_p)) { // also checks readers / pinning
return nullptr;
}
return shared_memory_p;
}
```
CanUnloadis not a deadness predicate — it also returnsfalsewhen the block is currently pinned (readers != 0). So a perfectly valid latest-version node that just happened to be pinned at the moment of the purge was:total_dead_nodes -= actually_dequeued - alive_nodes, decrementing the dead counter for a node that was never dead.This actively desyncs
total_dead_nodesfrom reality and combines badly with bug.Fix: introduce a dedicated
BufferEvictionNode::IsDeadNode()that checks only the two real deadness conditions (expired weak_ptr or stale sequence number) and use it for purge classification:```cpp
bool BufferEvictionNode::IsDeadNode() {
auto shared_memory_p = memory_p.lock();
if (!shared_memory_p) {
return true;
}
if (handle_sequence_number != shared_memory_p->GetEvictionSequenceNumber()) {
return true;
}
return false;
}
```
PurgeIterationnow decrementstotal_dead_nodesby an explicitdead_count, which can never include pinned-but-current entries.3. Multi-iteration purge churned the purge thread's own moodycamel sub-queue
This is the most impactful bug.
moodycamel::ConcurrentQueueis not a single FIFO. Internally it maintains one sub-queue per producer thread (implicit producers are created lazily on first enqueue from a given thread).try_dequeue_bulkwalks the producer list and tends to drain the most recently used sub-queues first.The previous purge loop re-enqueued alive nodes between iterations:
```cpp
// PurgeIteration (old)
q.enqueue_bulk(purge_nodes.begin(), alive_nodes);
```
That enqueue happens on the purge thread, which in nearly all interesting scenarios was not previously a producer for this queue (producers are worker threads). It creates an implicit-producer sub-queue owned by the purge thread, and on the next
try_dequeue_bulkcall moodycamel preferentially re-drains it — pulling back the very alive nodes the purge just inserted, instead of reaching dead entries sitting in worker producer sub-queues. Under sustained pressure the loop made little to no real progress: it kept "purging" the same alive nodes round and round while real dead entries accumulated elsewhere.Fix: accumulate alive nodes into a local
vectoracross the entirePurge()loop and re-enqueue them once at the end. The early-out queue-size estimate is compensated with the size of the held-aside vector so the early-out check still reflects the true logical queue size:```cpp
vector alive_nodes_to_reenqueue;
while (max_purges != 0) {
PurgeIteration(purge_size, alive_nodes_to_reenqueue);
approx_q_size = q.size_approx() + alive_nodes_to_reenqueue.size();
// ... ratio / early-out checks unchanged ...
max_purges--;
}
if (!alive_nodes_to_reenqueue.empty()) {
q.enqueue_bulk(alive_nodes_to_reenqueue.begin(), alive_nodes_to_reenqueue.size());
}
```
This guarantees that each purge iteration sees fresh entries from worker sub-queues instead of bouncing the purge thread's own buffer.
How they interact
total_dead_nodesBugs 1 and 2 push the counter in opposite directions, masking each other intermittently and making the symptom hard to trace. Bug 3 means even when the ratio check correctly demands more aggressive purging, the loop fails to actually purge.
Files changed
src/include/duckdb/storage/buffer/buffer_pool.hpp— declareBufferEvictionNode::IsDeadNode.src/storage/buffer/buffer_pool.cpp— implementIsDeadNode, restructurePurge/PurgeIterationto defer re-enqueue.src/storage/buffer/block_handle.cpp— dropGetBuffer()guard beforeIncrementDeadNodes.Test plan
make reldebugbuilds cleanly.build/reldebug/test/unittestpasses (full unit suite).build/reldebug/test/unittest "[buffermanager]").total_dead_nodesunboundedly between purges.q.size_approx()between iterations in a stress harness).Notes for reviewers
BufferEvictionNode::IsDeadNode()is intentionally notconstbecauseweak_ptr::lock()is a mutating operation only in spirit — happy to mark itconstif preferred (it would still compile sincelock()isconston the weak_ptr).vectoris local to a singlePurge()invocation and bounded by the total size of dequeued nodes, so memory overhead is at most onepurge_sizeworth of entries beyond whatpurge_nodesalready held.TryDequeueWithLockon the same queue is gated bypurge_lock(held for the entirePurgecall), so no observer can see the queue mid-purge.