fix: reject source-superset schema mismatch at SegmentBuilder#9110
Conversation
execute_optimization captures `target_config` from the optimizer's frozen config and then wraps source segments in proxies. Between those two points, `CollectionUpdater::update` can apply a `CreateVectorName(V)` to the source segments via `apply_segments`, leaving the optimizer with sources that have V but a target_config that does not. The optimization then produces a merged segment without V, and a follow-up optimization (running with the refreshed config that includes V) fails to use that segment as a source: "Cannot update from other segment because it is missing vector name X". Close the race by extending the scope of the existing `LockedSegmentHolder::acquire_updates_lock` to cover the proxy install window. `CollectionUpdater::update` already takes this lock before processing any shard update, so concurrent writers wait until proxies are in place — at which point further mutations hit the proxies (recorded as intent and propagated to the merged segment in `finish_optimization`) instead of the originals. The guard is dropped right after proxy install so the slow build phase does not extend it. Tests: - Three `SegmentBuilder::update` tests document the precondition the lock now guarantees: with a target schema that adds a named vector the source lacks, update errors with "missing vector name X". Quantized and mixed-source variants exercise the same error path. - `test_optimize_blocks_proxy_install_on_updates_lock` asserts the invariant directly: while the updates lock is held, proxies are not yet installed. Verified to fail when the new guard is removed (otherwise it passes because `finish_optimization` also takes the same lock, so a naive "did optimize finish?" check would not catch a missing proxy-install guard). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds vector schema validation to SegmentBuilder::update: it first scans source segments' vector storage keys and aborts with OperationError::cancelled if any source includes a vector name missing from the target schema. It also changes per-vector lookup failures (missing vector on a source) to return OperationError::cancelled with a retry-after-optimizer-config-refresh message. Two integration tests cover rejection when the target has an extra vector and when the source has an extra vector. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
lib/segment/tests/integration/segment_builder_test.rs (1)
674-674: ⚡ Quick winUse explicit point-id conversion at Line 674.
Replace the implicit
.into()with an explicit type conversion to match repository Rust style in new code.Suggested diff
- .upsert_point(10 + i, (100 + i).into(), vectors, &hw_counter) + .upsert_point( + 10 + i, + segment::types::PointIdType::from(100 + i), + vectors, + &hw_counter, + )As per coding guidelines: "
**/*.rs: Prefer explicitSomeType::from(x)over implicitx.into()in Rust".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/segment/tests/integration/segment_builder_test.rs` at line 674, The call to upsert_point uses an implicit conversion `(100 + i).into()` for the point id; replace that with an explicit from conversion using the crate's point-id type (e.g., PointId::from(100 + i) or the actual PointId type used in this module) so the call becomes upsert_point(10 + i, PointIdType::from(100 + i), vectors, &hw_counter); update the type name to the concrete point-id type used in this crate and ensure imports/paths are adjusted accordingly.lib/collection/src/collection_manager/optimizers/indexing_optimizer.rs (1)
827-831: 💤 Low valueConsider adding a comment about timing sensitivity for future maintainers.
The 250ms sleep is a reasonable approach for this concurrency test, but could be flaky on slow or heavily-loaded CI runners. If this test becomes unreliable, consider either increasing the sleep duration or adding a brief comment noting the timing dependency.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/collection/src/collection_manager/optimizers/indexing_optimizer.rs` around lines 827 - 831, Add a short explanatory comment above the std::thread::sleep(Duration::from_millis(250)) in the concurrency test that explains this 250ms wait is timing-sensitive and may be flaky on slow CI, and suggest options for future maintainers (e.g., increase the duration, replace with a synchronization primitive, or tune for CI). Reference the optimize_done atomic and the assert! that follows (which checks optimize_done.load(Ordering::SeqCst)) so readers know the sleep exists to avoid the race that would make the assert flaky.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@lib/collection/src/collection_manager/optimizers/indexing_optimizer.rs`:
- Around line 827-831: Add a short explanatory comment above the
std::thread::sleep(Duration::from_millis(250)) in the concurrency test that
explains this 250ms wait is timing-sensitive and may be flaky on slow CI, and
suggest options for future maintainers (e.g., increase the duration, replace
with a synchronization primitive, or tune for CI). Reference the optimize_done
atomic and the assert! that follows (which checks
optimize_done.load(Ordering::SeqCst)) so readers know the sleep exists to avoid
the race that would make the assert flaky.
In `@lib/segment/tests/integration/segment_builder_test.rs`:
- Line 674: The call to upsert_point uses an implicit conversion `(100 +
i).into()` for the point id; replace that with an explicit from conversion using
the crate's point-id type (e.g., PointId::from(100 + i) or the actual PointId
type used in this module) so the call becomes upsert_point(10 + i,
PointIdType::from(100 + i), vectors, &hw_counter); update the type name to the
concrete point-id type used in this crate and ensure imports/paths are adjusted
accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ff67ebad-d708-433e-935b-e16ce79f06ef
📒 Files selected for processing (3)
lib/collection/src/collection_manager/optimizers/indexing_optimizer.rslib/segment/tests/integration/segment_builder_test.rslib/shard/src/optimize.rs
Address review of #9110: 1. `finish_optimization` was acquiring `upgradable_read` before `acquire_updates_lock`, while the new guard at the start of `execute_optimization` acquires them in the reverse order. With two optimizer threads in flight, thread A in `finish_optimization` could hold `upgradable_read` and wait on `updates_lock` while thread B at the top of `execute_optimization` held `updates_lock` and waited on `upgradable_read` (parking_lot allows only one upgradable reader), deadlocking. Swap `finish_optimization` to take `updates_lock` first so both halves agree. 2. Drop the quantized and mixed-source variants of the inverted `SegmentBuilder` unit test — all three asserted the same error path (the mismatch check fires before quantization training or per-source branching), so only one is useful as documentation of the precondition. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/shard/src/optimize.rs (1)
695-705:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftDon't hold
updates_lockacross preflight work and callbacks.Line 704 now blocks collection updates until Line 812, and that window includes
check_segments_size()plus the caller-providedon_successful_start(). That widens the "serialize proxy install" critical section to arbitrary I/O/callback latency, and the callback can deadlock if it re-enters a path that also needs the same lock. Please narrow the locked region to the schema/COW/proxy-install steps, or at least move the callback outside the guard.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/shard/src/optimize.rs` around lines 695 - 705, The updates_lock (held via segment_holder.acquire_updates_lock() in proxy_install_guard) is being held too long — across check_segments_size() and the caller-provided on_successful_start() — which can block CollectionUpdater::update and cause deadlocks; release the proxy_install_guard immediately after performing only the schema/COW/proxy-install steps that must be serialized (the actual proxy creation/installation and any copy-on-write adjustments), and move check_segments_size() and the on_successful_start() callback to execute after the guard is dropped so the critical section only encloses the minimal proxy-installation work (ensure finish_optimization behavior remains unchanged).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@lib/shard/src/optimize.rs`:
- Around line 695-705: The updates_lock (held via
segment_holder.acquire_updates_lock() in proxy_install_guard) is being held too
long — across check_segments_size() and the caller-provided
on_successful_start() — which can block CollectionUpdater::update and cause
deadlocks; release the proxy_install_guard immediately after performing only the
schema/COW/proxy-install steps that must be serialized (the actual proxy
creation/installation and any copy-on-write adjustments), and move
check_segments_size() and the on_successful_start() callback to execute after
the guard is dropped so the critical section only encloses the minimal
proxy-installation work (ensure finish_optimization behavior remains unchanged).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0f8950a4-7095-49fa-8d17-3ca5c2c9ae85
📒 Files selected for processing (2)
lib/segment/tests/integration/segment_builder_test.rslib/shard/src/optimize.rs
💤 Files with no reviewable changes (1)
- lib/segment/tests/integration/segment_builder_test.rs
Drop the lock approach (deadlocked test_continuous_snapshot) and fix the bug at the merge layer instead. Snapshot's proxy_all_segments_and_apply acquires the segment_holder upgradable_read first and then takes acquire_updates_lock tactically inside the snapshot operation. The previous commits' lock-extension acquired updates_lock before upgradable_read, so a snapshot in flight and an optimization just entering execute_optimization could deadlock holding each other's required next lock. Snapshot cannot easily reverse its order — that would hold updates_lock for the entire snapshot duration, blocking all writes. Move the fix to where the actual harm happens: SegmentBuilder::update iterates the target's vector_data and silently drops source vectors that aren't in target. That silent drop is what produces the broken merged segment in the CreateVectorName-vs-optimizer race. Add a check that every source vector name is in the target schema; the optimization aborts cleanly on mismatch and the next round (with refreshed config) merges correctly. This is strictly stronger than the lock: the lock only closed the window where V arrived *during* the proxy-install region. The schema check catches both that window and the window where V's apply_segments completed before the optimizer's lock acquisition. Diff is contained to lib/segment; no locking changes, no cross-crate plumbing. test_continuous_snapshot passes again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
lib/segment/tests/integration/segment_builder_test.rs (1)
625-625: ⚡ Quick winUse explicit
fromconversion instead of.into()for point IDs.Line 625 uses implicit
.into()when converting the point ID. Replace with explicitExtendedPointId::from(100 + i)for consistency with the coding guidelines.Proposed change
- .upsert_point(10 + i, (100 + i).into(), vectors, &hw_counter) + .upsert_point(10 + i, segment::types::ExtendedPointId::from(100 + i), vectors, &hw_counter)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/segment/tests/integration/segment_builder_test.rs` at line 625, Replace the implicit .into() conversion used for point IDs in the upsert call with an explicit ExtendedPointId::from(...) to follow coding guidelines: locate the upsert_point call (the .upsert_point(10 + i, (100 + i).into(), vectors, &hw_counter) invocation) and change the second argument to ExtendedPointId::from(100 + i) so the point ID conversion is explicit and consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@lib/segment/tests/integration/segment_builder_test.rs`:
- Line 625: Replace the implicit .into() conversion used for point IDs in the
upsert call with an explicit ExtendedPointId::from(...) to follow coding
guidelines: locate the upsert_point call (the .upsert_point(10 + i, (100 +
i).into(), vectors, &hw_counter) invocation) and change the second argument to
ExtendedPointId::from(100 + i) so the point ID conversion is explicit and
consistent.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 18900485-9dfc-4aa8-ab2b-2f21d423694e
📒 Files selected for processing (2)
lib/segment/src/segment_constructor/segment_builder.rslib/segment/tests/integration/segment_builder_test.rs
…a mismatch ServiceError flips the shard to RED status (via `report_optimizer_error` → `segments.optimizer_errors`) and stays sticky until the next `recreate_optimizers_blocking` clears it. That's the right shape for hardware/IO failures but wrong for the schema-mismatch case here, which is an expected, recoverable race outcome — the next optimizer round with a refreshed target_config merges the same originals cleanly. `Cancelled` is the variant the optimization worker treats as a recoverable cancellation: logged at debug, tracker marked Cancelled, no `report_optimizer_error` call, no RED status. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rset error The existing "missing vector name" check at the start of the merge loop also fires during a race — specifically the optimizer-vs-DeleteVectorName shape, where V is removed from originals before J wraps proxies but J's frozen target_config still has V. Like the new source-superset check, this is an expected, recoverable race outcome, so use Cancelled instead of ServiceError to avoid flipping the shard to RED. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: serialize optimization proxy install against shard updates execute_optimization captures `target_config` from the optimizer's frozen config and then wraps source segments in proxies. Between those two points, `CollectionUpdater::update` can apply a `CreateVectorName(V)` to the source segments via `apply_segments`, leaving the optimizer with sources that have V but a target_config that does not. The optimization then produces a merged segment without V, and a follow-up optimization (running with the refreshed config that includes V) fails to use that segment as a source: "Cannot update from other segment because it is missing vector name X". Close the race by extending the scope of the existing `LockedSegmentHolder::acquire_updates_lock` to cover the proxy install window. `CollectionUpdater::update` already takes this lock before processing any shard update, so concurrent writers wait until proxies are in place — at which point further mutations hit the proxies (recorded as intent and propagated to the merged segment in `finish_optimization`) instead of the originals. The guard is dropped right after proxy install so the slow build phase does not extend it. Tests: - Three `SegmentBuilder::update` tests document the precondition the lock now guarantees: with a target schema that adds a named vector the source lacks, update errors with "missing vector name X". Quantized and mixed-source variants exercise the same error path. - `test_optimize_blocks_proxy_install_on_updates_lock` asserts the invariant directly: while the updates lock is held, proxies are not yet installed. Verified to fail when the new guard is removed (otherwise it passes because `finish_optimization` also takes the same lock, so a naive "did optimize finish?" check would not catch a missing proxy-install guard). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(optimize): finish_optimization lock order; drop redundant tests Address review of #9110: 1. `finish_optimization` was acquiring `upgradable_read` before `acquire_updates_lock`, while the new guard at the start of `execute_optimization` acquires them in the reverse order. With two optimizer threads in flight, thread A in `finish_optimization` could hold `upgradable_read` and wait on `updates_lock` while thread B at the top of `execute_optimization` held `updates_lock` and waited on `upgradable_read` (parking_lot allows only one upgradable reader), deadlocking. Swap `finish_optimization` to take `updates_lock` first so both halves agree. 2. Drop the quantized and mixed-source variants of the inverted `SegmentBuilder` unit test — all three asserted the same error path (the mismatch check fires before quantization training or per-source branching), so only one is useful as documentation of the precondition. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: reject source-superset schema mismatch at SegmentBuilder Drop the lock approach (deadlocked test_continuous_snapshot) and fix the bug at the merge layer instead. Snapshot's proxy_all_segments_and_apply acquires the segment_holder upgradable_read first and then takes acquire_updates_lock tactically inside the snapshot operation. The previous commits' lock-extension acquired updates_lock before upgradable_read, so a snapshot in flight and an optimization just entering execute_optimization could deadlock holding each other's required next lock. Snapshot cannot easily reverse its order — that would hold updates_lock for the entire snapshot duration, blocking all writes. Move the fix to where the actual harm happens: SegmentBuilder::update iterates the target's vector_data and silently drops source vectors that aren't in target. That silent drop is what produces the broken merged segment in the CreateVectorName-vs-optimizer race. Add a check that every source vector name is in the target schema; the optimization aborts cleanly on mismatch and the next round (with refreshed config) merges correctly. This is strictly stronger than the lock: the lock only closed the window where V arrived *during* the proxy-install region. The schema check catches both that window and the window where V's apply_segments completed before the optimizer's lock acquisition. Diff is contained to lib/segment; no locking changes, no cross-crate plumbing. test_continuous_snapshot passes again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(segment_builder): use Cancelled instead of ServiceError for schema mismatch ServiceError flips the shard to RED status (via `report_optimizer_error` → `segments.optimizer_errors`) and stays sticky until the next `recreate_optimizers_blocking` clears it. That's the right shape for hardware/IO failures but wrong for the schema-mismatch case here, which is an expected, recoverable race outcome — the next optimizer round with a refreshed target_config merges the same originals cleanly. `Cancelled` is the variant the optimization worker treats as a recoverable cancellation: logged at debug, tracker marked Cancelled, no `report_optimizer_error` call, no RED status. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(segment_builder): also use Cancelled for the existing target-superset error The existing "missing vector name" check at the start of the merge loop also fires during a race — specifically the optimizer-vs-DeleteVectorName shape, where V is removed from originals before J wraps proxies but J's frozen target_config still has V. Like the new source-superset check, this is an expected, recoverable race outcome, so use Cancelled instead of ServiceError to avoid flipping the shard to RED. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Alternative to #9053.
Bug
SegmentBuilder::updateiterates the target'svector_dataand ignores any source vector not in target. Combined with the optimizer-vs-CreateVectorName(V)race, this silently drops V's data and produces a merged segment without V at version ≥ V_opnum.The next optimization round picks that segment as a source and fails with
ServiceError"missing vector name V", flipping the shard to RED until manual intervention. The symmetricDeleteVectorNamerace triggers the existing "missing vector name" check, also asServiceError.What this PR does
SegmentBuilder::updateso the corrupting merge cannot occur.OperationError::Cancelledinstead ofServiceError.Behavior change
CreateVectorName(V)mid-optimizationServiceError→ RED status, stuck.Cancelledbefore C is produced. Debug log, no RED. Next round with refreshed config merges cleanly.DeleteVectorName(V)mid-optimizationServiceError→ RED, stuck.Cancelled. Debug log, no RED. Next round merges cleanly.What this does not do
target_configcan still diverge from source segments. We catch the divergence at the merge boundary instead of letting it produce a broken segment.Trade-off vs #9053: this PR rejects the mismatch and relies on
recreate_optimizers_blockingto make the retry succeed. Smaller surface, no new code path in the builder. If field damage from past versions is a concern, #9053 (or its tolerance behavior on top of this PR as a one-release heal-on-upgrade) is the better fit.Why not a lock (Tim's suggestion in #9053)
Tried it; reverted. Two reasons:
apply_segmentscompleted before the optimizer's lock acquisition.test_continuous_snapshot.snapshot::proxy_all_segments_and_applyholdsupgradable_readfor the snapshot duration and only takesacquire_updates_locktactically inside; the lock extension required the opposite order. Reversing snapshot's order would holdupdates_lockfor the entire snapshot, blocking writes — not viable.The schema check catches both race windows and has no lock-order surface.
Tests
test_segment_builder_rejects_target_with_extra_vector_name— target has a vector source lacks (existing path, included for symmetry).test_segment_builder_rejects_source_with_extra_vector_name— the new check.Test plan
cargo check --workspace --all-targetscargo test -p segment --test integration(122/122)cargo test -p shard --lib(72/72)cargo test -p collection --lib(197/197)cargo test -p collection --test integration test_continuous_snapshot(was deadlocked with the lock approach; passes now)🤖 Generated with Claude Code