Skip to content

Conversation

@gouhongshen
Copy link
Contributor

@gouhongshen gouhongshen commented Dec 8, 2025

User description

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue ##13959

What this PR does / why we need it:

fix block filter EQ, IN, ...


PR Type

Bug fix


Description

  • Fix block filter search functions to correctly handle duplicate values

  • Add early return checks for empty input vectors or value sets

  • Refactor binary search logic to find all matching occurrences, not just first

  • Improve handling of consecutive duplicate values in search results


Diagram Walkthrough

flowchart LR
  A["Search Functions<br/>OrderedBinarySearch<br/>VarlenBinarySearch<br/>FixedSizedBinarySearch"] -->|"Add empty input checks"| B["Early Return<br/>for empty data"]
  A -->|"Refactor duplicate handling"| C["Find All Matches<br/>for each value"]
  C -->|"Collect consecutive<br/>duplicates"| D["Return Complete<br/>Result Set"]
  E["Test Suite"] -->|"Verify correctness"| D
Loading

File Walkthrough

Relevant files
Bug fix
search.go
Fix duplicate value handling in block filter searches       

pkg/container/vector/search.go

  • Add empty input validation to OrderedBinarySearchOffsetByValFactory,
    VarlenBinarySearchOffsetByValFactory, and
    FixedSizedBinarySearchOffsetByValFactory
  • Refactor small set binary search to find all consecutive duplicate
    matches instead of just first occurrence
  • Refactor large set merge-based search to collect all matching rows for
    each search value
  • Optimize search by skipping duplicate search values and tracking
    position ranges
+120/-56
Tests
search_test.go
Add comprehensive tests for search functions                         

pkg/container/vector/search_test.go

  • Add comprehensive test for varlen binary search with small duplicate
    value set
  • Add large dataset test with 400 elements and duplicate search targets
  • Add empty input edge case tests for varlen search
  • Add benchmark tests for varlen search with various hit rates
  • Add tests for ordered and fixed-sized binary search with duplicate
    values
+292/-0 

@qodo-code-review
Copy link

qodo-code-review bot commented Dec 8, 2025

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🟡
🎫 #13959
🟢 Ensure filtering works for both fixed-size (ordered) and varlen columns with potential
duplicate values.
Handle edge cases like empty inputs to avoid unnecessary processing or panics.
Improve performance characteristics for large value sets while maintaining correctness.
Optimize reader to support filter on unsorted block so that equality and IN predicates can
prune/locate rows even when data is not sorted by fake/cluster key.
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
No Auditing: The new search functions add logic but do not emit any audit logs for critical actions,
which may be acceptable for low-level utilities but cannot be verified from the diff
alone.

Referred Code
func OrderedBinarySearchOffsetByValFactory[T types.OrderedT](vals []T) func(*Vector) []int64 {
	return func(vec *Vector) []int64 {
		var sels []int64
		rows := MustFixedColNoTypeCheck[T](vec)
		if len(rows) == 0 || len(vals) == 0 {
			return sels
		}
		subVals := vals
		if len(vals) >= kMinLenForSubVector {
			minVal := rows[0]
			maxVal := rows[len(rows)-1]
			lowerBound := sort.Search(len(vals), func(i int) bool {
				return minVal <= vals[i]
			})
			upperBound := sort.Search(len(vals), func(i int) bool {
				return maxVal < vals[i]
			})
			subVals = vals[lowerBound:upperBound]
			if len(subVals) == 0 {
				return sels
			}


 ... (clipped 208 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Input Assumptions: Functions assume sorted inputs and trust vector contents without explicit validation or
bounds checks beyond length checks, which may be acceptable internally but cannot be fully
verified from the diff.

Referred Code
func OrderedBinarySearchOffsetByValFactory[T types.OrderedT](vals []T) func(*Vector) []int64 {
	return func(vec *Vector) []int64 {
		var sels []int64
		rows := MustFixedColNoTypeCheck[T](vec)
		if len(rows) == 0 || len(vals) == 0 {
			return sels
		}
		subVals := vals
		if len(vals) >= kMinLenForSubVector {
			minVal := rows[0]
			maxVal := rows[len(rows)-1]
			lowerBound := sort.Search(len(vals), func(i int) bool {
				return minVal <= vals[i]
			})
			upperBound := sort.Search(len(vals), func(i int) bool {
				return maxVal < vals[i]
			})
			subVals = vals[lowerBound:upperBound]
			if len(subVals) == 0 {
				return sels
			}


 ... (clipped 208 lines)

Learn more about managing compliance generic rules or creating your own custom rules

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-code-review
Copy link

qodo-code-review bot commented Dec 8, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
High-level
PR and ticket descriptions are misaligned

The PR's code fixes binary search on sorted data, but the linked ticket
describes a problem with filtering on unsorted data. This discrepancy should be
clarified by updating the ticket to match the PR's actual changes for better
maintainability.

Examples:

Solution Walkthrough:

Before:

// PR Code in pkg/container/vector/search.go
// Fixes binary search for duplicate values in *sorted* data.
func OrderedBinarySearchOffsetByValFactory[T types.OrderedT](vals []T) func(*Vector) []int64 {
  // ... logic to find all occurrences of `vals` in a sorted vector
}

// Ticket #13959 Description
// Title: optimize reader to support filer on *unsorted* block
// Description: "t1 is never sorted by the fake key and right now the reader will not apply any input filter"

After:

// PR Code in pkg/container/vector/search.go
// The code is likely correct for its intended purpose and may not need changes.
func OrderedBinarySearchOffsetByValFactory[T types.OrderedT](vals []T) func(*Vector) []int64 {
  // ... logic to find all occurrences of `vals` in a sorted vector
}

// Updated Ticket #13959 Description (or new ticket)
// Title: Fix block filter (EQ, IN) to correctly handle duplicate values in sorted blocks
// Description: "The binary search functions used by the block filter did not correctly return all indices for search values that have duplicates in the data block. This PR fixes the search logic to find all occurrences."

Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a critical misalignment between the PR's code, which fixes binary search on sorted data, and the linked ticket's goal of filtering on unsorted data, which severely impacts project maintainability and traceability.

Medium
Possible issue
Correct assertion for empty slice result

In TestVarlenBinarySearchOffsetByValFactoryEmptyInputs, change the test
assertion from require.Nil to require.Empty to correctly reflect that the
function returns an empty slice, not a nil one.

pkg/container/vector/search_test.go [81-86]

 fn := VarlenBinarySearchOffsetByValFactory(nil)
-require.Nil(t, fn(vec))
+require.Empty(t, fn(vec))
 
 require.NoError(t, AppendBytes(vec, []byte("a"), false, mp))
 fn = VarlenBinarySearchOffsetByValFactory(nil)
-require.Nil(t, fn(vec))
+require.Empty(t, fn(vec))
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that the test's use of require.Nil is inconsistent with the function's behavior, which returns an empty slice, not a nil one. Correcting the assertion to require.Empty improves test accuracy and robustness.

Medium
  • Update

@mergify mergify bot added the kind/bug Something isn't working label Dec 8, 2025
@heni02 heni02 merged commit 58944ac into matrixorigin:v3.0.4-hotfix Dec 8, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working Review effort 3/5

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants