Skip to content

fix: use per-block filter instead of per-segment filter during query#223

Merged
egolearner merged 7 commits intomainfrom
fix/id_conversion
Mar 16, 2026
Merged

fix: use per-block filter instead of per-segment filter during query#223
egolearner merged 7 commits intomainfrom
fix/id_conversion

Conversation

@zhourrr
Copy link
Collaborator

@zhourrr zhourrr commented Mar 13, 2026

Greptile Summary

This PR fixes a correctness bug in multi-block vector search: when a segment's vector index is split across multiple blocks, each block's internal doc IDs are block-local (starting from 0), but the incoming IndexFilter operates on segment-level IDs. Previously, the segment-level filter was passed as-is to every block's indexer, causing incorrect filtering results for blocks beyond the first. The fix introduces a BlockOffsetFilter wrapper that adds a block's starting offset to any incoming block-local ID before delegating to the inner filter, correctly translating block-local → segment-level IDs.

Key changes:

  • combined_vector_column_indexer.cc: For each block with offset > 0, wraps query_params.filter in a BlockOffsetFilter; block 0 continues to use the filter directly (offset is always 0 so no translation is needed, avoiding unnecessary pointer indirection).
  • combined_vector_column_indexer.h: Introduces BlockOffsetFilter as a protected inner class of CombinedVectorColumnIndexer, with clear documentation of its purpose.
  • segment.cc: Improves comments on member variables to clarify which ID space (segment-local vs. block-local) each component uses.
  • recall_base.h: Adds options.max_buffer_size_ = 256 * 1024 to force multi-block creation during tests, ensuring all VectorRecallTest cases now exercise the multi-block code path.
  • vector_recall_test.cc: Adds DeleteFilter and HybridInvertForwardDeleteFilter tests that verify correct filter application across multiple blocks after bulk deletes. Note that HybridInvertForwardDeleteFilter explicitly depends on DeleteFilter having run first (shared static segment state), as documented in the test comment — this fragility was flagged in earlier review rounds.

Confidence Score: 4/5

  • This PR is safe to merge; the core fix is logically correct and well-covered by new tests, with only minor pre-existing test-design concerns remaining.
  • The BlockOffsetFilter translation logic is correct: block 0 (offset 0) passes the original filter unchanged, while subsequent blocks wrap it to add their base offset before delegating. The fix is minimal, targeted, and directly validated by the two new end-to-end tests. The max_buffer_size_ reduction makes all existing filter tests implicitly exercise the multi-block path as well. The one point deducted is for the inter-test state dependency in vector_recall_test.cc (previously flagged) — HybridInvertForwardDeleteFilter relies on DeleteFilter having run first, which makes the suite fragile under test reordering.
  • tests/db/sqlengine/vector_recall_test.cc — the DeleteFilter/HybridInvertForwardDeleteFilter ordering dependency warrants attention if the test suite grows or shuffle mode is enabled.

Important Files Changed

Filename Overview
src/db/index/column/vector_column/combined_vector_column_indexer.cc Core fix: adds per-block BlockOffsetFilter wrapping to translate block-local IDs to segment-level IDs before delegating to the inner filter during multi-block vector search. Logic is sound; block 0 (offset == 0) skips the wrapper as an optimisation. The per_block_filter is still constructed unconditionally even when query_params.filter is null (previously flagged; developer intentionally kept this structure).
src/db/index/column/vector_column/combined_vector_column_indexer.h Adds BlockOffsetFilter as a protected inner class (per previous review feedback); removes the now-redundant #include "vector_index_results.h" (safely provided transitively via vector_column_indexer.h); adds #include "db/index/common/index_filter.h" required by the new class.
src/db/index/segment/segment.cc Documentation-only: clarifies ID-space semantics of member variables (segment-local vs. block-local) in private member comments. No logic changes.
tests/db/sqlengine/recall_base.h Adds options.max_buffer_size_ = 256 * 1024 to force multi-block creation during test suite setup, enabling all VectorRecallTest cases — including the new delete-filter tests — to exercise the multi-block code path that this PR fixes.
tests/db/sqlengine/vector_recall_test.cc Adds DeleteFilter and HybridInvertForwardDeleteFilter tests that validate the per-block filter fix. The two tests share mutable state through the static segments_ fixture — HybridInvertForwardDeleteFilter explicitly depends on DeleteFilter having run first, making the suite order-sensitive (previously flagged).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Search called with query_params.filter] --> B{query_params.filter != nullptr?}

    B -- No --> C[filter = nullptr\npass-through to indexer]

    B -- Yes --> D{block_offsets_i > 0?}

    D -- No\nBlock 0\noffset == 0 --> E[filter = query_params.filter\nblock-local ID == segment-level ID\nno translation needed]

    D -- Yes\nBlock N\noffset > 0 --> F[filter = &BlockOffsetFilter\ninner_filter_ = query_params.filter\noffset_ = block_offsets_i]

    F --> G[BlockOffsetFilter::is_filtered\nblock_local_id + offset\n→ segment_level_id\n→ inner_filter_->is_filtered]

    E --> H[indexers_i->Search\nwith per-block filter]
    C --> H
    G --> H

    H --> I[Translate block-local doc IDs\nback to segment-level:\ndoc.key += block_offsets_i]

    I --> J[Merge & sort\nall block results\nby score]

    J --> K[Truncate to topk\nreturn VectorIndexResults]
Loading

Last reviewed commit: f774e03

@zhourrr zhourrr requested review from chinaux and egolearner March 13, 2026 07:37
@zhourrr
Copy link
Collaborator Author

zhourrr commented Mar 13, 2026

@greptile

@zhourrr zhourrr force-pushed the fix/id_conversion branch from 26e7b41 to 1aeb2e6 Compare March 13, 2026 09:38
@zhourrr
Copy link
Collaborator Author

zhourrr commented Mar 13, 2026

@greptile

@zhourrr
Copy link
Collaborator Author

zhourrr commented Mar 13, 2026

@greptile

egolearner
egolearner previously approved these changes Mar 16, 2026
Copy link
Collaborator

@egolearner egolearner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhourrr
Copy link
Collaborator Author

zhourrr commented Mar 16, 2026

@greptile

@egolearner egolearner merged commit 9314cb1 into main Mar 16, 2026
10 checks passed
@egolearner egolearner deleted the fix/id_conversion branch March 16, 2026 02:55
kdy1 added a commit to ZephyrCloudIO/zvec that referenced this pull request Mar 16, 2026
## Summary
Rebase   onto  (), then re-apply Zephyr-only commits on top.

## Upstream commits included (from )
- fix: use per-block filter instead of per-segment filter during query
(alibaba#223)
-  fix: minor typo (alibaba#225)
-  fix/fix mips euclidean (alibaba#226)
-  feat: buildwheel in ghrunner (alibaba#221)
-  minor: add deepwiki badges (alibaba#228)
-  feat: enlarge indice size limit for sparse vectors (alibaba#229)

## Zephyr commits re-applied
-  chore: Add C bindings (#1)
- fix(locking): prevent lock fd inheritance and add read-only C open
(#2)

## Commit mapping (same patch, new SHA)
-  => 
-  => 

## Safety / rollback
- Backup branch: 
- Previous  tip: 

## Reviewer notes
This PR primarily aligns history with upstream and brings in upstream
changes. Please focus review on integration points and CI stability.

---------

Co-authored-by: Qinren Zhou <zhouqinren.zqr@alibaba-inc.com>
Co-authored-by: Abdur-Rahmaan Janhangeer <cryptolabour@gmail.com>
Co-authored-by: rayx <rui.xing@alibaba-inc.com>
Co-authored-by: Cuiys <cuiyushuai.cys@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants