fix: use per-block filter instead of per-segment filter during query#223
Merged
egolearner merged 7 commits intomainfrom Mar 16, 2026
Merged
fix: use per-block filter instead of per-segment filter during query#223egolearner merged 7 commits intomainfrom
egolearner merged 7 commits intomainfrom
Conversation
Collaborator
Author
|
@greptile |
egolearner
reviewed
Mar 13, 2026
26e7b41 to
1aeb2e6
Compare
Collaborator
Author
|
@greptile |
Collaborator
Author
|
@greptile |
src/db/index/column/vector_column/combined_vector_column_indexer.h
Outdated
Show resolved
Hide resolved
Collaborator
Author
|
@greptile |
egolearner
approved these changes
Mar 16, 2026
kdy1
added a commit
to ZephyrCloudIO/zvec
that referenced
this pull request
Mar 16, 2026
## Summary Rebase onto (), then re-apply Zephyr-only commits on top. ## Upstream commits included (from ) - fix: use per-block filter instead of per-segment filter during query (alibaba#223) - fix: minor typo (alibaba#225) - fix/fix mips euclidean (alibaba#226) - feat: buildwheel in ghrunner (alibaba#221) - minor: add deepwiki badges (alibaba#228) - feat: enlarge indice size limit for sparse vectors (alibaba#229) ## Zephyr commits re-applied - chore: Add C bindings (#1) - fix(locking): prevent lock fd inheritance and add read-only C open (#2) ## Commit mapping (same patch, new SHA) - => - => ## Safety / rollback - Backup branch: - Previous tip: ## Reviewer notes This PR primarily aligns history with upstream and brings in upstream changes. Please focus review on integration points and CI stability. --------- Co-authored-by: Qinren Zhou <zhouqinren.zqr@alibaba-inc.com> Co-authored-by: Abdur-Rahmaan Janhangeer <cryptolabour@gmail.com> Co-authored-by: rayx <rui.xing@alibaba-inc.com> Co-authored-by: Cuiys <cuiyushuai.cys@alibaba-inc.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Greptile Summary
This PR fixes a correctness bug in multi-block vector search: when a segment's vector index is split across multiple blocks, each block's internal doc IDs are block-local (starting from 0), but the incoming
IndexFilteroperates on segment-level IDs. Previously, the segment-level filter was passed as-is to every block's indexer, causing incorrect filtering results for blocks beyond the first. The fix introduces aBlockOffsetFilterwrapper that adds a block's starting offset to any incoming block-local ID before delegating to the inner filter, correctly translating block-local → segment-level IDs.Key changes:
combined_vector_column_indexer.cc: For each block withoffset > 0, wrapsquery_params.filterin aBlockOffsetFilter; block 0 continues to use the filter directly (offset is always 0 so no translation is needed, avoiding unnecessary pointer indirection).combined_vector_column_indexer.h: IntroducesBlockOffsetFilteras aprotectedinner class ofCombinedVectorColumnIndexer, with clear documentation of its purpose.segment.cc: Improves comments on member variables to clarify which ID space (segment-local vs. block-local) each component uses.recall_base.h: Addsoptions.max_buffer_size_ = 256 * 1024to force multi-block creation during tests, ensuring allVectorRecallTestcases now exercise the multi-block code path.vector_recall_test.cc: AddsDeleteFilterandHybridInvertForwardDeleteFiltertests that verify correct filter application across multiple blocks after bulk deletes. Note thatHybridInvertForwardDeleteFilterexplicitly depends onDeleteFilterhaving run first (shared static segment state), as documented in the test comment — this fragility was flagged in earlier review rounds.Confidence Score: 4/5
BlockOffsetFiltertranslation logic is correct: block 0 (offset 0) passes the original filter unchanged, while subsequent blocks wrap it to add their base offset before delegating. The fix is minimal, targeted, and directly validated by the two new end-to-end tests. Themax_buffer_size_reduction makes all existing filter tests implicitly exercise the multi-block path as well. The one point deducted is for the inter-test state dependency invector_recall_test.cc(previously flagged) —HybridInvertForwardDeleteFilterrelies onDeleteFilterhaving run first, which makes the suite fragile under test reordering.DeleteFilter/HybridInvertForwardDeleteFilterordering dependency warrants attention if the test suite grows or shuffle mode is enabled.Important Files Changed
BlockOffsetFilterwrapping to translate block-local IDs to segment-level IDs before delegating to the inner filter during multi-block vector search. Logic is sound; block 0 (offset == 0) skips the wrapper as an optimisation. Theper_block_filteris still constructed unconditionally even whenquery_params.filteris null (previously flagged; developer intentionally kept this structure).BlockOffsetFilteras aprotectedinner class (per previous review feedback); removes the now-redundant#include "vector_index_results.h"(safely provided transitively viavector_column_indexer.h); adds#include "db/index/common/index_filter.h"required by the new class.options.max_buffer_size_ = 256 * 1024to force multi-block creation during test suite setup, enabling allVectorRecallTestcases — including the new delete-filter tests — to exercise the multi-block code path that this PR fixes.DeleteFilterandHybridInvertForwardDeleteFiltertests that validate the per-block filter fix. The two tests share mutable state through the staticsegments_fixture —HybridInvertForwardDeleteFilterexplicitly depends onDeleteFilterhaving run first, making the suite order-sensitive (previously flagged).Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[Search called with query_params.filter] --> B{query_params.filter != nullptr?} B -- No --> C[filter = nullptr\npass-through to indexer] B -- Yes --> D{block_offsets_i > 0?} D -- No\nBlock 0\noffset == 0 --> E[filter = query_params.filter\nblock-local ID == segment-level ID\nno translation needed] D -- Yes\nBlock N\noffset > 0 --> F[filter = &BlockOffsetFilter\ninner_filter_ = query_params.filter\noffset_ = block_offsets_i] F --> G[BlockOffsetFilter::is_filtered\nblock_local_id + offset\n→ segment_level_id\n→ inner_filter_->is_filtered] E --> H[indexers_i->Search\nwith per-block filter] C --> H G --> H H --> I[Translate block-local doc IDs\nback to segment-level:\ndoc.key += block_offsets_i] I --> J[Merge & sort\nall block results\nby score] J --> K[Truncate to topk\nreturn VectorIndexResults]Last reviewed commit: f774e03