[SVS] Implement 2-stage backend SVS index initialization#903
[SVS] Implement 2-stage backend SVS index initialization#903
Conversation
|
Hi, I’m Jit, a friendly security platform designed to help developers build secure applications from day zero with an MVS (Minimal viable security) mindset. In case there are security findings, they will be communicated to you as a comment inside the PR. Hope you’ll enjoy using Jit. Questions? Comments? Want to learn more? Get in touch with us. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #903 +/- ##
==========================================
+ Coverage 97.09% 97.11% +0.01%
==========================================
Files 129 129
Lines 7500 7549 +49
==========================================
+ Hits 7282 7331 +49
Misses 218 218 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
alonre24
left a comment
There was a problem hiding this comment.
LGTM!
A few small comments.
Also please:
- Validate that covering the affected scenarios in unit tests (I believe we are)
- Add micro benchmarks that will prove the improvement (add vector/run query while initial index creation is executed in the background)
src/VecSim/algorithms/svs/svs.h
Outdated
| } | ||
|
|
||
| void setImpl(std::unique_ptr<ImplHandler> handler) override { | ||
| assert(handler && "SVSIndex::setImpl called with null handler"); |
There was a problem hiding this comment.
If this is a debug-only assert, let's add this to the log in a warning level as well
There was a problem hiding this comment.
This assert is added just to simplify logic error catching in DEBUG mode - as well as the next-line assert.
In release mode, the logic_error will be thrown later if handler is null.
assert()s here are not really needed - except for debugging purposes.
I can just remove them.
| std::span<const labelType> ids(labels, n); | ||
| auto processed_blob = this->preprocessForBatchStorage(vectors_data, n); | ||
| auto typed_vectors_data = static_cast<DataType *>(processed_blob.get()); | ||
| // Wrap data into SVS SimpleDataView for SVS API | ||
| auto points = svs::data::SimpleDataView<DataType>{typed_vectors_data, n, this->dim}; | ||
|
|
There was a problem hiding this comment.
This logic seems to be a duplication of what we do in addVectorsImpl. Consider unifying these into a single function
There was a problem hiding this comment.
The main point here is the processed_blob which lifetime should be managed till end of initImpl() and impl_->add_points() calls.
A single function, which will wrap all this code would look like:
std::tuple<std::span<const labelType>, MemoryUtils::unique_blob, svs::data::SimpleDataView<DataType>> preprocessAndPrepareSVSArgs(...)| SVSImplHandler *svs_handler = dynamic_cast<SVSImplHandler *>(handler.get()); | ||
| if (!svs_handler) { | ||
| throw std::logic_error("Failed to cast to SVSImplHandler"); | ||
| } |
There was a problem hiding this comment.
What is the motivation to have an abstract ImplHandler rather than have only SVSImplHandler? The dynamic_cast here seems a bit awkward
There was a problem hiding this comment.
SVSImplHandler is not just a simple type - it is template class with a number of parameters, and it's full declaration looks like:
template <typename MetricType,
typename DataType,
bool isMulti,
size_t QuantBits,
size_t ResidualBits,
bool IsLeanVec>
struct SVSIndex<MetricType, DataType, isMulti, QuantBits, ResidualBits, IsLeanVec>::SVSImplHandler;This why the abstract SVSIndexBase::ImplHandler is defined for client code (TieredSVSIndex).
| svs_index->setNumThreads(std::min(availableThreads, labels_to_move.size())); | ||
| svs_index->addVectors(vectors_to_move.data(), labels_to_move.data(), | ||
| labels_to_move.size()); | ||
| if (this->backendIndex->indexSize() == 0) { |
There was a problem hiding this comment.
Does this also handle re-initialization after the index was emptied? Is that scenario tested?
There was a problem hiding this comment.
Yes, as it was before in SVSIndex::AddVectors()
23d0223 to
2538673
Compare
❌ Jit Scanner failed - Our team is investigatingJit Scanner failed - Our team has been notified and is working to resolve the issue. Please contact support if you have any questions. 💡 Need to bypass this check? Comment |
33ef4e8 to
ff11b80
Compare
|
|
||
| this->markIndexUpdate(deleted_num); | ||
| return deleted_num; | ||
| } |
There was a problem hiding this comment.
Missing single-threading enforcement for single-element deletions
Medium Severity
The new deleteVectorImpl doesn't enforce single-threading before calling impl_->delete_entries with a single element. The removed code from deleteVectorsImpl explicitly set numThreads to 1 when n == 1, with a comment stating "we should ensure single-threading." The addVector path has an analogous assertion (getNumThreads() == 1). The tiered callers (addVector async mode and deleteVector) also don't call setNumThreads(1) before invoking backendIndex->deleteVector, unlike the in-place addVector path which explicitly does setNumThreads(1).
Additional Locations (2)
There was a problem hiding this comment.
SVS deleteVectors() operation does not utilize the threadpool - this why any TP manipulations are redundant here.
Purpose: prevent long time SVS Tiered index lock at initialization time. Logic: First stage: create `svs::...::MutableVamanaIndex` instance with R/O shared lock Second stage: set `svs::...::MutableVamanaIndex` created before under R/W unique lock
ff11b80 to
e270c00
Compare
| auto svs_index = GetSVSIndex(); | ||
| svs_index->deleteVectors(deleted_labels_journal.data(), | ||
| deleted_labels_journal.size()); | ||
| } |
There was a problem hiding this comment.
Lock ordering inversion causes potential deadlock
High Severity
The new cleanup code in updateSVSIndex acquires mainIndexGuard exclusive (line 714) while already holding flatIndexGuard exclusive (line 706), establishing a flat → main lock order. The pre-existing in-place addVector acquires flatIndexGuard exclusive (line 796) while holding mainIndexGuard shared (line 788), establishing a main → flat lock order. This inverted ordering can deadlock when both paths run concurrently — for example, when the update job is in its cleanup phase and a concurrent in-place addVector sees an empty backend.


Purpose: prevent long time SVS Tiered index lock at initialization time.
Logic:
svs::...::MutableVamanaIndexinstance with R/O shared locksvs::...::MutableVamanaIndexcreated before under R/W unique lockWhich issues this PR fixes
Main objects this PR modified
createImpl()andsetImpl()createImpl()under shared lock andsetImpl()under unique lock if backend index is empty.Mark if applicable
Note
Medium Risk
Touches tiered index locking and backend initialization/deletion flows, so concurrency/consistency bugs could surface under racey update/delete workloads despite added test coverage.
Overview
Enables two-stage initialization of the backend SVS index to reduce time spent holding the tiered index’s write lock during the initial build.
SVSIndexBasenow exposescreateImpl()/setImpl()(plusisLabelExists()), andSVSIndexrefactors initial build into aninitImpl()that returns the implementation so the tiered update job can build the SVS impl under a shared lock and install it under an exclusive lock. Tiered SVS also starts journaling labels deleted/overwritten during an update and applies those deletions to the backend after the update, and adjusts delete paths to check label existence under shared locks before taking exclusive locks. A new unit test covers the two-stage init and re-init behavior.Written by Cursor Bugbot for commit e270c00. This will update automatically on new commits. Configure here.