feat(search): add offline AI-powered semantic search system#1208
feat(search): add offline AI-powered semantic search system#1208Arunodoy18 wants to merge 1 commit intoAOSSIE-Org:mainfrom
Conversation
Introduces a fully offline semantic search backend with no external
API dependencies. Images are represented as fixed-dimension (163-dim)
L2-normalised float32 vectors derived from existing metadata:
YOLO object tags, time-of-day, season, day-of-week, face-cluster
presence, and GPS coordinates.
New modules
-----------
- app/ai/embedding_service.py EmbeddingService: metadata -> vector
- app/ai/vector_index_manager.py VectorIndexManager: numpy flat index
- app/ai/search_service.py SearchService: indexing + search facade
- app/ai/__init__.py package exports
- app/database/embeddings.py image_embeddings SQLite table (BLOB)
- app/routes/search.py /search/* FastAPI endpoints
- app/schemas/search.py Pydantic response models
Modified files
--------------
- app/config/settings.py AI constants (MAX_FACE_CLUSTERS, TOP_K, ...)
- app/database/images.py db_get_images_by_ids() helper
- app/database/faces.py db_get_faces_by_image_id() helper
- main.py table init, SearchService singleton,
background startup indexing, router mount
API endpoints
-------------
GET /search/semantic?q=... ranked semantic search
GET /search/index-status index health / size
POST /search/index incremental or forced re-index
DELETE /search/index/{image_id} remove one image from the index
Design properties
-----------------
- Fully offline: numpy + stdlib only, no external models or APIs
- No recomputation on startup: already-indexed images are skipped
- Persistent: embeddings stored as BLOBs in existing SQLite DB
- Scalable: 10k images ~6.5 MB RAM, O(N) cosine via matmul
- Thread-safe incremental add/remove without full index rebuild
- Zero impact on existing features
Closes AOSSIE-Org#1205
WalkthroughA new offline semantic search subsystem is added, featuring embedding generation from image metadata, a thread-safe in-memory vector index backed by SQLite persistence, and a high-level search service orchestrating both. New API endpoints expose semantic search, index status, and index management. Integration occurs in the app startup with background indexing. Changes
Sequence DiagramsequenceDiagram
actor User
participant API as FastAPI Router
participant SearchSvc as SearchService
participant EmbedSvc as EmbeddingService
participant VecIdx as VectorIndexManager
participant DB as SQLite DB
User->>API: GET /search/semantic?q=beach
API->>SearchSvc: search("beach", top_k=20)
SearchSvc->>EmbedSvc: embed_query("beach")
EmbedSvc-->>SearchSvc: query_embedding [163D]
SearchSvc->>VecIdx: ensure_loaded()
VecIdx->>DB: db_get_all_embeddings()
DB-->>VecIdx: [(image_id, embedding), ...]
VecIdx-->>SearchSvc: index loaded
SearchSvc->>VecIdx: search(query_embedding, top_k=20)
VecIdx-->>SearchSvc: [(image_id, score), ...]
SearchSvc->>DB: db_get_images_by_ids([...])
DB-->>SearchSvc: [image metadata, ...]
SearchSvc-->>API: [SearchResult(...), ...]
API-->>User: 200 SemanticSearchResponse
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Adds an offline, semantic (embedding-based) search subsystem to the PictoPy backend, including persistent embedding storage in SQLite, an in-memory vector index, and FastAPI endpoints for querying and managing the index.
Changes:
- Introduces
EmbeddingService,VectorIndexManager, andSearchServiceto generate/query embeddings and manage an in-memory index backed by SQLite. - Adds new
/search/*API routes and Pydantic schemas for semantic search and index management. - Extends the database layer with an
image_embeddingstable plus helper queries for enriching results (image info + face clusters).
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| backend/main.py | Initializes embeddings table + search service at startup and starts background indexing thread; registers /search router. |
| backend/app/schemas/search.py | Adds response models for semantic search and index operations. |
| backend/app/routes/search.py | Adds /search/semantic, /search/index-status, /search/index, and /search/index/{image_id} endpoints. |
| backend/app/database/images.py | Adds db_get_images_by_ids() to enrich search results with path/thumbnail/tags. |
| backend/app/database/faces.py | Adds db_get_faces_by_image_id() for face-cluster enrichment during embedding. |
| backend/app/database/embeddings.py | Adds SQLite persistence layer for per-image embeddings (CRUD + bulk load). |
| backend/app/config/settings.py | Adds AI/search-related settings (paths, defaults, thresholds, face-cluster cap). |
| backend/app/ai/vector_index_manager.py | Implements a thread-safe in-memory flat index + cosine similarity search. |
| backend/app/ai/search_service.py | Orchestrates embedding generation, persistence, indexing, and result enrichment. |
| backend/app/ai/embedding_service.py | Implements deterministic, offline embedding construction for images + queries. |
| backend/app/ai/init.py | Exposes the AI subsystem components as a package. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| with self._lock: | ||
| self.ensure_loaded() | ||
| if image_id in self._id_to_row: | ||
| # In-place update (O(1)) | ||
| self._matrix[self._id_to_row[image_id]] = vec_row | ||
| else: |
There was a problem hiding this comment.
VectorIndexManager is documented as lazily loaded on first query, but add_or_update() always calls ensure_loaded(), which forces a full DB load the first time any embedding is added/updated (e.g., during startup indexing). If you want to preserve lazy loading, consider only updating the in-memory matrix when the index is already loaded, or buffering updates until the first explicit load/search.
| # Instantiate the semantic search service and attach it to app state | ||
| app.state.search_service = SearchService() | ||
| # Index any unindexed tagged images in a background thread so startup | ||
| # is not blocked. Already-indexed images are skipped automatically. | ||
| threading.Thread( | ||
| target=app.state.search_service.index_all, | ||
| daemon=True, | ||
| name="semantic-index-startup", | ||
| ).start() |
There was a problem hiding this comment.
The startup indexing runs in a daemon thread. Because shutdown can terminate the process abruptly (e.g. os._exit(0) on Windows), this thread may be killed mid-run, leaving indexing partially complete. Consider tracking the thread/task and allowing a graceful stop/join on shutdown, or making indexing explicitly triggerable only via the API to avoid background work during startup/shutdown cycles.
| def index_status(request: Request) -> IndexStatusResponse: | ||
| service = _get_service(request) | ||
| return IndexStatusResponse( | ||
| success=True, | ||
| indexed_count=service.index_size, | ||
| is_loaded=service.is_index_loaded, | ||
| ) |
There was a problem hiding this comment.
indexed_count is derived from service.index_size, which reports the in-memory index size. When the index is still lazily unloaded, this will return 0 even if embeddings exist on disk, making /search/index-status misleading. Consider either forcing a load for this endpoint or returning a persisted count from the embeddings table (and optionally reporting both persisted vs in-memory counts).
| conn.execute( | ||
| """ | ||
| CREATE INDEX IF NOT EXISTS ix_image_embeddings_image_id | ||
| ON image_embeddings(image_id) | ||
| """ | ||
| ) |
There was a problem hiding this comment.
image_id is already the PRIMARY KEY of image_embeddings, so SQLite will already maintain an index for it. The additional CREATE INDEX ... ON image_embeddings(image_id) is redundant and can be removed to reduce schema clutter.
| conn.execute( | |
| """ | |
| CREATE INDEX IF NOT EXISTS ix_image_embeddings_image_id | |
| ON image_embeddings(image_id) | |
| """ | |
| ) |
| if images is None: | ||
| images = db_get_all_images(tagged=True) | ||
|
|
||
| already_indexed: set[str] = set() if force else set(db_get_indexed_image_ids()) | ||
|
|
||
| count = 0 | ||
| for record in images: | ||
| iid = record.get("id") | ||
| if not iid: | ||
| continue | ||
| if not force and iid in already_indexed: | ||
| continue |
There was a problem hiding this comment.
index_all(force=False) skips any image_id already present in image_embeddings, but VectorIndexManager will ignore stored embeddings whose dim doesn't match the current EMBEDDING_DIM (e.g., after changing MAX_FACE_CLUSTERS in a new release). In that scenario, the in-memory index will be empty and index_all still won’t rebuild those embeddings unless the user explicitly passes force=true. Consider treating rows with a mismatched dim as not indexed (e.g., query only ids where dim == EMBEDDING_DIM, or add a DB helper to return ids by dim).
| vec_row = embedding.astype(np.float32).reshape(1, -1) | ||
|
|
||
| with self._lock: | ||
| self.ensure_loaded() | ||
| if image_id in self._id_to_row: | ||
| # In-place update (O(1)) | ||
| self._matrix[self._id_to_row[image_id]] = vec_row | ||
| else: | ||
| # Append row (amortised O(1) via np.vstack) | ||
| row_idx = len(self._image_ids) | ||
| self._image_ids.append(image_id) | ||
| self._id_to_row[image_id] = row_idx | ||
| if self._matrix.shape[0] > 0: | ||
| self._matrix = np.vstack([self._matrix, vec_row]) | ||
| else: | ||
| self._matrix = vec_row | ||
|
|
There was a problem hiding this comment.
In add_or_update, vec_row is reshaped to (1, dim) but the in-place update assigns it into a 1-D row (self._matrix[row]), which will raise a NumPy broadcasting error when updating an existing embedding. Use a 1-D vector for assignment (e.g., avoid the extra leading dimension for the update path).
| @router.get( | ||
| "/semantic", | ||
| response_model=SemanticSearchResponse, | ||
| summary="Semantic image search", | ||
| description=( | ||
| "Search for images using a natural-language query. The query is mapped " | ||
| "onto the same fixed-dimension embedding space used to index image " | ||
| "metadata (object classes, date/time, season, faces, GPS) and the " | ||
| "most similar images are returned ranked by cosine similarity." | ||
| ), | ||
| ) | ||
| def semantic_search( | ||
| request: Request, | ||
| q: str = Query(..., min_length=1, description="Free-text search query."), | ||
| limit: int = Query( | ||
| default=SEARCH_DEFAULT_TOP_K, | ||
| ge=1, | ||
| le=200, | ||
| description="Maximum number of results to return.", | ||
| ), | ||
| min_score: float = Query( | ||
| default=SEARCH_MIN_SCORE, | ||
| ge=0.0, | ||
| le=1.0, | ||
| description="Minimum cosine-similarity score threshold.", | ||
| ), | ||
| ) -> SemanticSearchResponse: | ||
| service = _get_service(request) |
There was a problem hiding this comment.
This PR adds new search/indexing endpoints and core ranking logic, but there are no corresponding tests under backend/tests/ (the repo already has substantial FastAPI/DB test coverage). Adding tests for /search/semantic, /search/index, and the indexing/search flow (including score thresholding and empty-index behavior) would help prevent regressions.
| results: List[SearchResult] = [] | ||
| for image_id, score in raw: | ||
| info = path_map.get(image_id, {}) | ||
| results.append( | ||
| SearchResult( | ||
| image_id=image_id, | ||
| score=round(score, 4), | ||
| path=info.get("path", ""), | ||
| thumbnail_path=info.get("thumbnailPath", ""), | ||
| tags=info.get("tags") or [], | ||
| ) |
There was a problem hiding this comment.
SearchResultItem.score is constrained to [0, 1] in the response schema, but cosine similarity computed in float32 can drift slightly outside this range (e.g., 1.0000x) due to numerical error, which can cause response-model validation failures. Consider clamping scores into [0.0, 1.0] (and optionally also clamping at 0.0) before returning them.
| elif len(tok) >= 4: | ||
| # Partial substring match only for sufficiently long tokens | ||
| # (avoids "at" → "cat", "park" → "parking meter" false positives). | ||
| # Rule: the token must be a prefix, suffix, or the class name must | ||
| # start with the token (prevents mid-word collisions like "ark"). | ||
| for name, idx in _OBJ_INDEX.items(): | ||
| if name.startswith(tok) or tok.startswith(name): | ||
| matched.add(idx) | ||
| i += 1 |
There was a problem hiding this comment.
The comment describing the partial-match rule mentions allowing prefix/suffix matches, but the implementation only checks name.startswith(tok) and tok.startswith(name) (no suffix matching). Either update the comment to reflect the actual rule or adjust the matching logic so documentation and behavior stay aligned.
There was a problem hiding this comment.
Actionable comments posted: 8
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/ai/embedding_service.py`:
- Around line 197-205: The constructor EmbeddingService.__init__ currently
allows max_face_clusters > module MAX_FC which can cause out-of-bounds writes in
_fill_faces; clamp the runtime value by setting self._max_fc =
min(max_face_clusters, MAX_FC) (and keep the existing warning but reflect the
clamping), and update _fill_faces (and the other location referenced around
lines 357-359) to always use min(self._max_fc, MAX_FC) or the clamped
self._max_fc when computing per-face offsets/indices so no code ever assumes
more than MAX_FC face clusters.
- Around line 371-373: The spatial encoding currently normalizes lat/lon without
bounds checking; update the logic in embedding_service.py (around the vec,
_OFF_FLAG and _OFF_SPA usage) to validate GPS values: clamp latitude to [-90,
90] and longitude to [-180, 180] (or raise a ValueError if you prefer strict
validation) before computing vec[_OFF_SPA] = float(lat)/90.0 and vec[_OFF_SPA +
1] = float(lon)/180.0; ensure vec[_OFF_FLAG + 1] (has_location) is only set when
a (possibly clamped) numeric lat/lon is present and convert inputs to float
first.
In `@backend/app/ai/search_service.py`:
- Around line 149-154: The current index_all loop swallows all exceptions when
calling self.index_image(record), making callers unable to tell “already
indexed” from real failures; update index_all so it does not blanket-catch
Exception: either catch only the specific expected "already indexed" exception
(or whatever custom exception your index_image raises) and continue, but for all
other exceptions capture/log them and accumulate failures to return or re-raise
an aggregated error; specifically modify the try/except around
self.index_image(record) and replace the bare except Exception with targeted
exception handling (e.g., except AlreadyIndexedError: continue) and for other
exceptions append iid (or record) and the exception to a failures list, then
after the loop either raise a combined Exception or return a result object
indicating successes and failures, while keeping the logger.exception(...) call
to record details.
- Around line 115-227: Add automated unit tests for SearchService covering:
index_all(force=False) skips already indexed IDs and returns correct count,
index_all(force=True) re-embeds all items and overwrites existing entries, and
index_all handles per-image exceptions (mock index_image to raise for one record
and assert count reflects only successful indexes and that logger.exception was
called); remove_image removes persisted embedding and the in-memory index (mock
db_delete_embedding and self._index.remove and then run search to assert the
image no longer appears); and search returns correctly ordered SearchResult
entries with trimmed query, threshold filtering, and proper path/thumbnail/tags
mapping (instantiate SearchService, mock its _embed.embed_query, _index.search,
and _build_path_map to assert ranking, score rounding, and empty-query
behavior). Use mocks for db_get_all_images, db_get_indexed_image_ids,
db_delete_embedding, _build_path_map, and methods index_image/_index/_embed so
tests are isolated and deterministic.
In `@backend/app/ai/vector_index_manager.py`:
- Around line 138-145: The update path in add_or_update creates vec_row with
shape (1, D) then assigns it into a single row of self._matrix, causing a shape
mismatch; fix by making vec_row a 1-D array (shape (D,)) before assignment.
Locate the add_or_update method (symbols: vec_row, ensure_loaded, _id_to_row,
_matrix, image_id) and change the creation/reshape of vec_row to use
.reshape(-1) or index [0] so that assignments like
self._matrix[self._id_to_row[image_id]] = vec_row succeed without broadcasting
errors.
In `@backend/app/database/embeddings.py`:
- Around line 90-104: Validate that the embedding is a 1-D array before
persisting: after creating vec via embedding.astype(np.float32) (and before
computing dim or calling conn.execute on the INSERT into image_embeddings),
check vec.ndim == 1 and use vec.shape[0] as the dim; if the shape is invalid,
raise a ValueError (or otherwise abort) rather than using len(vec) so you never
store an incorrect dim for image_id and corrupt future reads. Ensure this
validation is applied where vec is created and before calling _connect() /
conn.execute.
In `@backend/app/routes/search.py`:
- Around line 86-88: The exception handler in search.py currently calls
logger.exception("Semantic search failed for query %r", q) which logs raw user
input q; change this to log non-sensitive metadata instead: replace the
logger.exception call in the except block handling semantic search with
something like logger.exception("Semantic search failed for request_id=%s
query_len=%d", request_id, len(q or "")) (or include any existing request
identifier) and remove q from logs, ensuring the raise HTTPException remains
unchanged; update references around logger.exception, the except block, and the
variable q to compute/query length and attach request id if available.
In `@backend/main.py`:
- Around line 66-75: Add automated tests that verify the startup block calls
db_create_embeddings_table(), creates a SearchService instance on
app.state.search_service, and starts a background thread invoking
SearchService.index_all(); specifically, in tests (e.g., new test file using the
test client and monkeypatch/patch), assert db_create_embeddings_table was called
during app setup, assert app.state.search_service is an instance of
SearchService, and patch SearchService.index_all to confirm it was scheduled (or
invoked) by the background Thread (or assert a Thread named
"semantic-index-startup" was started); also add integration tests against the
search endpoint to confirm search responses after indexing behavior is
simulated.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
backend/app/ai/__init__.pybackend/app/ai/embedding_service.pybackend/app/ai/search_service.pybackend/app/ai/vector_index_manager.pybackend/app/config/settings.pybackend/app/database/embeddings.pybackend/app/database/faces.pybackend/app/database/images.pybackend/app/routes/search.pybackend/app/schemas/search.pybackend/main.py
| def __init__(self, max_face_clusters: int = MAX_FC) -> None: | ||
| if max_face_clusters != MAX_FC: | ||
| logger.warning( | ||
| "EmbeddingService initialised with max_face_clusters=%d " | ||
| "but module constant MAX_FC=%d; EMBEDDING_DIM will be wrong.", | ||
| max_face_clusters, | ||
| MAX_FC, | ||
| ) | ||
| self._max_fc = max_face_clusters |
There was a problem hiding this comment.
Prevent face-cluster width mismatch from breaking fixed vector bounds
The vector layout is fixed using module-level MAX_FC. Allowing a larger runtime max_face_clusters can cause out-of-bounds writes in _fill_faces.
Proposed fix
def __init__(self, max_face_clusters: int = MAX_FC) -> None:
- if max_face_clusters != MAX_FC:
- logger.warning(
- "EmbeddingService initialised with max_face_clusters=%d "
- "but module constant MAX_FC=%d; EMBEDDING_DIM will be wrong.",
- max_face_clusters,
- MAX_FC,
- )
- self._max_fc = max_face_clusters
+ if max_face_clusters != MAX_FC:
+ raise ValueError(
+ f"max_face_clusters must be {MAX_FC} to match fixed EMBEDDING_DIM={EMBEDDING_DIM}"
+ )
+ self._max_fc = MAX_FCAlso applies to: 357-359
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/ai/embedding_service.py` around lines 197 - 205, The constructor
EmbeddingService.__init__ currently allows max_face_clusters > module MAX_FC
which can cause out-of-bounds writes in _fill_faces; clamp the runtime value by
setting self._max_fc = min(max_face_clusters, MAX_FC) (and keep the existing
warning but reflect the clamping), and update _fill_faces (and the other
location referenced around lines 357-359) to always use min(self._max_fc,
MAX_FC) or the clamped self._max_fc when computing per-face offsets/indices so
no code ever assumes more than MAX_FC face clusters.
| vec[_OFF_FLAG + 1] = 1.0 # has_location | ||
| vec[_OFF_SPA] = float(lat) / 90.0 # [−90, 90] → [−1, 1] | ||
| vec[_OFF_SPA + 1] = float(lon) / 180.0 # [−180, 180] → [−1, 1] |
There was a problem hiding this comment.
Validate GPS bounds before encoding spatial features
This currently accepts any numeric latitude/longitude and can encode values outside the documented [-1, 1] range. Reject or clamp out-of-range coordinates.
Proposed fix
- vec[_OFF_FLAG + 1] = 1.0 # has_location
- vec[_OFF_SPA] = float(lat) / 90.0 # [−90, 90] → [−1, 1]
- vec[_OFF_SPA + 1] = float(lon) / 180.0 # [−180, 180] → [−1, 1]
+ lat_f = float(lat)
+ lon_f = float(lon)
+ if not (-90.0 <= lat_f <= 90.0 and -180.0 <= lon_f <= 180.0):
+ return
+ vec[_OFF_FLAG + 1] = 1.0 # has_location
+ vec[_OFF_SPA] = lat_f / 90.0
+ vec[_OFF_SPA + 1] = lon_f / 180.0📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| vec[_OFF_FLAG + 1] = 1.0 # has_location | |
| vec[_OFF_SPA] = float(lat) / 90.0 # [−90, 90] → [−1, 1] | |
| vec[_OFF_SPA + 1] = float(lon) / 180.0 # [−180, 180] → [−1, 1] | |
| lat_f = float(lat) | |
| lon_f = float(lon) | |
| if not (-90.0 <= lat_f <= 90.0 and -180.0 <= lon_f <= 180.0): | |
| return | |
| vec[_OFF_FLAG + 1] = 1.0 # has_location | |
| vec[_OFF_SPA] = lat_f / 90.0 | |
| vec[_OFF_SPA + 1] = lon_f / 180.0 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/ai/embedding_service.py` around lines 371 - 373, The spatial
encoding currently normalizes lat/lon without bounds checking; update the logic
in embedding_service.py (around the vec, _OFF_FLAG and _OFF_SPA usage) to
validate GPS values: clamp latitude to [-90, 90] and longitude to [-180, 180]
(or raise a ValueError if you prefer strict validation) before computing
vec[_OFF_SPA] = float(lat)/90.0 and vec[_OFF_SPA + 1] = float(lon)/180.0; ensure
vec[_OFF_FLAG + 1] (has_location) is only set when a (possibly clamped) numeric
lat/lon is present and convert inputs to float first.
| def index_all( | ||
| self, | ||
| *, | ||
| force: bool = False, | ||
| images: Optional[List[Dict[str, Any]]] = None, | ||
| ) -> int: | ||
| """ | ||
| Index all tagged images that do not yet have a stored embedding. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| force : | ||
| If ``True``, re-embed and overwrite every image even if already | ||
| indexed. Useful after a settings change (e.g. ``MAX_FACE_CLUSTERS``). | ||
| images : | ||
| Pre-fetched image records. Fetched from the DB when ``None``. | ||
|
|
||
| Returns | ||
| ------- | ||
| int | ||
| Number of images actually (re-)indexed in this call. | ||
| """ | ||
| if images is None: | ||
| images = db_get_all_images(tagged=True) | ||
|
|
||
| already_indexed: set[str] = set() if force else set(db_get_indexed_image_ids()) | ||
|
|
||
| count = 0 | ||
| for record in images: | ||
| iid = record.get("id") | ||
| if not iid: | ||
| continue | ||
| if not force and iid in already_indexed: | ||
| continue | ||
| try: | ||
| self.index_image(record) | ||
| count += 1 | ||
| except Exception: | ||
| logger.exception("Failed to index image %s", iid) | ||
|
|
||
| if count: | ||
| logger.info("Indexed %d new image(s)", count) | ||
| else: | ||
| logger.debug("index_all: no new images to index") | ||
| return count | ||
|
|
||
| def remove_image(self, image_id: str) -> None: | ||
| """ | ||
| Remove an image from both the persistent embedding store and the | ||
| in-memory index. | ||
|
|
||
| Should be called whenever an image is deleted from the library so | ||
| it no longer appears in search results. | ||
| """ | ||
| db_delete_embedding(image_id) | ||
| self._index.remove(image_id) | ||
| logger.debug("Removed image %s from semantic index", image_id) | ||
|
|
||
| # ------------------------------------------------------------------ | ||
| # Search | ||
| # ------------------------------------------------------------------ | ||
|
|
||
| def search( | ||
| self, | ||
| query: str, | ||
| top_k: int = 20, | ||
| min_score: float = 0.05, | ||
| ) -> List[SearchResult]: | ||
| """ | ||
| Perform a semantic search across all indexed images. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| query : | ||
| Free-text query (e.g. ``"dog at the park"``, ``"evening sunset"``, | ||
| ``"person in summer"``). | ||
| top_k : | ||
| Maximum number of results to return. | ||
| min_score : | ||
| Cosine-similarity threshold. Keeps only results with | ||
| ``score >= min_score``. | ||
|
|
||
| Returns | ||
| ------- | ||
| list of :class:`SearchResult` ordered by descending similarity score. | ||
| """ | ||
| query = query.strip() | ||
| if not query: | ||
| return [] | ||
|
|
||
| query_vec = self._embed.embed_query(query) | ||
| raw = self._index.search(query_vec, top_k=top_k, min_score=min_score) | ||
| if not raw: | ||
| return [] | ||
|
|
||
| # Batch-fetch path / thumbnail / tags for the matched IDs | ||
| id_set = {iid for iid, _ in raw} | ||
| path_map = _build_path_map(id_set) | ||
|
|
||
| results: List[SearchResult] = [] | ||
| for image_id, score in raw: | ||
| info = path_map.get(image_id, {}) | ||
| results.append( | ||
| SearchResult( | ||
| image_id=image_id, | ||
| score=round(score, 4), | ||
| path=info.get("path", ""), | ||
| thumbnail_path=info.get("thumbnailPath", ""), | ||
| tags=info.get("tags") or [], | ||
| ) | ||
| ) | ||
| return results | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find test files and locate any tests for SearchService or search_service
find . -type f -name "*test*.py" | head -20Repository: AOSSIE-Org/PictoPy
Length of output: 308
🏁 Script executed:
# Also search for common test directory patterns
fd -t f -e py test 2>/dev/null | grep -E '(test|spec)' | head -30Repository: AOSSIE-Org/PictoPy
Length of output: 292
🏁 Script executed:
# Search for SearchService references in test files
rg -l "SearchService|search_service" --type pyRepository: AOSSIE-Org/PictoPy
Length of output: 202
🏁 Script executed:
# Check if any existing test files contain search or SearchService tests
grep -r "SearchService\|search_service\|index_all\|remove_image" backend/tests/ 2>/dev/nullRepository: AOSSIE-Org/PictoPy
Length of output: 334
🏁 Script executed:
# Check the backend test files
cat -n backend/test.py | head -50Repository: AOSSIE-Org/PictoPy
Length of output: 1995
🏁 Script executed:
# Check the routes test file
cat -n backend/app/routes/test.py 2>/dev/null || echo "File not found or empty"Repository: AOSSIE-Org/PictoPy
Length of output: 7246
🏁 Script executed:
# Check if there are test files in backend/app/ai
find backend/app/ai -name "*test*.py" -o -name "test_*.py"Repository: AOSSIE-Org/PictoPy
Length of output: 44
🏁 Script executed:
# List all test files in the backend/tests directory
ls -la backend/tests/Repository: AOSSIE-Org/PictoPy
Length of output: 692
🏁 Script executed:
# Check for any search-related tests in existing test files
grep -r "semantic\|embed\|index_all\|remove_image" backend/tests/*.py 2>/dev/nullRepository: AOSSIE-Org/PictoPy
Length of output: 334
🏁 Script executed:
# Check the search_service.py to understand the full scope
cat -n backend/app/ai/search_service.py | head -100Repository: AOSSIE-Org/PictoPy
Length of output: 4202
🏁 Script executed:
# Check if there are integration or other test patterns
find backend -type f -name "*.py" -path "*/test*" | xargs ls -la 2>/dev/nullRepository: AOSSIE-Org/PictoPy
Length of output: 768
🏁 Script executed:
# Check routes/search.py to see if it has tests
cat -n backend/app/routes/search.pyRepository: AOSSIE-Org/PictoPy
Length of output: 7095
🏁 Script executed:
# Check if routes are tested in any test files
grep -r "/search\|semantic_search\|trigger_index\|remove_from_index" backend/tests/ 2>/dev/nullRepository: AOSSIE-Org/PictoPy
Length of output: 44
🏁 Script executed:
# Check conftest.py to understand test setup and fixtures
cat -n backend/tests/conftest.pyRepository: AOSSIE-Org/PictoPy
Length of output: 1910
🏁 Script executed:
# Look for any test configuration that might indicate search tests exist
find backend -type f \( -name "pytest.ini" -o -name "setup.cfg" -o -name "pyproject.toml" \) -exec grep -l "test" {} \;Repository: AOSSIE-Org/PictoPy
Length of output: 80
🏁 Script executed:
# Check the full search_service.py to understand exception handling in index_all
cat -n backend/app/ai/search_service.py | sed -n '115,160p'Repository: AOSSIE-Org/PictoPy
Length of output: 1773
🏁 Script executed:
# Check if there's a test directory specifically for ai module
find backend/app/ai -type d -name "*test*"Repository: AOSSIE-Org/PictoPy
Length of output: 44
🏁 Script executed:
# Check for any search test file
find backend -name "*search*test*.py" -o -name "test*search*.py"Repository: AOSSIE-Org/PictoPy
Length of output: 44
🏁 Script executed:
# Check the embeddings database module to understand the table
cat -n backend/app/database/embeddings.py 2>/dev/null | head -60Repository: AOSSIE-Org/PictoPy
Length of output: 2514
🏁 Script executed:
# Verify conftest creates embeddings table
grep -n "embedding" backend/tests/conftest.py -iRepository: AOSSIE-Org/PictoPy
Length of output: 44
🏁 Script executed:
# List all test files to ensure we haven't missed any
ls -la backend/tests/test_*.pyRepository: AOSSIE-Org/PictoPy
Length of output: 395
🏁 Script executed:
# Double-check for any hidden search tests
grep -r "class.*SearchService\|def.*test.*search\|def.*test.*index\|def.*test.*embed" backend/ --include="*.py" | grep -i testRepository: AOSSIE-Org/PictoPy
Length of output: 44
🏁 Script executed:
# Check the full conftest to see if embeddings table is created
cat -n backend/tests/conftest.py | tail -20Repository: AOSSIE-Org/PictoPy
Length of output: 886
🏁 Script executed:
# Verify no embeddings table creation anywhere in test setup
grep -r "db_create_embeddings_table" backend/tests/Repository: AOSSIE-Org/PictoPy
Length of output: 44
🏁 Script executed:
# Check the main.py to see when embeddings table should be created
grep -n "embeddings" backend/main.py -iRepository: AOSSIE-Org/PictoPy
Length of output: 163
🏁 Script executed:
# Check if db_create_embeddings_table is called in conftest or any test setup
grep -r "db_create_embeddings_table" backend/tests/Repository: AOSSIE-Org/PictoPy
Length of output: 44
🏁 Script executed:
# Verify the exact state: is embeddings table created before tests run?
cat -n backend/main.py | sed -n '60,75p'Repository: AOSSIE-Org/PictoPy
Length of output: 866
🏁 Script executed:
# Check if SearchService is instantiated or tested anywhere
grep -r "SearchService\(" backend/tests/Repository: AOSSIE-Org/PictoPy
Length of output: 85
🏁 Script executed:
# Search for SearchService instantiation in test files
grep -r "SearchService" backend/tests/Repository: AOSSIE-Org/PictoPy
Length of output: 44
Add automated tests for critical semantic search functionality.
This introduces core behavior (index_all, remove_image, search) without accompanying automated tests. Critical gaps include:
- Force reindex behavior (the
forceparameter) - Partial-failure propagation in batch indexing (exceptions are caught per image but not propagated)
- Remove-then-search consistency
- Query execution and result ranking
The test infrastructure also lacks SearchService instantiation and integration tests, despite the service being integral to the application.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/ai/search_service.py` around lines 115 - 227, Add automated unit
tests for SearchService covering: index_all(force=False) skips already indexed
IDs and returns correct count, index_all(force=True) re-embeds all items and
overwrites existing entries, and index_all handles per-image exceptions (mock
index_image to raise for one record and assert count reflects only successful
indexes and that logger.exception was called); remove_image removes persisted
embedding and the in-memory index (mock db_delete_embedding and
self._index.remove and then run search to assert the image no longer appears);
and search returns correctly ordered SearchResult entries with trimmed query,
threshold filtering, and proper path/thumbnail/tags mapping (instantiate
SearchService, mock its _embed.embed_query, _index.search, and _build_path_map
to assert ranking, score rounding, and empty-query behavior). Use mocks for
db_get_all_images, db_get_indexed_image_ids, db_delete_embedding,
_build_path_map, and methods index_image/_index/_embed so tests are isolated and
deterministic.
| try: | ||
| self.index_image(record) | ||
| count += 1 | ||
| except Exception: | ||
| logger.exception("Failed to index image %s", iid) | ||
|
|
There was a problem hiding this comment.
Do not silently swallow indexing failures in index_all.
Line 152 catches all exceptions and continues, so callers cannot distinguish “already indexed” from “indexing failed.” This can surface false success to /search/index.
⚙️ Proposed fix
- count = 0
+ count = 0
+ failures = 0
for record in images:
iid = record.get("id")
if not iid:
continue
if not force and iid in already_indexed:
continue
try:
self.index_image(record)
count += 1
except Exception:
+ failures += 1
logger.exception("Failed to index image %s", iid)
+ if failures:
+ raise RuntimeError(f"Indexing failed for {failures} image(s)")
if count:
logger.info("Indexed %d new image(s)", count)
else:
logger.debug("index_all: no new images to index")
return count🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/ai/search_service.py` around lines 149 - 154, The current
index_all loop swallows all exceptions when calling self.index_image(record),
making callers unable to tell “already indexed” from real failures; update
index_all so it does not blanket-catch Exception: either catch only the specific
expected "already indexed" exception (or whatever custom exception your
index_image raises) and continue, but for all other exceptions capture/log them
and accumulate failures to return or re-raise an aggregated error; specifically
modify the try/except around self.index_image(record) and replace the bare
except Exception with targeted exception handling (e.g., except
AlreadyIndexedError: continue) and for other exceptions append iid (or record)
and the exception to a failures list, then after the loop either raise a
combined Exception or return a result object indicating successes and failures,
while keeping the logger.exception(...) call to record details.
| vec_row = embedding.astype(np.float32).reshape(1, -1) | ||
|
|
||
| with self._lock: | ||
| self.ensure_loaded() | ||
| if image_id in self._id_to_row: | ||
| # In-place update (O(1)) | ||
| self._matrix[self._id_to_row[image_id]] = vec_row | ||
| else: |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
python - <<'PY'
import numpy as np
m = np.zeros((2, 4), dtype=np.float32)
vec_row = np.ones((1, 4), dtype=np.float32)
try:
m[0] = vec_row
print("assignment_ok")
except Exception as e:
print(type(e).__name__, str(e))
PYRepository: AOSSIE-Org/PictoPy
Length of output: 179
🏁 Script executed:
head -160 backend/app/ai/vector_index_manager.py | tail -50Repository: AOSSIE-Org/PictoPy
Length of output: 1849
🏁 Script executed:
cat -n backend/app/ai/vector_index_manager.py | sed -n '118,160p'Repository: AOSSIE-Org/PictoPy
Length of output: 1943
🏁 Script executed:
find . -type f -name "*.py" -path "*/test*" | xargs grep -l "vector_index_manager\|VectorIndexManager" 2>/dev/null | head -5Repository: AOSSIE-Org/PictoPy
Length of output: 44
Fix shape mismatch in update path (add_or_update).
Line 144 assigns a (1, D) shaped array into a (D,) row slot, causing a shape incompatibility. When updating an existing image_id, this triggers a broadcast error that breaks the update mechanism.
Proposed fix
- vec_row = embedding.astype(np.float32).reshape(1, -1)
+ vec = embedding.astype(np.float32).reshape(-1)
with self._lock:
self.ensure_loaded()
if image_id in self._id_to_row:
# In-place update (O(1))
- self._matrix[self._id_to_row[image_id]] = vec_row
+ self._matrix[self._id_to_row[image_id], :] = vec
else:
# Append row (amortised O(1) via np.vstack)
row_idx = len(self._image_ids)
self._image_ids.append(image_id)
self._id_to_row[image_id] = row_idx
if self._matrix.shape[0] > 0:
- self._matrix = np.vstack([self._matrix, vec_row])
+ self._matrix = np.vstack([self._matrix, vec[np.newaxis, :]])
else:
- self._matrix = vec_row
+ self._matrix = vec[np.newaxis, :]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| vec_row = embedding.astype(np.float32).reshape(1, -1) | |
| with self._lock: | |
| self.ensure_loaded() | |
| if image_id in self._id_to_row: | |
| # In-place update (O(1)) | |
| self._matrix[self._id_to_row[image_id]] = vec_row | |
| else: | |
| vec = embedding.astype(np.float32).reshape(-1) | |
| with self._lock: | |
| self.ensure_loaded() | |
| if image_id in self._id_to_row: | |
| # In-place update (O(1)) | |
| self._matrix[self._id_to_row[image_id], :] = vec | |
| else: | |
| # Append row (amortised O(1) via np.vstack) | |
| row_idx = len(self._image_ids) | |
| self._image_ids.append(image_id) | |
| self._id_to_row[image_id] = row_idx | |
| if self._matrix.shape[0] > 0: | |
| self._matrix = np.vstack([self._matrix, vec[np.newaxis, :]]) | |
| else: | |
| self._matrix = vec[np.newaxis, :] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/ai/vector_index_manager.py` around lines 138 - 145, The update
path in add_or_update creates vec_row with shape (1, D) then assigns it into a
single row of self._matrix, causing a shape mismatch; fix by making vec_row a
1-D array (shape (D,)) before assignment. Locate the add_or_update method
(symbols: vec_row, ensure_loaded, _id_to_row, _matrix, image_id) and change the
creation/reshape of vec_row to use .reshape(-1) or index [0] so that assignments
like self._matrix[self._id_to_row[image_id]] = vec_row succeed without
broadcasting errors.
| vec = embedding.astype(np.float32) | ||
| blob = vec.tobytes() | ||
| conn = _connect() | ||
| try: | ||
| conn.execute( | ||
| """ | ||
| INSERT INTO image_embeddings (image_id, embedding, dim, updated_at) | ||
| VALUES (?, ?, ?, CURRENT_TIMESTAMP) | ||
| ON CONFLICT(image_id) DO UPDATE SET | ||
| embedding = excluded.embedding, | ||
| dim = excluded.dim, | ||
| updated_at = excluded.updated_at | ||
| """, | ||
| (image_id, blob, len(vec)), | ||
| ) |
There was a problem hiding this comment.
Validate embedding dimensionality before persisting
len(vec) is unsafe for non-1D arrays and can store an incorrect dim, which later makes reads look corrupt. Reject invalid shapes up front.
Proposed fix
- vec = embedding.astype(np.float32)
+ vec = np.asarray(embedding, dtype=np.float32)
+ if vec.ndim != 1:
+ raise ValueError(f"Embedding must be 1-D, got shape={vec.shape}")
+ if not np.isfinite(vec).all():
+ raise ValueError("Embedding contains non-finite values")
blob = vec.tobytes()
@@
- (image_id, blob, len(vec)),
+ (image_id, blob, vec.shape[0]),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/database/embeddings.py` around lines 90 - 104, Validate that the
embedding is a 1-D array before persisting: after creating vec via
embedding.astype(np.float32) (and before computing dim or calling conn.execute
on the INSERT into image_embeddings), check vec.ndim == 1 and use vec.shape[0]
as the dim; if the shape is invalid, raise a ValueError (or otherwise abort)
rather than using len(vec) so you never store an incorrect dim for image_id and
corrupt future reads. Ensure this validation is applied where vec is created and
before calling _connect() / conn.execute.
| except Exception: | ||
| logger.exception("Semantic search failed for query %r", q) | ||
| raise HTTPException( |
There was a problem hiding this comment.
Avoid logging raw user queries in exception logs.
Line 87 logs q verbatim, which can capture sensitive user-entered text. Log metadata (e.g., length/request id) instead of raw query content.
🔒 Proposed fix
- except Exception:
- logger.exception("Semantic search failed for query %r", q)
+ except Exception:
+ logger.exception("Semantic search failed (query_len=%d)", len(q))
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail="Semantic search failed.",
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except Exception: | |
| logger.exception("Semantic search failed for query %r", q) | |
| raise HTTPException( | |
| except Exception: | |
| logger.exception("Semantic search failed (query_len=%d)", len(q)) | |
| raise HTTPException( |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/routes/search.py` around lines 86 - 88, The exception handler in
search.py currently calls logger.exception("Semantic search failed for query
%r", q) which logs raw user input q; change this to log non-sensitive metadata
instead: replace the logger.exception call in the except block handling semantic
search with something like logger.exception("Semantic search failed for
request_id=%s query_len=%d", request_id, len(q or "")) (or include any existing
request identifier) and remove q from logs, ensuring the raise HTTPException
remains unchanged; update references around logger.exception, the except block,
and the variable q to compute/query length and attach request id if available.
| db_create_embeddings_table() | ||
| # Instantiate the semantic search service and attach it to app state | ||
| app.state.search_service = SearchService() | ||
| # Index any unindexed tagged images in a background thread so startup | ||
| # is not blocked. Already-indexed images are skipped automatically. | ||
| threading.Thread( | ||
| target=app.state.search_service.index_all, | ||
| daemon=True, | ||
| name="semantic-index-startup", | ||
| ).start() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Search for test files that cover SearchService, index_all, embeddings table, and search endpoints
echo "=== Searching for test files covering semantic search startup/indexing ==="
rg -nP --type=py -g '*test*.py' -C2 'SearchService|index_all|db_create_embeddings_table|/search/semantic|/search/index-status|/search/index|lifespan' 2>/dev/null || echo "No matches found"
echo ""
echo "=== Checking for test directory structure ==="
fd -t f -e py . -g '*test*' | head -20Repository: AOSSIE-Org/PictoPy
Length of output: 311
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Finding all test files in repository ==="
find . -type f -name "*test*.py" -o -name "test_*.py" | head -30
echo ""
echo "=== Checking for common test directories ==="
find . -type d \( -name "*test*" -o -name "tests" \) 2>/dev/null | head -20
echo ""
echo "=== Listing all Python files in common test locations ==="
if [ -d "tests" ]; then
echo "Found 'tests' directory:"
ls -la tests/ 2>/dev/null | head -20
fi
if [ -d "test" ]; then
echo "Found 'test' directory:"
ls -la test/ 2>/dev/null | head -20
fi
echo ""
echo "=== Searching for SearchService class definition ==="
rg -n "class SearchService" --type=pyRepository: AOSSIE-Org/PictoPy
Length of output: 616
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Searching for SearchService references in test files ==="
rg -n "SearchService|index_all" backend/tests/ --type=py
echo ""
echo "=== Checking conftest.py for startup/lifespan fixtures ==="
cat -n backend/tests/conftest.py
echo ""
echo "=== Listing all test functions in test files ==="
rg -n "^def test_|^async def test_" backend/tests/ --type=py
echo ""
echo "=== Checking backend/test.py ==="
wc -l backend/test.py
head -50 backend/test.pyRepository: AOSSIE-Org/PictoPy
Length of output: 122
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Checking conftest.py for fixtures ==="
cat backend/tests/conftest.py
echo ""
echo "=== Checking backend/test.py ==="
cat backend/test.py 2>/dev/null || echo "File not found or error"
echo ""
echo "=== Checking backend/app/routes/test.py ==="
cat backend/app/routes/test.py 2>/dev/null || echo "File not found or error"
echo ""
echo "=== Examining SearchService class and startup integration ==="
rg -A 20 "class SearchService" backend/app/ai/search_service.py
echo ""
echo "=== Checking main.py lines 60-80 for full context ==="
sed -n '60,80p' backend/main.pyRepository: AOSSIE-Org/PictoPy
Length of output: 11617
Add automated test coverage for semantic search startup initialization
This startup block is critical but completely untested. The test suite's conftest.py does not create the embeddings table or initialize SearchService, and no tests exist for startup indexing behavior or search endpoints. Add tests covering:
- Embeddings table creation during setup
SearchServiceinstantiation andindex_all()background thread trigger- Search endpoint functionality
Relates to coding guidelines: "Ensure that test code is automated, comprehensive, and follows testing best practices" and "Verify that all critical functionality is covered by tests".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/main.py` around lines 66 - 75, Add automated tests that verify the
startup block calls db_create_embeddings_table(), creates a SearchService
instance on app.state.search_service, and starts a background thread invoking
SearchService.index_all(); specifically, in tests (e.g., new test file using the
test client and monkeypatch/patch), assert db_create_embeddings_table was called
during app setup, assert app.state.search_service is an instance of
SearchService, and patch SearchService.index_all to confirm it was scheduled (or
invoked) by the background Thread (or assert a Thread named
"semantic-index-startup" was started); also add integration tests against the
search endpoint to confirm search responses after indexing behavior is
simulated.
🚀 Feature: AI-Powered Natural Language Search (Phase 1)
Implements the backend foundation for semantic photo search as discussed in #1205.
This PR introduces a modular and offline-first semantic search system that enables natural language queries against locally stored photo metadata.
🧠 What This PR Includes
1️⃣ Embedding Pipeline
2️⃣ Persistent Embedding Storage
3️⃣ Modular Architecture
Introduces:
EmbeddingService– Handles embedding generation.VectorIndexManager– Manages vector index lifecycle.SearchService– Executes semantic similarity queries.All components are isolated and extensible.
🔒 Privacy Considerations
⚙️ Performance Considerations
📌 Scope
This PR introduces backend semantic search only.
UI integration and extended features such as:
will be handled in follow-up issues.
🧪 Testing
🔄 Future Improvements
Closes #1205
Summary by CodeRabbit