Skip to content

feat(search): add offline AI-powered semantic search system#1208

Open
Arunodoy18 wants to merge 1 commit intoAOSSIE-Org:mainfrom
Arunodoy18:feature/1205-ai-natural-search
Open

feat(search): add offline AI-powered semantic search system#1208
Arunodoy18 wants to merge 1 commit intoAOSSIE-Org:mainfrom
Arunodoy18:feature/1205-ai-natural-search

Conversation

@Arunodoy18
Copy link

@Arunodoy18 Arunodoy18 commented Mar 3, 2026

🚀 Feature: AI-Powered Natural Language Search (Phase 1)

Implements the backend foundation for semantic photo search as discussed in #1205.

This PR introduces a modular and offline-first semantic search system that enables natural language queries against locally stored photo metadata.


🧠 What This PR Includes

1️⃣ Embedding Pipeline

  • Generates text embeddings from photo metadata (tags, detected objects, faces, dates, locations).
  • Uses lightweight local embedding model (offline only).
  • No external API calls.
  • No cloud dependency.

2️⃣ Persistent Embedding Storage

  • Embeddings are stored locally.
  • Avoids recomputation on app startup.
  • Designed to scale to large libraries.

3️⃣ Modular Architecture

Introduces:

  • EmbeddingService – Handles embedding generation.
  • VectorIndexManager – Manages vector index lifecycle.
  • SearchService – Executes semantic similarity queries.

All components are isolated and extensible.


🔒 Privacy Considerations

  • Fully offline.
  • No external network calls.
  • No metadata leaves the device.
  • Model runs locally.

⚙️ Performance Considerations

  • Embeddings generated only when new photos are indexed.
  • No blocking operations on startup.
  • Vector index persisted to disk.
  • Designed to scale without excessive memory growth.

📌 Scope

This PR introduces backend semantic search only.

UI integration and extended features such as:

  • Duplicate detection
  • Auto-generated albums
  • Advanced filtering

will be handled in follow-up issues.


🧪 Testing

  • Verified embedding generation correctness.
  • Verified consistent similarity ranking.
  • Tested incremental indexing flow.
  • Confirmed no regression in existing search.

🔄 Future Improvements

  • Optional lightweight model selection.
  • Threshold tuning for semantic similarity.
  • Search result ranking refinement.
  • UI integration layer.

Closes #1205

Summary by CodeRabbit

  • New Features
    • Added semantic image search endpoint enabling natural language queries to find images
    • Added index status endpoint to monitor indexing statistics
    • Added index trigger endpoint to index or re-index all images
    • Added endpoint to remove individual images from the search index

Introduces a fully offline semantic search backend with no external
API dependencies. Images are represented as fixed-dimension (163-dim)
L2-normalised float32 vectors derived from existing metadata:
YOLO object tags, time-of-day, season, day-of-week, face-cluster
presence, and GPS coordinates.

New modules
-----------
- app/ai/embedding_service.py    EmbeddingService: metadata -> vector
- app/ai/vector_index_manager.py VectorIndexManager: numpy flat index
- app/ai/search_service.py       SearchService: indexing + search facade
- app/ai/__init__.py             package exports
- app/database/embeddings.py     image_embeddings SQLite table (BLOB)
- app/routes/search.py           /search/* FastAPI endpoints
- app/schemas/search.py          Pydantic response models

Modified files
--------------
- app/config/settings.py  AI constants (MAX_FACE_CLUSTERS, TOP_K, ...)
- app/database/images.py  db_get_images_by_ids() helper
- app/database/faces.py   db_get_faces_by_image_id() helper
- main.py                 table init, SearchService singleton,
                          background startup indexing, router mount

API endpoints
-------------
GET    /search/semantic?q=...      ranked semantic search
GET    /search/index-status        index health / size
POST   /search/index               incremental or forced re-index
DELETE /search/index/{image_id}    remove one image from the index

Design properties
-----------------
- Fully offline: numpy + stdlib only, no external models or APIs
- No recomputation on startup: already-indexed images are skipped
- Persistent: embeddings stored as BLOBs in existing SQLite DB
- Scalable: 10k images ~6.5 MB RAM, O(N) cosine via matmul
- Thread-safe incremental add/remove without full index rebuild
- Zero impact on existing features

Closes AOSSIE-Org#1205
Copilot AI review requested due to automatic review settings March 3, 2026 06:25
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 3, 2026

Walkthrough

A new offline semantic search subsystem is added, featuring embedding generation from image metadata, a thread-safe in-memory vector index backed by SQLite persistence, and a high-level search service orchestrating both. New API endpoints expose semantic search, index status, and index management. Integration occurs in the app startup with background indexing.

Changes

Cohort / File(s) Summary
AI Embedding & Search Core
backend/app/ai/__init__.py, backend/app/ai/embedding_service.py, backend/app/ai/vector_index_manager.py, backend/app/ai/search_service.py
Introduces offline semantic search subsystem. EmbeddingService generates L2-normalized vectors from YOLO object classes, temporal bins, face presence, and spatial coordinates. VectorIndexManager maintains thread-safe in-memory cosine-similarity index with SQLite backing. SearchService orchestrates both for semantic image search with ranked results.
Configuration
backend/app/config/settings.py
Added AI configuration constants: AI_DATA_PATH, MAX_FACE_CLUSTERS, SEARCH_DEFAULT_TOP_K, SEARCH_MIN_SCORE.
Database Embedding Storage
backend/app/database/embeddings.py
New module for persisting embeddings as SQLite BLOBs with upsert, retrieval, deletion, and bulk-load operations. Handles float32 normalization and cascade delete on image removal.
Database Query Helpers
backend/app/database/faces.py, backend/app/database/images.py
Enhanced face and image query paths: db_get_faces_by_image_id retrieves face cluster associations; db_get_images_by_ids batch-fetches essential metadata (id, path, thumbnail, tags) for search result enrichment.
Search API Layer
backend/app/routes/search.py, backend/app/schemas/search.py
New FastAPI router exposing semantic search, index status, indexing trigger, and image removal endpoints. Pydantic schemas define SearchResultItem, SemanticSearchResponse, IndexStatusResponse, and IndexTriggerResponse.
Application Integration
backend/main.py
Initialization of embeddings table, SearchService singleton attachment to app.state, background daemon thread for batch image indexing, and route registration at /search prefix.

Sequence Diagram

sequenceDiagram
    actor User
    participant API as FastAPI Router
    participant SearchSvc as SearchService
    participant EmbedSvc as EmbeddingService
    participant VecIdx as VectorIndexManager
    participant DB as SQLite DB

    User->>API: GET /search/semantic?q=beach
    API->>SearchSvc: search("beach", top_k=20)
    SearchSvc->>EmbedSvc: embed_query("beach")
    EmbedSvc-->>SearchSvc: query_embedding [163D]
    SearchSvc->>VecIdx: ensure_loaded()
    VecIdx->>DB: db_get_all_embeddings()
    DB-->>VecIdx: [(image_id, embedding), ...]
    VecIdx-->>SearchSvc: index loaded
    SearchSvc->>VecIdx: search(query_embedding, top_k=20)
    VecIdx-->>SearchSvc: [(image_id, score), ...]
    SearchSvc->>DB: db_get_images_by_ids([...])
    DB-->>SearchSvc: [image metadata, ...]
    SearchSvc-->>API: [SearchResult(...), ...]
    API-->>User: 200 SemanticSearchResponse
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested labels

Python

Suggested reviewers

  • rahulharpal1603

Poem

🐰 A hop through embeddings, a leap through the search,
L2-normalizing each photo with care,
From "beach" to your memories, vectors in flight,
Semantic rabbits find images so fair! ✨📸

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and concisely summarizes the main change: adding an offline AI-powered semantic search system, which directly aligns with the primary objective of implementing AI-powered natural language search for photos.
Linked Issues check ✅ Passed The PR fully implements the core coding requirements from #1205: generating text embeddings from photo metadata, storing embeddings in a vector index, accepting natural language queries, and ensuring offline processing with no external APIs.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the semantic search implementation. Database additions, configuration constants, embedding/indexing services, and API routes all support the stated objective. No extraneous modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an offline, semantic (embedding-based) search subsystem to the PictoPy backend, including persistent embedding storage in SQLite, an in-memory vector index, and FastAPI endpoints for querying and managing the index.

Changes:

  • Introduces EmbeddingService, VectorIndexManager, and SearchService to generate/query embeddings and manage an in-memory index backed by SQLite.
  • Adds new /search/* API routes and Pydantic schemas for semantic search and index management.
  • Extends the database layer with an image_embeddings table plus helper queries for enriching results (image info + face clusters).

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
backend/main.py Initializes embeddings table + search service at startup and starts background indexing thread; registers /search router.
backend/app/schemas/search.py Adds response models for semantic search and index operations.
backend/app/routes/search.py Adds /search/semantic, /search/index-status, /search/index, and /search/index/{image_id} endpoints.
backend/app/database/images.py Adds db_get_images_by_ids() to enrich search results with path/thumbnail/tags.
backend/app/database/faces.py Adds db_get_faces_by_image_id() for face-cluster enrichment during embedding.
backend/app/database/embeddings.py Adds SQLite persistence layer for per-image embeddings (CRUD + bulk load).
backend/app/config/settings.py Adds AI/search-related settings (paths, defaults, thresholds, face-cluster cap).
backend/app/ai/vector_index_manager.py Implements a thread-safe in-memory flat index + cosine similarity search.
backend/app/ai/search_service.py Orchestrates embedding generation, persistence, indexing, and result enrichment.
backend/app/ai/embedding_service.py Implements deterministic, offline embedding construction for images + queries.
backend/app/ai/init.py Exposes the AI subsystem components as a package.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +140 to +145
with self._lock:
self.ensure_loaded()
if image_id in self._id_to_row:
# In-place update (O(1))
self._matrix[self._id_to_row[image_id]] = vec_row
else:
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VectorIndexManager is documented as lazily loaded on first query, but add_or_update() always calls ensure_loaded(), which forces a full DB load the first time any embedding is added/updated (e.g., during startup indexing). If you want to preserve lazy loading, consider only updating the in-memory matrix when the index is already loaded, or buffering updates until the first explicit load/search.

Copilot uses AI. Check for mistakes.
Comment on lines +67 to +75
# Instantiate the semantic search service and attach it to app state
app.state.search_service = SearchService()
# Index any unindexed tagged images in a background thread so startup
# is not blocked. Already-indexed images are skipped automatically.
threading.Thread(
target=app.state.search_service.index_all,
daemon=True,
name="semantic-index-startup",
).start()
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The startup indexing runs in a daemon thread. Because shutdown can terminate the process abruptly (e.g. os._exit(0) on Windows), this thread may be killed mid-run, leaving indexing partially complete. Consider tracking the thread/task and allowing a graceful stop/join on shutdown, or making indexing explicitly triggerable only via the API to avoid background work during startup/shutdown cycles.

Copilot uses AI. Check for mistakes.
Comment on lines +116 to +122
def index_status(request: Request) -> IndexStatusResponse:
service = _get_service(request)
return IndexStatusResponse(
success=True,
indexed_count=service.index_size,
is_loaded=service.is_index_loaded,
)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indexed_count is derived from service.index_size, which reports the in-memory index size. When the index is still lazily unloaded, this will return 0 even if embeddings exist on disk, making /search/index-status misleading. Consider either forcing a load for this endpoint or returning a persisted count from the embeddings table (and optionally reporting both persisted vs in-memory counts).

Copilot uses AI. Check for mistakes.
Comment on lines +62 to +67
conn.execute(
"""
CREATE INDEX IF NOT EXISTS ix_image_embeddings_image_id
ON image_embeddings(image_id)
"""
)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image_id is already the PRIMARY KEY of image_embeddings, so SQLite will already maintain an index for it. The additional CREATE INDEX ... ON image_embeddings(image_id) is redundant and can be removed to reduce schema clutter.

Suggested change
conn.execute(
"""
CREATE INDEX IF NOT EXISTS ix_image_embeddings_image_id
ON image_embeddings(image_id)
"""
)

Copilot uses AI. Check for mistakes.
Comment on lines +137 to +148
if images is None:
images = db_get_all_images(tagged=True)

already_indexed: set[str] = set() if force else set(db_get_indexed_image_ids())

count = 0
for record in images:
iid = record.get("id")
if not iid:
continue
if not force and iid in already_indexed:
continue
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index_all(force=False) skips any image_id already present in image_embeddings, but VectorIndexManager will ignore stored embeddings whose dim doesn't match the current EMBEDDING_DIM (e.g., after changing MAX_FACE_CLUSTERS in a new release). In that scenario, the in-memory index will be empty and index_all still won’t rebuild those embeddings unless the user explicitly passes force=true. Consider treating rows with a mismatched dim as not indexed (e.g., query only ids where dim == EMBEDDING_DIM, or add a DB helper to return ids by dim).

Copilot uses AI. Check for mistakes.
Comment on lines +138 to +154
vec_row = embedding.astype(np.float32).reshape(1, -1)

with self._lock:
self.ensure_loaded()
if image_id in self._id_to_row:
# In-place update (O(1))
self._matrix[self._id_to_row[image_id]] = vec_row
else:
# Append row (amortised O(1) via np.vstack)
row_idx = len(self._image_ids)
self._image_ids.append(image_id)
self._id_to_row[image_id] = row_idx
if self._matrix.shape[0] > 0:
self._matrix = np.vstack([self._matrix, vec_row])
else:
self._matrix = vec_row

Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In add_or_update, vec_row is reshaped to (1, dim) but the in-place update assigns it into a 1-D row (self._matrix[row]), which will raise a NumPy broadcasting error when updating an existing embedding. Use a 1-D vector for assignment (e.g., avoid the extra leading dimension for the update path).

Copilot uses AI. Check for mistakes.
Comment on lines +56 to +83
@router.get(
"/semantic",
response_model=SemanticSearchResponse,
summary="Semantic image search",
description=(
"Search for images using a natural-language query. The query is mapped "
"onto the same fixed-dimension embedding space used to index image "
"metadata (object classes, date/time, season, faces, GPS) and the "
"most similar images are returned ranked by cosine similarity."
),
)
def semantic_search(
request: Request,
q: str = Query(..., min_length=1, description="Free-text search query."),
limit: int = Query(
default=SEARCH_DEFAULT_TOP_K,
ge=1,
le=200,
description="Maximum number of results to return.",
),
min_score: float = Query(
default=SEARCH_MIN_SCORE,
ge=0.0,
le=1.0,
description="Minimum cosine-similarity score threshold.",
),
) -> SemanticSearchResponse:
service = _get_service(request)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds new search/indexing endpoints and core ranking logic, but there are no corresponding tests under backend/tests/ (the repo already has substantial FastAPI/DB test coverage). Adding tests for /search/semantic, /search/index, and the indexing/search flow (including score thresholding and empty-index behavior) would help prevent regressions.

Copilot uses AI. Check for mistakes.
Comment on lines +214 to +224
results: List[SearchResult] = []
for image_id, score in raw:
info = path_map.get(image_id, {})
results.append(
SearchResult(
image_id=image_id,
score=round(score, 4),
path=info.get("path", ""),
thumbnail_path=info.get("thumbnailPath", ""),
tags=info.get("tags") or [],
)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SearchResultItem.score is constrained to [0, 1] in the response schema, but cosine similarity computed in float32 can drift slightly outside this range (e.g., 1.0000x) due to numerical error, which can cause response-model validation failures. Consider clamping scores into [0.0, 1.0] (and optionally also clamping at 0.0) before returning them.

Copilot uses AI. Check for mistakes.
Comment on lines +274 to +282
elif len(tok) >= 4:
# Partial substring match only for sufficiently long tokens
# (avoids "at" → "cat", "park" → "parking meter" false positives).
# Rule: the token must be a prefix, suffix, or the class name must
# start with the token (prevents mid-word collisions like "ark").
for name, idx in _OBJ_INDEX.items():
if name.startswith(tok) or tok.startswith(name):
matched.add(idx)
i += 1
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment describing the partial-match rule mentions allowing prefix/suffix matches, but the implementation only checks name.startswith(tok) and tok.startswith(name) (no suffix matching). Either update the comment to reflect the actual rule or adjust the matching logic so documentation and behavior stay aligned.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/ai/embedding_service.py`:
- Around line 197-205: The constructor EmbeddingService.__init__ currently
allows max_face_clusters > module MAX_FC which can cause out-of-bounds writes in
_fill_faces; clamp the runtime value by setting self._max_fc =
min(max_face_clusters, MAX_FC) (and keep the existing warning but reflect the
clamping), and update _fill_faces (and the other location referenced around
lines 357-359) to always use min(self._max_fc, MAX_FC) or the clamped
self._max_fc when computing per-face offsets/indices so no code ever assumes
more than MAX_FC face clusters.
- Around line 371-373: The spatial encoding currently normalizes lat/lon without
bounds checking; update the logic in embedding_service.py (around the vec,
_OFF_FLAG and _OFF_SPA usage) to validate GPS values: clamp latitude to [-90,
90] and longitude to [-180, 180] (or raise a ValueError if you prefer strict
validation) before computing vec[_OFF_SPA] = float(lat)/90.0 and vec[_OFF_SPA +
1] = float(lon)/180.0; ensure vec[_OFF_FLAG + 1] (has_location) is only set when
a (possibly clamped) numeric lat/lon is present and convert inputs to float
first.

In `@backend/app/ai/search_service.py`:
- Around line 149-154: The current index_all loop swallows all exceptions when
calling self.index_image(record), making callers unable to tell “already
indexed” from real failures; update index_all so it does not blanket-catch
Exception: either catch only the specific expected "already indexed" exception
(or whatever custom exception your index_image raises) and continue, but for all
other exceptions capture/log them and accumulate failures to return or re-raise
an aggregated error; specifically modify the try/except around
self.index_image(record) and replace the bare except Exception with targeted
exception handling (e.g., except AlreadyIndexedError: continue) and for other
exceptions append iid (or record) and the exception to a failures list, then
after the loop either raise a combined Exception or return a result object
indicating successes and failures, while keeping the logger.exception(...) call
to record details.
- Around line 115-227: Add automated unit tests for SearchService covering:
index_all(force=False) skips already indexed IDs and returns correct count,
index_all(force=True) re-embeds all items and overwrites existing entries, and
index_all handles per-image exceptions (mock index_image to raise for one record
and assert count reflects only successful indexes and that logger.exception was
called); remove_image removes persisted embedding and the in-memory index (mock
db_delete_embedding and self._index.remove and then run search to assert the
image no longer appears); and search returns correctly ordered SearchResult
entries with trimmed query, threshold filtering, and proper path/thumbnail/tags
mapping (instantiate SearchService, mock its _embed.embed_query, _index.search,
and _build_path_map to assert ranking, score rounding, and empty-query
behavior). Use mocks for db_get_all_images, db_get_indexed_image_ids,
db_delete_embedding, _build_path_map, and methods index_image/_index/_embed so
tests are isolated and deterministic.

In `@backend/app/ai/vector_index_manager.py`:
- Around line 138-145: The update path in add_or_update creates vec_row with
shape (1, D) then assigns it into a single row of self._matrix, causing a shape
mismatch; fix by making vec_row a 1-D array (shape (D,)) before assignment.
Locate the add_or_update method (symbols: vec_row, ensure_loaded, _id_to_row,
_matrix, image_id) and change the creation/reshape of vec_row to use
.reshape(-1) or index [0] so that assignments like
self._matrix[self._id_to_row[image_id]] = vec_row succeed without broadcasting
errors.

In `@backend/app/database/embeddings.py`:
- Around line 90-104: Validate that the embedding is a 1-D array before
persisting: after creating vec via embedding.astype(np.float32) (and before
computing dim or calling conn.execute on the INSERT into image_embeddings),
check vec.ndim == 1 and use vec.shape[0] as the dim; if the shape is invalid,
raise a ValueError (or otherwise abort) rather than using len(vec) so you never
store an incorrect dim for image_id and corrupt future reads. Ensure this
validation is applied where vec is created and before calling _connect() /
conn.execute.

In `@backend/app/routes/search.py`:
- Around line 86-88: The exception handler in search.py currently calls
logger.exception("Semantic search failed for query %r", q) which logs raw user
input q; change this to log non-sensitive metadata instead: replace the
logger.exception call in the except block handling semantic search with
something like logger.exception("Semantic search failed for request_id=%s
query_len=%d", request_id, len(q or "")) (or include any existing request
identifier) and remove q from logs, ensuring the raise HTTPException remains
unchanged; update references around logger.exception, the except block, and the
variable q to compute/query length and attach request id if available.

In `@backend/main.py`:
- Around line 66-75: Add automated tests that verify the startup block calls
db_create_embeddings_table(), creates a SearchService instance on
app.state.search_service, and starts a background thread invoking
SearchService.index_all(); specifically, in tests (e.g., new test file using the
test client and monkeypatch/patch), assert db_create_embeddings_table was called
during app setup, assert app.state.search_service is an instance of
SearchService, and patch SearchService.index_all to confirm it was scheduled (or
invoked) by the background Thread (or assert a Thread named
"semantic-index-startup" was started); also add integration tests against the
search endpoint to confirm search responses after indexing behavior is
simulated.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8274ebd and 2759d36.

📒 Files selected for processing (11)
  • backend/app/ai/__init__.py
  • backend/app/ai/embedding_service.py
  • backend/app/ai/search_service.py
  • backend/app/ai/vector_index_manager.py
  • backend/app/config/settings.py
  • backend/app/database/embeddings.py
  • backend/app/database/faces.py
  • backend/app/database/images.py
  • backend/app/routes/search.py
  • backend/app/schemas/search.py
  • backend/main.py

Comment on lines +197 to +205
def __init__(self, max_face_clusters: int = MAX_FC) -> None:
if max_face_clusters != MAX_FC:
logger.warning(
"EmbeddingService initialised with max_face_clusters=%d "
"but module constant MAX_FC=%d; EMBEDDING_DIM will be wrong.",
max_face_clusters,
MAX_FC,
)
self._max_fc = max_face_clusters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Prevent face-cluster width mismatch from breaking fixed vector bounds

The vector layout is fixed using module-level MAX_FC. Allowing a larger runtime max_face_clusters can cause out-of-bounds writes in _fill_faces.

Proposed fix
     def __init__(self, max_face_clusters: int = MAX_FC) -> None:
-        if max_face_clusters != MAX_FC:
-            logger.warning(
-                "EmbeddingService initialised with max_face_clusters=%d "
-                "but module constant MAX_FC=%d; EMBEDDING_DIM will be wrong.",
-                max_face_clusters,
-                MAX_FC,
-            )
-        self._max_fc = max_face_clusters
+        if max_face_clusters != MAX_FC:
+            raise ValueError(
+                f"max_face_clusters must be {MAX_FC} to match fixed EMBEDDING_DIM={EMBEDDING_DIM}"
+            )
+        self._max_fc = MAX_FC

Also applies to: 357-359

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/ai/embedding_service.py` around lines 197 - 205, The constructor
EmbeddingService.__init__ currently allows max_face_clusters > module MAX_FC
which can cause out-of-bounds writes in _fill_faces; clamp the runtime value by
setting self._max_fc = min(max_face_clusters, MAX_FC) (and keep the existing
warning but reflect the clamping), and update _fill_faces (and the other
location referenced around lines 357-359) to always use min(self._max_fc,
MAX_FC) or the clamped self._max_fc when computing per-face offsets/indices so
no code ever assumes more than MAX_FC face clusters.

Comment on lines +371 to +373
vec[_OFF_FLAG + 1] = 1.0 # has_location
vec[_OFF_SPA] = float(lat) / 90.0 # [−90, 90] → [−1, 1]
vec[_OFF_SPA + 1] = float(lon) / 180.0 # [−180, 180] → [−1, 1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Validate GPS bounds before encoding spatial features

This currently accepts any numeric latitude/longitude and can encode values outside the documented [-1, 1] range. Reject or clamp out-of-range coordinates.

Proposed fix
-            vec[_OFF_FLAG + 1] = 1.0  # has_location
-            vec[_OFF_SPA] = float(lat) / 90.0  # [−90,  90] → [−1, 1]
-            vec[_OFF_SPA + 1] = float(lon) / 180.0  # [−180, 180] → [−1, 1]
+            lat_f = float(lat)
+            lon_f = float(lon)
+            if not (-90.0 <= lat_f <= 90.0 and -180.0 <= lon_f <= 180.0):
+                return
+            vec[_OFF_FLAG + 1] = 1.0  # has_location
+            vec[_OFF_SPA] = lat_f / 90.0
+            vec[_OFF_SPA + 1] = lon_f / 180.0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
vec[_OFF_FLAG + 1] = 1.0 # has_location
vec[_OFF_SPA] = float(lat) / 90.0 # [−90, 90] → [−1, 1]
vec[_OFF_SPA + 1] = float(lon) / 180.0 # [−180, 180] → [−1, 1]
lat_f = float(lat)
lon_f = float(lon)
if not (-90.0 <= lat_f <= 90.0 and -180.0 <= lon_f <= 180.0):
return
vec[_OFF_FLAG + 1] = 1.0 # has_location
vec[_OFF_SPA] = lat_f / 90.0
vec[_OFF_SPA + 1] = lon_f / 180.0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/ai/embedding_service.py` around lines 371 - 373, The spatial
encoding currently normalizes lat/lon without bounds checking; update the logic
in embedding_service.py (around the vec, _OFF_FLAG and _OFF_SPA usage) to
validate GPS values: clamp latitude to [-90, 90] and longitude to [-180, 180]
(or raise a ValueError if you prefer strict validation) before computing
vec[_OFF_SPA] = float(lat)/90.0 and vec[_OFF_SPA + 1] = float(lon)/180.0; ensure
vec[_OFF_FLAG + 1] (has_location) is only set when a (possibly clamped) numeric
lat/lon is present and convert inputs to float first.

Comment on lines +115 to +227
def index_all(
self,
*,
force: bool = False,
images: Optional[List[Dict[str, Any]]] = None,
) -> int:
"""
Index all tagged images that do not yet have a stored embedding.

Parameters
----------
force :
If ``True``, re-embed and overwrite every image even if already
indexed. Useful after a settings change (e.g. ``MAX_FACE_CLUSTERS``).
images :
Pre-fetched image records. Fetched from the DB when ``None``.

Returns
-------
int
Number of images actually (re-)indexed in this call.
"""
if images is None:
images = db_get_all_images(tagged=True)

already_indexed: set[str] = set() if force else set(db_get_indexed_image_ids())

count = 0
for record in images:
iid = record.get("id")
if not iid:
continue
if not force and iid in already_indexed:
continue
try:
self.index_image(record)
count += 1
except Exception:
logger.exception("Failed to index image %s", iid)

if count:
logger.info("Indexed %d new image(s)", count)
else:
logger.debug("index_all: no new images to index")
return count

def remove_image(self, image_id: str) -> None:
"""
Remove an image from both the persistent embedding store and the
in-memory index.

Should be called whenever an image is deleted from the library so
it no longer appears in search results.
"""
db_delete_embedding(image_id)
self._index.remove(image_id)
logger.debug("Removed image %s from semantic index", image_id)

# ------------------------------------------------------------------
# Search
# ------------------------------------------------------------------

def search(
self,
query: str,
top_k: int = 20,
min_score: float = 0.05,
) -> List[SearchResult]:
"""
Perform a semantic search across all indexed images.

Parameters
----------
query :
Free-text query (e.g. ``"dog at the park"``, ``"evening sunset"``,
``"person in summer"``).
top_k :
Maximum number of results to return.
min_score :
Cosine-similarity threshold. Keeps only results with
``score >= min_score``.

Returns
-------
list of :class:`SearchResult` ordered by descending similarity score.
"""
query = query.strip()
if not query:
return []

query_vec = self._embed.embed_query(query)
raw = self._index.search(query_vec, top_k=top_k, min_score=min_score)
if not raw:
return []

# Batch-fetch path / thumbnail / tags for the matched IDs
id_set = {iid for iid, _ in raw}
path_map = _build_path_map(id_set)

results: List[SearchResult] = []
for image_id, score in raw:
info = path_map.get(image_id, {})
results.append(
SearchResult(
image_id=image_id,
score=round(score, 4),
path=info.get("path", ""),
thumbnail_path=info.get("thumbnailPath", ""),
tags=info.get("tags") or [],
)
)
return results

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find test files and locate any tests for SearchService or search_service
find . -type f -name "*test*.py" | head -20

Repository: AOSSIE-Org/PictoPy

Length of output: 308


🏁 Script executed:

# Also search for common test directory patterns
fd -t f -e py test 2>/dev/null | grep -E '(test|spec)' | head -30

Repository: AOSSIE-Org/PictoPy

Length of output: 292


🏁 Script executed:

# Search for SearchService references in test files
rg -l "SearchService|search_service" --type py

Repository: AOSSIE-Org/PictoPy

Length of output: 202


🏁 Script executed:

# Check if any existing test files contain search or SearchService tests
grep -r "SearchService\|search_service\|index_all\|remove_image" backend/tests/ 2>/dev/null

Repository: AOSSIE-Org/PictoPy

Length of output: 334


🏁 Script executed:

# Check the backend test files
cat -n backend/test.py | head -50

Repository: AOSSIE-Org/PictoPy

Length of output: 1995


🏁 Script executed:

# Check the routes test file
cat -n backend/app/routes/test.py 2>/dev/null || echo "File not found or empty"

Repository: AOSSIE-Org/PictoPy

Length of output: 7246


🏁 Script executed:

# Check if there are test files in backend/app/ai
find backend/app/ai -name "*test*.py" -o -name "test_*.py"

Repository: AOSSIE-Org/PictoPy

Length of output: 44


🏁 Script executed:

# List all test files in the backend/tests directory
ls -la backend/tests/

Repository: AOSSIE-Org/PictoPy

Length of output: 692


🏁 Script executed:

# Check for any search-related tests in existing test files
grep -r "semantic\|embed\|index_all\|remove_image" backend/tests/*.py 2>/dev/null

Repository: AOSSIE-Org/PictoPy

Length of output: 334


🏁 Script executed:

# Check the search_service.py to understand the full scope
cat -n backend/app/ai/search_service.py | head -100

Repository: AOSSIE-Org/PictoPy

Length of output: 4202


🏁 Script executed:

# Check if there are integration or other test patterns
find backend -type f -name "*.py" -path "*/test*" | xargs ls -la 2>/dev/null

Repository: AOSSIE-Org/PictoPy

Length of output: 768


🏁 Script executed:

# Check routes/search.py to see if it has tests
cat -n backend/app/routes/search.py

Repository: AOSSIE-Org/PictoPy

Length of output: 7095


🏁 Script executed:

# Check if routes are tested in any test files
grep -r "/search\|semantic_search\|trigger_index\|remove_from_index" backend/tests/ 2>/dev/null

Repository: AOSSIE-Org/PictoPy

Length of output: 44


🏁 Script executed:

# Check conftest.py to understand test setup and fixtures
cat -n backend/tests/conftest.py

Repository: AOSSIE-Org/PictoPy

Length of output: 1910


🏁 Script executed:

# Look for any test configuration that might indicate search tests exist
find backend -type f \( -name "pytest.ini" -o -name "setup.cfg" -o -name "pyproject.toml" \) -exec grep -l "test" {} \;

Repository: AOSSIE-Org/PictoPy

Length of output: 80


🏁 Script executed:

# Check the full search_service.py to understand exception handling in index_all
cat -n backend/app/ai/search_service.py | sed -n '115,160p'

Repository: AOSSIE-Org/PictoPy

Length of output: 1773


🏁 Script executed:

# Check if there's a test directory specifically for ai module
find backend/app/ai -type d -name "*test*"

Repository: AOSSIE-Org/PictoPy

Length of output: 44


🏁 Script executed:

# Check for any search test file
find backend -name "*search*test*.py" -o -name "test*search*.py"

Repository: AOSSIE-Org/PictoPy

Length of output: 44


🏁 Script executed:

# Check the embeddings database module to understand the table
cat -n backend/app/database/embeddings.py 2>/dev/null | head -60

Repository: AOSSIE-Org/PictoPy

Length of output: 2514


🏁 Script executed:

# Verify conftest creates embeddings table
grep -n "embedding" backend/tests/conftest.py -i

Repository: AOSSIE-Org/PictoPy

Length of output: 44


🏁 Script executed:

# List all test files to ensure we haven't missed any
ls -la backend/tests/test_*.py

Repository: AOSSIE-Org/PictoPy

Length of output: 395


🏁 Script executed:

# Double-check for any hidden search tests
grep -r "class.*SearchService\|def.*test.*search\|def.*test.*index\|def.*test.*embed" backend/ --include="*.py" | grep -i test

Repository: AOSSIE-Org/PictoPy

Length of output: 44


🏁 Script executed:

# Check the full conftest to see if embeddings table is created
cat -n backend/tests/conftest.py | tail -20

Repository: AOSSIE-Org/PictoPy

Length of output: 886


🏁 Script executed:

# Verify no embeddings table creation anywhere in test setup
grep -r "db_create_embeddings_table" backend/tests/

Repository: AOSSIE-Org/PictoPy

Length of output: 44


🏁 Script executed:

# Check the main.py to see when embeddings table should be created
grep -n "embeddings" backend/main.py -i

Repository: AOSSIE-Org/PictoPy

Length of output: 163


🏁 Script executed:

# Check if db_create_embeddings_table is called in conftest or any test setup
grep -r "db_create_embeddings_table" backend/tests/

Repository: AOSSIE-Org/PictoPy

Length of output: 44


🏁 Script executed:

# Verify the exact state: is embeddings table created before tests run?
cat -n backend/main.py | sed -n '60,75p'

Repository: AOSSIE-Org/PictoPy

Length of output: 866


🏁 Script executed:

# Check if SearchService is instantiated or tested anywhere
grep -r "SearchService\(" backend/tests/

Repository: AOSSIE-Org/PictoPy

Length of output: 85


🏁 Script executed:

# Search for SearchService instantiation in test files
grep -r "SearchService" backend/tests/

Repository: AOSSIE-Org/PictoPy

Length of output: 44


Add automated tests for critical semantic search functionality.

This introduces core behavior (index_all, remove_image, search) without accompanying automated tests. Critical gaps include:

  • Force reindex behavior (the force parameter)
  • Partial-failure propagation in batch indexing (exceptions are caught per image but not propagated)
  • Remove-then-search consistency
  • Query execution and result ranking

The test infrastructure also lacks SearchService instantiation and integration tests, despite the service being integral to the application.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/ai/search_service.py` around lines 115 - 227, Add automated unit
tests for SearchService covering: index_all(force=False) skips already indexed
IDs and returns correct count, index_all(force=True) re-embeds all items and
overwrites existing entries, and index_all handles per-image exceptions (mock
index_image to raise for one record and assert count reflects only successful
indexes and that logger.exception was called); remove_image removes persisted
embedding and the in-memory index (mock db_delete_embedding and
self._index.remove and then run search to assert the image no longer appears);
and search returns correctly ordered SearchResult entries with trimmed query,
threshold filtering, and proper path/thumbnail/tags mapping (instantiate
SearchService, mock its _embed.embed_query, _index.search, and _build_path_map
to assert ranking, score rounding, and empty-query behavior). Use mocks for
db_get_all_images, db_get_indexed_image_ids, db_delete_embedding,
_build_path_map, and methods index_image/_index/_embed so tests are isolated and
deterministic.

Comment on lines +149 to +154
try:
self.index_image(record)
count += 1
except Exception:
logger.exception("Failed to index image %s", iid)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not silently swallow indexing failures in index_all.

Line 152 catches all exceptions and continues, so callers cannot distinguish “already indexed” from “indexing failed.” This can surface false success to /search/index.

⚙️ Proposed fix
-        count = 0
+        count = 0
+        failures = 0
         for record in images:
             iid = record.get("id")
             if not iid:
                 continue
             if not force and iid in already_indexed:
                 continue
             try:
                 self.index_image(record)
                 count += 1
             except Exception:
+                failures += 1
                 logger.exception("Failed to index image %s", iid)
 
+        if failures:
+            raise RuntimeError(f"Indexing failed for {failures} image(s)")
         if count:
             logger.info("Indexed %d new image(s)", count)
         else:
             logger.debug("index_all: no new images to index")
         return count
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/ai/search_service.py` around lines 149 - 154, The current
index_all loop swallows all exceptions when calling self.index_image(record),
making callers unable to tell “already indexed” from real failures; update
index_all so it does not blanket-catch Exception: either catch only the specific
expected "already indexed" exception (or whatever custom exception your
index_image raises) and continue, but for all other exceptions capture/log them
and accumulate failures to return or re-raise an aggregated error; specifically
modify the try/except around self.index_image(record) and replace the bare
except Exception with targeted exception handling (e.g., except
AlreadyIndexedError: continue) and for other exceptions append iid (or record)
and the exception to a failures list, then after the loop either raise a
combined Exception or return a result object indicating successes and failures,
while keeping the logger.exception(...) call to record details.

Comment on lines +138 to +145
vec_row = embedding.astype(np.float32).reshape(1, -1)

with self._lock:
self.ensure_loaded()
if image_id in self._id_to_row:
# In-place update (O(1))
self._matrix[self._id_to_row[image_id]] = vec_row
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
import numpy as np
m = np.zeros((2, 4), dtype=np.float32)
vec_row = np.ones((1, 4), dtype=np.float32)
try:
    m[0] = vec_row
    print("assignment_ok")
except Exception as e:
    print(type(e).__name__, str(e))
PY

Repository: AOSSIE-Org/PictoPy

Length of output: 179


🏁 Script executed:

head -160 backend/app/ai/vector_index_manager.py | tail -50

Repository: AOSSIE-Org/PictoPy

Length of output: 1849


🏁 Script executed:

cat -n backend/app/ai/vector_index_manager.py | sed -n '118,160p'

Repository: AOSSIE-Org/PictoPy

Length of output: 1943


🏁 Script executed:

find . -type f -name "*.py" -path "*/test*" | xargs grep -l "vector_index_manager\|VectorIndexManager" 2>/dev/null | head -5

Repository: AOSSIE-Org/PictoPy

Length of output: 44


Fix shape mismatch in update path (add_or_update).

Line 144 assigns a (1, D) shaped array into a (D,) row slot, causing a shape incompatibility. When updating an existing image_id, this triggers a broadcast error that breaks the update mechanism.

Proposed fix
-        vec_row = embedding.astype(np.float32).reshape(1, -1)
+        vec = embedding.astype(np.float32).reshape(-1)
 
         with self._lock:
             self.ensure_loaded()
             if image_id in self._id_to_row:
                 # In-place update (O(1))
-                self._matrix[self._id_to_row[image_id]] = vec_row
+                self._matrix[self._id_to_row[image_id], :] = vec
             else:
                 # Append row (amortised O(1) via np.vstack)
                 row_idx = len(self._image_ids)
                 self._image_ids.append(image_id)
                 self._id_to_row[image_id] = row_idx
                 if self._matrix.shape[0] > 0:
-                    self._matrix = np.vstack([self._matrix, vec_row])
+                    self._matrix = np.vstack([self._matrix, vec[np.newaxis, :]])
                 else:
-                    self._matrix = vec_row
+                    self._matrix = vec[np.newaxis, :]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
vec_row = embedding.astype(np.float32).reshape(1, -1)
with self._lock:
self.ensure_loaded()
if image_id in self._id_to_row:
# In-place update (O(1))
self._matrix[self._id_to_row[image_id]] = vec_row
else:
vec = embedding.astype(np.float32).reshape(-1)
with self._lock:
self.ensure_loaded()
if image_id in self._id_to_row:
# In-place update (O(1))
self._matrix[self._id_to_row[image_id], :] = vec
else:
# Append row (amortised O(1) via np.vstack)
row_idx = len(self._image_ids)
self._image_ids.append(image_id)
self._id_to_row[image_id] = row_idx
if self._matrix.shape[0] > 0:
self._matrix = np.vstack([self._matrix, vec[np.newaxis, :]])
else:
self._matrix = vec[np.newaxis, :]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/ai/vector_index_manager.py` around lines 138 - 145, The update
path in add_or_update creates vec_row with shape (1, D) then assigns it into a
single row of self._matrix, causing a shape mismatch; fix by making vec_row a
1-D array (shape (D,)) before assignment. Locate the add_or_update method
(symbols: vec_row, ensure_loaded, _id_to_row, _matrix, image_id) and change the
creation/reshape of vec_row to use .reshape(-1) or index [0] so that assignments
like self._matrix[self._id_to_row[image_id]] = vec_row succeed without
broadcasting errors.

Comment on lines +90 to +104
vec = embedding.astype(np.float32)
blob = vec.tobytes()
conn = _connect()
try:
conn.execute(
"""
INSERT INTO image_embeddings (image_id, embedding, dim, updated_at)
VALUES (?, ?, ?, CURRENT_TIMESTAMP)
ON CONFLICT(image_id) DO UPDATE SET
embedding = excluded.embedding,
dim = excluded.dim,
updated_at = excluded.updated_at
""",
(image_id, blob, len(vec)),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate embedding dimensionality before persisting

len(vec) is unsafe for non-1D arrays and can store an incorrect dim, which later makes reads look corrupt. Reject invalid shapes up front.

Proposed fix
-    vec = embedding.astype(np.float32)
+    vec = np.asarray(embedding, dtype=np.float32)
+    if vec.ndim != 1:
+        raise ValueError(f"Embedding must be 1-D, got shape={vec.shape}")
+    if not np.isfinite(vec).all():
+        raise ValueError("Embedding contains non-finite values")
     blob = vec.tobytes()
@@
-            (image_id, blob, len(vec)),
+            (image_id, blob, vec.shape[0]),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/database/embeddings.py` around lines 90 - 104, Validate that the
embedding is a 1-D array before persisting: after creating vec via
embedding.astype(np.float32) (and before computing dim or calling conn.execute
on the INSERT into image_embeddings), check vec.ndim == 1 and use vec.shape[0]
as the dim; if the shape is invalid, raise a ValueError (or otherwise abort)
rather than using len(vec) so you never store an incorrect dim for image_id and
corrupt future reads. Ensure this validation is applied where vec is created and
before calling _connect() / conn.execute.

Comment on lines +86 to +88
except Exception:
logger.exception("Semantic search failed for query %r", q)
raise HTTPException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid logging raw user queries in exception logs.

Line 87 logs q verbatim, which can capture sensitive user-entered text. Log metadata (e.g., length/request id) instead of raw query content.

🔒 Proposed fix
-    except Exception:
-        logger.exception("Semantic search failed for query %r", q)
+    except Exception:
+        logger.exception("Semantic search failed (query_len=%d)", len(q))
         raise HTTPException(
             status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
             detail="Semantic search failed.",
         )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except Exception:
logger.exception("Semantic search failed for query %r", q)
raise HTTPException(
except Exception:
logger.exception("Semantic search failed (query_len=%d)", len(q))
raise HTTPException(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/routes/search.py` around lines 86 - 88, The exception handler in
search.py currently calls logger.exception("Semantic search failed for query
%r", q) which logs raw user input q; change this to log non-sensitive metadata
instead: replace the logger.exception call in the except block handling semantic
search with something like logger.exception("Semantic search failed for
request_id=%s query_len=%d", request_id, len(q or "")) (or include any existing
request identifier) and remove q from logs, ensuring the raise HTTPException
remains unchanged; update references around logger.exception, the except block,
and the variable q to compute/query length and attach request id if available.

Comment on lines +66 to +75
db_create_embeddings_table()
# Instantiate the semantic search service and attach it to app state
app.state.search_service = SearchService()
# Index any unindexed tagged images in a background thread so startup
# is not blocked. Already-indexed images are skipped automatically.
threading.Thread(
target=app.state.search_service.index_all,
daemon=True,
name="semantic-index-startup",
).start()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search for test files that cover SearchService, index_all, embeddings table, and search endpoints
echo "=== Searching for test files covering semantic search startup/indexing ==="
rg -nP --type=py -g '*test*.py' -C2 'SearchService|index_all|db_create_embeddings_table|/search/semantic|/search/index-status|/search/index|lifespan' 2>/dev/null || echo "No matches found"

echo ""
echo "=== Checking for test directory structure ==="
fd -t f -e py . -g '*test*' | head -20

Repository: AOSSIE-Org/PictoPy

Length of output: 311


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Finding all test files in repository ==="
find . -type f -name "*test*.py" -o -name "test_*.py" | head -30

echo ""
echo "=== Checking for common test directories ==="
find . -type d \( -name "*test*" -o -name "tests" \) 2>/dev/null | head -20

echo ""
echo "=== Listing all Python files in common test locations ==="
if [ -d "tests" ]; then
  echo "Found 'tests' directory:"
  ls -la tests/ 2>/dev/null | head -20
fi

if [ -d "test" ]; then
  echo "Found 'test' directory:"
  ls -la test/ 2>/dev/null | head -20
fi

echo ""
echo "=== Searching for SearchService class definition ==="
rg -n "class SearchService" --type=py

Repository: AOSSIE-Org/PictoPy

Length of output: 616


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Searching for SearchService references in test files ==="
rg -n "SearchService|index_all" backend/tests/ --type=py

echo ""
echo "=== Checking conftest.py for startup/lifespan fixtures ==="
cat -n backend/tests/conftest.py

echo ""
echo "=== Listing all test functions in test files ==="
rg -n "^def test_|^async def test_" backend/tests/ --type=py

echo ""
echo "=== Checking backend/test.py ==="
wc -l backend/test.py
head -50 backend/test.py

Repository: AOSSIE-Org/PictoPy

Length of output: 122


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Checking conftest.py for fixtures ==="
cat backend/tests/conftest.py

echo ""
echo "=== Checking backend/test.py ==="
cat backend/test.py 2>/dev/null || echo "File not found or error"

echo ""
echo "=== Checking backend/app/routes/test.py ==="
cat backend/app/routes/test.py 2>/dev/null || echo "File not found or error"

echo ""
echo "=== Examining SearchService class and startup integration ==="
rg -A 20 "class SearchService" backend/app/ai/search_service.py

echo ""
echo "=== Checking main.py lines 60-80 for full context ==="
sed -n '60,80p' backend/main.py

Repository: AOSSIE-Org/PictoPy

Length of output: 11617


Add automated test coverage for semantic search startup initialization

This startup block is critical but completely untested. The test suite's conftest.py does not create the embeddings table or initialize SearchService, and no tests exist for startup indexing behavior or search endpoints. Add tests covering:

  • Embeddings table creation during setup
  • SearchService instantiation and index_all() background thread trigger
  • Search endpoint functionality

Relates to coding guidelines: "Ensure that test code is automated, comprehensive, and follows testing best practices" and "Verify that all critical functionality is covered by tests".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/main.py` around lines 66 - 75, Add automated tests that verify the
startup block calls db_create_embeddings_table(), creates a SearchService
instance on app.state.search_service, and starts a background thread invoking
SearchService.index_all(); specifically, in tests (e.g., new test file using the
test client and monkeypatch/patch), assert db_create_embeddings_table was called
during app setup, assert app.state.search_service is an instance of
SearchService, and patch SearchService.index_all to confirm it was scheduled (or
invoked) by the background Thread (or assert a Thread named
"semantic-index-startup" was started); also add integration tests against the
search endpoint to confirm search responses after indexing behavior is
simulated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feat: AI-powered natural language search for photos

3 participants