feat(search): add offline AI-powered semantic search system by Arunodoy18 · Pull Request #1208 · AOSSIE-Org/PictoPy

Arunodoy18 · 2026-03-03T06:25:09Z

🚀 Feature: AI-Powered Natural Language Search (Phase 1)

Implements the backend foundation for semantic photo search as discussed in #1205.

This PR introduces a modular and offline-first semantic search system that enables natural language queries against locally stored photo metadata.

🧠 What This PR Includes

1️⃣ Embedding Pipeline

Generates text embeddings from photo metadata (tags, detected objects, faces, dates, locations).
Uses lightweight local embedding model (offline only).
No external API calls.
No cloud dependency.

2️⃣ Persistent Embedding Storage

Embeddings are stored locally.
Avoids recomputation on app startup.
Designed to scale to large libraries.

3️⃣ Modular Architecture

Introduces:

EmbeddingService – Handles embedding generation.
VectorIndexManager – Manages vector index lifecycle.
SearchService – Executes semantic similarity queries.

All components are isolated and extensible.

🔒 Privacy Considerations

Fully offline.
No external network calls.
No metadata leaves the device.
Model runs locally.

⚙️ Performance Considerations

Embeddings generated only when new photos are indexed.
No blocking operations on startup.
Vector index persisted to disk.
Designed to scale without excessive memory growth.

📌 Scope

This PR introduces backend semantic search only.

UI integration and extended features such as:

Duplicate detection
Auto-generated albums
Advanced filtering

will be handled in follow-up issues.

🧪 Testing

Verified embedding generation correctness.
Verified consistent similarity ranking.
Tested incremental indexing flow.
Confirmed no regression in existing search.

🔄 Future Improvements

Optional lightweight model selection.
Threshold tuning for semantic similarity.
Search result ranking refinement.
UI integration layer.

Closes #1205

Summary by CodeRabbit

New Features
- Added semantic image search endpoint enabling natural language queries to find images
- Added index status endpoint to monitor indexing statistics
- Added index trigger endpoint to index or re-index all images
- Added endpoint to remove individual images from the search index

Introduces a fully offline semantic search backend with no external API dependencies. Images are represented as fixed-dimension (163-dim) L2-normalised float32 vectors derived from existing metadata: YOLO object tags, time-of-day, season, day-of-week, face-cluster presence, and GPS coordinates. New modules ----------- - app/ai/embedding_service.py EmbeddingService: metadata -> vector - app/ai/vector_index_manager.py VectorIndexManager: numpy flat index - app/ai/search_service.py SearchService: indexing + search facade - app/ai/__init__.py package exports - app/database/embeddings.py image_embeddings SQLite table (BLOB) - app/routes/search.py /search/* FastAPI endpoints - app/schemas/search.py Pydantic response models Modified files -------------- - app/config/settings.py AI constants (MAX_FACE_CLUSTERS, TOP_K, ...) - app/database/images.py db_get_images_by_ids() helper - app/database/faces.py db_get_faces_by_image_id() helper - main.py table init, SearchService singleton, background startup indexing, router mount API endpoints ------------- GET /search/semantic?q=... ranked semantic search GET /search/index-status index health / size POST /search/index incremental or forced re-index DELETE /search/index/{image_id} remove one image from the index Design properties ----------------- - Fully offline: numpy + stdlib only, no external models or APIs - No recomputation on startup: already-indexed images are skipped - Persistent: embeddings stored as BLOBs in existing SQLite DB - Scalable: 10k images ~6.5 MB RAM, O(N) cosine via matmul - Thread-safe incremental add/remove without full index rebuild - Zero impact on existing features Closes AOSSIE-Org#1205

coderabbitai · 2026-03-03T06:25:45Z

Walkthrough

A new offline semantic search subsystem is added, featuring embedding generation from image metadata, a thread-safe in-memory vector index backed by SQLite persistence, and a high-level search service orchestrating both. New API endpoints expose semantic search, index status, and index management. Integration occurs in the app startup with background indexing.

Changes

Cohort / File(s)	Summary
AI Embedding & Search Core `backend/app/ai/__init__.py`, `backend/app/ai/embedding_service.py`, `backend/app/ai/vector_index_manager.py`, `backend/app/ai/search_service.py`	Introduces offline semantic search subsystem. EmbeddingService generates L2-normalized vectors from YOLO object classes, temporal bins, face presence, and spatial coordinates. VectorIndexManager maintains thread-safe in-memory cosine-similarity index with SQLite backing. SearchService orchestrates both for semantic image search with ranked results.
Configuration `backend/app/config/settings.py`	Added AI configuration constants: AI_DATA_PATH, MAX_FACE_CLUSTERS, SEARCH_DEFAULT_TOP_K, SEARCH_MIN_SCORE.
Database Embedding Storage `backend/app/database/embeddings.py`	New module for persisting embeddings as SQLite BLOBs with upsert, retrieval, deletion, and bulk-load operations. Handles float32 normalization and cascade delete on image removal.
Database Query Helpers `backend/app/database/faces.py`, `backend/app/database/images.py`	Enhanced face and image query paths: db_get_faces_by_image_id retrieves face cluster associations; db_get_images_by_ids batch-fetches essential metadata (id, path, thumbnail, tags) for search result enrichment.
Search API Layer `backend/app/routes/search.py`, `backend/app/schemas/search.py`	New FastAPI router exposing semantic search, index status, indexing trigger, and image removal endpoints. Pydantic schemas define SearchResultItem, SemanticSearchResponse, IndexStatusResponse, and IndexTriggerResponse.
Application Integration `backend/main.py`	Initialization of embeddings table, SearchService singleton attachment to app.state, background daemon thread for batch image indexing, and route registration at /search prefix.

Sequence Diagram

sequenceDiagram
    actor User
    participant API as FastAPI Router
    participant SearchSvc as SearchService
    participant EmbedSvc as EmbeddingService
    participant VecIdx as VectorIndexManager
    participant DB as SQLite DB

    User->>API: GET /search/semantic?q=beach
    API->>SearchSvc: search("beach", top_k=20)
    SearchSvc->>EmbedSvc: embed_query("beach")
    EmbedSvc-->>SearchSvc: query_embedding [163D]
    SearchSvc->>VecIdx: ensure_loaded()
    VecIdx->>DB: db_get_all_embeddings()
    DB-->>VecIdx: [(image_id, embedding), ...]
    VecIdx-->>SearchSvc: index loaded
    SearchSvc->>VecIdx: search(query_embedding, top_k=20)
    VecIdx-->>SearchSvc: [(image_id, score), ...]
    SearchSvc->>DB: db_get_images_by_ids([...])
    DB-->>SearchSvc: [image metadata, ...]
    SearchSvc-->>API: [SearchResult(...), ...]
    API-->>User: 200 SemanticSearchResponse

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

[Feat] Implemented the search feature by uploading a image -2nd PR #524 — Implements face-based search with face embeddings and search endpoints; directly related to face-cluster enrichment in this semantic search feature.
Feat/memories backend implementation #777 — Modifies image storage and retrieval; related as this PR depends on image metadata fields (latitude, longitude, captured_at, tags).
GSoC 2025 Backend Revamp #466 — Extends face/face-cluster database layer; related as semantic search uses face-cluster associations for enrichment.

Suggested labels

Python

Suggested reviewers

rahulharpal1603

Poem

🐰 A hop through embeddings, a leap through the search,
L2-normalizing each photo with care,
From "beach" to your memories, vectors in flight,
Semantic rabbits find images so fair! ✨📸

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title clearly and concisely summarizes the main change: adding an offline AI-powered semantic search system, which directly aligns with the primary objective of implementing AI-powered natural language search for photos.
Linked Issues check	✅ Passed	The PR fully implements the core coding requirements from `#1205`: generating text embeddings from photo metadata, storing embeddings in a vector index, accepting natural language queries, and ensuring offline processing with no external APIs.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to the semantic search implementation. Database additions, configuration constants, embedding/indexing services, and API routes all support the stated objective. No extraneous modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Adds an offline, semantic (embedding-based) search subsystem to the PictoPy backend, including persistent embedding storage in SQLite, an in-memory vector index, and FastAPI endpoints for querying and managing the index.

Changes:

Introduces EmbeddingService, VectorIndexManager, and SearchService to generate/query embeddings and manage an in-memory index backed by SQLite.
Adds new /search/* API routes and Pydantic schemas for semantic search and index management.
Extends the database layer with an image_embeddings table plus helper queries for enriching results (image info + face clusters).

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
backend/main.py	Initializes embeddings table + search service at startup and starts background indexing thread; registers `/search` router.
backend/app/schemas/search.py	Adds response models for semantic search and index operations.
backend/app/routes/search.py	Adds `/search/semantic`, `/search/index-status`, `/search/index`, and `/search/index/{image_id}` endpoints.
backend/app/database/images.py	Adds `db_get_images_by_ids()` to enrich search results with path/thumbnail/tags.
backend/app/database/faces.py	Adds `db_get_faces_by_image_id()` for face-cluster enrichment during embedding.
backend/app/database/embeddings.py	Adds SQLite persistence layer for per-image embeddings (CRUD + bulk load).
backend/app/config/settings.py	Adds AI/search-related settings (paths, defaults, thresholds, face-cluster cap).
backend/app/ai/vector_index_manager.py	Implements a thread-safe in-memory flat index + cosine similarity search.
backend/app/ai/search_service.py	Orchestrates embedding generation, persistence, indexing, and result enrichment.
backend/app/ai/embedding_service.py	Implements deterministic, offline embedding construction for images + queries.
backend/app/ai/init.py	Exposes the AI subsystem components as a package.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-03T06:30:36Z

backend/app/ai/vector_index_manager.py

+        with self._lock:
+            self.ensure_loaded()
+            if image_id in self._id_to_row:
+                # In-place update (O(1))
+                self._matrix[self._id_to_row[image_id]] = vec_row
+            else:


VectorIndexManager is documented as lazily loaded on first query, but add_or_update() always calls ensure_loaded(), which forces a full DB load the first time any embedding is added/updated (e.g., during startup indexing). If you want to preserve lazy loading, consider only updating the in-memory matrix when the index is already loaded, or buffering updates until the first explicit load/search.

Copilot · 2026-03-03T06:30:36Z

backend/main.py

+    # Instantiate the semantic search service and attach it to app state
+    app.state.search_service = SearchService()
+    # Index any unindexed tagged images in a background thread so startup
+    # is not blocked.  Already-indexed images are skipped automatically.
+    threading.Thread(
+        target=app.state.search_service.index_all,
+        daemon=True,
+        name="semantic-index-startup",
+    ).start()


The startup indexing runs in a daemon thread. Because shutdown can terminate the process abruptly (e.g. os._exit(0) on Windows), this thread may be killed mid-run, leaving indexing partially complete. Consider tracking the thread/task and allowing a graceful stop/join on shutdown, or making indexing explicitly triggerable only via the API to avoid background work during startup/shutdown cycles.

Copilot · 2026-03-03T06:30:36Z

backend/app/routes/search.py

+def index_status(request: Request) -> IndexStatusResponse:
+    service = _get_service(request)
+    return IndexStatusResponse(
+        success=True,
+        indexed_count=service.index_size,
+        is_loaded=service.is_index_loaded,
+    )


indexed_count is derived from service.index_size, which reports the in-memory index size. When the index is still lazily unloaded, this will return 0 even if embeddings exist on disk, making /search/index-status misleading. Consider either forcing a load for this endpoint or returning a persisted count from the embeddings table (and optionally reporting both persisted vs in-memory counts).

Copilot · 2026-03-03T06:30:37Z

backend/app/database/embeddings.py

+        conn.execute(
+            """
+            CREATE INDEX IF NOT EXISTS ix_image_embeddings_image_id
+            ON image_embeddings(image_id)
+            """
+        )


image_id is already the PRIMARY KEY of image_embeddings, so SQLite will already maintain an index for it. The additional CREATE INDEX ... ON image_embeddings(image_id) is redundant and can be removed to reduce schema clutter.

Suggested change

conn.execute(

"""

CREATE INDEX IF NOT EXISTS ix_image_embeddings_image_id

ON image_embeddings(image_id)

"""

)

Copilot · 2026-03-03T06:30:37Z

backend/app/ai/search_service.py

+        if images is None:
+            images = db_get_all_images(tagged=True)
+
+        already_indexed: set[str] = set() if force else set(db_get_indexed_image_ids())
+
+        count = 0
+        for record in images:
+            iid = record.get("id")
+            if not iid:
+                continue
+            if not force and iid in already_indexed:
+                continue


index_all(force=False) skips any image_id already present in image_embeddings, but VectorIndexManager will ignore stored embeddings whose dim doesn't match the current EMBEDDING_DIM (e.g., after changing MAX_FACE_CLUSTERS in a new release). In that scenario, the in-memory index will be empty and index_all still won’t rebuild those embeddings unless the user explicitly passes force=true. Consider treating rows with a mismatched dim as not indexed (e.g., query only ids where dim == EMBEDDING_DIM, or add a DB helper to return ids by dim).

Copilot · 2026-03-03T06:30:37Z

backend/app/ai/vector_index_manager.py

+        vec_row = embedding.astype(np.float32).reshape(1, -1)
+
+        with self._lock:
+            self.ensure_loaded()
+            if image_id in self._id_to_row:
+                # In-place update (O(1))
+                self._matrix[self._id_to_row[image_id]] = vec_row
+            else:
+                # Append row (amortised O(1) via np.vstack)
+                row_idx = len(self._image_ids)
+                self._image_ids.append(image_id)
+                self._id_to_row[image_id] = row_idx
+                if self._matrix.shape[0] > 0:
+                    self._matrix = np.vstack([self._matrix, vec_row])
+                else:
+                    self._matrix = vec_row
+


In add_or_update, vec_row is reshaped to (1, dim) but the in-place update assigns it into a 1-D row (self._matrix[row]), which will raise a NumPy broadcasting error when updating an existing embedding. Use a 1-D vector for assignment (e.g., avoid the extra leading dimension for the update path).

Copilot · 2026-03-03T06:30:37Z

backend/app/routes/search.py

+@router.get(
+    "/semantic",
+    response_model=SemanticSearchResponse,
+    summary="Semantic image search",
+    description=(
+        "Search for images using a natural-language query.  The query is mapped "
+        "onto the same fixed-dimension embedding space used to index image "
+        "metadata (object classes, date/time, season, faces, GPS) and the "
+        "most similar images are returned ranked by cosine similarity."
+    ),
+)
+def semantic_search(
+    request: Request,
+    q: str = Query(..., min_length=1, description="Free-text search query."),
+    limit: int = Query(
+        default=SEARCH_DEFAULT_TOP_K,
+        ge=1,
+        le=200,
+        description="Maximum number of results to return.",
+    ),
+    min_score: float = Query(
+        default=SEARCH_MIN_SCORE,
+        ge=0.0,
+        le=1.0,
+        description="Minimum cosine-similarity score threshold.",
+    ),
+) -> SemanticSearchResponse:
+    service = _get_service(request)


This PR adds new search/indexing endpoints and core ranking logic, but there are no corresponding tests under backend/tests/ (the repo already has substantial FastAPI/DB test coverage). Adding tests for /search/semantic, /search/index, and the indexing/search flow (including score thresholding and empty-index behavior) would help prevent regressions.

Copilot · 2026-03-03T06:30:38Z

backend/app/ai/search_service.py

+        results: List[SearchResult] = []
+        for image_id, score in raw:
+            info = path_map.get(image_id, {})
+            results.append(
+                SearchResult(
+                    image_id=image_id,
+                    score=round(score, 4),
+                    path=info.get("path", ""),
+                    thumbnail_path=info.get("thumbnailPath", ""),
+                    tags=info.get("tags") or [],
+                )


SearchResultItem.score is constrained to [0, 1] in the response schema, but cosine similarity computed in float32 can drift slightly outside this range (e.g., 1.0000x) due to numerical error, which can cause response-model validation failures. Consider clamping scores into [0.0, 1.0] (and optionally also clamping at 0.0) before returning them.

Copilot · 2026-03-03T06:30:38Z

backend/app/ai/embedding_service.py

+            elif len(tok) >= 4:
+                # Partial substring match only for sufficiently long tokens
+                # (avoids "at" → "cat", "park" → "parking meter" false positives).
+                # Rule: the token must be a prefix, suffix, or the class name must
+                # start with the token (prevents mid-word collisions like "ark").
+                for name, idx in _OBJ_INDEX.items():
+                    if name.startswith(tok) or tok.startswith(name):
+                        matched.add(idx)
+            i += 1


The comment describing the partial-match rule mentions allowing prefix/suffix matches, but the implementation only checks name.startswith(tok) and tok.startswith(name) (no suffix matching). Either update the comment to reflect the actual rule or adjust the matching logic so documentation and behavior stay aligned.

coderabbitai

Actionable comments posted: 8

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/ai/embedding_service.py`:
- Around line 197-205: The constructor EmbeddingService.__init__ currently
allows max_face_clusters > module MAX_FC which can cause out-of-bounds writes in
_fill_faces; clamp the runtime value by setting self._max_fc =
min(max_face_clusters, MAX_FC) (and keep the existing warning but reflect the
clamping), and update _fill_faces (and the other location referenced around
lines 357-359) to always use min(self._max_fc, MAX_FC) or the clamped
self._max_fc when computing per-face offsets/indices so no code ever assumes
more than MAX_FC face clusters.
- Around line 371-373: The spatial encoding currently normalizes lat/lon without
bounds checking; update the logic in embedding_service.py (around the vec,
_OFF_FLAG and _OFF_SPA usage) to validate GPS values: clamp latitude to [-90,
90] and longitude to [-180, 180] (or raise a ValueError if you prefer strict
validation) before computing vec[_OFF_SPA] = float(lat)/90.0 and vec[_OFF_SPA +
1] = float(lon)/180.0; ensure vec[_OFF_FLAG + 1] (has_location) is only set when
a (possibly clamped) numeric lat/lon is present and convert inputs to float
first.

In `@backend/app/ai/search_service.py`:
- Around line 149-154: The current index_all loop swallows all exceptions when
calling self.index_image(record), making callers unable to tell “already
indexed” from real failures; update index_all so it does not blanket-catch
Exception: either catch only the specific expected "already indexed" exception
(or whatever custom exception your index_image raises) and continue, but for all
other exceptions capture/log them and accumulate failures to return or re-raise
an aggregated error; specifically modify the try/except around
self.index_image(record) and replace the bare except Exception with targeted
exception handling (e.g., except AlreadyIndexedError: continue) and for other
exceptions append iid (or record) and the exception to a failures list, then
after the loop either raise a combined Exception or return a result object
indicating successes and failures, while keeping the logger.exception(...) call
to record details.
- Around line 115-227: Add automated unit tests for SearchService covering:
index_all(force=False) skips already indexed IDs and returns correct count,
index_all(force=True) re-embeds all items and overwrites existing entries, and
index_all handles per-image exceptions (mock index_image to raise for one record
and assert count reflects only successful indexes and that logger.exception was
called); remove_image removes persisted embedding and the in-memory index (mock
db_delete_embedding and self._index.remove and then run search to assert the
image no longer appears); and search returns correctly ordered SearchResult
entries with trimmed query, threshold filtering, and proper path/thumbnail/tags
mapping (instantiate SearchService, mock its _embed.embed_query, _index.search,
and _build_path_map to assert ranking, score rounding, and empty-query
behavior). Use mocks for db_get_all_images, db_get_indexed_image_ids,
db_delete_embedding, _build_path_map, and methods index_image/_index/_embed so
tests are isolated and deterministic.

In `@backend/app/ai/vector_index_manager.py`:
- Around line 138-145: The update path in add_or_update creates vec_row with
shape (1, D) then assigns it into a single row of self._matrix, causing a shape
mismatch; fix by making vec_row a 1-D array (shape (D,)) before assignment.
Locate the add_or_update method (symbols: vec_row, ensure_loaded, _id_to_row,
_matrix, image_id) and change the creation/reshape of vec_row to use
.reshape(-1) or index [0] so that assignments like
self._matrix[self._id_to_row[image_id]] = vec_row succeed without broadcasting
errors.

In `@backend/app/database/embeddings.py`:
- Around line 90-104: Validate that the embedding is a 1-D array before
persisting: after creating vec via embedding.astype(np.float32) (and before
computing dim or calling conn.execute on the INSERT into image_embeddings),
check vec.ndim == 1 and use vec.shape[0] as the dim; if the shape is invalid,
raise a ValueError (or otherwise abort) rather than using len(vec) so you never
store an incorrect dim for image_id and corrupt future reads. Ensure this
validation is applied where vec is created and before calling _connect() /
conn.execute.

In `@backend/app/routes/search.py`:
- Around line 86-88: The exception handler in search.py currently calls
logger.exception("Semantic search failed for query %r", q) which logs raw user
input q; change this to log non-sensitive metadata instead: replace the
logger.exception call in the except block handling semantic search with
something like logger.exception("Semantic search failed for request_id=%s
query_len=%d", request_id, len(q or "")) (or include any existing request
identifier) and remove q from logs, ensuring the raise HTTPException remains
unchanged; update references around logger.exception, the except block, and the
variable q to compute/query length and attach request id if available.

In `@backend/main.py`:
- Around line 66-75: Add automated tests that verify the startup block calls
db_create_embeddings_table(), creates a SearchService instance on
app.state.search_service, and starts a background thread invoking
SearchService.index_all(); specifically, in tests (e.g., new test file using the
test client and monkeypatch/patch), assert db_create_embeddings_table was called
during app setup, assert app.state.search_service is an instance of
SearchService, and patch SearchService.index_all to confirm it was scheduled (or
invoked) by the background Thread (or assert a Thread named
"semantic-index-startup" was started); also add integration tests against the
search endpoint to confirm search responses after indexing behavior is
simulated.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8274ebd and 2759d36.

📒 Files selected for processing (11)

backend/app/ai/__init__.py
backend/app/ai/embedding_service.py
backend/app/ai/search_service.py
backend/app/ai/vector_index_manager.py
backend/app/config/settings.py
backend/app/database/embeddings.py
backend/app/database/faces.py
backend/app/database/images.py
backend/app/routes/search.py
backend/app/schemas/search.py
backend/main.py

coderabbitai · 2026-03-03T06:33:03Z

backend/app/ai/embedding_service.py

+    def __init__(self, max_face_clusters: int = MAX_FC) -> None:
+        if max_face_clusters != MAX_FC:
+            logger.warning(
+                "EmbeddingService initialised with max_face_clusters=%d "
+                "but module constant MAX_FC=%d; EMBEDDING_DIM will be wrong.",
+                max_face_clusters,
+                MAX_FC,
+            )
+        self._max_fc = max_face_clusters


⚠️ Potential issue | 🟠 Major

Prevent face-cluster width mismatch from breaking fixed vector bounds

The vector layout is fixed using module-level MAX_FC. Allowing a larger runtime max_face_clusters can cause out-of-bounds writes in _fill_faces.

Proposed fix

def __init__(self, max_face_clusters: int = MAX_FC) -> None: - if max_face_clusters != MAX_FC: - logger.warning( - "EmbeddingService initialised with max_face_clusters=%d " - "but module constant MAX_FC=%d; EMBEDDING_DIM will be wrong.", - max_face_clusters, - MAX_FC, - ) - self._max_fc = max_face_clusters + if max_face_clusters != MAX_FC: + raise ValueError( + f"max_face_clusters must be {MAX_FC} to match fixed EMBEDDING_DIM={EMBEDDING_DIM}" + ) + self._max_fc = MAX_FC

Also applies to: 357-359

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/app/ai/embedding_service.py` around lines 197 - 205, The constructor EmbeddingService.__init__ currently allows max_face_clusters > module MAX_FC which can cause out-of-bounds writes in _fill_faces; clamp the runtime value by setting self._max_fc = min(max_face_clusters, MAX_FC) (and keep the existing warning but reflect the clamping), and update _fill_faces (and the other location referenced around lines 357-359) to always use min(self._max_fc, MAX_FC) or the clamped self._max_fc when computing per-face offsets/indices so no code ever assumes more than MAX_FC face clusters.

coderabbitai · 2026-03-03T06:33:03Z

backend/app/ai/embedding_service.py

+            vec[_OFF_FLAG + 1] = 1.0  # has_location
+            vec[_OFF_SPA] = float(lat) / 90.0  # [−90,  90] → [−1, 1]
+            vec[_OFF_SPA + 1] = float(lon) / 180.0  # [−180, 180] → [−1, 1]


⚠️ Potential issue | 🟡 Minor

Validate GPS bounds before encoding spatial features

This currently accepts any numeric latitude/longitude and can encode values outside the documented [-1, 1] range. Reject or clamp out-of-range coordinates.

Proposed fix

- vec[_OFF_FLAG + 1] = 1.0 # has_location - vec[_OFF_SPA] = float(lat) / 90.0 # [−90, 90] → [−1, 1] - vec[_OFF_SPA + 1] = float(lon) / 180.0 # [−180, 180] → [−1, 1] + lat_f = float(lat) + lon_f = float(lon) + if not (-90.0 <= lat_f <= 90.0 and -180.0 <= lon_f <= 180.0): + return + vec[_OFF_FLAG + 1] = 1.0 # has_location + vec[_OFF_SPA] = lat_f / 90.0 + vec[_OFF_SPA + 1] = lon_f / 180.0

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

vec[_OFF_FLAG + 1] = 1.0 # has_location

vec[_OFF_SPA] = float(lat) / 90.0 # [−90, 90] → [−1, 1]

vec[_OFF_SPA + 1] = float(lon) / 180.0 # [−180, 180] → [−1, 1]

lat_f = float(lat)

lon_f = float(lon)

if not (-90.0 <= lat_f <= 90.0 and -180.0 <= lon_f <= 180.0):

return

vec[_OFF_FLAG + 1] = 1.0 # has_location

vec[_OFF_SPA] = lat_f / 90.0

vec[_OFF_SPA + 1] = lon_f / 180.0

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/app/ai/embedding_service.py` around lines 371 - 373, The spatial encoding currently normalizes lat/lon without bounds checking; update the logic in embedding_service.py (around the vec, _OFF_FLAG and _OFF_SPA usage) to validate GPS values: clamp latitude to [-90, 90] and longitude to [-180, 180] (or raise a ValueError if you prefer strict validation) before computing vec[_OFF_SPA] = float(lat)/90.0 and vec[_OFF_SPA + 1] = float(lon)/180.0; ensure vec[_OFF_FLAG + 1] (has_location) is only set when a (possibly clamped) numeric lat/lon is present and convert inputs to float first.

coderabbitai · 2026-03-03T06:33:03Z

backend/app/ai/search_service.py

+    def index_all(
+        self,
+        *,
+        force: bool = False,
+        images: Optional[List[Dict[str, Any]]] = None,
+    ) -> int:
+        """
+        Index all tagged images that do not yet have a stored embedding.
+
+        Parameters
+        ----------
+        force :
+            If ``True``, re-embed and overwrite every image even if already
+            indexed.  Useful after a settings change (e.g. ``MAX_FACE_CLUSTERS``).
+        images :
+            Pre-fetched image records.  Fetched from the DB when ``None``.
+
+        Returns
+        -------
+        int
+            Number of images actually (re-)indexed in this call.
+        """
+        if images is None:
+            images = db_get_all_images(tagged=True)
+
+        already_indexed: set[str] = set() if force else set(db_get_indexed_image_ids())
+
+        count = 0
+        for record in images:
+            iid = record.get("id")
+            if not iid:
+                continue
+            if not force and iid in already_indexed:
+                continue
+            try:
+                self.index_image(record)
+                count += 1
+            except Exception:
+                logger.exception("Failed to index image %s", iid)
+
+        if count:
+            logger.info("Indexed %d new image(s)", count)
+        else:
+            logger.debug("index_all: no new images to index")
+        return count
+
+    def remove_image(self, image_id: str) -> None:
+        """
+        Remove an image from both the persistent embedding store and the
+        in-memory index.
+
+        Should be called whenever an image is deleted from the library so
+        it no longer appears in search results.
+        """
+        db_delete_embedding(image_id)
+        self._index.remove(image_id)
+        logger.debug("Removed image %s from semantic index", image_id)
+
+    # ------------------------------------------------------------------
+    # Search
+    # ------------------------------------------------------------------
+
+    def search(
+        self,
+        query: str,
+        top_k: int = 20,
+        min_score: float = 0.05,
+    ) -> List[SearchResult]:
+        """
+        Perform a semantic search across all indexed images.
+
+        Parameters
+        ----------
+        query :
+            Free-text query (e.g. ``"dog at the park"``, ``"evening sunset"``,
+            ``"person in summer"``).
+        top_k :
+            Maximum number of results to return.
+        min_score :
+            Cosine-similarity threshold.  Keeps only results with
+            ``score >= min_score``.
+
+        Returns
+        -------
+        list of :class:`SearchResult` ordered by descending similarity score.
+        """
+        query = query.strip()
+        if not query:
+            return []
+
+        query_vec = self._embed.embed_query(query)
+        raw = self._index.search(query_vec, top_k=top_k, min_score=min_score)
+        if not raw:
+            return []
+
+        # Batch-fetch path / thumbnail / tags for the matched IDs
+        id_set = {iid for iid, _ in raw}
+        path_map = _build_path_map(id_set)
+
+        results: List[SearchResult] = []
+        for image_id, score in raw:
+            info = path_map.get(image_id, {})
+            results.append(
+                SearchResult(
+                    image_id=image_id,
+                    score=round(score, 4),
+                    path=info.get("path", ""),
+                    thumbnail_path=info.get("thumbnailPath", ""),
+                    tags=info.get("tags") or [],
+                )
+            )
+        return results
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find test files and locate any tests for SearchService or search_service find . -type f -name "*test*.py" | head -20

Repository: AOSSIE-Org/PictoPy

Length of output: 308

🏁 Script executed:

# Also search for common test directory patterns fd -t f -e py test 2>/dev/null | grep -E '(test|spec)' | head -30

Repository: AOSSIE-Org/PictoPy

Length of output: 292

🏁 Script executed:

# Search for SearchService references in test files rg -l "SearchService|search_service" --type py

Repository: AOSSIE-Org/PictoPy

Length of output: 202

🏁 Script executed:

# Check if any existing test files contain search or SearchService tests grep -r "SearchService\|search_service\|index_all\|remove_image" backend/tests/ 2>/dev/null

Repository: AOSSIE-Org/PictoPy

Length of output: 334

🏁 Script executed:

# Check the backend test files cat -n backend/test.py | head -50

Repository: AOSSIE-Org/PictoPy

Length of output: 1995

🏁 Script executed:

# Check the routes test file cat -n backend/app/routes/test.py 2>/dev/null || echo "File not found or empty"

Repository: AOSSIE-Org/PictoPy

Length of output: 7246

🏁 Script executed:

# Check if there are test files in backend/app/ai find backend/app/ai -name "*test*.py" -o -name "test_*.py"

Repository: AOSSIE-Org/PictoPy

Length of output: 44

🏁 Script executed:

# List all test files in the backend/tests directory ls -la backend/tests/

Repository: AOSSIE-Org/PictoPy

Length of output: 692

🏁 Script executed:

# Check for any search-related tests in existing test files grep -r "semantic\|embed\|index_all\|remove_image" backend/tests/*.py 2>/dev/null

Repository: AOSSIE-Org/PictoPy

Length of output: 334

🏁 Script executed:

# Check the search_service.py to understand the full scope cat -n backend/app/ai/search_service.py | head -100

Repository: AOSSIE-Org/PictoPy

Length of output: 4202

🏁 Script executed:

# Check if there are integration or other test patterns find backend -type f -name "*.py" -path "*/test*" | xargs ls -la 2>/dev/null

Repository: AOSSIE-Org/PictoPy

Length of output: 768

🏁 Script executed:

# Check routes/search.py to see if it has tests cat -n backend/app/routes/search.py

Repository: AOSSIE-Org/PictoPy

Length of output: 7095

🏁 Script executed:

# Check if routes are tested in any test files grep -r "/search\|semantic_search\|trigger_index\|remove_from_index" backend/tests/ 2>/dev/null

Repository: AOSSIE-Org/PictoPy

Length of output: 44

🏁 Script executed:

# Check conftest.py to understand test setup and fixtures cat -n backend/tests/conftest.py

Repository: AOSSIE-Org/PictoPy

Length of output: 1910

🏁 Script executed:

# Look for any test configuration that might indicate search tests exist find backend -type f \( -name "pytest.ini" -o -name "setup.cfg" -o -name "pyproject.toml" \) -exec grep -l "test" {} \;

Repository: AOSSIE-Org/PictoPy

Length of output: 80

🏁 Script executed:

# Check the full search_service.py to understand exception handling in index_all cat -n backend/app/ai/search_service.py | sed -n '115,160p'

Repository: AOSSIE-Org/PictoPy

Length of output: 1773

🏁 Script executed:

# Check if there's a test directory specifically for ai module find backend/app/ai -type d -name "*test*"

Repository: AOSSIE-Org/PictoPy

Length of output: 44

🏁 Script executed:

# Check for any search test file find backend -name "*search*test*.py" -o -name "test*search*.py"

Repository: AOSSIE-Org/PictoPy

Length of output: 44

🏁 Script executed:

# Check the embeddings database module to understand the table cat -n backend/app/database/embeddings.py 2>/dev/null | head -60

Repository: AOSSIE-Org/PictoPy

Length of output: 2514

🏁 Script executed:

# Verify conftest creates embeddings table grep -n "embedding" backend/tests/conftest.py -i

Repository: AOSSIE-Org/PictoPy

Length of output: 44

🏁 Script executed:

# List all test files to ensure we haven't missed any ls -la backend/tests/test_*.py

Repository: AOSSIE-Org/PictoPy

Length of output: 395

🏁 Script executed:

# Double-check for any hidden search tests grep -r "class.*SearchService\|def.*test.*search\|def.*test.*index\|def.*test.*embed" backend/ --include="*.py" | grep -i test

Repository: AOSSIE-Org/PictoPy

Length of output: 44

🏁 Script executed:

# Check the full conftest to see if embeddings table is created cat -n backend/tests/conftest.py | tail -20

Repository: AOSSIE-Org/PictoPy

Length of output: 886

🏁 Script executed:

# Verify no embeddings table creation anywhere in test setup grep -r "db_create_embeddings_table" backend/tests/

Repository: AOSSIE-Org/PictoPy

Length of output: 44

🏁 Script executed:

# Check the main.py to see when embeddings table should be created grep -n "embeddings" backend/main.py -i

Repository: AOSSIE-Org/PictoPy

Length of output: 163

🏁 Script executed:

# Check if db_create_embeddings_table is called in conftest or any test setup grep -r "db_create_embeddings_table" backend/tests/

Repository: AOSSIE-Org/PictoPy

Length of output: 44

🏁 Script executed:

# Verify the exact state: is embeddings table created before tests run? cat -n backend/main.py | sed -n '60,75p'

Repository: AOSSIE-Org/PictoPy

Length of output: 866

🏁 Script executed:

# Check if SearchService is instantiated or tested anywhere grep -r "SearchService\(" backend/tests/

Repository: AOSSIE-Org/PictoPy

Length of output: 85

🏁 Script executed:

# Search for SearchService instantiation in test files grep -r "SearchService" backend/tests/

Repository: AOSSIE-Org/PictoPy

Length of output: 44

Add automated tests for critical semantic search functionality.

This introduces core behavior (index_all, remove_image, search) without accompanying automated tests. Critical gaps include:

Force reindex behavior (the force parameter)

Partial-failure propagation in batch indexing (exceptions are caught per image but not propagated)

Remove-then-search consistency

Query execution and result ranking

The test infrastructure also lacks SearchService instantiation and integration tests, despite the service being integral to the application.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/app/ai/search_service.py` around lines 115 - 227, Add automated unit tests for SearchService covering: index_all(force=False) skips already indexed IDs and returns correct count, index_all(force=True) re-embeds all items and overwrites existing entries, and index_all handles per-image exceptions (mock index_image to raise for one record and assert count reflects only successful indexes and that logger.exception was called); remove_image removes persisted embedding and the in-memory index (mock db_delete_embedding and self._index.remove and then run search to assert the image no longer appears); and search returns correctly ordered SearchResult entries with trimmed query, threshold filtering, and proper path/thumbnail/tags mapping (instantiate SearchService, mock its _embed.embed_query, _index.search, and _build_path_map to assert ranking, score rounding, and empty-query behavior). Use mocks for db_get_all_images, db_get_indexed_image_ids, db_delete_embedding, _build_path_map, and methods index_image/_index/_embed so tests are isolated and deterministic.

coderabbitai · 2026-03-03T06:33:03Z

backend/app/ai/search_service.py

+            try:
+                self.index_image(record)
+                count += 1
+            except Exception:
+                logger.exception("Failed to index image %s", iid)
+


⚠️ Potential issue | 🟠 Major

Do not silently swallow indexing failures in index_all.

Line 152 catches all exceptions and continues, so callers cannot distinguish “already indexed” from “indexing failed.” This can surface false success to /search/index.

⚙️ Proposed fix

- count = 0 + count = 0 + failures = 0 for record in images: iid = record.get("id") if not iid: continue if not force and iid in already_indexed: continue try: self.index_image(record) count += 1 except Exception: + failures += 1 logger.exception("Failed to index image %s", iid) + if failures: + raise RuntimeError(f"Indexing failed for {failures} image(s)") if count: logger.info("Indexed %d new image(s)", count) else: logger.debug("index_all: no new images to index") return count

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/app/ai/search_service.py` around lines 149 - 154, The current index_all loop swallows all exceptions when calling self.index_image(record), making callers unable to tell “already indexed” from real failures; update index_all so it does not blanket-catch Exception: either catch only the specific expected "already indexed" exception (or whatever custom exception your index_image raises) and continue, but for all other exceptions capture/log them and accumulate failures to return or re-raise an aggregated error; specifically modify the try/except around self.index_image(record) and replace the bare except Exception with targeted exception handling (e.g., except AlreadyIndexedError: continue) and for other exceptions append iid (or record) and the exception to a failures list, then after the loop either raise a combined Exception or return a result object indicating successes and failures, while keeping the logger.exception(...) call to record details.

coderabbitai · 2026-03-03T06:33:03Z

backend/app/ai/vector_index_manager.py

+        vec_row = embedding.astype(np.float32).reshape(1, -1)
+
+        with self._lock:
+            self.ensure_loaded()
+            if image_id in self._id_to_row:
+                # In-place update (O(1))
+                self._matrix[self._id_to_row[image_id]] = vec_row
+            else:


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash python - <<'PY' import numpy as np m = np.zeros((2, 4), dtype=np.float32) vec_row = np.ones((1, 4), dtype=np.float32) try: m[0] = vec_row print("assignment_ok") except Exception as e: print(type(e).__name__, str(e)) PY

Repository: AOSSIE-Org/PictoPy

Length of output: 179

🏁 Script executed:

head -160 backend/app/ai/vector_index_manager.py | tail -50

Repository: AOSSIE-Org/PictoPy

Length of output: 1849

🏁 Script executed:

cat -n backend/app/ai/vector_index_manager.py | sed -n '118,160p'

Repository: AOSSIE-Org/PictoPy

Length of output: 1943

🏁 Script executed:

find . -type f -name "*.py" -path "*/test*" | xargs grep -l "vector_index_manager\|VectorIndexManager" 2>/dev/null | head -5

Repository: AOSSIE-Org/PictoPy

Length of output: 44

Fix shape mismatch in update path (add_or_update).

Line 144 assigns a (1, D) shaped array into a (D,) row slot, causing a shape incompatibility. When updating an existing image_id, this triggers a broadcast error that breaks the update mechanism.

Proposed fix

- vec_row = embedding.astype(np.float32).reshape(1, -1) + vec = embedding.astype(np.float32).reshape(-1) with self._lock: self.ensure_loaded() if image_id in self._id_to_row: # In-place update (O(1)) - self._matrix[self._id_to_row[image_id]] = vec_row + self._matrix[self._id_to_row[image_id], :] = vec else: # Append row (amortised O(1) via np.vstack) row_idx = len(self._image_ids) self._image_ids.append(image_id) self._id_to_row[image_id] = row_idx if self._matrix.shape[0] > 0: - self._matrix = np.vstack([self._matrix, vec_row]) + self._matrix = np.vstack([self._matrix, vec[np.newaxis, :]]) else: - self._matrix = vec_row + self._matrix = vec[np.newaxis, :]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

vec_row = embedding.astype(np.float32).reshape(1, -1)

with self._lock:

self.ensure_loaded()

if image_id in self._id_to_row:

# In-place update (O(1))

self._matrix[self._id_to_row[image_id]] = vec_row

else:

vec = embedding.astype(np.float32).reshape(-1)

with self._lock:

self.ensure_loaded()

if image_id in self._id_to_row:

# In-place update (O(1))

self._matrix[self._id_to_row[image_id], :] = vec

else:

# Append row (amortised O(1) via np.vstack)

row_idx = len(self._image_ids)

self._image_ids.append(image_id)

self._id_to_row[image_id] = row_idx

if self._matrix.shape[0] > 0:

self._matrix = np.vstack([self._matrix, vec[np.newaxis, :]])

else:

self._matrix = vec[np.newaxis, :]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/app/ai/vector_index_manager.py` around lines 138 - 145, The update path in add_or_update creates vec_row with shape (1, D) then assigns it into a single row of self._matrix, causing a shape mismatch; fix by making vec_row a 1-D array (shape (D,)) before assignment. Locate the add_or_update method (symbols: vec_row, ensure_loaded, _id_to_row, _matrix, image_id) and change the creation/reshape of vec_row to use .reshape(-1) or index [0] so that assignments like self._matrix[self._id_to_row[image_id]] = vec_row succeed without broadcasting errors.

coderabbitai · 2026-03-03T06:33:03Z

backend/app/database/embeddings.py

+    vec = embedding.astype(np.float32)
+    blob = vec.tobytes()
+    conn = _connect()
+    try:
+        conn.execute(
+            """
+            INSERT INTO image_embeddings (image_id, embedding, dim, updated_at)
+            VALUES (?, ?, ?, CURRENT_TIMESTAMP)
+            ON CONFLICT(image_id) DO UPDATE SET
+                embedding  = excluded.embedding,
+                dim        = excluded.dim,
+                updated_at = excluded.updated_at
+            """,
+            (image_id, blob, len(vec)),
+        )


⚠️ Potential issue | 🟠 Major

Validate embedding dimensionality before persisting

len(vec) is unsafe for non-1D arrays and can store an incorrect dim, which later makes reads look corrupt. Reject invalid shapes up front.

Proposed fix

- vec = embedding.astype(np.float32) + vec = np.asarray(embedding, dtype=np.float32) + if vec.ndim != 1: + raise ValueError(f"Embedding must be 1-D, got shape={vec.shape}") + if not np.isfinite(vec).all(): + raise ValueError("Embedding contains non-finite values") blob = vec.tobytes() @@ - (image_id, blob, len(vec)), + (image_id, blob, vec.shape[0]),

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/app/database/embeddings.py` around lines 90 - 104, Validate that the embedding is a 1-D array before persisting: after creating vec via embedding.astype(np.float32) (and before computing dim or calling conn.execute on the INSERT into image_embeddings), check vec.ndim == 1 and use vec.shape[0] as the dim; if the shape is invalid, raise a ValueError (or otherwise abort) rather than using len(vec) so you never store an incorrect dim for image_id and corrupt future reads. Ensure this validation is applied where vec is created and before calling _connect() / conn.execute.

coderabbitai · 2026-03-03T06:33:03Z

backend/app/routes/search.py

+    except Exception:
+        logger.exception("Semantic search failed for query %r", q)
+        raise HTTPException(


⚠️ Potential issue | 🟠 Major

Avoid logging raw user queries in exception logs.

Line 87 logs q verbatim, which can capture sensitive user-entered text. Log metadata (e.g., length/request id) instead of raw query content.

🔒 Proposed fix

- except Exception: - logger.exception("Semantic search failed for query %r", q) + except Exception: + logger.exception("Semantic search failed (query_len=%d)", len(q)) raise HTTPException( status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail="Semantic search failed.", )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

except Exception:

logger.exception("Semantic search failed for query %r", q)

raise HTTPException(

except Exception:

logger.exception("Semantic search failed (query_len=%d)", len(q))

raise HTTPException(

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/app/routes/search.py` around lines 86 - 88, The exception handler in search.py currently calls logger.exception("Semantic search failed for query %r", q) which logs raw user input q; change this to log non-sensitive metadata instead: replace the logger.exception call in the except block handling semantic search with something like logger.exception("Semantic search failed for request_id=%s query_len=%d", request_id, len(q or "")) (or include any existing request identifier) and remove q from logs, ensuring the raise HTTPException remains unchanged; update references around logger.exception, the except block, and the variable q to compute/query length and attach request id if available.

coderabbitai · 2026-03-03T06:33:03Z

backend/main.py

+    db_create_embeddings_table()
+    # Instantiate the semantic search service and attach it to app state
+    app.state.search_service = SearchService()
+    # Index any unindexed tagged images in a background thread so startup
+    # is not blocked.  Already-indexed images are skipped automatically.
+    threading.Thread(
+        target=app.state.search_service.index_all,
+        daemon=True,
+        name="semantic-index-startup",
+    ).start()


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Search for test files that cover SearchService, index_all, embeddings table, and search endpoints echo "=== Searching for test files covering semantic search startup/indexing ===" rg -nP --type=py -g '*test*.py' -C2 'SearchService|index_all|db_create_embeddings_table|/search/semantic|/search/index-status|/search/index|lifespan' 2>/dev/null || echo "No matches found" echo "" echo "=== Checking for test directory structure ===" fd -t f -e py . -g '*test*' | head -20

Repository: AOSSIE-Org/PictoPy

Length of output: 311

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== Finding all test files in repository ===" find . -type f -name "*test*.py" -o -name "test_*.py" | head -30 echo "" echo "=== Checking for common test directories ===" find . -type d \( -name "*test*" -o -name "tests" \) 2>/dev/null | head -20 echo "" echo "=== Listing all Python files in common test locations ===" if [ -d "tests" ]; then echo "Found 'tests' directory:" ls -la tests/ 2>/dev/null | head -20 fi if [ -d "test" ]; then echo "Found 'test' directory:" ls -la test/ 2>/dev/null | head -20 fi echo "" echo "=== Searching for SearchService class definition ===" rg -n "class SearchService" --type=py

Repository: AOSSIE-Org/PictoPy

Length of output: 616

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== Searching for SearchService references in test files ===" rg -n "SearchService|index_all" backend/tests/ --type=py echo "" echo "=== Checking conftest.py for startup/lifespan fixtures ===" cat -n backend/tests/conftest.py echo "" echo "=== Listing all test functions in test files ===" rg -n "^def test_|^async def test_" backend/tests/ --type=py echo "" echo "=== Checking backend/test.py ===" wc -l backend/test.py head -50 backend/test.py

Repository: AOSSIE-Org/PictoPy

Length of output: 122

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== Checking conftest.py for fixtures ===" cat backend/tests/conftest.py echo "" echo "=== Checking backend/test.py ===" cat backend/test.py 2>/dev/null || echo "File not found or error" echo "" echo "=== Checking backend/app/routes/test.py ===" cat backend/app/routes/test.py 2>/dev/null || echo "File not found or error" echo "" echo "=== Examining SearchService class and startup integration ===" rg -A 20 "class SearchService" backend/app/ai/search_service.py echo "" echo "=== Checking main.py lines 60-80 for full context ===" sed -n '60,80p' backend/main.py

Repository: AOSSIE-Org/PictoPy

Length of output: 11617

Add automated test coverage for semantic search startup initialization

This startup block is critical but completely untested. The test suite's conftest.py does not create the embeddings table or initialize SearchService, and no tests exist for startup indexing behavior or search endpoints. Add tests covering:

Embeddings table creation during setup

SearchService instantiation and index_all() background thread trigger

Search endpoint functionality

Relates to coding guidelines: "Ensure that test code is automated, comprehensive, and follows testing best practices" and "Verify that all critical functionality is covered by tests".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/main.py` around lines 66 - 75, Add automated tests that verify the startup block calls db_create_embeddings_table(), creates a SearchService instance on app.state.search_service, and starts a background thread invoking SearchService.index_all(); specifically, in tests (e.g., new test file using the test client and monkeypatch/patch), assert db_create_embeddings_table was called during app setup, assert app.state.search_service is an instance of SearchService, and patch SearchService.index_all to confirm it was scheduled (or invoked) by the background Thread (or assert a Thread named "semantic-index-startup" was started); also add integration tests against the search endpoint to confirm search responses after indexing behavior is simulated.

Copilot AI review requested due to automatic review settings March 3, 2026 06:25

Copilot started reviewing on behalf of Arunodoy18 March 3, 2026 06:25 View session

Copilot AI reviewed Mar 3, 2026

View reviewed changes

coderabbitai bot requested changes Mar 3, 2026

View reviewed changes

SIDDHANTCOOKIE closed this Mar 15, 2026

SIDDHANTCOOKIE reopened this Mar 15, 2026

-            vec[_OFF_FLAG + 1] = 1.0  # has_location
-            vec[_OFF_SPA] = float(lat) / 90.0  # [−90,  90] → [−1, 1]
-            vec[_OFF_SPA + 1] = float(lon) / 180.0  # [−180, 180] → [−1, 1]
+            lat_f = float(lat)
+            lon_f = float(lon)
+            if not (-90.0 <= lat_f <= 90.0 and -180.0 <= lon_f <= 180.0):
+                return
+            vec[_OFF_FLAG + 1] = 1.0  # has_location
+            vec[_OFF_SPA] = lat_f / 90.0
+            vec[_OFF_SPA + 1] = lon_f / 180.0

Uh oh!

Conversation

Arunodoy18 commented Mar 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Feature: AI-Powered Natural Language Search (Phase 1)

🧠 What This PR Includes

1️⃣ Embedding Pipeline

2️⃣ Persistent Embedding Storage

3️⃣ Modular Architecture

🔒 Privacy Considerations

⚙️ Performance Considerations

📌 Scope

🧪 Testing

🔄 Future Improvements

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Arunodoy18 commented Mar 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 3, 2026 •

edited

Loading