Postgres by superintendent2521 · Pull Request #48 · superintendent2521/wikiware

superintendent2521 · 2025-12-14T19:21:51Z

/gemini review

long term pr

…eatures

…bash script

gemini-code-assist · 2025-12-14T19:22:11Z

Summary of Changes

Hello @superintendent2521, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request represents a significant architectural change, transitioning the application's data persistence layer from MongoDB to PostgreSQL. This involves a comprehensive migration script to move existing data, a re-engineered database abstraction layer that mimics the MongoDB API for minimal code changes in the application, and updates to dependencies and configurations to support the new database system.

Highlights

Database Migration: The application's backend database has been migrated from MongoDB to PostgreSQL, leveraging PostgreSQL's JSONB capabilities for document storage.
New Migration Script: A new Python script, scripts/mongo_to_postgres_migration.py, has been added to facilitate the migration of existing MongoDB data to the new PostgreSQL schema. It includes options for creating a mongodump backup and truncating existing data.
Database Abstraction Layer Rewrite: The src/database.py module has been completely rewritten to provide a MongoDB-like collection API on top of PostgreSQL, including custom PostgresCursor and PostgresCollection classes for seamless integration with existing application logic.
Dependency Updates: New Python packages pydantic-settings and asyncpg have been added to requirements.txt to support the new PostgreSQL database connection and environment variable handling.
Environment Configuration: A new POSTGRES_DSN environment variable has been introduced in the .env file to configure the connection string for the PostgreSQL database.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request initiates a migration from MongoDB to PostgreSQL, introducing a new Python script (scripts/mongo_to_postgres_migration.py) to handle the data transfer by backing up MongoDB and then upserting collections into PostgreSQL JSONB tables. The core database layer (src/database.py) has been rewritten to use asyncpg for PostgreSQL, emulating MongoDB's collection API (e.g., find, aggregate) through in-memory processing of fetched documents and defining JSONB indexes. Supporting changes include adding POSTGRES_DSN to .env, updating requirements.txt with asyncpg and pydantic-settings, and modifying src/server.py to use pydantic-settings. However, review comments highlight critical architectural and functional issues: the in-memory emulation of query operations in src/database.py will lead to severe performance and memory problems, necessitating a translation to native SQL queries; the removal of transactions for page creation in src/services/page_service.py introduces data inconsistency risks; the search functionality is broken due to the custom _matches_filter not supporting $regex; the migration script's upsert_documents performs slow one-by-one insertions instead of batch operations; the PostgresCursor's synchronous iteration blocks the event loop; and the new database connection logic lacks the resilience (retry and monitoring) of the previous MongoDB implementation.

I am having trouble creating individual review comments. Click here to see my feedback.

src/database.py (260-457)

The current implementation of PostgresCollection and PostgresCursor emulates MongoDB's query functionality by fetching all documents from a table into memory and then performing filtering, sorting, projection, and aggregation in Python. This is a major architectural issue that will lead to severe performance degradation and high memory usage as the database grows. For example, _fetch_docs loads the entire table, and methods like find, count_documents, and aggregate process this in-memory list. This negates the benefits of using a powerful database like PostgreSQL.

The correct approach is to translate the MongoDB-style queries into equivalent SQL queries that can be executed by the database engine.

Filtering (find) should be translated to a WHERE clause.
Counting (count_documents) should use SELECT COUNT(*) ... WHERE ....
Aggregations (aggregate) should be translated to GROUP BY, SUM(), etc.
Sorting should use ORDER BY.
Projections should use SELECT col1, col2, ....

This will be a significant refactoring, but it's essential for the application to be performant and scalable.

src/services/page_service.py (225-244)

The database transaction for creating a new page along with its talk page has been removed. Previously, this was wrapped in client.start_session() and s.start_transaction(), ensuring atomicity. Now, the two create_page calls are independent. If the second call to create the 'talk' page fails, the 'main' page will still have been created, leaving the database in an inconsistent state. Transactions should be reintroduced to ensure both pages are created successfully or neither is.

src/services/page_service.py (352-355)

The search functionality is broken. This find call uses a $regex operator in the filter. However, the custom _matches_filter function in src/database.py does not implement the $regex operator. As a result, this query will not match any documents, and the search will always return empty results.

scripts/mongo_to_postgres_migration.py (146-170)

The upsert_documents function inserts documents one by one in a loop. For large collections, this will be very slow due to the overhead of individual INSERT statements. asyncpg supports batch operations via executemany, which is significantly more performant. Consider refactoring this to use executemany to speed up the migration process.

async def upsert_documents(
    pool: asyncpg.Pool, collection: str, documents: Iterable[Dict]
) -> int:
    table_name = table_name_for(collection)

    records_to_insert = []
    for doc in documents:
        doc = dict(doc)
        doc_id = str(doc.get("_id") or doc.get("id") or doc.get("uuid") or "")
        if not doc_id:
            continue
        clean_doc = sanitize_document(doc)
        clean_doc["_id"] = doc_id
        json_payload = json.dumps(clean_doc, ensure_ascii=False)
        records_to_insert.append((doc_id, json_payload))

    if not records_to_insert:
        return 0

    async with pool.acquire() as conn:
        async with conn.transaction():
            stmt = await conn.prepare(
                f"""
                INSERT INTO {table_name} (id, doc)
                VALUES ($1, $2::jsonb)
                ON CONFLICT (id) DO UPDATE SET doc = EXCLUDED.doc
                """
            )
            await stmt.executemany(records_to_insert)

    return len(records_to_insert)

src/database.py (248-254)

The synchronous iterator __iter__ uses asyncio.run(self._ensure_loaded()). This will block the entire event loop and will raise a RuntimeError if called from within an already running async function, as correctly checked. This design makes it impossible to use synchronous iteration on a cursor from within async code (e.g., for doc in my_cursor: inside an async def function). This is a significant limitation and can lead to unexpected errors. It's generally better to require async iteration (async for) for async database drivers and avoid mixing sync/async paradigms in this way.

src/database.py (114-182)

The database connection logic has lost its resilience. The previous implementation had a connect method with retry logic and a monitor_connection background task to handle disconnects. The new implementation attempts to connect only once, and if it fails, the database remains disconnected. This makes the application less robust to transient network issues or database restarts. Consider reintroducing connection retry logic and a monitoring task.

superintendent2521 · 2025-12-14T22:43:23Z

/gemini review

gemini-code-assist

Code Review

This pull request represents a significant and impressive migration of the database backend from MongoDB to PostgreSQL. The new database abstraction layer in src/database.py is well-designed, emulating the motor API on top of Postgres's JSONB features. The refactoring of complex queries, such as log aggregation, to use raw SQL will yield substantial performance benefits.

However, the review has identified several critical issues that must be addressed. Most importantly, several unique indexes from the MongoDB schema have not been correctly translated to unique constraints in Postgres, which introduces a serious data integrity risk. Additionally, there is a major feature regression in the page search functionality and a high-severity bug in the update_many implementation. I have provided detailed comments and suggestions for these and other medium-severity issues related to performance and code quality.

gemini-code-assist · 2025-12-14T22:45:49Z

+INDEX_SPECS = {
+    "pages": [
+        ("title_branch", "((doc->>'title'), (doc->>'branch'))"),
+        ("branch", "((doc->>'branch'))"),
+        ("updated_at", "((doc->>'updated_at'))"),
+        ("title_trgm", "USING gin ((doc->>'title') gin_trgm_ops)"),
    ],
-    "sessions": [
-        ("session_id", {"unique": True}),
-        ("user_id", {}),
-        ("expires_at", {"expireAfterSeconds": 0}),
+    "history": [
+        ("title_branch", "((doc->>'title'), (doc->>'branch'))"),
+        ("updated_at", "((doc->>'updated_at'))"),
+    ],
+    "branches": [
+        ("page_branch", "((doc->>'page_title'), (doc->>'branch_name'))"),
+        ("created_at", "((doc->>'created_at'))"),
+        ("branch_name", "((doc->>'branch_name'))"),
+    ],
+    "users": [
+        ("username", "((doc->>'username'))"),
+        ("email", "((doc->>'email'))"),
    ],
    "image_hashes": [
-        ("filename", {"unique": True}),
-        ("sha256", {}),
+        ("filename", "((doc->>'filename'))"),
+        ("sha256", "((doc->>'sha256'))"),
    ],
    "analytics_events": [
-        ([("event_type", 1), ("timestamp", -1)], {}),
-        ("timestamp", {}),
-        ("query_normalized", {}),
+        ("event_type_timestamp", "((doc->>'event_type'), ((doc->>'timestamp')::timestamptz))"),
+        ("query_normalized_trgm", "USING gin ((doc->>'query_normalized') gin_trgm_ops)"),
+        ("timestamp_only", "(((doc->>'timestamp'))::timestamptz)"),
+    ],
+    "sessions": [
+        ("session_id", "((doc->>'session_id'))"),
+        ("expires_at", "(((doc->>'expires_at'))::timestamptz)"),
+        ("user_id", "((doc->>'user_id'))"),
+    ],
+    "settings": [
+        ("doc_id", "((doc->>'_id'))"),
+    ],
+    "system_logs": [
+        ("action_timestamp", "((doc->>'action'), ((doc->>'timestamp')::timestamptz))"),
+        ("timestamp_only", "(((doc->>'timestamp'))::timestamptz)"),
    ],
 }


Several indexes that were unique in the MongoDB schema are no longer defined as unique in the new INDEX_SPECS. This is a critical data integrity issue that could lead to duplicate data. Specifically:

pages: The index on (title, branch) should be unique.

users: The indexes on username and email should be unique.

image_hashes: The index on filename should be unique.

sessions: The index on session_id should be unique.

Please update the index definitions to include the UNIQUE constraint. I've also renamed the index suffixes to reflect this (e.g., _unique).

INDEX_SPECS = { "pages": [ ("title_branch_unique", "UNIQUE ((doc->>'title'), (doc->>'branch'))"), ("branch", "((doc->>'branch'))"), ("updated_at", "((doc->>'updated_at'))"), ("title_trgm", "USING gin ((doc->>'title') gin_trgm_ops)"), ], "history": [ ("title_branch", "((doc->>'title'), (doc->>'branch'))"), ("updated_at", "((doc->>'updated_at'))"), ], "branches": [ ("page_branch", "((doc->>'page_title'), (doc->>'branch_name'))"), ( "created_at", "((doc->>'created_at'))", ), ("branch_name", "((doc->>'branch_name'))"), ], "users": [ ("username_unique", "UNIQUE ((doc->>'username'))"), ("email_unique", "UNIQUE ((doc->>'email'))"), ], "image_hashes": [ ("filename_unique", "UNIQUE ((doc->>'filename'))"), ("sha256", "((doc->>'sha256'))"), ], "analytics_events": [ ( "event_type_timestamp", "((doc->>'event_type'), ((doc->>'timestamp')::timestamptz))", ), ( "query_normalized_trgm", "USING gin ((doc->>'query_normalized') gin_trgm_ops)", ), ("timestamp_only", "(((doc->>'timestamp'))::timestamptz)"), ], "sessions": [ ("session_id_unique", "UNIQUE ((doc->>'session_id'))"), ("expires_at", "(((doc->>'expires_at'))::timestamptz)"), ("user_id", "((doc->>'user_id'))"), ], "settings": [("doc_id", "((doc->>'_id'))")], "system_logs": [ ( "action_timestamp", "((doc->>'action'), ((doc->>'timestamp')::timestamptz))", ), ("timestamp_only", "(((doc->>'timestamp'))::timestamptz)"), ], }

gemini-code-assist · 2025-12-14T22:45:49Z

+    async def update_many(
+        self,
+        filt: Dict[str, Any],
+        update: Dict[str, Any],
+        *,
+        upsert: bool = False,
+        connection: Optional[asyncpg.Connection] = None,
+    ) -> UpdateResult:
+        await self._ensure_table()
+        params: List[Any] = []
+        where_clause = self._build_where_clause(filt or {}, params)
+        update_expr = self._build_update_expression(update, params)
+
+        query = f"UPDATE {self._table_name} SET doc = {update_expr}"
+        if where_clause:
+            query += f" WHERE {where_clause}"
+        query += " RETURNING 1"
+
+        rows = await self._db.fetch(query, *params, conn=connection)
+        matched = len(rows)
+        modified = matched
+        upserted_id = None
+
+        if upsert and matched == 0:
+            base = {k: v for k, v in filt.items() if not isinstance(v, dict)}
+            new_doc = _apply_update(base, update)
+            result = await self.insert_one(new_doc, connection=connection)
+            upserted_id = result.inserted_id
+            modified += 1
+
+        return UpdateResult(matched_count=matched, modified_count=modified, upserted_id=upserted_id)


There is a bug in the upsert logic for update_many. When no documents match the filter, it attempts to create a new document. However, the base for this new document is created using base = {k: v for k, v in filt.items() if not isinstance(v, dict)}, which only handles simple equality filters. If the filter contains operators (e.g., {'field': {'$gt': 5}}), the base dictionary will be empty, and the created document will be incorrect. This can lead to data corruption and does not match MongoDB's behavior.

gemini-code-assist · 2025-12-14T22:45:49Z

+    @classmethod
+    async def _maybe_enforce_retention(cls, collection) -> None:
+        """Delete analytics events older than the retention window on a fixed cadence."""
+        now = _utcnow()
+        if cls._last_retention_check and now - cls._last_retention_check < cls._RETENTION_CHECK_INTERVAL:
+            return
+        cutoff = now - timedelta(days=cls._RETENTION_DAYS)
+        try:
+            await collection.delete_many({"timestamp": {"$lt": cutoff}})
+        except Exception as exc:  # IGNORE W0718
+            logger.warning(f"Failed to enforce analytics retention: {exc}")
+        cls._last_retention_check = now


Calling _maybe_enforce_retention after every event insertion is inefficient. While the internal time check prevents the DELETE query from running every time, it still adds overhead to every write operation. A more robust approach would be to run this retention logic in a separate, periodic background task (e.g., using asyncio.create_task on application startup) that runs once every few hours.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…ware into postgres

superintendent2521 · 2025-12-15T02:23:34Z

/gemini review

gemini-code-assist

Code Review

This pull request represents a significant and well-executed migration from MongoDB to PostgreSQL. The introduction of a new database abstraction layer in src/database.py to emulate the MongoDB collection API is a clever approach to minimize changes in the service layer. The migration script is thorough, and many parts of the application have been refactored to leverage PostgreSQL's features, such as the improved search and log aggregation. My review focuses on some correctness issues in the database emulation layer, a few performance considerations, and a regression in observability. Overall, this is a fantastic piece of engineering.

gemini-code-assist · 2025-12-15T02:25:59Z

+                    if op == "$exists":
+                        clauses.append(f"{expr} IS NOT NULL" if raw_expected else f"{expr} IS NULL")
+                        continue


The implementation of the $exists operator is incorrect. The current logic ({expr}) IS NOT NULL checks if the value of a field is not SQL NULL. However, in JSONB, if a key does not exist, extracting it with ->> also results in SQL NULL. Therefore, this check cannot distinguish between a non-existent key and a key that exists with a null value.

To correctly check for the existence of a key, you should use the #> operator (which _json_path_expr(as_text=False) provides) and check if the result IS NOT NULL.

Suggested change

if op == "$exists":

clauses.append(f"{expr} IS NOT NULL" if raw_expected else f"{expr} IS NULL")

continue

if op == "$exists":

# Use #> to check for path existence, as #>> returns null for both

# non-existent keys and keys with an explicit null value.

expr_for_exists = self._json_path_expr(key, as_text=False)

clauses.append(f"{expr_for_exists} IS NOT NULL" if raw_expected else f"{expr_for_exists} IS NULL")

continue

gemini-code-assist · 2025-12-15T02:25:59Z

+    async def insert_one(
+        self, document: Dict[str, Any], *, connection: Optional[asyncpg.Connection] = None
+    ) -> InsertOneResult:
+        await self._ensure_table()
+        doc = dict(document)
+        if "_id" not in doc:
+            doc["_id"] = str(uuid.uuid4())
+        json_payload = json.dumps(_jsonable(doc), ensure_ascii=False)
+        await self._db.execute(
+            f"""
+            INSERT INTO {self._table_name} (id, doc)
+            VALUES ($1, $2::jsonb)
+            ON CONFLICT (id) DO UPDATE SET doc = EXCLUDED.doc
+            """,
+            str(doc["_id"]),
+            json_payload,
+            conn=connection,
+        )
+        return InsertOneResult(inserted_id=str(doc["_id"]))


The implementation of insert_one uses ON CONFLICT (id) DO UPDATE, which makes it behave as an 'upsert' operation. This differs from the standard pymongo.insert_one behavior, which would raise a DuplicateKeyError on an ID collision. This could lead to unexpected silent overwrites of data if the calling code assumes the original behavior. If the intention is to prevent overwrites, you should remove the ON CONFLICT clause and let the database raise a unique constraint violation error, which can then be handled by the caller.

gemini-code-assist · 2025-12-15T02:25:59Z

+        if group_stage is None:
+            # Fall back to a find-like query if no grouping is requested
+            combined_sorts: List[tuple[str, int]] = []
+            if sort_stage:
+                combined_sorts.extend(list(sort_stage.items()))
+            combined_sorts.extend(sorts)
+            effective_limit = limit_stage if limit_stage is not None else limit
+            results = await self._find_docs(match_filter, None, combined_sorts, effective_limit)
+            if project_stage:
+                results = self._apply_project_stage(results, project_stage)
+            return results


In the aggregation pipeline, when no $group stage is present, the implementation falls back to _find_docs. However, it passes projection=None and then applies the $project stage in Python (_apply_project_stage). This is inefficient, as it fetches full documents from the database only to discard fields in the application.

You could improve performance by translating the $project stage into a projection dictionary and passing it to _find_docs. This would allow the database to handle the projection for include-only cases, reducing network traffic and memory usage.

superintendent2521 · 2026-01-21T01:47:05Z

clusterfuck

superintendent2521 added 4 commits December 13, 2025 14:34

Implement migration from MongoDB to Postgres with backup and export f…

5395a1c

…eatures

add import awaitable

d596612

Refactor migration scripts: add JSON sanitization, remove deprecated …

243ef91

…bash script

the fuck

1f2757b

gemini-code-assist Bot reviewed Dec 14, 2025

View reviewed changes

chud life

b85b9cd

gemini-code-assist Bot reviewed Dec 14, 2025

View reviewed changes

superintendent2521 and others added 4 commits December 14, 2025 17:50

Update src/database.py

575610a

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

fixes

f1a5557

Merge branch 'postgres' of https://github.com/superintendent2521/wiki…

18f93b8

…ware into postgres

fix critical

381a17a

gemini-code-assist Bot reviewed Dec 15, 2025

View reviewed changes

ruff/black

bcde4e7

superintendent2521 force-pushed the postgres branch from be81c1f to bcde4e7 Compare January 21, 2026 01:10

superintendent2521 closed this Jan 21, 2026

superintendent2521 deleted the postgres branch January 21, 2026 01:47

Conversation

superintendent2521 commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Dec 14, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

src/database.py (260-457)

src/services/page_service.py (225-244)

src/services/page_service.py (352-355)

scripts/mongo_to_postgres_migration.py (146-170)

src/database.py (248-254)

src/database.py (114-182)

Uh oh!

superintendent2521 commented Dec 14, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

superintendent2521 commented Dec 15, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist Bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

superintendent2521 commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

superintendent2521 commented Dec 14, 2025 •

edited

Loading