Skip to content

feat(data-catalog): server-side pagination, asset columns, hierarchy filters & expanded search#337

Merged
larsgeorge-db merged 5 commits into
mainfrom
feat/data-catalog-333
May 21, 2026
Merged

feat(data-catalog): server-side pagination, asset columns, hierarchy filters & expanded search#337
larsgeorge-db merged 5 commits into
mainfrom
feat/data-catalog-333

Conversation

@larsgeorge-db
Copy link
Copy Markdown
Collaborator

@larsgeorge-db larsgeorge-db commented May 6, 2026

Summary

Resolves #333

Complete overhaul of the Data Catalog (column dictionary) feature:

  • Merged data sources: Columns from both Data Contracts and the Asset DB (Table/View/Dataset entities with hasColumn children) are now surfaced, deduplicated by (table_full_name, column_name).
  • Server-side pagination: offset/limit params on /columns and /columns/search (default 50 per page), with total_count and has_more in responses.
  • Faceted hierarchy filters: New filter dropdowns for Asset Type, System, Catalog, Schema — values served via GET /api/data-catalog/hierarchy.
  • Expanded search: Full-field search across column name, description, business terms (label + IRI), parent table, contract name, system name, catalog, schema. Removed the previous 5000-record search cap.
  • Source provenance: Each column entry carries a source field (contract | asset | both) rendered as a badge in the UI.
  • Navigation: Clicking a row now navigates to the correct detail page (Asset or Contract) depending on provenance.

Files Changed

File Change
src/backend/src/models/data_catalog.py Pagination metadata, source field, HierarchyFilters model
src/backend/src/controller/data_catalog_manager.py Asset extraction, merge/dedup logic, full-field search, hierarchy filters
src/backend/src/routes/data_catalog_routes.py New query params, /hierarchy endpoint
src/frontend/src/types/data-catalog.ts Updated TS interfaces
src/frontend/src/views/data-catalog.tsx Filter bar, pagination controls, source badges, smart navigation

Test plan

  • Load Data Catalog page — verify columns from both Contracts and Assets appear
  • Confirm pagination: next/prev buttons, page size selector, total count accuracy
  • Filter by Asset Type, System, Catalog, Schema — verify results narrow correctly
  • Search for a business term — confirm it appears in results
  • Search for a table name — confirm matching columns appear
  • Click a row sourced from Asset → navigates to /assets/:id
  • Click a row sourced from Contract → navigates to /data-contracts/:id
  • Verify "Clear filters" resets all dropdowns and reloads full set

Follow-ups

  • #409 — Push data-catalog pagination/filters into SQL (perf). Acknowledged in review; deferred from this PR.

@larsgeorge-db larsgeorge-db requested a review from a team as a code owner May 6, 2026 13:54
@will-yuponce-db
Copy link
Copy Markdown
Contributor

Looks like there's a Typescript error

Copy link
Copy Markdown
Contributor

@mvkonchits-db mvkonchits-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on — the API contract, layering, and the contract+asset merge with source provenance are clean, and the frontend pagination UX is sound (filters reset offset to 0,
debounced search, sort honestly scoped to the current page).

Three things I'd like to discuss before merge:

  1. Pagination is in-memory, not pushed to SQL.** get_all_columns / search_columns / get_hierarchy_filters / get_table_list each call _get_columns_from_contracts() +
    _get_columns_from_assets() + _merge_columns() end-to-end on every request and slice the result in Python (data_catalog_manager.py:580-585, 540-555). For tenants with thousands of registered
    columns, every page click and every debounced keystroke re-pays the full merge cost. At a minimum I'd like to see this called out as a known follow-up; ideally filters get pushed into the
    SQLAlchemy query so pagination is genuinely server-side.

  2. N+1 in _get_asset_child_columns** (data_catalog_manager.py:281-291): one SELECT per hasColumn relationship per table-like asset. Combined with the unbounded fetch above, this is the
    actual hot path. selectinload / joinedload on the column children, or a batched target_asset_id IN (...) query, would close it. (Note: this method also appears in #340 — whichever PR merges
    second will conflict, so worth coordinating the fix once.)

  3. CI scope.** The lockfile auto-commit job in test-coverage.yml adds contents: write + pull-requests: write and a LOCKFILE_BOT_TOKEN with write permission to the repo. That's a
    meaningful security-surface change and ideally lands in its own PR so it can be reviewed independently of the data-catalog feature.

Nits (non-blocking):

  • No tests added for pagination math, filter composition, merge/dedup precedence, or the new search field surface — meanwhile coverage gates go up.
  • get_table_list iterates all_columns twice (data_catalog_manager.py:651-685); single pass would do.
  • _resolve_asset_parents recurses without a visited-set — a cyclic asset graph would loop.
  • Search route min_length=1 lets single-char q="a" scan everything; consider min 2 chars.

…filters, expanded search

Resolves #333 — Complete overhaul of the Data Catalog feature:

Backend:
- Merge columns from both Data Contracts and Asset DB (Table/View/Dataset
  with hasColumn children)
- Server-side pagination with offset/limit (default 50 per page)
- Faceted filtering by asset_type, system, catalog, schema
- Full-field search across column name, description, business terms,
  parent table, contract name, system name (no 5000-record cap)
- New GET /api/data-catalog/hierarchy endpoint for filter dropdown values
- Source provenance tracking (contract/asset/both) on each column entry
- Deduplication: asset metadata is base, contract enriches

Frontend:
- Replace single table dropdown with faceted filter bar
  (Asset Type, System, Catalog, Schema)
- Pagination controls (prev/next, page size selector, page X of Y)
- Source badge on each row showing provenance
- Search placeholder updated to reflect broader search scope
- Click-through to asset detail when column sourced from Asset DB
Eager-load source_relationships.target_asset in _get_columns_from_assets so
that _get_asset_child_columns can read rel.target_asset directly instead of
issuing one SELECT per hasColumn relationship per table-like asset.

Note: PR #340 (feat/concepts-panels-and-fixes) also touches this method —
whichever PR merges second will need a rebase carry-over.
- _resolve_asset_parents: thread a visited-set through the recursion so
  cyclic asset graphs can no longer loop.
- get_table_list: single pass over merged columns (accumulate counts and
  per-table metadata together, materialise list at the end).
- search route: bump query min_length from 1 to 2 so single-char q="a"
  no longer scans every column.
…lters

Adds unit tests around the parts of #337 the reviewer flagged as untested:

- _merge_columns: dedup precedence (asset base + contract enrichment),
  case-insensitive key, business-term union.
- _matches_search: each searchable field branch (name, description,
  label, table, contract, system, catalog, schema, business term
  label + IRI).
- get_all_columns pagination math: first slice, offset skipping,
  has_more boundary.
- Filter composition: catalog alone, and the combined
  catalog/schema/asset_type/system narrow.
- search_columns: total_count reflects full match set, not the page.

Extractors are monkeypatched to keep the suite fast and DB-free.
@larsgeorge-db
Copy link
Copy Markdown
Collaborator Author

@mvkonchits-db Thanks for the thorough review — pushed an update addressing each point. Branch was rebased onto current main first (force-push).

1. In-memory pagination → tracked as #409. Agreed this is a real concern at scale. The proper fix is sizeable (push filters into SQL, switch to keyset or LIMIT/OFFSET at the DB layer, probably a tsvector for the full-field search). Filed #409 with the hot paths and a "done when" target so it doesn't drift.

2. N+1 in _get_asset_child_columns — fixed in 7f85ede. Eager-loaded source_relationships.target_asset in _get_columns_from_assets and changed _get_asset_child_columns to read rel.target_asset directly. Left a docstring note on the eager-load contract. Coordination note for #340: it touches the same method — whichever lands second will need a trivial carry-over.

3. CI scope — auto-resolved. The lockfile auto-commit + LOCKFILE_BOT_TOKEN PAT landed independently via #313 and #314 while #337 was in review, and #372 restructured test-coverage.yml into composite actions. The rebase dropped this branch's now-redundant copies of those edits — git log origin/main..HEAD is back to feature commits only.

Nits:

  • get_table_list single-pass: 72cbd37 — single pass over merged columns, counts and per-table metadata accumulated together, list materialised at the end.
  • _resolve_asset_parents visited-set: same commit — threads a _visited set through the recursion.
  • Search route min_length: same commit — bumped to 2.
  • Duplicate except Exception as e: — false alarm; was a display artifact in my earlier inspection. The file only has one per try-block (confirmed).
  • Tests: 9de92f4 — 21 cases covering merge/dedup precedence (incl. case-insensitive key and business-term union), every _matches_search field branch, pagination math (first slice / offset / boundary has_more), filter composition, and search total_count. Extractors monkeypatched, DB-free.

CI should run cleanly now that the TS error is gone (dropped the unused Database import in 5ea6bc5).

@larsgeorge-db larsgeorge-db merged commit 4b9115f into main May 21, 2026
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[PRD]: Data Catalog — Server-Side Pagination, Asset Column Integration, Hierarchy Filters & Expanded Search

3 participants