feat(data-catalog): server-side pagination, asset columns, hierarchy filters & expanded search by larsgeorge-db · Pull Request #337 · databrickslabs/ontos

larsgeorge-db · 2026-05-06T13:54:08Z

Summary

Resolves #333

Complete overhaul of the Data Catalog (column dictionary) feature:

Merged data sources: Columns from both Data Contracts and the Asset DB (Table/View/Dataset entities with hasColumn children) are now surfaced, deduplicated by (table_full_name, column_name).
Server-side pagination: offset/limit params on /columns and /columns/search (default 50 per page), with total_count and has_more in responses.
Faceted hierarchy filters: New filter dropdowns for Asset Type, System, Catalog, Schema — values served via GET /api/data-catalog/hierarchy.
Expanded search: Full-field search across column name, description, business terms (label + IRI), parent table, contract name, system name, catalog, schema. Removed the previous 5000-record search cap.
Source provenance: Each column entry carries a source field (contract | asset | both) rendered as a badge in the UI.
Navigation: Clicking a row now navigates to the correct detail page (Asset or Contract) depending on provenance.

Files Changed

File	Change
`src/backend/src/models/data_catalog.py`	Pagination metadata, source field, `HierarchyFilters` model
`src/backend/src/controller/data_catalog_manager.py`	Asset extraction, merge/dedup logic, full-field search, hierarchy filters
`src/backend/src/routes/data_catalog_routes.py`	New query params, `/hierarchy` endpoint
`src/frontend/src/types/data-catalog.ts`	Updated TS interfaces
`src/frontend/src/views/data-catalog.tsx`	Filter bar, pagination controls, source badges, smart navigation

Test plan

Load Data Catalog page — verify columns from both Contracts and Assets appear
Confirm pagination: next/prev buttons, page size selector, total count accuracy
Filter by Asset Type, System, Catalog, Schema — verify results narrow correctly
Search for a business term — confirm it appears in results
Search for a table name — confirm matching columns appear
Click a row sourced from Asset → navigates to /assets/:id
Click a row sourced from Contract → navigates to /data-contracts/:id
Verify "Clear filters" resets all dropdowns and reloads full set

Follow-ups

#409 — Push data-catalog pagination/filters into SQL (perf). Acknowledged in review; deferred from this PR.

will-yuponce-db · 2026-05-06T15:31:08Z

Looks like there's a Typescript error

mvkonchits-db

Thanks for taking this on — the API contract, layering, and the contract+asset merge with source provenance are clean, and the frontend pagination UX is sound (filters reset offset to 0,
debounced search, sort honestly scoped to the current page).

Three things I'd like to discuss before merge:

Pagination is in-memory, not pushed to SQL.** get_all_columns / search_columns / get_hierarchy_filters / get_table_list each call _get_columns_from_contracts() +
_get_columns_from_assets() + _merge_columns() end-to-end on every request and slice the result in Python (data_catalog_manager.py:580-585, 540-555). For tenants with thousands of registered
columns, every page click and every debounced keystroke re-pays the full merge cost. At a minimum I'd like to see this called out as a known follow-up; ideally filters get pushed into the
SQLAlchemy query so pagination is genuinely server-side.
N+1 in _get_asset_child_columns** (data_catalog_manager.py:281-291): one SELECT per hasColumn relationship per table-like asset. Combined with the unbounded fetch above, this is the
actual hot path. selectinload / joinedload on the column children, or a batched target_asset_id IN (...) query, would close it. (Note: this method also appears in #340 — whichever PR merges
second will conflict, so worth coordinating the fix once.)
CI scope.** The lockfile auto-commit job in test-coverage.yml adds contents: write + pull-requests: write and a LOCKFILE_BOT_TOKEN with write permission to the repo. That's a
meaningful security-surface change and ideally lands in its own PR so it can be reviewed independently of the data-catalog feature.

Nits (non-blocking):

No tests added for pagination math, filter composition, merge/dedup precedence, or the new search field surface — meanwhile coverage gates go up.
get_table_list iterates all_columns twice (data_catalog_manager.py:651-685); single pass would do.
_resolve_asset_parents recurses without a visited-set — a cyclic asset graph would loop.
Search route min_length=1 lets single-char q="a" scan everything; consider min 2 chars.

…filters, expanded search Resolves #333 — Complete overhaul of the Data Catalog feature: Backend: - Merge columns from both Data Contracts and Asset DB (Table/View/Dataset with hasColumn children) - Server-side pagination with offset/limit (default 50 per page) - Faceted filtering by asset_type, system, catalog, schema - Full-field search across column name, description, business terms, parent table, contract name, system name (no 5000-record cap) - New GET /api/data-catalog/hierarchy endpoint for filter dropdown values - Source provenance tracking (contract/asset/both) on each column entry - Deduplication: asset metadata is base, contract enriches Frontend: - Replace single table dropdown with faceted filter bar (Asset Type, System, Catalog, Schema) - Pagination controls (prev/next, page size selector, page X of Y) - Source badge on each row showing provenance - Search placeholder updated to reflect broader search scope - Click-through to asset detail when column sourced from Asset DB

Eager-load source_relationships.target_asset in _get_columns_from_assets so that _get_asset_child_columns can read rel.target_asset directly instead of issuing one SELECT per hasColumn relationship per table-like asset. Note: PR #340 (feat/concepts-panels-and-fixes) also touches this method — whichever PR merges second will need a rebase carry-over.

- _resolve_asset_parents: thread a visited-set through the recursion so cyclic asset graphs can no longer loop. - get_table_list: single pass over merged columns (accumulate counts and per-table metadata together, materialise list at the end). - search route: bump query min_length from 1 to 2 so single-char q="a" no longer scans every column.

…lters Adds unit tests around the parts of #337 the reviewer flagged as untested: - _merge_columns: dedup precedence (asset base + contract enrichment), case-insensitive key, business-term union. - _matches_search: each searchable field branch (name, description, label, table, contract, system, catalog, schema, business term label + IRI). - get_all_columns pagination math: first slice, offset skipping, has_more boundary. - Filter composition: catalog alone, and the combined catalog/schema/asset_type/system narrow. - search_columns: total_count reflects full match set, not the page. Extractors are monkeypatched to keep the suite fast and DB-free.

larsgeorge-db · 2026-05-21T10:34:02Z

@mvkonchits-db Thanks for the thorough review — pushed an update addressing each point. Branch was rebased onto current main first (force-push).

1. In-memory pagination → tracked as #409. Agreed this is a real concern at scale. The proper fix is sizeable (push filters into SQL, switch to keyset or LIMIT/OFFSET at the DB layer, probably a tsvector for the full-field search). Filed #409 with the hot paths and a "done when" target so it doesn't drift.

2. N+1 in _get_asset_child_columns — fixed in 7f85ede. Eager-loaded source_relationships.target_asset in _get_columns_from_assets and changed _get_asset_child_columns to read rel.target_asset directly. Left a docstring note on the eager-load contract. Coordination note for #340: it touches the same method — whichever lands second will need a trivial carry-over.

3. CI scope — auto-resolved. The lockfile auto-commit + LOCKFILE_BOT_TOKEN PAT landed independently via #313 and #314 while #337 was in review, and #372 restructured test-coverage.yml into composite actions. The rebase dropped this branch's now-redundant copies of those edits — git log origin/main..HEAD is back to feature commits only.

Nits:

get_table_list single-pass: 72cbd37 — single pass over merged columns, counts and per-table metadata accumulated together, list materialised at the end.
_resolve_asset_parents visited-set: same commit — threads a _visited set through the recursion.
Search route min_length: same commit — bumped to 2.
Duplicate except Exception as e: — false alarm; was a display artifact in my earlier inspection. The file only has one per try-block (confirmed).
Tests: 9de92f4 — 21 cases covering merge/dedup precedence (incl. case-insensitive key and business-term union), every _matches_search field branch, pagination math (first slice / offset / boundary has_more), filter composition, and search total_count. Extractors monkeypatched, DB-free.

CI should run cleanly now that the TS error is gone (dropped the unused Database import in 5ea6bc5).

larsgeorge-db requested a review from a team as a code owner May 6, 2026 13:54

mvkonchits-db reviewed May 13, 2026

View reviewed changes

larsgeorge-db added 5 commits May 21, 2026 12:26

fix(data-catalog): drop unused Database icon import

5ea6bc5

larsgeorge-db force-pushed the feat/data-catalog-333 branch 2 times, most recently from 1fbdd58 to 9de92f4 Compare May 21, 2026 10:32

larsgeorge-db mentioned this pull request May 21, 2026

[Perf]: Push data-catalog pagination/filters into SQL (perf follow-up to #337) #409

Open

larsgeorge-db merged commit 4b9115f into main May 21, 2026
5 of 6 checks passed

larsgeorge-db mentioned this pull request May 21, 2026

feat(concepts): full-page browser/detail, links, neighbourhood, robust ontology save #340

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(data-catalog): server-side pagination, asset columns, hierarchy filters & expanded search#337

feat(data-catalog): server-side pagination, asset columns, hierarchy filters & expanded search#337
larsgeorge-db merged 5 commits into
mainfrom
feat/data-catalog-333

larsgeorge-db commented May 6, 2026 •

edited

Loading

Uh oh!

will-yuponce-db commented May 6, 2026

Uh oh!

mvkonchits-db left a comment

Uh oh!

larsgeorge-db commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

larsgeorge-db commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files Changed

Test plan

Follow-ups

Uh oh!

will-yuponce-db commented May 6, 2026

Uh oh!

mvkonchits-db left a comment

Choose a reason for hiding this comment

Uh oh!

larsgeorge-db commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

larsgeorge-db commented May 6, 2026 •

edited

Loading