feat(data-catalog): server-side pagination, asset columns, hierarchy filters & expanded search#337
Conversation
|
Looks like there's a Typescript error |
mvkonchits-db
left a comment
There was a problem hiding this comment.
Thanks for taking this on — the API contract, layering, and the contract+asset merge with source provenance are clean, and the frontend pagination UX is sound (filters reset offset to 0,
debounced search, sort honestly scoped to the current page).
Three things I'd like to discuss before merge:
-
Pagination is in-memory, not pushed to SQL.**
get_all_columns/search_columns/get_hierarchy_filters/get_table_listeach call_get_columns_from_contracts()+
_get_columns_from_assets()+_merge_columns()end-to-end on every request and slice the result in Python (data_catalog_manager.py:580-585, 540-555). For tenants with thousands of registered
columns, every page click and every debounced keystroke re-pays the full merge cost. At a minimum I'd like to see this called out as a known follow-up; ideally filters get pushed into the
SQLAlchemy query so pagination is genuinely server-side. -
N+1 in
_get_asset_child_columns** (data_catalog_manager.py:281-291): oneSELECTperhasColumnrelationship per table-like asset. Combined with the unbounded fetch above, this is the
actual hot path.selectinload/joinedloadon the column children, or a batchedtarget_asset_id IN (...)query, would close it. (Note: this method also appears in #340 — whichever PR merges
second will conflict, so worth coordinating the fix once.) -
CI scope.** The lockfile auto-commit job in
test-coverage.ymladdscontents: write+pull-requests: writeand aLOCKFILE_BOT_TOKENwith write permission to the repo. That's a
meaningful security-surface change and ideally lands in its own PR so it can be reviewed independently of the data-catalog feature.
Nits (non-blocking):
- No tests added for pagination math, filter composition, merge/dedup precedence, or the new search field surface — meanwhile coverage gates go up.
get_table_listiteratesall_columnstwice (data_catalog_manager.py:651-685); single pass would do._resolve_asset_parentsrecurses without a visited-set — a cyclic asset graph would loop.- Search route
min_length=1lets single-charq="a"scan everything; consider min 2 chars.
…filters, expanded search Resolves #333 — Complete overhaul of the Data Catalog feature: Backend: - Merge columns from both Data Contracts and Asset DB (Table/View/Dataset with hasColumn children) - Server-side pagination with offset/limit (default 50 per page) - Faceted filtering by asset_type, system, catalog, schema - Full-field search across column name, description, business terms, parent table, contract name, system name (no 5000-record cap) - New GET /api/data-catalog/hierarchy endpoint for filter dropdown values - Source provenance tracking (contract/asset/both) on each column entry - Deduplication: asset metadata is base, contract enriches Frontend: - Replace single table dropdown with faceted filter bar (Asset Type, System, Catalog, Schema) - Pagination controls (prev/next, page size selector, page X of Y) - Source badge on each row showing provenance - Search placeholder updated to reflect broader search scope - Click-through to asset detail when column sourced from Asset DB
Eager-load source_relationships.target_asset in _get_columns_from_assets so that _get_asset_child_columns can read rel.target_asset directly instead of issuing one SELECT per hasColumn relationship per table-like asset. Note: PR #340 (feat/concepts-panels-and-fixes) also touches this method — whichever PR merges second will need a rebase carry-over.
- _resolve_asset_parents: thread a visited-set through the recursion so cyclic asset graphs can no longer loop. - get_table_list: single pass over merged columns (accumulate counts and per-table metadata together, materialise list at the end). - search route: bump query min_length from 1 to 2 so single-char q="a" no longer scans every column.
…lters Adds unit tests around the parts of #337 the reviewer flagged as untested: - _merge_columns: dedup precedence (asset base + contract enrichment), case-insensitive key, business-term union. - _matches_search: each searchable field branch (name, description, label, table, contract, system, catalog, schema, business term label + IRI). - get_all_columns pagination math: first slice, offset skipping, has_more boundary. - Filter composition: catalog alone, and the combined catalog/schema/asset_type/system narrow. - search_columns: total_count reflects full match set, not the page. Extractors are monkeypatched to keep the suite fast and DB-free.
1fbdd58 to
9de92f4
Compare
|
@mvkonchits-db Thanks for the thorough review — pushed an update addressing each point. Branch was rebased onto current 1. In-memory pagination → tracked as #409. Agreed this is a real concern at scale. The proper fix is sizeable (push filters into SQL, switch to keyset or 2. N+1 in 3. CI scope — auto-resolved. The lockfile auto-commit + Nits:
CI should run cleanly now that the TS error is gone (dropped the unused |
Summary
Resolves #333
Complete overhaul of the Data Catalog (column dictionary) feature:
hasColumnchildren) are now surfaced, deduplicated by(table_full_name, column_name).offset/limitparams on/columnsand/columns/search(default 50 per page), withtotal_countandhas_morein responses.GET /api/data-catalog/hierarchy.sourcefield (contract|asset|both) rendered as a badge in the UI.Files Changed
src/backend/src/models/data_catalog.pyHierarchyFiltersmodelsrc/backend/src/controller/data_catalog_manager.pysrc/backend/src/routes/data_catalog_routes.py/hierarchyendpointsrc/frontend/src/types/data-catalog.tssrc/frontend/src/views/data-catalog.tsxTest plan
/assets/:id/data-contracts/:idFollow-ups