Add Parquet feature cache format and CLI options#350
Conversation
There was a problem hiding this comment.
Pull request overview
Adds Parquet-backed feature caching and exposes cache-format / cache-path controls across CLI and GUI, including new UI actions for clearing feature caches and refactors to feature extraction to operate more efficiently on flat (columnar) feature maps.
Changes:
- Introduce
CacheFormatplumbing end-to-end (project setting, CLI--cache-format, cache reader/writer selection, JSON-friendly enum). - Add GUI settings + menu actions for feature-cache management, plus classification timing display.
- Refactor feature extraction to support flat per-frame feature access and pose-hash-based cache directory layouts.
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/project/test_project_cache_format.py | Adds project-level tests around cache_format persistence/migration and feature cache clearing. |
| tests/project/test_initialize_project.py | Extends jabs-init CLI parsing tests to include --cache-format. |
| src/jabs/ui/settings_dialog/settings_dialog.py | Registers a new settings group in the project settings dialog. |
| src/jabs/ui/settings_dialog/cache_format_settings_group.py | Adds GUI controls/help text for selecting feature cache format. |
| src/jabs/ui/main_window/menu_handlers.py | Splits pose-cache clearing vs feature-cache clearing and adds feature-cache confirmation/hinting. |
| src/jabs/ui/main_window/menu_builder.py | Moves “Clear Pose Cache” into File menu and adds “Clear Feature Cache…” action. |
| src/jabs/ui/main_window/main_window.py | Enables the new clear-feature-cache action on project load. |
| src/jabs/ui/main_window/central_widget.py | Updates classification completion handler to accept/display elapsed time. |
| src/jabs/ui/classification_thread.py | Emits elapsed time and passes project cache_format into feature extraction. |
| src/jabs/scripts/initialize_project.py | Adds --cache-format option and propagates it through multiprocessing feature generation. |
| src/jabs/scripts/generate_features.py | Adds --use-pose-hash and writes Parquet caches with optional pose-hash subdir. |
| src/jabs/scripts/classify.py | Adds --use-pose-hash and uses Parquet feature caching for CLI classification. |
| src/jabs/resources/docs/user_guide/project-setup.md | Documents feature cache formats, structure, and migration workflow. |
| src/jabs/resources/docs/user_guide/cli-tools.md | Documents new CLI flags and clarifies cache behavior. |
| docs/user-guide/project-setup.md | Mirrors user guide updates for feature cache formats and migration. |
| docs/user-guide/cli-tools.md | Mirrors CLI documentation updates. |
| src/jabs/project/project.py | Persists cache_format into project settings and adds Project.clear_feature_cache(). |
| src/jabs/project/parallel_workers.py | Passes cache_format through multiprocessing feature loading and uses flat per-frame features. |
| src/jabs/feature_extraction/features.py | Adds pose-hash cache paths, cache format auto-detection, and flat per-frame feature access. |
| packages/jabs-io/src/jabs/io/feature_cache/parquet.py | Normalizes optional metadata values to float/None for JSON. |
| packages/jabs-core/src/jabs/core/enums/cache_format.py | Makes CacheFormat a str-backed Enum for easier serialization. |
| packages/jabs-core/src/jabs/core/constants.py | Adds CACHE_FORMAT_KEY constant for project settings. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ut the setting to HDF5
…=None; fix clear_feature_cache for pose-hash layout; update cache format tests for new-project Parquet default
| try: | ||
| cache_format = CacheFormat(job["cache_format"]) | ||
| except (KeyError, ValueError): | ||
| logger.error( | ||
| "Unknown cache_format %r in job spec; falling back to HDF5", | ||
| job.get("cache_format"), | ||
| ) | ||
| cache_format = CacheFormat.HDF5 |
There was a problem hiding this comment.
just double checking: should this exception not be fatal / fail-fast?
There was a problem hiding this comment.
maybe it should be fatal. I was thinking we can just fall back to the lowest common denominator, HDF5, but this shouldn't happen and if it did maybe it would be best to fail and have to "repair" the project by setting the "cache_format" value in the project.json file to a supported value
There was a problem hiding this comment.
so I think I'm going to just have the bare code here without wrapping it in try/except and let it raise a KeyError if "cache_format" is missing from he job spec or a ValueError if it's not a valid CacheFormat enum value
this shouldn't happen, and if it does we'll just propagate the exception back to the caller with the stack trace. The caller will catch the exception and re-raise a RunTimeError: "RuntimeError: Feature collection failed for video: " with the original exception chained via cause
| try: | ||
| cache_format = CacheFormat(params.get("cache_format", CacheFormat.HDF5.value)) | ||
| except ValueError: | ||
| logger.error( | ||
| "Unknown cache_format %r in job params; falling back to HDF5", | ||
| params.get("cache_format"), | ||
| ) | ||
| cache_format = CacheFormat.HDF5 |
There was a problem hiding this comment.
candidate for a small utility function?
There was a problem hiding this comment.
I think this is unnecessary code -- the "cache_format" parameter is always set by the job producer AND it's value is going to be validated by Click
…isting project cache format when --cache-format is not passed
…e_format when --cache-format is omitted
This pull request adds support for Parquet as the default cache format, and enhance cache directory structure options. The changes clarify and expand documentation for feature cache formats and migration, add new CLI options for cache management.
For large projects, testing shows that by using the more efficient Parquet feature cache format, the classification loop in the GUI can see a ~40% speedup.
Documentation improvements:
--use-pose-hash,--skip-window-cache, and--cache-format.Feature cache and CLI enhancements:
--use-pose-hashoption to CLI tools and feature extraction to support more robust cache directory layouts and avoid filename collisions in shared caches.jabs-initto support explicit cache format selection (--cache-format {hdf5,parquet}).Core code and serialization updates:
CacheFormatenum to inherit fromstrfor easier JSON serialization and addedCACHE_FORMAT_KEYconstant for project settings.Feature extraction and performance improvements: