Skip to content

Add Parquet feature cache format and CLI options#350

Merged
gbeane merged 7 commits intomainfrom
finish-parquet-feature-cache-option
Apr 21, 2026
Merged

Add Parquet feature cache format and CLI options#350
gbeane merged 7 commits intomainfrom
finish-parquet-feature-cache-option

Conversation

@gbeane
Copy link
Copy Markdown
Collaborator

@gbeane gbeane commented Apr 20, 2026

This pull request adds support for Parquet as the default cache format, and enhance cache directory structure options. The changes clarify and expand documentation for feature cache formats and migration, add new CLI options for cache management.

For large projects, testing shows that by using the more efficient Parquet feature cache format, the classification loop in the GUI can see a ~40% speedup.

Documentation improvements:

  • Expanded user and project setup guides to document both HDF5 and Parquet feature cache formats, migration steps, and cache directory structure, including new CLI options like --use-pose-hash, --skip-window-cache, and --cache-format.

Feature cache and CLI enhancements:

  • Added --use-pose-hash option to CLI tools and feature extraction to support more robust cache directory layouts and avoid filename collisions in shared caches.
  • Updated jabs-init to support explicit cache format selection (--cache-format {hdf5,parquet}).

Core code and serialization updates:

  • Changed CacheFormat enum to inherit from str for easier JSON serialization and added CACHE_FORMAT_KEY constant for project settings.
  • Improved Parquet cache metadata serialization to ensure values are always stored as floats or None.

Feature extraction and performance improvements:

  • Refactored feature extraction code to normalize operation settings, optimize per-frame feature handling, and add efficient flat feature access methods, reducing unnecessary data transformations.
  • Consolidated and simplified feature filtering logic for both per-frame and window features based on operation settings.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Parquet-backed feature caching and exposes cache-format / cache-path controls across CLI and GUI, including new UI actions for clearing feature caches and refactors to feature extraction to operate more efficiently on flat (columnar) feature maps.

Changes:

  • Introduce CacheFormat plumbing end-to-end (project setting, CLI --cache-format, cache reader/writer selection, JSON-friendly enum).
  • Add GUI settings + menu actions for feature-cache management, plus classification timing display.
  • Refactor feature extraction to support flat per-frame feature access and pose-hash-based cache directory layouts.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/project/test_project_cache_format.py Adds project-level tests around cache_format persistence/migration and feature cache clearing.
tests/project/test_initialize_project.py Extends jabs-init CLI parsing tests to include --cache-format.
src/jabs/ui/settings_dialog/settings_dialog.py Registers a new settings group in the project settings dialog.
src/jabs/ui/settings_dialog/cache_format_settings_group.py Adds GUI controls/help text for selecting feature cache format.
src/jabs/ui/main_window/menu_handlers.py Splits pose-cache clearing vs feature-cache clearing and adds feature-cache confirmation/hinting.
src/jabs/ui/main_window/menu_builder.py Moves “Clear Pose Cache” into File menu and adds “Clear Feature Cache…” action.
src/jabs/ui/main_window/main_window.py Enables the new clear-feature-cache action on project load.
src/jabs/ui/main_window/central_widget.py Updates classification completion handler to accept/display elapsed time.
src/jabs/ui/classification_thread.py Emits elapsed time and passes project cache_format into feature extraction.
src/jabs/scripts/initialize_project.py Adds --cache-format option and propagates it through multiprocessing feature generation.
src/jabs/scripts/generate_features.py Adds --use-pose-hash and writes Parquet caches with optional pose-hash subdir.
src/jabs/scripts/classify.py Adds --use-pose-hash and uses Parquet feature caching for CLI classification.
src/jabs/resources/docs/user_guide/project-setup.md Documents feature cache formats, structure, and migration workflow.
src/jabs/resources/docs/user_guide/cli-tools.md Documents new CLI flags and clarifies cache behavior.
docs/user-guide/project-setup.md Mirrors user guide updates for feature cache formats and migration.
docs/user-guide/cli-tools.md Mirrors CLI documentation updates.
src/jabs/project/project.py Persists cache_format into project settings and adds Project.clear_feature_cache().
src/jabs/project/parallel_workers.py Passes cache_format through multiprocessing feature loading and uses flat per-frame features.
src/jabs/feature_extraction/features.py Adds pose-hash cache paths, cache format auto-detection, and flat per-frame feature access.
packages/jabs-io/src/jabs/io/feature_cache/parquet.py Normalizes optional metadata values to float/None for JSON.
packages/jabs-core/src/jabs/core/enums/cache_format.py Makes CacheFormat a str-backed Enum for easier serialization.
packages/jabs-core/src/jabs/core/constants.py Adds CACHE_FORMAT_KEY constant for project settings.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/jabs/resources/docs/user_guide/project-setup.md
Comment thread src/jabs/project/project.py Outdated
Comment thread src/jabs/project/project.py Outdated
Comment thread src/jabs/feature_extraction/features.py Outdated
Comment thread src/jabs/ui/settings_dialog/cache_format_settings_group.py Outdated
Comment thread tests/project/test_project_cache_format.py Outdated
Comment thread docs/user-guide/project-setup.md
@gbeane gbeane self-assigned this Apr 20, 2026
@gbeane gbeane marked this pull request as ready for review April 20, 2026 01:07
@gbeane gbeane requested review from bergsalex and keithshep April 20, 2026 01:07
Comment thread src/jabs/project/parallel_workers.py Outdated
Comment on lines +139 to +146
try:
cache_format = CacheFormat(job["cache_format"])
except (KeyError, ValueError):
logger.error(
"Unknown cache_format %r in job spec; falling back to HDF5",
job.get("cache_format"),
)
cache_format = CacheFormat.HDF5
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just double checking: should this exception not be fatal / fail-fast?

Copy link
Copy Markdown
Collaborator Author

@gbeane gbeane Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it should be fatal. I was thinking we can just fall back to the lowest common denominator, HDF5, but this shouldn't happen and if it did maybe it would be best to fail and have to "repair" the project by setting the "cache_format" value in the project.json file to a supported value

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I think I'm going to just have the bare code here without wrapping it in try/except and let it raise a KeyError if "cache_format" is missing from he job spec or a ValueError if it's not a valid CacheFormat enum value

this shouldn't happen, and if it does we'll just propagate the exception back to the caller with the stack trace. The caller will catch the exception and re-raise a RunTimeError: "RuntimeError: Feature collection failed for video: " with the original exception chained via cause

Comment thread src/jabs/scripts/initialize_project.py Outdated
Comment on lines +38 to +45
try:
cache_format = CacheFormat(params.get("cache_format", CacheFormat.HDF5.value))
except ValueError:
logger.error(
"Unknown cache_format %r in job params; falling back to HDF5",
params.get("cache_format"),
)
cache_format = CacheFormat.HDF5
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

candidate for a small utility function?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is unnecessary code -- the "cache_format" parameter is always set by the job producer AND it's value is going to be validated by Click

gbeane added 2 commits April 21, 2026 13:48
…isting project cache format when --cache-format is not passed
@gbeane gbeane merged commit 18227f2 into main Apr 21, 2026
5 checks passed
@gbeane gbeane deleted the finish-parquet-feature-cache-option branch April 21, 2026 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants