feat: add metrics services and aggregation engine#45
Conversation
Summary of ChangesHello @jeremi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a robust and centralized system for metrics computation and data aggregation within OpenSPP. It establishes a new Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Semgrep OSS found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
There was a problem hiding this comment.
Code Review
This pull request introduces two significant new modules, spp_metrics_services and spp_aggregation, which together form a powerful and well-designed aggregation engine. The architecture is clean, with a clear separation of concerns into services for breakdown, fairness, distribution, and privacy. The use of a statistic registry and a flexible scope definition model are excellent design choices for extensibility.
My review has identified a few key areas for improvement, primarily related to performance. There are instances of database queries inside loops that should be refactored into single, more efficient queries. Most critically, the fairness analysis service currently loads entire datasets into memory, which will not scale in a production environment and needs to be reworked using database-level aggregation like read_group.
Overall, this is a very strong contribution that lays a solid foundation for metrics and statistics across the platform. Addressing the identified performance bottlenecks will be crucial for its success with large data volumes.
| population = partner_model.search(base_domain) | ||
| values = set(population.mapped(f"{field_name}.id")) | ||
| values.discard(False) |
There was a problem hiding this comment.
The implementation of fairness analysis in this service and its helper methods loads the entire population recordset into memory (partner_model.search(base_domain)). This will cause severe performance and memory issues on production-sized datasets and will not scale.
The analysis should be refactored to use read_group to efficiently get population and beneficiary counts per category directly from the database. This avoids loading all records and delegates the counting to the database, which is significantly more performant.
For example, in _analyze_many2one_dimension, instead of searching and then looping to count, you could do something like this:
# Get population counts per group
population_groups = partner_model.read_group(
base_domain, [field_name], [field_name]
)
# Get beneficiary counts per group
beneficiary_domain = base_domain + [('id', 'in', list(beneficiary_set))]
beneficiary_groups = partner_model.read_group(
beneficiary_domain, [field_name], [field_name]
)
# Then, merge these results to calculate coverage and disparity.This same pattern should be applied to _analyze_selection_dimension, _analyze_boolean_dimension, and _analyze_expression_dimension to ensure the entire fairness service is scalable.
| if self.include_child_areas: | ||
| # Use parent_path for efficient child lookup | ||
| for area in self.allowed_area_ids: | ||
| if area.parent_path: | ||
| # Find all child areas using parent_path prefix | ||
| children = self.env["spp.area"].sudo().search([("parent_path", "like", f"{area.parent_path}%")]) | ||
| allowed_area_ids.update(children.ids) |
There was a problem hiding this comment.
The current implementation for finding child areas performs a database search inside a loop. This pattern is inefficient and can lead to significant performance degradation, especially if allowed_area_ids contains a large number of areas. It's better to collect all parent_path values first and then perform a single database search.
| if self.include_child_areas: | |
| # Use parent_path for efficient child lookup | |
| for area in self.allowed_area_ids: | |
| if area.parent_path: | |
| # Find all child areas using parent_path prefix | |
| children = self.env["spp.area"].sudo().search([("parent_path", "like", f"{area.parent_path}%")]) | |
| allowed_area_ids.update(children.ids) | |
| if self.include_child_areas and self.allowed_area_ids: | |
| # Use a single search to find all child areas efficiently | |
| parent_paths = [area.parent_path for area in self.allowed_area_ids if area.parent_path] | |
| if parent_paths: | |
| # Build a domain with OR conditions for each parent path | |
| domain = ['|'] * (len(parent_paths) - 1) | |
| for path in parent_paths: | |
| domain.append(('parent_path', 'like', f'{path}%')) | |
| child_areas = self.env["spp.area"].sudo().search(domain) | |
| allowed_area_ids.update(child_areas.ids) |
| if include_children: | ||
| # Use parent_path for efficient child lookup | ||
| areas = self.env["spp.area"].sudo().browse(area_ids) | ||
| all_area_ids = set(area_ids) | ||
| for area in areas: | ||
| if area.parent_path: | ||
| # Find all child areas using parent_path prefix | ||
| children = self.env["spp.area"].sudo().search([("parent_path", "like", f"{area.parent_path}%")]) | ||
| all_area_ids.update(children.ids) | ||
| area_ids = list(all_area_ids) |
There was a problem hiding this comment.
This method for finding child areas has a performance issue due to performing a database search inside a loop. This pattern should be avoided as it does not scale well. It's more efficient to gather all parent paths and use a single search query to fetch all child areas at once.
| if include_children: | |
| # Use parent_path for efficient child lookup | |
| areas = self.env["spp.area"].sudo().browse(area_ids) | |
| all_area_ids = set(area_ids) | |
| for area in areas: | |
| if area.parent_path: | |
| # Find all child areas using parent_path prefix | |
| children = self.env["spp.area"].sudo().search([("parent_path", "like", f"{area.parent_path}%")]) | |
| all_area_ids.update(children.ids) | |
| area_ids = list(all_area_ids) | |
| if include_children: | |
| areas = self.env["spp.area"].sudo().browse(area_ids) | |
| all_area_ids = set(area_ids) | |
| parent_paths = [area.parent_path for area in areas if area.parent_path] | |
| if parent_paths: | |
| # Build a domain with OR conditions for each parent path | |
| domain = ['|'] * (len(parent_paths) - 1) | |
| for path in parent_paths: | |
| domain.append(('parent_path', 'like', f'{path}%')) | |
| child_areas = self.env["spp.area"].sudo().search(domain) | |
| all_area_ids.update(child_areas.ids) | |
| area_ids = list(all_area_ids) |
| scope_record = self._resolve_scope(scope) | ||
| scope_type = self._get_scope_type(scope_record) | ||
|
|
||
| # Invalidate all cache entries of this scope type | ||
| # This is a conservative approach - it may invalidate more than needed, | ||
| # but ensures consistency | ||
| entries = self.env["spp.aggregation.cache.entry"].sudo().search( | ||
| [("scope_type", "=", scope_type)] | ||
| ) |
There was a problem hiding this comment.
The cache invalidation for a scope is currently based on scope_type, which invalidates all cached results for that type, not just for the specific scope being refreshed. This is a safe but overly broad approach that can lead to unnecessary re-computations. For more granular control, consider adding a scope_id field to the spp.aggregation.cache.entry model. This would allow invalidating the cache for a specific spp.aggregation.scope record while leaving others of the same type intact. For inline scopes, the current behavior is acceptable.
| # Log warning but don't fail (depends on counts) | ||
| _logger.warning( | ||
| "Partial suppression for gender %s: %d suppressed, %d visible", | ||
| gender, |
Check failure
Code scanning / CodeQL
Clear-text logging of sensitive information High test
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 2 days ago
In general, to fix clear-text logging of sensitive data, avoid including sensitive values directly in log messages. Either (1) remove the logging, (2) replace sensitive values with non-sensitive identifiers, or (3) obfuscate/anonymize them so that the logs remain useful but do not reveal sensitive attributes.
For this specific case, we want to keep the warning that partial suppression occurred (which is useful for debugging the aggregation logic) but avoid logging the exact gender value, which CodeQL considers tainted. The best minimally invasive fix is to remove gender from the logged message and instead log only non-sensitive structural information, such as the number of suppressed and visible cells, or perhaps the index/order of the group if needed. This preserves the existing test behavior and assertions, changes only a single log call, and does not affect any functional outcomes of the test.
Concretely, in spp_aggregation/tests/test_integration_demo.py, at the _logger.warning(...) call around line 687–692, we will change the format string and arguments so that gender is no longer passed to the logger. For example, we can log a generic message like "Partial suppression detected for a gender group: %d suppressed, %d visible" and only pass len(suppressed) and len(visible) as arguments. No new imports or helper methods are required.
| @@ -685,8 +685,7 @@ | ||
| # Partial suppression detected - this could allow differencing | ||
| # Log warning but don't fail (depends on counts) | ||
| _logger.warning( | ||
| "Partial suppression for gender %s: %d suppressed, %d visible", | ||
| gender, | ||
| "Partial suppression detected for a gender group: %d suppressed, %d visible", | ||
| len(suppressed), | ||
| len(visible), | ||
| ) |
|
You have run out of free Bugbot PR reviews for this billing cycle. This will reset on March 17. To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 19.0 #45 +/- ##
===========================================
- Coverage 71.31% 56.66% -14.66%
===========================================
Files 299 147 -152
Lines 23618 11965 -11653
===========================================
- Hits 16844 6780 -10064
+ Misses 6774 5185 -1589
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
New modules: - spp_metrics_services: demographic dimensions and shared metric computation services built on CEL domain - spp_aggregation: unified aggregation engine with TTL-based caching, scope-based access control, and cron-managed cache cleanup
…optimize fairness analysis - Replace per-area child lookup loops with a single OR-chained domain search in aggregation_access.py and service_scope_resolver.py to avoid N+1 queries - Refactor _analyze_many2one_dimension, _analyze_selection_dimension, and _analyze_boolean_dimension in fairness_service.py to use read_group instead of loading entire partner recordsets into memory
bb9743c to
a11688a
Compare
Summary
Dependencies
spp_cel_domainspp_aggregationbuilds onspp_metrics_coreOrigin
From
openspp-modules-v2branchclaude/global-alliance-policy-basket.Test plan