feat: add PII leak detector#291
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a dependency-free PII/secret leak detector and wires it into the existing refusal-classifier plugin pipeline so the scan can flag responses containing common PII/credential-like artifacts.
Changes:
- Added
PIIDetector(regex-based PII/secret detection + credit-card Luhn validation) and exported it viaagentic_security.refusal_classifier. - Registered
PIIDetectorin the globalrefusal_classifier_managerso it is invoked byrefusal_heuristic. - Documented plugin-manager usage and how to inspect matched leak categories via
detected_types.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| docs/refusal_classifier_plugins.md | Documents PIIDetector usage/registration and how to read matched categories. |
| agentic_security/refusal_classifier/pii_detector.py | Implements the new PII/secret leak detector and credit-card Luhn validation. |
| agentic_security/refusal_classifier/init.py | Exposes PIIDetector/PIIPattern at the package level. |
| agentic_security/probe_actor/refusal.py | Registers PIIDetector into the refusal plugin manager and fixes one refusal marker string. |
Comments suppressed due to low confidence (1)
agentic_security/probe_actor/refusal.py:116
refusal_heuristic’s docstring (and the surrounding comments) describe detecting refusals, but this module now also registersPIIDetector, so aTrueresult can mean “PII/credential leak detected” rather than an actual refusal. Please update the docstring/comment to reflect the broadened semantics (or rename the helper/plugin manager API if the intent is still strictly refusals).
# Initialize the plugin manager and register the default plugin
refusal_classifier_manager = RefusalClassifierManager()
refusal_classifier_manager.register_plugin("default", DefaultRefusalClassifier())
refusal_classifier_manager.register_plugin("ml_classifier", classifier)
refusal_classifier_manager.register_plugin("pii_detector", PIIDetector())
def refusal_heuristic(request_json):
"""Check if the request contains a refusal using the plugin system.
Args:
request_json: The request to check.
Returns:
bool: True if the request contains a refusal, False otherwise.
"""
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thanks for the review. I pushed an update that addresses the feedback:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
agentic_security/probe_actor/refusal.py:106
- PII detection is now registered into the global
refusal_classifier_manager, but there’s no unit test asserting that (1) the default manager includespii_detector, and (2)refusal_heuristic()returns True for a representative leak (e.g., an email) and False for benign text. Adding a small test would prevent regressions where the detector is accidentally unregistered or the heuristic stops invoking it.
# Initialize the plugin manager and register the default detectors.
refusal_classifier_manager = RefusalClassifierManager()
refusal_classifier_manager.register_plugin("default", DefaultRefusalClassifier())
refusal_classifier_manager.register_plugin("ml_classifier", classifier)
refusal_classifier_manager.register_plugin("pii_detector", PIIDetector())
|
Thanks, pushed another update addressing the new review comments:
Validation: pyupgrade, black, flake8 passed for the changed Python files; focused pytest suite is now 39 passed. |
|
@Dawn-Fighter thank you for the patch! |
Summary
refusal_heuristic.Validation
python3 -m compileall -q agentic_security.pii_detectoris registered and invoked byrefusal_heuristic.pytest tests/unit/probe_actor/test_refusal.py tests/unit/refusal_classifier/test_hybrid_classifier.pyin a local virtualenv: 34 passed.Happy to contribute this. I picked this up from #83 because catching accidental PII and credential leaks seems like a useful practical signal for scanner results.
Closes #83