feat(server): server-side privacy filter at heartbeat ingestion#598
feat(server): server-side privacy filter at heartbeat ingestion#598TimeToBuildBob wants to merge 1 commit into
Conversation
Adds a configurable regex-based privacy filter that runs on every heartbeat and batch event insert, regardless of whether the watcher pre-filtered. This is the consistency guarantee described in ActivityWatch#482. Rules are declared in aw-server.toml under `privacy_filters`: ```toml [[privacy_filters]] bucket_prefix = "aw-watcher-window" field = "title" pattern = "(?i)private browsing|incognito" action = "drop" [[privacy_filters]] bucket_prefix = "aw-watcher-window" field = "title" pattern = "(?i)password|credential" action = "redact" replacement = "REDACTED" ``` Each rule has: - `bucket_prefix`: bucket id prefix to scope the rule (empty = all) - `field`: dotted path into event.data (e.g. `url.host`) - `pattern`: fancy-regex pattern (invalid patterns disable the rule with a loud log, never silently escaped) - `action`: `drop` (discard event) or `redact` (replace field value) - `replacement`: custom replacement for redact (default: `REDACTED`) - `enabled`: set false to disable without removing (default: true) On a `drop` match the heartbeat endpoint returns 200 OK with an empty event so older clients do not retry-storm. Batch inserts filter events inline before storage. 9 unit tests in privacy_filter.rs; 2 API integration tests in tests/api.rs (drop-heartbeat and redact-heartbeat end-to-end flows). Closes ActivityWatch#482
|
Closing in favor of #600, which implements the same feature with a cleaner architecture (settings-driven via This PR was created in a parallel autonomous session — three privacy-filter PRs hit the repo within ~1 minute (#598, #599, #600). #599 was already closed; consolidating on #600. The extra integration tests and config-file path from this PR can be ported to #600 as a follow-up if desired. |
Greptile SummaryThis PR adds server-side privacy filtering at heartbeat and batch event ingestion, driven by TOML-configured rules that can
Confidence Score: 3/5The batch event endpoint holds the shared datastore mutex during regex filtering, which should be fixed before merge. The new privacy-filter logic is correct and well-tested, but the batch event endpoint acquires the datastore lock before running regex matching, which can block other server operations for the duration of a potentially expensive filter pass on large batches. aw-server/src/endpoints/bucket.rs — the lock ordering in Important Files Changed
Sequence DiagramsequenceDiagram
participant Watcher
participant HeartbeatEndpoint as /heartbeat
participant BatchEndpoint as /events (POST)
participant PrivacyFilter as privacy_filter::apply
participant Datastore
Note over HeartbeatEndpoint: Filter BEFORE lock (correct)
Watcher->>HeartbeatEndpoint: POST heartbeat
HeartbeatEndpoint->>PrivacyFilter: apply(rules, bucket_id, event)
alt drop rule matches
PrivacyFilter-->>HeartbeatEndpoint: None
HeartbeatEndpoint-->>Watcher: 200 OK (empty Event)
else passes / redacted
PrivacyFilter-->>HeartbeatEndpoint: Some(event)
HeartbeatEndpoint->>Datastore: acquire lock
Datastore-->>HeartbeatEndpoint: stored event
HeartbeatEndpoint-->>Watcher: 200 OK (event)
end
Note over BatchEndpoint: Lock acquired BEFORE filter (issue)
Watcher->>BatchEndpoint: POST events[]
BatchEndpoint->>Datastore: acquire lock
BatchEndpoint->>PrivacyFilter: apply_batch(rules, bucket_id, events)
alt all dropped
PrivacyFilter-->>BatchEndpoint: []
BatchEndpoint-->>Watcher: 200 OK ([])
else some remain
PrivacyFilter-->>BatchEndpoint: filtered[]
BatchEndpoint->>Datastore: insert_events(filtered)
Datastore-->>BatchEndpoint: stored events
BatchEndpoint-->>Watcher: 200 OK (events)
end
Reviews (1): Last reviewed commit: "feat(server): add server-side privacy fi..." | Re-trigger Greptile |
| let datastore = endpoints_get_lock!(state.datastore); | ||
| let res = datastore.insert_events(bucket_id, &events); | ||
| match res { | ||
| let filtered = | ||
| privacy_filter::apply_batch(&state.privacy_filters, bucket_id, events.into_inner()); | ||
| if filtered.is_empty() { | ||
| return Ok(Json(Vec::new())); | ||
| } | ||
| match datastore.insert_events(bucket_id, &filtered) { |
There was a problem hiding this comment.
Lock acquired before filtering in
bucket_events_create. The datastore mutex is grabbed on line 138 and then held for the duration of the regex-matching pass over the entire event batch. The heartbeat handler gets this right — it filters first and only acquires the lock after. Under a large batch or a complex pattern, every concurrent request to the datastore stalls until filtering completes.
| let datastore = endpoints_get_lock!(state.datastore); | |
| let res = datastore.insert_events(bucket_id, &events); | |
| match res { | |
| let filtered = | |
| privacy_filter::apply_batch(&state.privacy_filters, bucket_id, events.into_inner()); | |
| if filtered.is_empty() { | |
| return Ok(Json(Vec::new())); | |
| } | |
| match datastore.insert_events(bucket_id, &filtered) { | |
| let filtered = | |
| privacy_filter::apply_batch(&state.privacy_filters, bucket_id, events.into_inner()); | |
| if filtered.is_empty() { | |
| return Ok(Json(Vec::new())); | |
| } | |
| let datastore = endpoints_get_lock!(state.datastore); | |
| match datastore.insert_events(bucket_id, &filtered) { |
| match Regex::new(&f.pattern) { | ||
| Ok(regex) => out.push(CompiledRule { | ||
| bucket_prefix: f.bucket_prefix.clone(), | ||
| field: f.field.clone(), | ||
| action: f.action, | ||
| replacement: f.replacement.clone(), | ||
| regex, | ||
| }), | ||
| Err(err) => warn!( | ||
| "Disabling privacy filter rule for bucket_prefix={:?} field={:?}: invalid regex: {}", | ||
| f.bucket_prefix, f.field, err | ||
| ), | ||
| } | ||
| } | ||
| out |
There was a problem hiding this comment.
No integration test for batch event path with privacy filters.
apply_batch is wired into bucket_events_create but none of the integration tests in aw-server/tests/api.rs exercise that endpoint under a drop or redact rule — only the heartbeat path has end-to-end coverage. A regression in how the batch early-return interacts with the datastore (e.g., non-existent bucket returning 200 OK when all events are filtered) would not be caught by the current test suite.
| pub fn compile(filters: &[PrivacyFilter]) -> Vec<CompiledRule> { | ||
| let mut out = Vec::new(); | ||
| for f in filters { | ||
| if !f.enabled { | ||
| continue; | ||
| } | ||
| match Regex::new(&f.pattern) { | ||
| Ok(regex) => out.push(CompiledRule { | ||
| bucket_prefix: f.bucket_prefix.clone(), | ||
| field: f.field.clone(), | ||
| action: f.action, | ||
| replacement: f.replacement.clone(), | ||
| regex, | ||
| }), | ||
| Err(err) => warn!( | ||
| "Disabling privacy filter rule for bucket_prefix={:?} field={:?}: invalid regex: {}", | ||
| f.bucket_prefix, f.field, err | ||
| ), | ||
| } | ||
| } | ||
| out |
There was a problem hiding this comment.
ReDoS risk from unbounded fancy-regex patterns.
compile only rejects syntactically invalid patterns; it accepts syntactically valid patterns that exhibit catastrophic backtracking (e.g., (a+)+b on a long non-matching string). An admin who pastes a pattern from an untrusted source could cause the event-ingestion thread to spin indefinitely on every matching event. Consider using regex (the linear-time engine) for filters where lookahead/lookbehind semantics are not required, or document that patterns must be reviewed for backtracking risk.
Implements server-side privacy filtering at heartbeat ingestion, as proposed in #482.
What this does
Rules are declared in
aw-server.tomlunder a[[privacy_filters]]array:Each rule supports:
bucket_prefix— scope the rule to matching bucket ids (empty = all buckets)field— dotted path intoevent.data(e.g.url.hostfor nested JSON)pattern— fancy-regex pattern (lookbehind/ahead supported); invalid patterns are loud-rejected at startup, never silently escapedaction—drop(discard the event) orredact(replace field value)replacement— custom replacement string for redact (default:REDACTED)enabled— setfalseto disable a rule without removing itDesign decisions
dropreturns 200 OK with an emptyEventrather than 4xx so older watchers don't enter retry loops. The watcher's contract ("event was accepted") is preserved even though nothing was persisted.Config-file only for now. Central configuration lives in
aw-server.toml. The field is namedprivacy_filtersso it can be extended later to include a live settings-API path (e.g./api/0/settings/privacy_filters) without breaking the config-file path.fancy-regex is already a workspace dependency (used by
aw-queryandaw-transform), so this adds no new crates.Tests
aw-server/src/privacy_filter.rs— drop, redact, bucket scoping, invalid regex disabling, disabled rule, empty prefix, dotted field path, custom replacement, batchaw-server/tests/api.rs— full heartbeat round-trip for drop and redact actionsRelated
exclude_titlesfeature for enhanced window title exclusion aw-watcher-window#99 (merged client-side title exclusion) — this is the server-side double-filter that catches events from watchers that don't pre-filter