Skip to content

feat(server): server-side privacy filter at heartbeat ingestion#598

Closed
TimeToBuildBob wants to merge 1 commit into
ActivityWatch:masterfrom
TimeToBuildBob:feat/privacy-filter-server
Closed

feat(server): server-side privacy filter at heartbeat ingestion#598
TimeToBuildBob wants to merge 1 commit into
ActivityWatch:masterfrom
TimeToBuildBob:feat/privacy-filter-server

Conversation

@TimeToBuildBob
Copy link
Copy Markdown
Contributor

Implements server-side privacy filtering at heartbeat ingestion, as proposed in #482.

What this does

Rules are declared in aw-server.toml under a [[privacy_filters]] array:

# Drop events matching a regex (silent discard, 200 OK so watchers don't retry)
[[privacy_filters]]
bucket_prefix = "aw-watcher-window"
field = "title"
pattern = "(?i)private browsing|incognito"
action = "drop"

# Redact a field value when it matches a regex
[[privacy_filters]]
field = "title"
pattern = "(?i)password|credential|secret"
action = "redact"
replacement = "REDACTED"

Each rule supports:

  • bucket_prefix — scope the rule to matching bucket ids (empty = all buckets)
  • field — dotted path into event.data (e.g. url.host for nested JSON)
  • patternfancy-regex pattern (lookbehind/ahead supported); invalid patterns are loud-rejected at startup, never silently escaped
  • actiondrop (discard the event) or redact (replace field value)
  • replacement — custom replacement string for redact (default: REDACTED)
  • enabled — set false to disable a rule without removing it

Design decisions

drop returns 200 OK with an empty Event rather than 4xx so older watchers don't enter retry loops. The watcher's contract ("event was accepted") is preserved even though nothing was persisted.

Config-file only for now. Central configuration lives in aw-server.toml. The field is named privacy_filters so it can be extended later to include a live settings-API path (e.g. /api/0/settings/privacy_filters) without breaking the config-file path.

fancy-regex is already a workspace dependency (used by aw-query and aw-transform), so this adds no new crates.

Tests

  • 9 unit tests in aw-server/src/privacy_filter.rs — drop, redact, bucket scoping, invalid regex disabling, disabled rule, empty prefix, dotted field path, custom replacement, batch
  • 2 API integration tests in aw-server/tests/api.rs — full heartbeat round-trip for drop and redact actions

Related

Adds a configurable regex-based privacy filter that runs on every
heartbeat and batch event insert, regardless of whether the watcher
pre-filtered. This is the consistency guarantee described in
ActivityWatch#482.

Rules are declared in aw-server.toml under `privacy_filters`:

```toml
[[privacy_filters]]
bucket_prefix = "aw-watcher-window"
field = "title"
pattern = "(?i)private browsing|incognito"
action = "drop"

[[privacy_filters]]
bucket_prefix = "aw-watcher-window"
field = "title"
pattern = "(?i)password|credential"
action = "redact"
replacement = "REDACTED"
```

Each rule has:
- `bucket_prefix`: bucket id prefix to scope the rule (empty = all)
- `field`: dotted path into event.data (e.g. `url.host`)
- `pattern`: fancy-regex pattern (invalid patterns disable the rule
  with a loud log, never silently escaped)
- `action`: `drop` (discard event) or `redact` (replace field value)
- `replacement`: custom replacement for redact (default: `REDACTED`)
- `enabled`: set false to disable without removing (default: true)

On a `drop` match the heartbeat endpoint returns 200 OK with an empty
event so older clients do not retry-storm. Batch inserts filter events
inline before storage.

9 unit tests in privacy_filter.rs; 2 API integration tests in
tests/api.rs (drop-heartbeat and redact-heartbeat end-to-end flows).

Closes ActivityWatch#482
@TimeToBuildBob
Copy link
Copy Markdown
Contributor Author

Closing in favor of #600, which implements the same feature with a cleaner architecture (settings-driven via RefreshPrivacyFilter command, lives in aw-datastore closer to storage) and passing CI.

This PR was created in a parallel autonomous session — three privacy-filter PRs hit the repo within ~1 minute (#598, #599, #600). #599 was already closed; consolidating on #600.

The extra integration tests and config-file path from this PR can be ported to #600 as a follow-up if desired.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 9, 2026

Greptile Summary

This PR adds server-side privacy filtering at heartbeat and batch event ingestion, driven by TOML-configured rules that can drop (silently discard with 200 OK) or redact (replace a field value) events whose data matches a regex pattern. Rules are compiled once at startup using fancy-regex, and the new privacy_filter module is well-covered by unit tests.

  • Heartbeat path (bucket_events_heartbeat) correctly filters before acquiring the datastore lock; a dropped event returns Event::default() with 200 OK so older clients do not retry-storm.
  • Batch events path (bucket_events_create) acquires the datastore mutex before running apply_batch, holding the lock during all regex work — inconsistent with the heartbeat pattern and a potential source of lock contention under large batches or complex patterns.
  • Integration tests cover the heartbeat endpoint under both drop and redact scenarios, but the bucket_events_create path has no integration-test coverage under privacy-filter conditions.

Confidence Score: 3/5

The batch event endpoint holds the shared datastore mutex during regex filtering, which should be fixed before merge.

The new privacy-filter logic is correct and well-tested, but the batch event endpoint acquires the datastore lock before running regex matching, which can block other server operations for the duration of a potentially expensive filter pass on large batches.

aw-server/src/endpoints/bucket.rs — the lock ordering in bucket_events_create needs correction before merge.

Important Files Changed

Filename Overview
aw-server/src/endpoints/bucket.rs Heartbeat filtering is correct (filter before lock). Batch event endpoint acquires the datastore lock before running regex filtering — the lock should be moved after the filter call to match the heartbeat pattern and avoid unnecessary contention.
aw-server/src/privacy_filter.rs New module implementing drop/redact filter logic. Core logic is well-structured and well-tested (9 unit tests). Minor concerns: no ReDoS guard on user-supplied patterns, and no startup warning when a compiled rule's field never produces a string match.
aw-server/src/config.rs Adds privacy_filters field with serde default of empty vec. Integration with AWConfig is straightforward and non-breaking.
aw-server/src/endpoints/mod.rs Adds privacy_filters: Vec<CompiledRule> to ServerState. Clean, minimal change.
aw-server/src/main.rs Calls privacy_filter::compile once at startup and passes compiled rules into ServerState. Correctly logs the loaded rule count.
aw-server/tests/api.rs Adds two integration tests (drop and redact) for the heartbeat endpoint. Batch event endpoint (bucket_events_create) is not covered under privacy-filter conditions.

Sequence Diagram

sequenceDiagram
    participant Watcher
    participant HeartbeatEndpoint as /heartbeat
    participant BatchEndpoint as /events (POST)
    participant PrivacyFilter as privacy_filter::apply
    participant Datastore

    Note over HeartbeatEndpoint: Filter BEFORE lock (correct)
    Watcher->>HeartbeatEndpoint: POST heartbeat
    HeartbeatEndpoint->>PrivacyFilter: apply(rules, bucket_id, event)
    alt drop rule matches
        PrivacyFilter-->>HeartbeatEndpoint: None
        HeartbeatEndpoint-->>Watcher: 200 OK (empty Event)
    else passes / redacted
        PrivacyFilter-->>HeartbeatEndpoint: Some(event)
        HeartbeatEndpoint->>Datastore: acquire lock
        Datastore-->>HeartbeatEndpoint: stored event
        HeartbeatEndpoint-->>Watcher: 200 OK (event)
    end

    Note over BatchEndpoint: Lock acquired BEFORE filter (issue)
    Watcher->>BatchEndpoint: POST events[]
    BatchEndpoint->>Datastore: acquire lock
    BatchEndpoint->>PrivacyFilter: apply_batch(rules, bucket_id, events)
    alt all dropped
        PrivacyFilter-->>BatchEndpoint: []
        BatchEndpoint-->>Watcher: 200 OK ([])
    else some remain
        PrivacyFilter-->>BatchEndpoint: filtered[]
        BatchEndpoint->>Datastore: insert_events(filtered)
        Datastore-->>BatchEndpoint: stored events
        BatchEndpoint-->>Watcher: 200 OK (events)
    end
Loading

Reviews (1): Last reviewed commit: "feat(server): add server-side privacy fi..." | Re-trigger Greptile

Comment on lines 138 to +144
let datastore = endpoints_get_lock!(state.datastore);
let res = datastore.insert_events(bucket_id, &events);
match res {
let filtered =
privacy_filter::apply_batch(&state.privacy_filters, bucket_id, events.into_inner());
if filtered.is_empty() {
return Ok(Json(Vec::new()));
}
match datastore.insert_events(bucket_id, &filtered) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Lock acquired before filtering in bucket_events_create. The datastore mutex is grabbed on line 138 and then held for the duration of the regex-matching pass over the entire event batch. The heartbeat handler gets this right — it filters first and only acquires the lock after. Under a large batch or a complex pattern, every concurrent request to the datastore stalls until filtering completes.

Suggested change
let datastore = endpoints_get_lock!(state.datastore);
let res = datastore.insert_events(bucket_id, &events);
match res {
let filtered =
privacy_filter::apply_batch(&state.privacy_filters, bucket_id, events.into_inner());
if filtered.is_empty() {
return Ok(Json(Vec::new()));
}
match datastore.insert_events(bucket_id, &filtered) {
let filtered =
privacy_filter::apply_batch(&state.privacy_filters, bucket_id, events.into_inner());
if filtered.is_empty() {
return Ok(Json(Vec::new()));
}
let datastore = endpoints_get_lock!(state.datastore);
match datastore.insert_events(bucket_id, &filtered) {

Comment on lines +80 to +94
match Regex::new(&f.pattern) {
Ok(regex) => out.push(CompiledRule {
bucket_prefix: f.bucket_prefix.clone(),
field: f.field.clone(),
action: f.action,
replacement: f.replacement.clone(),
regex,
}),
Err(err) => warn!(
"Disabling privacy filter rule for bucket_prefix={:?} field={:?}: invalid regex: {}",
f.bucket_prefix, f.field, err
),
}
}
out
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No integration test for batch event path with privacy filters. apply_batch is wired into bucket_events_create but none of the integration tests in aw-server/tests/api.rs exercise that endpoint under a drop or redact rule — only the heartbeat path has end-to-end coverage. A regression in how the batch early-return interacts with the datastore (e.g., non-existent bucket returning 200 OK when all events are filtered) would not be caught by the current test suite.

Comment on lines +74 to +94
pub fn compile(filters: &[PrivacyFilter]) -> Vec<CompiledRule> {
let mut out = Vec::new();
for f in filters {
if !f.enabled {
continue;
}
match Regex::new(&f.pattern) {
Ok(regex) => out.push(CompiledRule {
bucket_prefix: f.bucket_prefix.clone(),
field: f.field.clone(),
action: f.action,
replacement: f.replacement.clone(),
regex,
}),
Err(err) => warn!(
"Disabling privacy filter rule for bucket_prefix={:?} field={:?}: invalid regex: {}",
f.bucket_prefix, f.field, err
),
}
}
out
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 ReDoS risk from unbounded fancy-regex patterns. compile only rejects syntactically invalid patterns; it accepts syntactically valid patterns that exhibit catastrophic backtracking (e.g., (a+)+b on a long non-matching string). An admin who pastes a pattern from an untrusted source could cause the event-ingestion thread to spin indefinitely on every matching event. Consider using regex (the linear-time engine) for filters where lookahead/lookbehind semantics are not required, or document that patterns must be reviewed for backtracking risk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Suggestion: Regex-based inbound data filters

1 participant