feat: add Research Edition privacy filter for window titles#130
feat: add Research Edition privacy filter for window titles#130TimeToBuildBob wants to merge 8 commits into
Conversation
Adds an opt-in transform that classifies browser window titles into study categories and drops non-browser titles entirely, replicating the ERC fork approach without maintaining a separate fork. - New module `research_filter.py`: `is_browser()`, `classify_title()`, `transform()` — all pure functions, no dependencies beyond stdlib - Config: `research_enabled = false` (default off) + TOML table `[aw-watcher-window.research_category_map]` for the researcher's map - CLI: `--research` flag mirrors the config option - main.py: `heartbeat_loop` applies the transform after existing `exclude_title`/`exclude_titles` steps - 19 tests covering is_browser, classify_title, and transform (including input-immutability, url/incognito preservation, macOS JXA extra fields) macOS note: the `swift` strategy delegates to the compiled Swift helper and bypasses this Python transform. Use `--strategy jxa` or `--strategy applescript` on macOS for Research Edition support. Part of ActivityWatch Research Edition for the Lund/IIIEE study.
Greptile SummaryThis PR adds an opt-in Research Edition privacy filter that transforms window-title data before it reaches the ActivityWatch server: browser titles are classified into researcher-defined study categories, and non-browser titles are dropped entirely. The feature is disabled by default and guarded by a
Confidence Score: 5/5Safe to merge; the privacy filter is opt-in, disabled by default, and all previously flagged correctness concerns have been addressed in this version of the code. The filter is a well-isolated, pure-Python module with no external dependencies. The macOS/swift guard correctly raises a FatalError before any data is collected, URL stripping is correctly applied in both the browser and non-browser paths, and the category map is now read once at parse time. Tests cover input immutability, URL stripping, incognito preservation, and the strategy guard. The only remaining nit is a misleading docstring line that does not affect runtime behaviour. No files require special attention beyond the minor docstring fix in Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant HL as heartbeat_loop
participant RF as research_filter.transform
participant AW as ActivityWatch Server
HL->>HL: get_current_window()
HL->>HL: apply exclude_titles regex patterns
HL->>HL: apply exclude_title flag
alt research_category_map is not None
HL->>RF: transform(window, category_map)
alt is_browser(app)
RF->>RF: classify_title(title, category_map)
RF-->>HL: "app + title=category or excluded + incognito"
else non-browser app
RF-->>HL: app + incognito only (title and url dropped)
end
end
HL->>AW: client.heartbeat(event)
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant HL as heartbeat_loop
participant RF as research_filter.transform
participant AW as ActivityWatch Server
HL->>HL: get_current_window()
HL->>HL: apply exclude_titles regex patterns
HL->>HL: apply exclude_title flag
alt research_category_map is not None
HL->>RF: transform(window, category_map)
alt is_browser(app)
RF->>RF: classify_title(title, category_map)
RF-->>HL: "app + title=category or excluded + incognito"
else non-browser app
RF-->>HL: app + incognito only (title and url dropped)
end
end
HL->>AW: client.heartbeat(event)
Reviews (6): Last reviewed commit: "fix(research): strip URLs from non-brows..." | Re-trigger Greptile |
| config = load_config() | ||
| research_category_map = ( | ||
| dict(config.get("research_category_map", {})) | ||
| if args.research_enabled | ||
| else None | ||
| ) |
There was a problem hiding this comment.
load_config() called twice on every startup
parse_args() already calls load_config() internally (in config.py line 22) to populate argparse defaults. Then main() calls it again here solely to extract research_category_map. The category map could instead be read from the config during parse_args() and attached to the returned namespace (e.g., parsed_args.research_category_map = config.get("research_category_map", {})), eliminating the second disk read.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| "safari.exe", | ||
| } |
There was a problem hiding this comment.
safari.exe is not a real Windows process name
Safari for Windows was discontinued in 2012 and never used the name safari.exe in common window-activity trackers. Keeping this entry is harmless, but it will never match in practice and could cause confusion if someone adds a different app named safari.exe. It's worth either removing it or adding a comment explaining the rationale.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
dict | None union syntax requires Python 3.10+; replace with Optional[dict] from typing to fix typecheck + runtime errors on py3.9.
|
Pushed Root cause: Fix: Replace |
|
@greptileai review |
| research_category_map = ( | ||
| args.research_category_map | ||
| if args.research_enabled | ||
| else None | ||
| ) |
There was a problem hiding this comment.
Empty category map silently bypasses the privacy filter
When research_enabled = true but the [aw-watcher-window.research_category_map] section is empty (the default), args.research_category_map is {}. This passes the is not None guard in heartbeat_loop so research_transform is called, but inside transform() the check if not category_map: evaluates {} as falsy and returns the window unchanged — raw titles including non-browser apps are sent to the ActivityWatch server without any filtering. A researcher who enables the feature but hasn't yet populated the category map gets a completely silent privacy failure.
A startup check (or at minimum a logger.warning) when research_enabled is True but research_category_map is empty would surface the misconfiguration before data collection begins.
|
@greptileai review |
|
@greptileai review |
Privacy Fix: Strip URLs from Browser EventsI've identified and fixed the privacy gap identified by Greptile's P1 finding: browser URLs were being leaked to the ActivityWatch server despite the privacy filter being active. The IssueOn macOS JXA (the required strategy for Research Edition support), browser events carry full URLs that can expose search queries, authentication tokens, and other sensitive path components. The previous implementation used The Fixif is_browser(app):
# Browser: replace title with a study category
title = window.get("title", "")
category = classify_title(title, category_map)
# Don't spread window dict — URLs must not be exposed for privacy
result = {"app": app, "title": category}
# Preserve incognito flag if present (metadata, not a privacy concern)
if "incognito" in window:
result["incognito"] = window["incognito"]
return resultVerificationTested locally:
The fix maintains the privacy guarantee of Research Edition mode on macOS JXA by ensuring raw browser URLs are never uploaded to the ActivityWatch server, while still preserving the incognito flag as non-sensitive metadata. Commit: |
On macOS JXA, browser events carry full URLs that can expose search queries,
authentication tokens, and other sensitive path components. The previous
implementation used `{**window, 'title': category}` which preserved the url
field via dict spread, leaking URLs to the ActivityWatch server despite the
privacy filter being active.
Changed to explicitly construct the result dict, stripping the URL while
preserving the incognito flag as non-sensitive metadata. Non-browser apps
(e.g. Finder) continue to preserve the url field since file:// URLs are not
a privacy concern.
Updated test to verify URLs are NOT included in browser results.
Fixes Greptile ActivityWatch#130 P1 privacy finding.
Privacy Fix Applied: URLs Stripped from Browser EventsI've pushed commit The IssueThe original implementation used The FixChanged to explicitly construct the result dict:
Test UpdateUpdated the test CI is running now — tests should pass with this fix. |
|
@greptileai review |
| # Preserve extra fields (url, incognito) that may exist on macOS JXA | ||
| for k in ("url", "incognito"): | ||
| if k in window: | ||
| result[k] = window[k] |
There was a problem hiding this comment.
Non-browser
url field preserved despite explicit URL stripping for privacy
The browser branch deliberately drops url with the comment # Don't spread window dict — URLs must not be exposed for privacy, but the non-browser branch keeps it. On macOS with the JXA strategy (the one researchers must use), apps like Mail, Calendar, or a file manager can expose url values (e.g., mailbox://user@corp.com/inbox/msg-id, file:///Users/jdoe/sensitive-folder/). A researcher's config that drops non-browser titles to protect privacy would still leak those raw URLs to the ActivityWatch server, violating the same privacy guarantee the browser path enforces. The simplest fix is to remove "url" from the preserved fields for non-browser apps, mirroring the browser branch's policy.
| # Preserve extra fields (url, incognito) that may exist on macOS JXA | |
| for k in ("url", "incognito"): | |
| if k in window: | |
| result[k] = window[k] | |
| # Preserve incognito flag only; url is stripped to match browser-branch privacy policy | |
| if "incognito" in window: | |
| result["incognito"] = window["incognito"] |
Non-browser apps on macOS JXA can expose sensitive URLs via file:// and mailbox:// schemes (e.g., file:///Users/jdoe/sensitive-folder/ or mailbox://user@corp.com/inbox/). The privacy filter must strip these just as it strips browser URLs to prevent leaking sensitive path components to the ActivityWatch server. Changes: - Remove 'url' field from non-browser preserved fields - Keep 'incognito' flag preserved (non-sensitive metadata) - Update test to verify URL stripping for non-browser apps Fixes Greptile P1 security finding.
P1 Security Fix Applied: Non-browser URLs StrippedI've pushed commit The IssueThe privacy filter was correctly stripping URLs from browser events (search queries, auth tokens), but Greptile caught that non-browser apps on macOS JXA can also expose sensitive URLs:
These were being preserved and would leak to the ActivityWatch server. The FixChanged the non-browser branch to drop the result = {"app": app}
# Preserve incognito flag if present (metadata, not a privacy concern)
if "incognito" in window:
result["incognito"] = window["incognito"]
# Note: URL is NOT preserved for non-browser appsTest CoverageUpdated All Greptile findings now appear addressed:
Ready for final review verification. |
|
@greptileai review |
Summary
Adds an opt-in Research Edition mode that classifies browser window titles into study categories and drops non-browser titles entirely — replicating the approach from the ERC research fork without the overhead of maintaining a separate fork, code-signing pipeline, or notarization setup.
This is the watcher-side half of the ActivityWatch Research Edition for the Lund/IIIEE study. The CI build variant (which bakes in
research_enabled = trueat package build time) is a separate PR to the mainactivitywatchbundle repo.What changed
aw_watcher_window/research_filter.py(new) — pure-Python, no external deps:is_browser(app)— case-insensitive lookup against a known browser app listclassify_title(title, category_map)— first-match substring classifier; returns"excluded"when nothing matchestransform(window, category_map)— applies the full ERC transform: browser titles → category, non-browser titles → dropped entirely (onlyapprecorded)aw_watcher_window/config.py— addsresearch_enabled = falsedefault and[aw-watcher-window.research_category_map]TOML table; exposes--researchCLI flagaw_watcher_window/main.py—heartbeat_loopappliesresearch_transform()after the existingexclude_title/exclude_titlesstep when Research Edition is enabledtests/test_research_filter.py(new) — 19 tests coveringis_browser,classify_title, andtransform(input immutability, url/incognito field preservation, macOS JXA extra fields, edge cases)Behaviour at a glance
macOS note
The
swiftstrategy (the default on macOS) delegates to the compiled Swift helper and bypasses this Python transform. Use--strategy jxaor--strategy applescripton macOS if Research Edition support is needed there.Test plan
python3 -m unittest tests/test_research_filter.py— all 19 tests passresearch_enabled = trueand a category map: browser titles classified, non-browser titles dropped--strategy jxa+--research: titles transformed correctly