Skip to content

feat: add Research Edition privacy filter for window titles#130

Open
TimeToBuildBob wants to merge 8 commits into
ActivityWatch:masterfrom
TimeToBuildBob:feat/research-edition-filter
Open

feat: add Research Edition privacy filter for window titles#130
TimeToBuildBob wants to merge 8 commits into
ActivityWatch:masterfrom
TimeToBuildBob:feat/research-edition-filter

Conversation

@TimeToBuildBob

Copy link
Copy Markdown
Contributor

Summary

Adds an opt-in Research Edition mode that classifies browser window titles into study categories and drops non-browser titles entirely — replicating the approach from the ERC research fork without the overhead of maintaining a separate fork, code-signing pipeline, or notarization setup.

This is the watcher-side half of the ActivityWatch Research Edition for the Lund/IIIEE study. The CI build variant (which bakes in research_enabled = true at package build time) is a separate PR to the main activitywatch bundle repo.

What changed

  • aw_watcher_window/research_filter.py (new) — pure-Python, no external deps:
    • is_browser(app) — case-insensitive lookup against a known browser app list
    • classify_title(title, category_map) — first-match substring classifier; returns "excluded" when nothing matches
    • transform(window, category_map) — applies the full ERC transform: browser titles → category, non-browser titles → dropped entirely (only app recorded)
  • aw_watcher_window/config.py — adds research_enabled = false default and [aw-watcher-window.research_category_map] TOML table; exposes --research CLI flag
  • aw_watcher_window/main.pyheartbeat_loop applies research_transform() after the existing exclude_title/exclude_titles step when Research Edition is enabled
  • tests/test_research_filter.py (new) — 19 tests covering is_browser, classify_title, and transform (input immutability, url/incognito field preservation, macOS JXA extra fields, edge cases)

Behaviour at a glance

# Normal user — zero behaviour change (research_enabled = false by default)

# Researcher config (~/.config/activitywatch/aw-watcher-window/aw-watcher-window.toml):
# [aw-watcher-window]
# research_enabled = true
#
# [aw-watcher-window.research_category_map]
# youtube = "Entertainment"
# facebook = "Social Media"
# gmail = "Email"

# Browser event: {"app": "Chrome", "title": "YouTube - Google Chrome"}
# → {"app": "Chrome", "title": "Entertainment"}

# Browser event (unmatched): {"app": "Firefox", "title": "Hacker News"}
# → {"app": "Firefox", "title": "excluded"}

# Non-browser event: {"app": "iTerm2", "title": "~/projects/foo"}
# → {"app": "iTerm2"}   (title dropped)

macOS note

The swift strategy (the default on macOS) delegates to the compiled Swift helper and bypasses this Python transform. Use --strategy jxa or --strategy applescript on macOS if Research Edition support is needed there.

Test plan

  • python3 -m unittest tests/test_research_filter.py — all 19 tests pass
  • Normal run with no config change: behaviour unchanged
  • Run with research_enabled = true and a category map: browser titles classified, non-browser titles dropped
  • Run on macOS with --strategy jxa + --research: titles transformed correctly

Adds an opt-in transform that classifies browser window titles into
study categories and drops non-browser titles entirely, replicating the
ERC fork approach without maintaining a separate fork.

- New module `research_filter.py`: `is_browser()`, `classify_title()`,
  `transform()` — all pure functions, no dependencies beyond stdlib
- Config: `research_enabled = false` (default off) + TOML table
  `[aw-watcher-window.research_category_map]` for the researcher's map
- CLI: `--research` flag mirrors the config option
- main.py: `heartbeat_loop` applies the transform after existing
  `exclude_title`/`exclude_titles` steps
- 19 tests covering is_browser, classify_title, and transform (including
  input-immutability, url/incognito preservation, macOS JXA extra fields)

macOS note: the `swift` strategy delegates to the compiled Swift helper
and bypasses this Python transform. Use `--strategy jxa` or
`--strategy applescript` on macOS for Research Edition support.

Part of ActivityWatch Research Edition for the Lund/IIIEE study.
@greptile-apps

greptile-apps Bot commented Jul 2, 2026

Copy link
Copy Markdown

Greptile Summary

This PR adds an opt-in Research Edition privacy filter that transforms window-title data before it reaches the ActivityWatch server: browser titles are classified into researcher-defined study categories, and non-browser titles are dropped entirely. The feature is disabled by default and guarded by a FatalError on macOS when the incompatible swift strategy is in use.

  • research_filter.py (new) — pure-Python is_browser, classify_title, and transform functions; URLs are always stripped for privacy, only the incognito flag is forwarded; transform is a no-op when category_map is None.
  • config.py / main.py — adds research_enabled config/CLI flag; research_category_map is attached to args at parse time; heartbeat_loop applies research_transform after the existing exclude-title step when enabled.
  • tests/ — 19 unit tests for the filter plus tests for the macOS strategy guard and config parsing.

Confidence Score: 5/5

Safe to merge; the privacy filter is opt-in, disabled by default, and all previously flagged correctness concerns have been addressed in this version of the code.

The filter is a well-isolated, pure-Python module with no external dependencies. The macOS/swift guard correctly raises a FatalError before any data is collected, URL stripping is correctly applied in both the browser and non-browser paths, and the category map is now read once at parse time. Tests cover input immutability, URL stripping, incognito preservation, and the strategy guard. The only remaining nit is a misleading docstring line that does not affect runtime behaviour.

No files require special attention beyond the minor docstring fix in aw_watcher_window/research_filter.py.

Important Files Changed

Filename Overview
aw_watcher_window/research_filter.py New privacy-filter module; transform logic is correct and immutable. Docstring for transform() incorrectly states url is preserved for browsers when it is explicitly stripped.
aw_watcher_window/main.py Adds ensure_research_strategy_supported guard and wires research_category_map into heartbeat_loop; ordering relative to exclude_title is correct.
aw_watcher_window/config.py Adds research_enabled default and research_category_map TOML section; attaches category map to parsed_args to avoid a second load_config() call in main.
tests/test_research_filter.py 19 tests covering is_browser, classify_title, and transform including input-immutability, url-stripping, incognito preservation, and macOS JXA field handling.
tests/test_main.py Adds three tests for ensure_research_strategy_supported covering darwin/swift (FatalError), darwin/jxa (allowed), and non-research mode.
tests/test_config.py New test file verifying research_category_map is attached to args and that --no-research overrides a config-enabled flag.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant HL as heartbeat_loop
    participant RF as research_filter.transform
    participant AW as ActivityWatch Server

    HL->>HL: get_current_window()
    HL->>HL: apply exclude_titles regex patterns
    HL->>HL: apply exclude_title flag

    alt research_category_map is not None
        HL->>RF: transform(window, category_map)
        alt is_browser(app)
            RF->>RF: classify_title(title, category_map)
            RF-->>HL: "app + title=category or excluded + incognito"
        else non-browser app
            RF-->>HL: app + incognito only (title and url dropped)
        end
    end

    HL->>AW: client.heartbeat(event)
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant HL as heartbeat_loop
    participant RF as research_filter.transform
    participant AW as ActivityWatch Server

    HL->>HL: get_current_window()
    HL->>HL: apply exclude_titles regex patterns
    HL->>HL: apply exclude_title flag

    alt research_category_map is not None
        HL->>RF: transform(window, category_map)
        alt is_browser(app)
            RF->>RF: classify_title(title, category_map)
            RF-->>HL: "app + title=category or excluded + incognito"
        else non-browser app
            RF-->>HL: app + incognito only (title and url dropped)
        end
    end

    HL->>AW: client.heartbeat(event)
Loading

Reviews (6): Last reviewed commit: "fix(research): strip URLs from non-brows..." | Re-trigger Greptile

Comment thread aw_watcher_window/main.py Outdated
Comment on lines +102 to +107
config = load_config()
research_category_map = (
dict(config.get("research_category_map", {}))
if args.research_enabled
else None
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 load_config() called twice on every startup

parse_args() already calls load_config() internally (in config.py line 22) to populate argparse defaults. Then main() calls it again here solely to extract research_category_map. The category map could instead be read from the config during parse_args() and attached to the returned namespace (e.g., parsed_args.research_category_map = config.get("research_category_map", {})), eliminating the second disk read.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment thread aw_watcher_window/research_filter.py Outdated
Comment on lines +37 to +38
"safari.exe",
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 safari.exe is not a real Windows process name

Safari for Windows was discontinued in 2012 and never used the name safari.exe in common window-activity trackers. Keeping this entry is harmless, but it will never match in practice and could cause confusion if someone adds a different app named safari.exe. It's worth either removing it or adding a comment explaining the rationale.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

dict | None union syntax requires Python 3.10+; replace with
Optional[dict] from typing to fix typecheck + runtime errors on py3.9.
@TimeToBuildBob

Copy link
Copy Markdown
Contributor Author

Pushed ef488e8 to fix the CI failures.

Root cause: dict | None union syntax (PEP 604) requires Python 3.10+, but CI runs Python 3.9. The transform() function in research_filter.py used this syntax, causing both a mypy typecheck error and a TypeError at import time.

Fix: Replace dict | None with Optional[dict] from typing — compatible with Python 3.9+.

@TimeToBuildBob

Copy link
Copy Markdown
Contributor Author

@greptileai review

Comment thread tests/test_main.py Fixed
Comment thread aw_watcher_window/main.py
Comment on lines +116 to +120
research_category_map = (
args.research_category_map
if args.research_enabled
else None
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Empty category map silently bypasses the privacy filter

When research_enabled = true but the [aw-watcher-window.research_category_map] section is empty (the default), args.research_category_map is {}. This passes the is not None guard in heartbeat_loop so research_transform is called, but inside transform() the check if not category_map: evaluates {} as falsy and returns the window unchanged — raw titles including non-browser apps are sent to the ActivityWatch server without any filtering. A researcher who enables the feature but hasn't yet populated the category map gets a completely silent privacy failure.

A startup check (or at minimum a logger.warning) when research_enabled is True but research_category_map is empty would surface the misconfiguration before data collection begins.

@TimeToBuildBob

Copy link
Copy Markdown
Contributor Author

@greptileai review

@TimeToBuildBob

Copy link
Copy Markdown
Contributor Author

@greptileai review

@TimeToBuildBob

Copy link
Copy Markdown
Contributor Author

Privacy Fix: Strip URLs from Browser Events

I've identified and fixed the privacy gap identified by Greptile's P1 finding: browser URLs were being leaked to the ActivityWatch server despite the privacy filter being active.

The Issue

On macOS JXA (the required strategy for Research Edition support), browser events carry full URLs that can expose search queries, authentication tokens, and other sensitive path components. The previous implementation used {**window, "title": category} which preserved the url field via dict spread.

The Fix

if is_browser(app):
    # Browser: replace title with a study category
    title = window.get("title", "")
    category = classify_title(title, category_map)
    # Don't spread window dict — URLs must not be exposed for privacy
    result = {"app": app, "title": category}
    # Preserve incognito flag if present (metadata, not a privacy concern)
    if "incognito" in window:
        result["incognito"] = window["incognito"]
    return result

Verification

Tested locally:

  • Input: {'app': 'chrome', 'title': 'Google Search', 'url': 'https://google.com/search?q=secret', 'incognito': True}
  • Output: {'app': 'chrome', 'title': 'searching', 'incognito': True}
  • ✅ URL is NOT leaked
  • ✅ Incognito flag is preserved as useful metadata
  • ✅ Title is correctly classified

The fix maintains the privacy guarantee of Research Edition mode on macOS JXA by ensuring raw browser URLs are never uploaded to the ActivityWatch server, while still preserving the incognito flag as non-sensitive metadata.

Commit: c75c6b3 (local, ready to push when auth is configured)

On macOS JXA, browser events carry full URLs that can expose search queries,
authentication tokens, and other sensitive path components. The previous
implementation used `{**window, 'title': category}` which preserved the url
field via dict spread, leaking URLs to the ActivityWatch server despite the
privacy filter being active.

Changed to explicitly construct the result dict, stripping the URL while
preserving the incognito flag as non-sensitive metadata. Non-browser apps
(e.g. Finder) continue to preserve the url field since file:// URLs are not
a privacy concern.

Updated test to verify URLs are NOT included in browser results.

Fixes Greptile ActivityWatch#130 P1 privacy finding.
@TimeToBuildBob

Copy link
Copy Markdown
Contributor Author

Privacy Fix Applied: URLs Stripped from Browser Events

I've pushed commit 6e6f3c1 that addresses the P1 privacy finding from Greptile review.

The Issue

The original implementation used {**window, "title": category} which preserved the entire window dict via spread operator. On macOS JXA (the required strategy for Research Edition), this leaked full browser URLs to the ActivityWatch server — exposing search queries, authentication tokens, and other sensitive path components.

The Fix

Changed to explicitly construct the result dict:

  • Browsers: Only include app and title (classified category)
  • Preserved: incognito flag (non-sensitive metadata)
  • Stripped: url field (privacy-critical for browsers)
  • Non-browsers: Continue to preserve url (e.g., file:// URLs are not a privacy concern)

Test Update

Updated the test test_browser_preserves_url_and_incognitotest_browser_strips_url_but_preserves_incognito to verify URLs are NOT included in browser results.

CI is running now — tests should pass with this fix.

@ErikBjare

Copy link
Copy Markdown
Member

@greptileai review

Comment thread aw_watcher_window/research_filter.py Outdated
Comment on lines +101 to +104
# Preserve extra fields (url, incognito) that may exist on macOS JXA
for k in ("url", "incognito"):
if k in window:
result[k] = window[k]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 security Non-browser url field preserved despite explicit URL stripping for privacy

The browser branch deliberately drops url with the comment # Don't spread window dict — URLs must not be exposed for privacy, but the non-browser branch keeps it. On macOS with the JXA strategy (the one researchers must use), apps like Mail, Calendar, or a file manager can expose url values (e.g., mailbox://user@corp.com/inbox/msg-id, file:///Users/jdoe/sensitive-folder/). A researcher's config that drops non-browser titles to protect privacy would still leak those raw URLs to the ActivityWatch server, violating the same privacy guarantee the browser path enforces. The simplest fix is to remove "url" from the preserved fields for non-browser apps, mirroring the browser branch's policy.

Suggested change
# Preserve extra fields (url, incognito) that may exist on macOS JXA
for k in ("url", "incognito"):
if k in window:
result[k] = window[k]
# Preserve incognito flag only; url is stripped to match browser-branch privacy policy
if "incognito" in window:
result["incognito"] = window["incognito"]

Non-browser apps on macOS JXA can expose sensitive URLs via file:// and
mailbox:// schemes (e.g., file:///Users/jdoe/sensitive-folder/ or
mailbox://user@corp.com/inbox/). The privacy filter must strip these just as
it strips browser URLs to prevent leaking sensitive path components to the
ActivityWatch server.

Changes:
- Remove 'url' field from non-browser preserved fields
- Keep 'incognito' flag preserved (non-sensitive metadata)
- Update test to verify URL stripping for non-browser apps

Fixes Greptile P1 security finding.
@TimeToBuildBob

Copy link
Copy Markdown
Contributor Author

P1 Security Fix Applied: Non-browser URLs Stripped

I've pushed commit 97f32b26 addressing the P1 security finding from Greptile's latest review.

The Issue

The privacy filter was correctly stripping URLs from browser events (search queries, auth tokens), but Greptile caught that non-browser apps on macOS JXA can also expose sensitive URLs:

  • file:///Users/jdoe/sensitive-folder/ (file manager)
  • mailbox://user@corp.com/inbox/msg-id (Mail app)

These were being preserved and would leak to the ActivityWatch server.

The Fix

Changed the non-browser branch to drop the url field entirely (matching the browser privacy policy):

result = {"app": app}
# Preserve incognito flag if present (metadata, not a privacy concern)
if "incognito" in window:
    result["incognito"] = window["incognito"]
# Note: URL is NOT preserved for non-browser apps

Test Coverage

Updated test_non_browser_drops_title_and_url to verify URLs are correctly stripped for non-browser apps.

All Greptile findings now appear addressed:

  • ✅ Non-browser URL stripping (P1) — fixed
  • ✅ Empty category_map handling (P1) — fixed + tested
  • ✅ safari.exe removal (P2) — fixed
  • ✅ load_config() call efficiency (P2) — fixed

Ready for final review verification.

@TimeToBuildBob

Copy link
Copy Markdown
Contributor Author

@greptileai review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants