-
Notifications
You must be signed in to change notification settings - Fork 39
tests: add Playwright E2E tests with screenshot golden testing #115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add E2E tests for all 8 MCP server examples - Screenshot golden images for visual regression testing - CI workflow for running E2E tests - npm scripts: test:e2e, test:e2e:update, test:e2e:ui 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
commit: |
When no code is provided, show a rotating green cube demo. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Set minimal read-only permissions for security best practices. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Playwright requires setTimeout to be called within a test or describe block. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Remove explicit setTimeout calls (default 30s is sufficient) - Replace waitForTimeout(5000/6000) with proper waitForAppLoad() - Wait for inner iframe visibility instead of fixed delays - Keep only 500ms stabilization for screenshot animations 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Use updateSnapshots: 'missing' in CI to handle cross-platform screenshot differences (macOS vs Linux). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Remove -chromium-darwin suffix from snapshot filenames - Configure snapshotPathTemplate for cross-platform compatibility - Increase tolerance to 5% for rendering differences between platforms 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
May help with Playwright test discovery issues in CI. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Use npm ci for exact package versions - Use --reporter=list instead of html to avoid potential re-evaluation issues - Only upload test-results on failure 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Run bun test only on src/ directory to avoid running Playwright tests with Bun's test runner. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Test that clicking buttons in the MCP App triggers the corresponding host callbacks: - Send Message → host logs "Message from MCP App" - Send Log → host logs "Log message from MCP App" - Open Link → host logs "Open link request" Tests both React and Vanilla JS implementations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
|
I will take a closer look tomorrow (sorry!), but how does this handle examples that have dynamic output? For example, the |
Add Playwright mask option to handle servers with dynamic/random content: - basic-react/vanillajs: mask server time display - system-monitor: mask CPU chart, memory stats, uptime - cohort-heatmap: mask heatmap grid (random data) - customer-segmentation: mask scatter chart (random data) This addresses PR feedback about handling examples with non-deterministic output. Masking replaces the previous 10% tolerance with proper exclusion of dynamic elements, allowing tighter 1% tolerance for the rest of the UI. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
|
@jonathanhefner it was using 10% pixel tolerance, i've now tigthened it to 1% by using per-test masks for variable stuff (they show up in purple) |
jonathanhefner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool idea! 😎
CONTRIBUTING.md
Outdated
| npm run test:e2e:update -- --grep "Three.js" | ||
| ``` | ||
|
|
||
| **Note**: Golden screenshots are platform-specific (e.g., `*-chromium-darwin.png` for macOS). CI runs on Linux, so you may need to update screenshots in both environments or run in a container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having platform-specific golden screenshots, what do you think about comparing screenshots of the current commit with the last known good commit? That way, both screenshots would be using the same OS, browser, installed fonts, etc.
It might also allow dropping the masks in some cases, e.g., "Hostname" and "Platform" for the system-monitor example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry already made them platform agnostic will update the text
playwright.config.ts
Outdated
| fullyParallel: false, // Run tests sequentially to share server | ||
| forbidOnly: !!process.env.CI, | ||
| retries: process.env.CI ? 2 : 0, | ||
| workers: 1, // Single worker since we share the server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means all E2E tests are run sequentially (not in parallel), right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes. no real point to avoid (host) server interaction, and tests are independent (different mcp servers). Not sure if they'd use distinct headless browsers, will experiment.
| const SERVERS = [ | ||
| { key: "basic-react", index: 0, name: "Basic MCP App Server (React-based)" }, | ||
| { | ||
| key: "basic-vanillajs", | ||
| index: 1, | ||
| name: "Basic MCP App Server (Vanilla JS)", | ||
| }, | ||
| { key: "budget-allocator", index: 2, name: "Budget Allocator Server" }, | ||
| { key: "cohort-heatmap", index: 3, name: "Cohort Heatmap Server" }, | ||
| { | ||
| key: "customer-segmentation", | ||
| index: 4, | ||
| name: "Customer Segmentation Server", | ||
| }, | ||
| { key: "scenario-modeler", index: 5, name: "SaaS Scenario Modeler" }, | ||
| { key: "system-monitor", index: 6, name: "System Monitor Server" }, | ||
| { key: "threejs", index: 7, name: "Three.js Server" }, | ||
| ]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly for a v2, what do you think about having each example define its own test/e2e/ directory? That might make it easier for each example to define its own testing parameters. For example, I could add a SEED env variable just for the customer-segmentation example, which would sidestep the issue of random data.
Also, possibly for a v2 or even v3, if we factor out test helpers for each example to use in its own test/e2e/ directory, we could possibly offer those test helpers as part of the Apps SDK so that other developers can test their own apps more easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this scale I'd rather keep things simple and centralized and one-fits-all (e.g. define same SEED for all - or just get the cue from CI env that the seed needs to be predictible - and have some unified way to disable animations - as in the wiki stuff - etc)
Was hoping the examples don't get complex enough that any of them requires specialized handling. For instance I've added defaults to relevant args schemas (wiki & three.js)
- Add default URL "https://en.wikipedia.org/wiki/Model_Context_Protocol" to wiki-explorer-server's get-first-degree-links tool inputSchema - Update basic-host to automatically populate input field with tool defaults extracted from inputSchema.properties - Add wiki-explorer-server to E2E test suite with dynamic masking for the force-directed graph 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
This exposes the default code and height values in the JSON Schema, allowing basic-host to auto-populate the input field with defaults. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Enable fullyParallel with 4 workers locally (2 in CI) - Add 30s timeout per test - Mask threejs canvas for stable screenshots - Update snapshots to reflect default values in input fields 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Same check as in CI to catch issues before push. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Font rendering differs between macOS and Linux, causing ~5% pixel differences. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Summary
.default()for tool schema defaults (wiki-explorer URL, threejs code/height) so basic-host can auto-populate input fieldsChanges
E2E Tests (
tests/e2e/servers.spec.ts)Playwright Config
Tool Schema Improvements
wiki-explorer-server: Default URLhttps://en.wikipedia.org/wiki/Model_Context_Protocolthreejs-server: Default code (rotating cube) and height (400px)basic-host: Auto-populates input field with tool defaults from JSON SchemaTest plan
npm test- Unit tests passnpm run test:e2e- 26 E2E tests pass in ~28s🤖 Generated with Claude Code