Testing System Architecture
The toolkit includes 10 specialized testing agents organized into layers. An orchestrator routes to unit test specialists, E2E testers, and QA agents—each optimized for different testing needs.
Testing Agent Hierarchy
The testing system is organized into four specialized layers, coordinated by Builder — the agent you interact with. Builder automatically delegates to the tester orchestrator, which routes to specialists based on file patterns.
User-Facing Layer
builderYou interact with Builder — it delegates testing automatically
Orchestrator Layer
testerRoutes to specialists based on file patterns
Unit Test Specialists
jest-testerBackend JS/TS
react-testerReact + RTL
go-testerGo testify
E2E Testing Layer
ui-tester-playwrightWrites E2E tests
ui-test-reviewerIdentifies test gaps
ui-test-full-app-auditorFull app audits
QA / Adversarial Layer
qaCoordinates testing
qa-explorerFinds bugs adversarially
qa-browser-testerBug → regression test
How the Tester Orchestrator Works
The tester agent is the central coordinator for all test generation. It receives test requests, analyzes which files need testing, and delegates to the appropriate specialist agents.
The Orchestration Process
- 1
Receive Test Request
The orchestrator receives a request to generate tests—either from a workflow automation or a direct user request.
- 2
Analyze Files
Examines file extensions, import patterns, and project structure to determine which specialist to invoke.
- 3
Route to Specialists
Delegates test generation to the appropriate agent(s) based on file patterns and project configuration.
- 4
Run & Verify
Executes the generated tests and handles failures through the retry loop (up to 3 attempts).
Routing Logic
The orchestrator uses file patterns to determine which specialist agent handles each file:
| File Pattern | Routes To | Test Framework |
|---|---|---|
*.tsx, *.jsx | react-tester | React Testing Library + Jest |
*.ts, *.js (backend) | jest-tester | Jest |
*.go | go-tester | Go testing + testify |
Note: Python files (*.py) do not yet have a dedicated specialist agent. The orchestrator will route to a general-purpose agent for Python test generation.
Test Loop & Retry Handling
When tests fail, the orchestrator attempts to fix and retry—up to 3 times before deferring:
Max 3 attempts: If tests still fail after 3 fix attempts, the orchestrator defers to human review rather than continuing indefinitely. This prevents infinite loops and ensures complex issues get proper attention.
Test Failure Output Policy
All testing agents follow a strict output policy to ensure failures are never hidden. This helps you debug issues quickly without missing critical error information.
The Rule
Passing Tests
Can be summarized: "42 tests passed"
Failing Tests
Must show complete output — never truncated. Full stack traces, assertion messages, and context are always preserved.
Why this matters
Truncated failure output is the #1 cause of wasted debugging time. When a test fails, you need to see exactly what went wrong — not a summary that hides the actual error.
Agents with this policy
All 6 testing agents enforce this policy:
Three Operational Modes
The tester orchestrator operates in different modes depending on context. Each mode optimizes for different testing scenarios.
Story Mode
Used during PRD implementation. After each user story is completed, tests are automatically generated to cover the new functionality.
Example:
"When builder completes US-003, tester automatically generates tests for all changed files in that story"
When to use:
Building features from PRDs—tests are generated as part of the workflow after each story completion.
Ad-hoc Mode
For direct test requests outside of PRD workflows. Scopes testing to specific files for fast, targeted coverage.
Example:
"Write tests for UserProfile.tsx" → tester scopes to just that component and routes to react-tester
When to use:
Quick fixes, bug patches, or when you just need tests for specific files without running a full workflow.
Full Suite Mode
Comprehensive test generation and execution across the entire codebase. Identifies coverage gaps and ensures all critical paths are tested.
Example:
"Run all tests with coverage analysis" → tester executes entire suite, identifies untested code, generates coverage report
When to use:
Before major releases, when bootstrapping test coverage, or for periodic quality audits.
Visual Audit Mode
Full-site UX sweeps that systematically check every page for visual inconsistencies, accessibility issues, and design system violations. Can also perform targeted post-fix re-checks after issues are resolved.
Example:
"Run visual audit on all pages" → tester crawls site, captures screenshots, flags inconsistent spacing, broken dark mode, and accessibility violations
When to use:
After major UI refactors, before releases, when verifying design system compliance, or for targeted re-checks after fixing visual bugs.
Automatic Test Activity Selection
The toolkit automatically determines what testing to run by matching changed files against glob patterns in test-activity-rules.json. There are no story-level metadata flags — Planner assigns zero test-related fields to stories. Non-Playwright tests (typecheck, lint, unit tests, critics) run unconditionally. Playwright execution is gated solely by testVerifySettings.
How File Patterns Drive Activity
When files change, the test-flow skill matches each file against glob patterns in test-activity-rules.json to collect unit testers and critics. Cross-cutting rules and code-pattern matching add additional activities. The result is a set of resolved activities — no prompt, no choice.
// Resolved activity structure
{
"baseline": ["typecheck", "lint"],
"unit": ["react-tester", "jest-tester"],
"critics": ["frontend-critic", "security-critic"],
"quality": ["aesthetic-critic"],
"reasoning": ["src/Button.tsx → *.tsx", "Code pattern: useAuth"]
}
Resolution Pipeline
Activities are resolved in four passes. The first three collect non-Playwright work; Playwright is handled separately in the execution step (gated by testVerifySettings).
| Pass | Input | Activities Collected |
|---|---|---|
File patterns | Each changed file matched against filePatterns globs | Unit testers, critics, quality critics |
Code patterns | Diff content matched against codePatterns regexes | Additional critics (e.g., security-critic for auth patterns) |
Cross-cutting | Multiple directories touched, shared modules modified | oddball-critic, dx-critic |
Hotspots | Files listed in test-debt.json | Extra critics, forced E2E for known-fragile files |
Execution Order
After resolution, activities execute in a fixed order. Steps 1–4 always run; Step 5 (Playwright) is gated by testVerifySettings.
testVerifySettings — write & run E2E testsZero Story Metadata
Planner assigns no test-related fields to stories. Activity resolution is driven entirely by which files change — making it consistent across PRD and ad-hoc work.
Playwright Gating
Playwright is the only activity that can be disabled. The six booleans in testVerifySettings control exactly which invocation points run automatically.
Test Flow Automation
The test-flow skill is a unified test orchestrator (~698 lines) that manages the entire test lifecycle—from activity resolution through execution to failure handling. It loads 5 focused Tier 2 sub-skills on demand (verification loop, prerequisite detection, E2E flow, UI verification, failure handling) and coordinates between Builder and testing agents to ensure quality gates are met in both PRD and ad-hoc modes.
Automatic Test Generation Triggers
Tests are automatically generated at specific points in the workflow, without requiring manual intervention:
After Each Story
In PRD mode, tests are automatically generated after each user story completes. Unit tests run immediately; E2E tests are queued.
On Request
In ad-hoc mode, tests are generated when explicitly requested or after all ad-hoc todos complete. User chooses when to run E2E.
Coverage Gaps
During full suite mode, tests are generated for any untested code paths identified through coverage analysis.
The Fix Loop Mechanism
When tests fail, the test-flow skill automatically attempts to fix the issues—up to 3 times before deferring to human review:
Why 20 attempts? The fix loop allows up to 20 attempts per failing test, with no escape hatches. There is no "save as-is" or "defer" option — the loop either succeeds or stops and reports the failure to the user.
Playwright Gating
Playwright is the only test activity that can be disabled. The six booleans in testVerifySettings control exactly which automated invocation points run. All other test activities (typecheck, lint, unit tests, critics) run unconditionally.
Non-Playwright: Always Run
Typecheck, lint, unit tests, and critics always run when resolved by file patterns. There is no setting to disable them.
Baseline, unit, and critic activities are unconditional
Playwright: Gated by Settings
Each automated Playwright invocation point has a corresponding boolean in testVerifySettings. All default to true.
6 booleans control analysis probes, story tests, and completion tests
User-invoked workflows are never gated. These settings only affect automated Playwright invocations during the build workflow. You can always invoke @qa or @ui-test-full-app-auditor directly regardless of these settings.
Integration with Tester Agent
The test-flow skill acts as a coordinator between the builder workflow and the tester agent hierarchy:
buildertest-flow skilltester1. Builder loads test-flow skill in both PRD and ad-hoc modes
2. test-flow determines when to generate/run tests
3. test-flow invokes @tester for test generation
4. tester routes to specialists (react-tester, jest-tester, etc.)
5. test-flow manages pendingTests state in builder-state.json
Architecture-Aware Verification
Before running verification, test-flow detects app architecture from project.json to choose the optimal strategy. Desktop apps always use playwright-electron — never browser-based verification. The webContent field determines whether a rebuild is needed, not whether to use Electron.
| App Type | webContent | Strategy | How Verification Works |
|---|---|---|---|
| frontend / fullstack | n/a | browser | Standard Playwright against dev server (HMR) |
| desktop | bundled | rebuild-then-launch-app | Build → relaunch Electron → verify with Playwright-Electron |
| desktop | remote | ensure-electron-running | Ensure Electron is running (HMR handles changes) → verify with Playwright-Electron |
| desktop | hybrid | rebuild-then-launch-app | Build → relaunch Electron → verify with Playwright-Electron |
| mobile | remote | verify-web-url | Test web URL directly in browser |
| backend / cli | n/a | not-required | No UI verification |
⛔ All desktop strategies use playwright: "electron"
Never use browser-based verification for desktop apps. Even webContent: "remote" (where HMR delivers changes via dev server) requires connecting Playwright to the Electron process. Opening localhost in a browser is not the same as testing inside Electron — Electron has its own window chrome, IPC, and process model.
Pre-check: Before running desktop tests, test-flow performs a zombie process pre-check to clean up any orphaned Electron instances from previous runs.
Layer Deep Dive
Orchestrator Layer
The tester agent is the main entry point for all testing tasks. It analyzes file patterns and routes to the appropriate specialist:
*.tsx/*.jsx→ routes to react-tester*.ts/*.js(backend) → routes to jest-tester*.go→ routes to go-tester
Unit Test Specialists
Language-specific agents that understand testing idioms and best practices for each stack.
jest-testerBackend JS/TS testing with Jest. Handles Node.js services, utilities, and API logic.
react-testerReact Testing Library. Tests components, hooks, and UI interactions.
go-testerGo testing with testify and httptest for handlers and services.
E2E Testing Layer
End-to-end testing using Playwright. Tests complete user flows through the browser.
ui-tester-playwrightWrites Playwright E2E tests for user flows, forms, and critical paths.
ui-test-reviewerReviews UI changes and identifies areas needing E2E coverage.
ui-test-full-app-auditorComprehensive E2E test audits with 5-retry resilience.
QA / Adversarial Layer
Exploratory testing that finds bugs through adversarial thinking—testing edge cases, unusual inputs, and unexpected user behaviors.
qaCoordinates exploratory testing sessions and prioritizes what to test.
qa-explorerUses browser-use CLI to actively explore the app and find bugs.
qa-browser-testerConverts bug findings into Playwright regression tests.
Unit Test Specialists
Each specialist agent is optimized for a specific language and testing framework, understanding the idioms and best practices unique to that stack.
jest-tester
Backend JavaScript/TypeScript Testing
Generates comprehensive Jest tests for Node.js services, utilities, API routes, and backend logic. Understands async patterns, mocking strategies, and TypeScript typing.
File Patterns
*.ts*.js(backend only)Testing Framework
Jest
Key Capabilities
- Module mocking with jest.mock()
- Spy functions and mock implementations
- Async/await and Promise testing
- Coverage reporting with thresholds
- Snapshot testing for data structures
- Timer mocking (setTimeout, setInterval)
react-tester
React Component Testing
Generates tests for React components using React Testing Library. Focuses on testing user interactions and behavior rather than implementation details.
File Patterns
*.tsx*.jsxTesting Framework
React Testing Library + Jest
Key Capabilities
- User event simulation (click, type, etc.)
- Accessible queries (getByRole, getByLabelText)
- Async rendering with waitFor/findBy
- Custom hooks testing with renderHook
- Component snapshot testing
- Context and provider mocking
go-tester
Go Testing
Generates idiomatic Go tests using the standard library testing package, testify assertions, and httptest for HTTP handlers. Follows Go testing conventions and table-driven test patterns.
File Patterns
*.goTesting Framework
Go testing + testify + httptest
Key Capabilities
- Table-driven tests with subtests
- Testify assertions (assert, require)
- HTTP handler testing with httptest
- Mock interfaces with testify/mock
- Parallel test execution (t.Parallel)
- Benchmark testing support
Project-Specific Overrides
Projects can override global testers with project-specific versions to customize testing behavior, add custom utilities, or enforce project conventions.
How it works
Place a custom tester definition in your project's docs/agents/ directory. The toolkit will use your project version instead of the global agent.
Example: Override jest-tester
your-project/
├── docs/
│ └── agents/
│ └── jest-tester.md ← Overrides global
├── src/
│ └── ...
└── package.jsonCommon Use Cases
- Custom test utilities or helpers unique to your project
- Project-specific mocking patterns (e.g., mock your auth layer)
- Enforcing team conventions (naming, structure, coverage thresholds)
- Integration with project-specific testing infrastructure
Override Priority
Project agents in docs/agents/ take priority over global toolkit agents. This applies to all agent types, not just testers.
E2E Testing System
The toolkit includes a complete end-to-end testing pipeline using Playwright. Four specialized agents work together to identify test gaps, write comprehensive tests, run full browser-based test suites, and perform comprehensive test audits across your entire application.
E2E Testing Workflow
UI Changes Made
Code changes to components, pages, or user flows
ui-test-revieweranalyzesReviews git diff, identifies UI areas needing E2E coverage
e2e-areas.jsonui-tester-playwrightwritesReads manifest, writes comprehensive Playwright tests
BuilderrunsExecutes Playwright tests, handles failures, and ships
Execution Mode Detection
Before running any E2E tests, the ui-test-flow skill automatically detects whether the project uses Electron desktop testing or standard browser testing, and routes to the correct test runner.
🌐 Browser Mode
- • Resolves test base URL
- • Checks dev server is running
- • Uses standard Playwright config
- • Parallel workers supported
🖥️ Electron Mode
- • Skips base URL resolution
- • Skips dev server checks
- • Routes to
playwright-electron - • Single worker only (
--workers=1)
Detection logic: The skill checks architecture.deployment and apps.*.framework in your project.json. If either indicates Electron, the skill loads the ui-test-electron skill and skips browser-specific setup steps entirely.
E2E Testing Agents
ui-test-reviewer
Test Gap Analyzer
Analyzes git diffs to identify UI areas that need E2E test coverage. Creates a structured manifest of test requirements for other agents to consume.
Primary Output
e2e-areas.jsonTypically Invoked By
builder, developer, or workflow automation
Key Capabilities
- Git diff analysis for changed UI components
- Identifies user flows affected by changes
- Detects forms, modals, and interactive elements
- Prioritizes critical paths for testing
- Structured JSON manifest generation
- Integration with existing test coverage
ui-tester-playwright
E2E Test Writer
Reads the e2e-areas.json manifest and writes comprehensive Playwright E2E tests for each identified area. Generates tests that cover user flows, forms, and critical paths.
Test Output
*.spec.tsin e2e/ or tests/Testing Framework
Playwright Test
Key Capabilities
- Page object pattern for maintainability
- Form submission and validation tests
- Navigation and routing verification
- Authentication flow testing
- Visual regression with screenshots
- Multi-browser testing support
ui-test-full-app-auditor
Comprehensive E2E Test Auditor
Autonomous agent for comprehensive E2E test audits. Analyzes your entire application, generates tests for all critical flows, and executes with 5-retry resilience—committing after each passing test to preserve progress.
Manifest Output
ui-test-audit-manifest.jsonUse Cases
- Full-app E2E audits
- Legacy app test bootstrapping
- Pre-release coverage checks
Key Capabilities
- Project selection on startup (like @builder)
- Full-app analysis and test planning
- 5-retry resilience per test
- Per-test commits for progress preservation
- Continue-on-failure execution
- Manifest-driven test tracking
- PRD integration for test manifests
Key Concepts
The e2e-areas.json Manifest
A structured JSON file that describes which UI areas need E2E test coverage. Created by ui-test-reviewer and consumed by ui-tester-playwright.
{
"areas": [
{
"name": "Login Flow",
"priority": "critical",
"paths": ["/login", "/forgot-password"],
"interactions": ["form-submit", "validation"]
},
{
"name": "Dashboard Navigation",
"priority": "high",
"paths": ["/dashboard/*"],
"interactions": ["nav-click", "search"]
}
]
}Failures Become Draft PRDs
When E2E tests fail, the ui-test-reviewer automatically generates a draft PRD describing the issue. This ensures failures are tracked and queued for resolution rather than being lost.
Using the Dev Server for E2E Tests
E2E tests run against your local development server, configured via project.json. Builder manages starting and stopping the server automatically.
From project.json
"apps": {
"web": {
"devPort": 3000,
"framework": "nextjs"
}
}Authentication in E2E Tests
For authenticated test runs, use globalSetup with shared storageState. This authenticates once at suite start and shares the session across all tests.
Recommended: globalSetup + storageState
// playwright.config.ts
export default defineConfig({
globalSetup: './tests/global-setup.ts',
use: {
storageState: './tests/.auth/user.json',
},
});
// tests/global-setup.ts
async function globalSetup() {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('/login');
await page.fill('#email', process.env.TEST_USER);
await page.fill('#password', process.env.TEST_PASS);
await page.click('button[type="submit"]');
await page.context().storageState({
path: './tests/.auth/user.json'
});
await browser.close();
}Anti-pattern: per-suite beforeAll auth
Avoid authenticating in each test file's beforeAll. This triggers multiple login requests per test run, which:
- • Rate limit risk — Auth providers may throttle or block
- • Slower runs — Login flow repeated N times
- • Flaky tests — Network timing issues multiply
// ❌ Don't do this for default user flows
test.beforeAll(async ({ browser }) => {
const page = await browser.newPage();
await page.goto('/login');
await page.fill('#email', user);
// Runs for every test file!
});Exception: Per-suite auth is appropriate when testing different user roles or permission levels that need distinct sessions.
QA & Adversarial Testing
Beyond correctness testing, the toolkit includes a dedicated QA layer for exploratory, adversarial testing. These agents actively try to break your application—finding edge cases, race conditions, and unexpected behaviors that scripted tests miss.
QA Testing Workflow
qacoordinatesDispatches exploratory testing tasks, prioritizes test areas
qa-explorerexploresAutonomously browses the app, tries edge cases, finds bugs
browser-use CLIBug Findingsqa-browser-testerconvertsTurns bug findings into permanent Playwright regression tests
QA Testing Agents
qa
QA Coordinator
The central coordinator for all QA and exploratory testing. Dispatches testing tasks to specialists, manages testing sessions, and prioritizes which areas of the application need the most attention.
Dispatch Mechanism
Routes to qa-explorer or qa-browser-tester
Based on testing phase and objectives
Typically Invoked By
builder, developer, or manual request
Key Capabilities
- Exploratory testing session management
- Test area prioritization based on risk
- Dispatches to explorer or browser-tester
- Aggregates findings across sessions
- Tracks bug discovery and resolution
- Coordinates with E2E testing workflow
qa-explorer
Adversarial Testing Agent
Uses the browser-use CLI to autonomously browse and interact with your application like a real user—but with an adversarial mindset. Tries edge cases, rapid interactions, unusual inputs, and unexpected navigation patterns to find bugs.
Primary Tool
browser-use CLIRequired dependency for autonomous browsing
Output
Bug findings with screenshots
Includes reproduction steps for each issue
Key Capabilities
- Autonomous browser navigation
- Edge case input testing
- Rapid interaction sequences
- Unusual navigation patterns
- Screenshot capture on failures
- Reproduction step documentation
qa-browser-tester
Bug-to-Test Converter
Takes bug findings from qa-explorer and converts them into permanent Playwright regression tests. Ensures that once a bug is found, it can never silently regress—the test will catch it.
Input
Bug findings from qa-explorer
Screenshots and reproduction steps
Output
*.spec.tsregression testsKey Capabilities
- Converts bug reports to Playwright tests
- Extracts reproduction steps automatically
- Generates reliable test selectors
- Adds appropriate assertions
- Integrates with existing E2E suite
- Prevents silent regressions
Browser-Use CLI Dependency
External Dependency: browser-use
The qa-explorer agent relies on the browser-use CLI—an external tool that enables AI agents to control a real browser autonomously.
Note
The browser-use CLI must be installed separately. See the browser-use documentation for installation instructions.
Why Adversarial Testing?
Finds Edge Cases
Tests combinations and scenarios that developers didn't anticipate—unusual inputs, rapid sequences, boundary conditions.
Simulates Real Users
Real users don't follow happy paths. They click randomly, navigate unexpectedly, and use the app in ways you never imagined.
Permanent Protection
Every bug found becomes a regression test. Once discovered, that issue can never silently return to your codebase.
E2E Quality Patterns
Quality testing goes beyond basic assertions. The ui-test-ux-quality skill provides specialized patterns for catching visual glitches, performance issues, layout shifts, and intermediate bad states that users experience but traditional tests miss.
When to Use Quality Patterns
Use these patterns when you need to verify the experience of using your UI, not just the final state. Basic assertions check "does this element exist?" — quality patterns check "did the user see a flash of wrong content?", "did the layout jump?", "was it fast enough?"
✓ Use quality patterns for:
- • Loading states that shouldn't flash
- • Drag-and-drop interactions
- • Animations and transitions
- • Performance-critical pages
- • Layout-sensitive components
✗ Basic assertions suffice for:
- • Form validation messages
- • Navigation between pages
- • Simple CRUD operations
- • Static content verification
- • API response checks
1assertNeverAppears
Verifies that an element never appears during an action — catches flickers, loading state flashes, and momentary error displays that users see but final-state assertions miss.
// Watch for skeleton flash during cached navigation
await assertNeverAppears(
page,
'[data-testid="skeleton"]',
async () => {
await page.click('[data-testid="cached-link"]');
await page.waitForSelector('[data-testid="content"]');
},
{ checkInterval: 16 } // Check every frame (~60fps)
);
// Ensure error toast doesn't flash during successful submit
await assertNeverAppears(
page,
'[data-testid="error-toast"]',
async () => {
await page.click('[data-testid="submit-button"]');
await page.waitForSelector('[data-testid="success-message"]');
}
);Use when: Testing cached navigations, optimistic updates, or any action where intermediate states shouldn't be visible.
2withPerformanceBudget
Enforces performance constraints as test assertions. Fails the test if an action exceeds time or memory budgets — catches performance regressions before they reach production.
// Dashboard must load within 2 seconds
await withPerformanceBudget(
page,
{ timeout: 2000 },
async () => {
await page.goto('/dashboard');
await page.waitForSelector('[data-testid="dashboard-loaded"]');
}
);
// Search should respond within 500ms with reasonable memory
await withPerformanceBudget(
page,
{ timeout: 500, maxHeapUsage: 50 * 1024 * 1024 }, // 50MB
async () => {
await page.fill('[data-testid="search"]', 'query');
await page.waitForSelector('[data-testid="results"]');
}
);Use when: Protecting critical user journeys from performance regressions, especially after optimization work.
3assertNoLayoutShift
Captures element positions before and after an action, failing if any tracked element moved unexpectedly. Prevents the frustrating experience of clicking a button that jumps away.
// Verify ad loading doesn't shift article content
await assertNoLayoutShift(
page,
['article', '.sidebar', '.cta-button'], // Elements to track
async () => {
// Wait for lazy-loaded ad to appear
await page.waitForSelector('[data-testid="ad-loaded"]');
}
);
// Ensure image loading doesn't shift text below
await assertNoLayoutShift(
page,
['.hero-text', '.nav-button'],
async () => {
await page.waitForSelector('img[data-loaded="true"]');
},
{ tolerance: 2 } // Allow 2px variance for subpixel rendering
);Use when: Testing pages with lazy-loaded content, images without dimensions, or dynamic elements that could push other content around.
4assertStableRender
Monitors an element's content over time, failing if it changes unexpectedly. Catches React hydration mismatches, flickering values, and components that re-render with different content.
// Price shouldn't flicker between different values
await assertStableRender(
page,
'[data-testid="price"]',
{ duration: 1000 } // Watch for 1 second
);
// Dashboard metrics should stabilize after load
await page.waitForSelector('[data-testid="dashboard-ready"]');
await assertStableRender(
page,
'[data-testid="total-revenue"]',
{
duration: 500,
allowedChanges: 1 // Allow one update, then must stabilize
}
);Use when: Testing SSR hydration, real-time data displays, or any component where content flickering would confuse users.
5measureCLS
Measures Cumulative Layout Shift using the browser's Performance Observer API. Returns the actual CLS score for assertions or reporting — essential for Core Web Vitals compliance.
// Measure CLS during full page lifecycle
const cls = await measureCLS(page, async () => {
await page.goto('/article');
await page.waitForLoadState('networkidle');
// Wait for all lazy content
await page.waitForTimeout(2000);
});
// Google's "Good" threshold is < 0.1
expect(cls).toBeLessThan(0.1);
// For stricter pages, use tighter threshold
const checkoutCLS = await measureCLS(page, async () => {
await page.goto('/checkout');
await page.waitForSelector('[data-testid="checkout-ready"]');
});
expect(checkoutCLS).toBeLessThan(0.05);Use when: Monitoring Core Web Vitals, testing pages with ads or dynamic content, or after any layout-related changes.
6assertStateStability
Verifies that once a desired state is reached, it doesn't regress. Catches race conditions where success briefly appears then reverts to loading or error states.
// Once saved, the button shouldn't revert to "saving..."
await assertStateStability(
page,
'[data-testid="save-button"]',
{
desiredState: { text: 'Saved' },
forbiddenStates: [{ text: 'Saving...' }, { text: 'Save' }],
duration: 2000
}
);
// Verify successful state persists after form submit
await page.click('[data-testid="submit"]');
await assertStateStability(
page,
'[data-testid="status"]',
{
desiredState: { attribute: 'data-status', value: 'success' },
forbiddenStates: [
{ attribute: 'data-status', value: 'loading' },
{ attribute: 'data-status', value: 'error' }
],
duration: 3000
}
);Use when: Testing async operations, optimistic updates with server reconciliation, or any action with multiple state transitions.
7expectMutualExclusivity
Asserts that certain UI states never coexist — catches impossible states like showing both a loading spinner and loaded content, or both success and error messages simultaneously.
// Loading and content should never appear together
await expectMutualExclusivity(
page,
['[data-testid="loading"]', '[data-testid="content"]'],
async () => {
await page.goto('/dashboard');
await page.waitForSelector('[data-testid="content"]');
}
);
// Success and error toasts are mutually exclusive
await expectMutualExclusivity(
page,
[
'[data-testid="success-toast"]',
'[data-testid="error-toast"]',
'[data-testid="warning-toast"]'
],
async () => {
await page.click('[data-testid="submit"]');
await page.waitForSelector('[data-testid*="toast"]');
}
);Use when: Testing state machines, loading/error states, or any UI with states that should be mutually exclusive.
Full Skill Documentation
The ui-test-ux-quality skill includes implementation details, helper functions, and integration guidance. Load it with Loading skill: ui-test-ux-quality or see the full documentation.
Mutation Testing Pattern
The 3-step mutation testing pattern ensures state changes are truly persisted, not just optimistically displayed. This pattern catches bugs at three distinct verification stages—from immediate UI feedback to permanent data persistence.
Why Three Stages?
Single-assertion tests only verify the final state. But users experience the full journey: they click a button, see it respond, wait for completion, and expect the change to survive a refresh. The 3-step pattern tests what users actually experience—catching bugs that final-state tests miss.
The Three Verification Stages
Immediate State
Assert right after the action. Verifies the UI responds immediately to user input.
Catches
- • Action handler bugs
- • Validation errors
- • Event binding issues
- • Missing optimistic updates
Example
"After clicking save, button shows 'Saving...'"
Stable State
Assert after async operations settle. Verifies the operation completed successfully.
Catches
- • Async bugs
- • Race conditions
- • Optimistic UI mismatches
- • API error handling
Example
"After save completes, success toast appears"
Persistence
Assert after page reload or re-fetch. Verifies the change was actually persisted.
Catches
- • Persistence bugs
- • Cache issues
- • Serialization problems
- • Database transaction failures
Example
"After reload, the saved data is still there"
Complete Pattern Example
Here's a full Playwright test demonstrating all three verification stages for a profile update flow:
test('saving profile updates persists', async ({ page }) => {
// Setup: Navigate to the profile page
await page.goto('/profile');
await page.waitForSelector('[name="bio"]');
// === Stage 1: Immediate State ===
// Assert right after the action
await page.fill('[name="bio"]', 'New bio text');
await page.click('button:text("Save")');
// Verify the button shows loading state immediately
await expect(page.locator('button:text("Save")'))
.toHaveAttribute('aria-busy', 'true');
// Verify optimistic update appears in form
await expect(page.locator('[name="bio"]'))
.toHaveValue('New bio text');
// === Stage 2: Stable State ===
// Assert after async operations settle
await expect(page.locator('[role="alert"]'))
.toHaveText('Profile saved');
await expect(page.locator('button:text("Save")'))
.not.toHaveAttribute('aria-busy', 'true');
// Verify the form still shows the correct value
await expect(page.locator('[name="bio"]'))
.toHaveValue('New bio text');
// === Stage 3: Persistence ===
// Assert after page reload or re-fetch
await page.reload();
await page.waitForSelector('[name="bio"]');
// Verify the saved data survived the refresh
await expect(page.locator('[name="bio"]'))
.toHaveValue('New bio text');
});Why Each Stage Matters
Each stage catches a different category of bugs. Skipping any stage leaves blind spots in your test coverage:
Immediate State catches UX failures
If Stage 1 fails, users don't know their action registered. They might click again, causing duplicate submissions. Or they might think the app is broken and leave. Common bugs: missing onClick handlers, broken form bindings, disabled state not applying.
Stable State catches async failures
If Stage 2 fails, the optimistic update showed success but the server rejected it. Or a race condition caused the UI to revert. Users see a "success" that disappears. Common bugs: unhandled API errors, race conditions in state updates, missing error boundaries.
Persistence catches data loss
If Stage 3 fails, everything looked correct but the data was never actually saved. Users think their work is safe, but refreshing the page reveals it's gone. Common bugs: cache-only updates without API calls, transaction rollbacks, serialization errors.
More Pattern Examples
The 3-step pattern applies to any mutation operation. Here are additional examples:
test('deleting an item removes it permanently', async ({ page }) => {
// Stage 1: Immediate - item starts fading/strikethrough
await page.click('[data-testid="delete-item-1"]');
await expect(page.locator('[data-testid="item-1"]'))
.toHaveClass(/deleting/);
// Stage 2: Stable - item is removed from list
await expect(page.locator('[data-testid="item-1"]'))
.not.toBeVisible();
await expect(page.locator('[data-testid="toast"]'))
.toHaveText('Item deleted');
// Stage 3: Persistence - item stays gone after reload
await page.reload();
await expect(page.locator('[data-testid="item-1"]'))
.not.toBeVisible();
});test('creating a task adds it to the list', async ({ page }) => {
// Stage 1: Immediate - new task appears optimistically
await page.fill('[data-testid="new-task"]', 'Buy groceries');
await page.click('[data-testid="add-task"]');
await expect(page.locator('text=Buy groceries')).toBeVisible();
await expect(page.locator('[data-testid="add-task"]'))
.toBeDisabled(); // Prevent double-submit
// Stage 2: Stable - task gets permanent ID, button re-enabled
await expect(page.locator('[data-testid="add-task"]'))
.toBeEnabled();
const task = page.locator('text=Buy groceries');
await expect(task).toHaveAttribute('data-id', /.+/); // Has server ID
// Stage 3: Persistence - task survives page refresh
await page.reload();
await expect(page.locator('text=Buy groceries')).toBeVisible();
});When to Apply This Pattern
Use the 3-step mutation testing pattern for any operation that modifies data: creates, updates, deletes, settings changes, form submissions, and user preference updates. Skip it only for read-only operations like navigation and search.
Testing in the Workflow
Here's how testing agents integrate with the broader toolkit workflow:
@builder completes a feature
Implementation is done, code is ready for testing
tester orchestrator is invoked
Analyzes changed files and determines which specialists to call
Specialists generate tests
react-tester, jest-tester, etc. write tests for their domains
Tests run and verify
If tests pass, feature is ready. If they fail, issues are reported back.
E2E tests run (if configured)
ui-tester-playwright runs full browser tests for complete coverage
Electron Desktop Testing
For Electron desktop apps, the toolkit uses Playwright's Electron API instead of browser-based testing. The ui-test-electron skill is automatically loaded when your project includes an Electron app entry.
Playwright Web vs Playwright Electron
Web (Standard)
- • Connects to URL via browser
- • Uses
page.goto() - • Standard DOM selectors
Electron
- • Launches Electron binary directly
- • Uses
electron.launch() - • Access to main + renderer processes
project.json Configuration
Configure your Electron app in the apps[] array:
{
"apps": [
{
"name": "desktop",
"type": "electron",
"devServer": {
"startCommand": "npm run electron:dev",
"port": null, // Electron doesn't use HTTP port
"readyPattern": "Electron ready"
},
"electron": {
// Path to built executable (for production testing)
"executablePath": "dist/MyApp-darwin-arm64/MyApp.app",
// Args to launch in dev mode (uses electron .)
"devLaunchArgs": [".", "--no-sandbox"]
},
// Architecture detection for verification strategy
"webContent": "bundled", // bundled | remote | hybrid
"remoteUrl": null // Only for remote/hybrid apps
}
]
}executablePath
Path to your built Electron app. On macOS, this is typically .app bundle. On Windows/Linux, point to the executable directly.
devLaunchArgs
Arguments passed to electron binary during development. The first arg is typically "." to run from project root.
port: null
Unlike web apps, Electron apps don't expose an HTTP port. Set port: null to indicate Playwright should launch the binary directly instead of connecting to a URL.
webContent
Describes how the app loads its UI content. Used for architecture-aware verification strategy selection:
- •
bundled— UI is packaged with the app (file:// protocol) - •
remote— UI loads from a remote URL - •
hybrid— Mix of bundled shell with remote content
remoteUrl
For remote or hybrid apps, specify the URL where the UI content is loaded from. Set to null for bundled apps.
Zombie Process Cleanup
Electron apps can leave zombie processes if tests fail or are interrupted. The ui-test-electron skill includes a globalSetup.ts pattern that cleans up orphaned processes before each test run:
// playwright/globalSetup.ts
import { execSync } from 'child_process';
export default async function globalSetup() {
// Kill any orphaned Electron processes from previous runs
try {
if (process.platform === 'darwin') {
execSync('pkill -f "Electron" || true', { stdio: 'ignore' });
} else if (process.platform === 'win32') {
execSync('taskkill /F /IM electron.exe 2>nul || exit 0', { stdio: 'ignore' });
} else {
execSync('pkill -f electron || true', { stdio: 'ignore' });
}
} catch {
// Ignore errors if no processes found
}
// Brief pause to ensure cleanup completes
await new Promise(resolve => setTimeout(resolve, 500));
}Pre-test verification: The test-flow skill runs a zombie process pre-check before Electron tests. If orphaned processes are detected, they are cleaned up automatically to prevent "another instance already running" errors.
How the ui-test-electron Skill Works
Skill is loaded automatically
The ui-test-flow skill detects Electron projects via architecture.deployment or apps.*.framework and automatically routes to the ui-test-electron skill, skipping browser-specific setup.
Playwright launches Electron
Tests use electron.launch() with your devLaunchArgs or executablePath.
Tests interact with the window
Standard Playwright page APIs work on the Electron renderer process. Main process can be accessed via electronApp.evaluate().
Multi-Platform Testing
If your project has both web and Electron targets in the apps[] array, agents automatically detect which platform a test targets based on the test file location or explicit annotations. See the Multi-Platform Apps section for configuration details.
Test Verify Settings
The testVerifySettings object in project.json controls which automated Playwright invocation points are enabled. All settings default to true when absent, so the system runs all verification steps unless you explicitly opt out.
What These Settings Control
These settings gate automated Playwright invocations triggered during the build workflow. They do not gate user-invoked workflows like @qa or @ui-test-full-app-auditor, nor do they affect test file creation or maintenance.
Configuration
// project.json
{
"testVerifySettings": {
"adHocUIVerify_Analysis": true,
"adHocUIVerify_StoryTest": true,
"adHocUIVerify_CompletionTest": true,
"prdUIVerify_Analysis": true,
"prdUIVerify_StoryTest": true,
"prdUIVerify_PRDCompletionTest": true
}
}Settings Reference
| Setting | Mode | Description |
|---|---|---|
| adHocUIVerify_Analysis | Ad-hoc | Run Playwright analysis probe after code changes (adhoc-workflow Step 0.1b) |
| adHocUIVerify_StoryTest | Ad-hoc | Write and run Playwright tests for completed tasks (test-flow Step 5 ad-hoc) |
| adHocUIVerify_CompletionTest | Ad-hoc | Run holistic Playwright tests covering the full batch of changes at task spec completion |
| prdUIVerify_Analysis | PRD | Run per-story Playwright verification after implementation (test-flow Step 3 PRD) |
| prdUIVerify_StoryTest | PRD | Write and run per-story Playwright tests (test-flow Step 5 PRD, tester Step 7) |
| prdUIVerify_PRDCompletionTest | PRD | Generate deferred UI tests at PRD completion (prd-workflow Ship Phase "G" option) |
Common Patterns
Skip analysis probes, keep test generation
Useful when analysis probes are slow but you still want Playwright tests written for each story.
"testVerifySettings": {
"adHocUIVerify_Analysis": false,
"prdUIVerify_Analysis": false
}Disable all automated Playwright
For projects that rely on manual QA or external CI for UI testing. You can still invoke @qa or @ui-test-full-app-auditor directly.
"testVerifySettings": {
"adHocUIVerify_Analysis": false,
"adHocUIVerify_StoryTest": false,
"adHocUIVerify_CompletionTest": false,
"prdUIVerify_Analysis": false,
"prdUIVerify_StoryTest": false,
"prdUIVerify_PRDCompletionTest": false
}Default Behavior
If the testVerifySettings object is absent from project.json, all six settings default to true. This means existing projects get full automated Playwright verification without any configuration changes.
CORS & Browser Verification
Cross-Origin Resource Sharing (CORS) is a browser security mechanism that controls which domains can access resources from another domain. This has important implications for how agents verify API behavior.
⚠️ Critical: CORS Is Browser-Enforced
Agents must never use curl, wget, or similar CLI tools to verify CORS behavior. CORS headers are enforced by browsers, not by servers or CLI tools.
Why CLI Tools Cannot Test CORS
CORS works as follows:
- Browser makes a preflight
OPTIONSrequest - Server responds with CORS headers (
Access-Control-Allow-Origin, etc.) - Browser decides whether to allow or block the actual request
CLI tools like curl skip step 3 entirely—they receive the response regardless of CORS headers. A curl request succeeding tells you nothing about whether a browser would allow the same request.
Correct CORS Verification Methods
| Method | When to Use |
|---|---|
| Playwright E2E test | Primary method—runs in a real browser context |
| Browser DevTools (manual) | Quick verification during development |
| QA adversarial agent | Exploratory testing of cross-origin scenarios |
Example: Playwright CORS Test
test('API allows cross-origin requests from allowed domain', async ({ page }) => {
// Navigate to the allowed origin
await page.goto('https://allowed-origin.example.com');
// Make cross-origin request from browser context
const response = await page.evaluate(async () => {
const res = await fetch('https://api.example.com/data');
return { ok: res.ok, status: res.status };
});
expect(response.ok).toBe(true);
});
test('API blocks cross-origin requests from disallowed domain', async ({ page }) => {
await page.goto('https://disallowed-origin.example.com');
// This should fail due to CORS
const error = await page.evaluate(async () => {
try {
await fetch('https://api.example.com/data');
return null;
} catch (e) {
return e.message;
}
});
expect(error).toContain('CORS');
});Agent Enforcement
The security-critic and backend-critic agents are configured to flag any CORS verification that uses CLI tools. If you see a CORS test using curl, the test is invalid and must be rewritten to use browser-based verification.