Testing System Architecture

The toolkit includes 10 specialized testing agents organized into layers. An orchestrator routes to unit test specialists, E2E testers, and QA agents—each optimized for different testing needs.

Testing Agent Hierarchy

The testing system is organized into four specialized layers, coordinated by Builder — the agent you interact with. Builder automatically delegates to the tester orchestrator, which routes to specialists based on file patterns.

User-Facing Layer

builder

You interact with Builder — it delegates testing automatically

Orchestrator Layer

tester

Routes to specialists based on file patterns

Unit Test Specialists

jest-tester

Backend JS/TS

react-tester

React + RTL

go-tester

Go testify

E2E Testing Layer

ui-tester-playwright

Writes E2E tests

ui-test-reviewer

Identifies test gaps

ui-test-full-app-auditor

Full app audits

QA / Adversarial Layer

qa

Coordinates testing

qa-explorer

Finds bugs adversarially

qa-browser-tester

Bug → regression test

How the Tester Orchestrator Works

The tester agent is the central coordinator for all test generation. It receives test requests, analyzes which files need testing, and delegates to the appropriate specialist agents.

The Orchestration Process

  1. 1

    Receive Test Request

    The orchestrator receives a request to generate tests—either from a workflow automation or a direct user request.

  2. 2

    Analyze Files

    Examines file extensions, import patterns, and project structure to determine which specialist to invoke.

  3. 3

    Route to Specialists

    Delegates test generation to the appropriate agent(s) based on file patterns and project configuration.

  4. 4

    Run & Verify

    Executes the generated tests and handles failures through the retry loop (up to 3 attempts).

Routing Logic

The orchestrator uses file patterns to determine which specialist agent handles each file:

File PatternRoutes ToTest Framework
*.tsx, *.jsxreact-testerReact Testing Library + Jest
*.ts, *.js (backend)jest-testerJest
*.gogo-testerGo testing + testify

Note: Python files (*.py) do not yet have a dedicated specialist agent. The orchestrator will route to a general-purpose agent for Python test generation.

Test Loop & Retry Handling

When tests fail, the orchestrator attempts to fix and retry—up to 3 times before deferring:

Run Tests
Tests Pass?
Yes
Done ✓
No
Analyze & Fix
Retry (max 3×)
Defer to Human

Max 3 attempts: If tests still fail after 3 fix attempts, the orchestrator defers to human review rather than continuing indefinitely. This prevents infinite loops and ensures complex issues get proper attention.

Test Failure Output Policy

All testing agents follow a strict output policy to ensure failures are never hidden. This helps you debug issues quickly without missing critical error information.

The Rule

Passing Tests

Can be summarized: "42 tests passed"

Failing Tests

Must show complete output — never truncated. Full stack traces, assertion messages, and context are always preserved.

Why this matters

Truncated failure output is the #1 cause of wasted debugging time. When a test fails, you need to see exactly what went wrong — not a summary that hides the actual error.

Agents with this policy

All 6 testing agents enforce this policy:

Three Operational Modes

The tester orchestrator operates in different modes depending on context. Each mode optimizes for different testing scenarios.

Story Mode

Used during PRD implementation. After each user story is completed, tests are automatically generated to cover the new functionality.

Example:

"When builder completes US-003, tester automatically generates tests for all changed files in that story"

When to use:

Building features from PRDs—tests are generated as part of the workflow after each story completion.

Ad-hoc Mode

For direct test requests outside of PRD workflows. Scopes testing to specific files for fast, targeted coverage.

Example:

"Write tests for UserProfile.tsx" → tester scopes to just that component and routes to react-tester

When to use:

Quick fixes, bug patches, or when you just need tests for specific files without running a full workflow.

Full Suite Mode

Comprehensive test generation and execution across the entire codebase. Identifies coverage gaps and ensures all critical paths are tested.

Example:

"Run all tests with coverage analysis" → tester executes entire suite, identifies untested code, generates coverage report

When to use:

Before major releases, when bootstrapping test coverage, or for periodic quality audits.

Visual Audit Mode

Full-site UX sweeps that systematically check every page for visual inconsistencies, accessibility issues, and design system violations. Can also perform targeted post-fix re-checks after issues are resolved.

Example:

"Run visual audit on all pages" → tester crawls site, captures screenshots, flags inconsistent spacing, broken dark mode, and accessibility violations

When to use:

After major UI refactors, before releases, when verifying design system compliance, or for targeted re-checks after fixing visual bugs.

Automatic Test Activity Selection

The toolkit automatically determines what testing to run by matching changed files against glob patterns in test-activity-rules.json. There are no story-level metadata flags — Planner assigns zero test-related fields to stories. Non-Playwright tests (typecheck, lint, unit tests, critics) run unconditionally. Playwright execution is gated solely by testVerifySettings.

How File Patterns Drive Activity

When files change, the test-flow skill matches each file against glob patterns in test-activity-rules.json to collect unit testers and critics. Cross-cutting rules and code-pattern matching add additional activities. The result is a set of resolved activities — no prompt, no choice.

// Resolved activity structure

{

"baseline": ["typecheck", "lint"],

"unit": ["react-tester", "jest-tester"],

"critics": ["frontend-critic", "security-critic"],

"quality": ["aesthetic-critic"],

"reasoning": ["src/Button.tsx → *.tsx", "Code pattern: useAuth"]

}

Resolution Pipeline

Activities are resolved in four passes. The first three collect non-Playwright work; Playwright is handled separately in the execution step (gated by testVerifySettings).

PassInputActivities Collected
File patternsEach changed file matched against filePatterns globsUnit testers, critics, quality critics
Code patternsDiff content matched against codePatterns regexesAdditional critics (e.g., security-critic for auth patterns)
Cross-cuttingMultiple directories touched, shared modules modifiedoddball-critic, dx-critic
HotspotsFiles listed in test-debt.jsonExtra critics, forced E2E for known-fragile files

Execution Order

After resolution, activities execute in a fixed order. Steps 1–4 always run; Step 5 (Playwright) is gated by testVerifySettings.

1Baseline: typecheck, lint (always)
2Unit tests: resolved testers for changed file types
3Critics: resolved critics run in parallel
4Quality: aesthetic-critic, tailwind-critic if resolved
5Playwright: gated by testVerifySettings — write & run E2E tests

Zero Story Metadata

Planner assigns no test-related fields to stories. Activity resolution is driven entirely by which files change — making it consistent across PRD and ad-hoc work.

Playwright Gating

Playwright is the only activity that can be disabled. The six booleans in testVerifySettings control exactly which invocation points run automatically.

Test Flow Automation

The test-flow skill is a unified test orchestrator (~698 lines) that manages the entire test lifecycle—from activity resolution through execution to failure handling. It loads 5 focused Tier 2 sub-skills on demand (verification loop, prerequisite detection, E2E flow, UI verification, failure handling) and coordinates between Builder and testing agents to ensure quality gates are met in both PRD and ad-hoc modes.

Automatic Test Generation Triggers

Tests are automatically generated at specific points in the workflow, without requiring manual intervention:

After Each Story

In PRD mode, tests are automatically generated after each user story completes. Unit tests run immediately; E2E tests are queued.

On Request

In ad-hoc mode, tests are generated when explicitly requested or after all ad-hoc todos complete. User chooses when to run E2E.

Coverage Gaps

During full suite mode, tests are generated for any untested code paths identified through coverage analysis.

The Fix Loop Mechanism

When tests fail, the test-flow skill automatically attempts to fix the issues—up to 3 times before deferring to human review:

Run Tests
Tests Pass?
Yes
Done ✓
No
Analyze Failures
@developer fixes
Attempts < 20?
Yes
↑ Retry
No
STOP

Why 20 attempts? The fix loop allows up to 20 attempts per failing test, with no escape hatches. There is no "save as-is" or "defer" option — the loop either succeeds or stops and reports the failure to the user.

Playwright Gating

Playwright is the only test activity that can be disabled. The six booleans in testVerifySettings control exactly which automated invocation points run. All other test activities (typecheck, lint, unit tests, critics) run unconditionally.

Non-Playwright: Always Run

Typecheck, lint, unit tests, and critics always run when resolved by file patterns. There is no setting to disable them.

Baseline, unit, and critic activities are unconditional

Playwright: Gated by Settings

Each automated Playwright invocation point has a corresponding boolean in testVerifySettings. All default to true.

6 booleans control analysis probes, story tests, and completion tests

User-invoked workflows are never gated. These settings only affect automated Playwright invocations during the build workflow. You can always invoke @qa or @ui-test-full-app-auditor directly regardless of these settings.

Integration with Tester Agent

The test-flow skill acts as a coordinator between the builder workflow and the tester agent hierarchy:

builder
test-flow skill
tester

1. Builder loads test-flow skill in both PRD and ad-hoc modes

2. test-flow determines when to generate/run tests

3. test-flow invokes @tester for test generation

4. tester routes to specialists (react-tester, jest-tester, etc.)

5. test-flow manages pendingTests state in builder-state.json

Architecture-Aware Verification

Before running verification, test-flow detects app architecture from project.json to choose the optimal strategy. Desktop apps always use playwright-electron — never browser-based verification. The webContent field determines whether a rebuild is needed, not whether to use Electron.

App TypewebContentStrategyHow Verification Works
frontend / fullstackn/abrowserStandard Playwright against dev server (HMR)
desktopbundledrebuild-then-launch-appBuild → relaunch Electron → verify with Playwright-Electron
desktopremoteensure-electron-runningEnsure Electron is running (HMR handles changes) → verify with Playwright-Electron
desktophybridrebuild-then-launch-appBuild → relaunch Electron → verify with Playwright-Electron
mobileremoteverify-web-urlTest web URL directly in browser
backend / clin/anot-requiredNo UI verification

⛔ All desktop strategies use playwright: "electron"

Never use browser-based verification for desktop apps. Even webContent: "remote" (where HMR delivers changes via dev server) requires connecting Playwright to the Electron process. Opening localhost in a browser is not the same as testing inside Electron — Electron has its own window chrome, IPC, and process model.

Pre-check: Before running desktop tests, test-flow performs a zombie process pre-check to clean up any orphaned Electron instances from previous runs.

Layer Deep Dive

1

Orchestrator Layer

The tester agent is the main entry point for all testing tasks. It analyzes file patterns and routes to the appropriate specialist:

  • *.tsx / *.jsx → routes to react-tester
  • *.ts / *.js (backend) → routes to jest-tester
  • *.go → routes to go-tester
2

Unit Test Specialists

Language-specific agents that understand testing idioms and best practices for each stack.

jest-tester

Backend JS/TS testing with Jest. Handles Node.js services, utilities, and API logic.

react-tester

React Testing Library. Tests components, hooks, and UI interactions.

go-tester

Go testing with testify and httptest for handlers and services.

3

E2E Testing Layer

End-to-end testing using Playwright. Tests complete user flows through the browser.

ui-tester-playwright

Writes Playwright E2E tests for user flows, forms, and critical paths.

ui-test-reviewer

Reviews UI changes and identifies areas needing E2E coverage.

ui-test-full-app-auditor

Comprehensive E2E test audits with 5-retry resilience.

4

QA / Adversarial Layer

Exploratory testing that finds bugs through adversarial thinking—testing edge cases, unusual inputs, and unexpected user behaviors.

qa

Coordinates exploratory testing sessions and prioritizes what to test.

qa-explorer

Uses browser-use CLI to actively explore the app and find bugs.

qa-browser-tester

Converts bug findings into Playwright regression tests.

Unit Test Specialists

Each specialist agent is optimized for a specific language and testing framework, understanding the idioms and best practices unique to that stack.

jest-tester

Backend JavaScript/TypeScript Testing

Generates comprehensive Jest tests for Node.js services, utilities, API routes, and backend logic. Understands async patterns, mocking strategies, and TypeScript typing.

File Patterns

*.ts*.js(backend only)

Testing Framework

Jest

Key Capabilities

  • Module mocking with jest.mock()
  • Spy functions and mock implementations
  • Async/await and Promise testing
  • Coverage reporting with thresholds
  • Snapshot testing for data structures
  • Timer mocking (setTimeout, setInterval)

react-tester

React Component Testing

Generates tests for React components using React Testing Library. Focuses on testing user interactions and behavior rather than implementation details.

File Patterns

*.tsx*.jsx

Testing Framework

React Testing Library + Jest

Key Capabilities

  • User event simulation (click, type, etc.)
  • Accessible queries (getByRole, getByLabelText)
  • Async rendering with waitFor/findBy
  • Custom hooks testing with renderHook
  • Component snapshot testing
  • Context and provider mocking

go-tester

Go Testing

Generates idiomatic Go tests using the standard library testing package, testify assertions, and httptest for HTTP handlers. Follows Go testing conventions and table-driven test patterns.

File Patterns

*.go

Testing Framework

Go testing + testify + httptest

Key Capabilities

  • Table-driven tests with subtests
  • Testify assertions (assert, require)
  • HTTP handler testing with httptest
  • Mock interfaces with testify/mock
  • Parallel test execution (t.Parallel)
  • Benchmark testing support

Project-Specific Overrides

Projects can override global testers with project-specific versions to customize testing behavior, add custom utilities, or enforce project conventions.

How it works

Place a custom tester definition in your project's docs/agents/ directory. The toolkit will use your project version instead of the global agent.

Example: Override jest-tester

your-project/
├── docs/
│   └── agents/
│       └── jest-tester.md   ← Overrides global
├── src/
│   └── ...
└── package.json

Common Use Cases

  • Custom test utilities or helpers unique to your project
  • Project-specific mocking patterns (e.g., mock your auth layer)
  • Enforcing team conventions (naming, structure, coverage thresholds)
  • Integration with project-specific testing infrastructure

Override Priority

Project agents in docs/agents/ take priority over global toolkit agents. This applies to all agent types, not just testers.

E2E Testing System

The toolkit includes a complete end-to-end testing pipeline using Playwright. Four specialized agents work together to identify test gaps, write comprehensive tests, run full browser-based test suites, and perform comprehensive test audits across your entire application.

E2E Testing Workflow

1

UI Changes Made

Code changes to components, pages, or user flows

2
ui-test-revieweranalyzes

Reviews git diff, identifies UI areas needing E2E coverage

e2e-areas.json
3
ui-tester-playwrightwrites

Reads manifest, writes comprehensive Playwright tests

4
Builderruns

Executes Playwright tests, handles failures, and ships

Pass
Ready to ship
Fail
Draft PRD created

Execution Mode Detection

Before running any E2E tests, the ui-test-flow skill automatically detects whether the project uses Electron desktop testing or standard browser testing, and routes to the correct test runner.

🌐 Browser Mode

  • • Resolves test base URL
  • • Checks dev server is running
  • • Uses standard Playwright config
  • • Parallel workers supported

🖥️ Electron Mode

  • • Skips base URL resolution
  • • Skips dev server checks
  • • Routes to playwright-electron
  • • Single worker only (--workers=1)

Detection logic: The skill checks architecture.deployment and apps.*.framework in your project.json. If either indicates Electron, the skill loads the ui-test-electron skill and skips browser-specific setup steps entirely.

E2E Testing Agents

ui-test-reviewer

Test Gap Analyzer

Analyzes git diffs to identify UI areas that need E2E test coverage. Creates a structured manifest of test requirements for other agents to consume.

Primary Output

e2e-areas.json

Typically Invoked By

builder, developer, or workflow automation

Key Capabilities

  • Git diff analysis for changed UI components
  • Identifies user flows affected by changes
  • Detects forms, modals, and interactive elements
  • Prioritizes critical paths for testing
  • Structured JSON manifest generation
  • Integration with existing test coverage

ui-tester-playwright

E2E Test Writer

Reads the e2e-areas.json manifest and writes comprehensive Playwright E2E tests for each identified area. Generates tests that cover user flows, forms, and critical paths.

Test Output

*.spec.tsin e2e/ or tests/

Testing Framework

Playwright Test

Key Capabilities

  • Page object pattern for maintainability
  • Form submission and validation tests
  • Navigation and routing verification
  • Authentication flow testing
  • Visual regression with screenshots
  • Multi-browser testing support

ui-test-full-app-auditor

Comprehensive E2E Test Auditor

Autonomous agent for comprehensive E2E test audits. Analyzes your entire application, generates tests for all critical flows, and executes with 5-retry resilience—committing after each passing test to preserve progress.

Manifest Output

ui-test-audit-manifest.json

Use Cases

  • Full-app E2E audits
  • Legacy app test bootstrapping
  • Pre-release coverage checks

Key Capabilities

  • Project selection on startup (like @builder)
  • Full-app analysis and test planning
  • 5-retry resilience per test
  • Per-test commits for progress preservation
  • Continue-on-failure execution
  • Manifest-driven test tracking
  • PRD integration for test manifests

Key Concepts

The e2e-areas.json Manifest

A structured JSON file that describes which UI areas need E2E test coverage. Created by ui-test-reviewer and consumed by ui-tester-playwright.

{
  "areas": [
    {
      "name": "Login Flow",
      "priority": "critical",
      "paths": ["/login", "/forgot-password"],
      "interactions": ["form-submit", "validation"]
    },
    {
      "name": "Dashboard Navigation",
      "priority": "high",
      "paths": ["/dashboard/*"],
      "interactions": ["nav-click", "search"]
    }
  ]
}

Failures Become Draft PRDs

When E2E tests fail, the ui-test-reviewer automatically generates a draft PRD describing the issue. This ensures failures are tracked and queued for resolution rather than being lost.

Test Failure
Draft PRD Created
Fixed in Next Sprint

Using the Dev Server for E2E Tests

E2E tests run against your local development server, configured via project.json. Builder manages starting and stopping the server automatically.

From project.json

"apps": {
  "web": {
    "devPort": 3000,
    "framework": "nextjs"
  }
}

Authentication in E2E Tests

For authenticated test runs, use globalSetup with shared storageState. This authenticates once at suite start and shares the session across all tests.

Recommended: globalSetup + storageState

// playwright.config.ts
export default defineConfig({
  globalSetup: './tests/global-setup.ts',
  use: {
    storageState: './tests/.auth/user.json',
  },
});

// tests/global-setup.ts
async function globalSetup() {
  const browser = await chromium.launch();
  const page = await browser.newPage();
  await page.goto('/login');
  await page.fill('#email', process.env.TEST_USER);
  await page.fill('#password', process.env.TEST_PASS);
  await page.click('button[type="submit"]');
  await page.context().storageState({
    path: './tests/.auth/user.json'
  });
  await browser.close();
}

Anti-pattern: per-suite beforeAll auth

Avoid authenticating in each test file's beforeAll. This triggers multiple login requests per test run, which:

  • Rate limit risk — Auth providers may throttle or block
  • Slower runs — Login flow repeated N times
  • Flaky tests — Network timing issues multiply
// ❌ Don't do this for default user flows
test.beforeAll(async ({ browser }) => {
  const page = await browser.newPage();
  await page.goto('/login');
  await page.fill('#email', user);
  // Runs for every test file!
});

Exception: Per-suite auth is appropriate when testing different user roles or permission levels that need distinct sessions.

QA & Adversarial Testing

Beyond correctness testing, the toolkit includes a dedicated QA layer for exploratory, adversarial testing. These agents actively try to break your application—finding edge cases, race conditions, and unexpected behaviors that scripted tests miss.

QA Testing Workflow

1
qacoordinates

Dispatches exploratory testing tasks, prioritizes test areas

2
qa-explorerexplores

Autonomously browses the app, tries edge cases, finds bugs

browser-use CLI
(autonomous browsing)
Bug Findings
(screenshots + reproduction steps)
3
qa-browser-testerconverts

Turns bug findings into permanent Playwright regression tests

Regression Tests in e2e/

QA Testing Agents

qa

QA Coordinator

The central coordinator for all QA and exploratory testing. Dispatches testing tasks to specialists, manages testing sessions, and prioritizes which areas of the application need the most attention.

Dispatch Mechanism

Routes to qa-explorer or qa-browser-tester

Based on testing phase and objectives

Typically Invoked By

builder, developer, or manual request

Key Capabilities

  • Exploratory testing session management
  • Test area prioritization based on risk
  • Dispatches to explorer or browser-tester
  • Aggregates findings across sessions
  • Tracks bug discovery and resolution
  • Coordinates with E2E testing workflow

qa-explorer

Adversarial Testing Agent

Uses the browser-use CLI to autonomously browse and interact with your application like a real user—but with an adversarial mindset. Tries edge cases, rapid interactions, unusual inputs, and unexpected navigation patterns to find bugs.

Primary Tool

browser-use CLI

Required dependency for autonomous browsing

Output

Bug findings with screenshots

Includes reproduction steps for each issue

Key Capabilities

  • Autonomous browser navigation
  • Edge case input testing
  • Rapid interaction sequences
  • Unusual navigation patterns
  • Screenshot capture on failures
  • Reproduction step documentation

qa-browser-tester

Bug-to-Test Converter

Takes bug findings from qa-explorer and converts them into permanent Playwright regression tests. Ensures that once a bug is found, it can never silently regress—the test will catch it.

Input

Bug findings from qa-explorer

Screenshots and reproduction steps

Output

*.spec.tsregression tests

Key Capabilities

  • Converts bug reports to Playwright tests
  • Extracts reproduction steps automatically
  • Generates reliable test selectors
  • Adds appropriate assertions
  • Integrates with existing E2E suite
  • Prevents silent regressions

Browser-Use CLI Dependency

External Dependency: browser-use

The qa-explorer agent relies on the browser-use CLI—an external tool that enables AI agents to control a real browser autonomously.

Real browser control: Navigates pages, clicks elements, fills forms, and simulates real user behavior
AI-driven exploration: The agent decides what to try next based on what it sees, enabling truly autonomous testing
Beyond scripted tests: Can discover issues that predefined test scripts would never think to check

Note

The browser-use CLI must be installed separately. See the browser-use documentation for installation instructions.

Why Adversarial Testing?

Finds Edge Cases

Tests combinations and scenarios that developers didn't anticipate—unusual inputs, rapid sequences, boundary conditions.

Simulates Real Users

Real users don't follow happy paths. They click randomly, navigate unexpectedly, and use the app in ways you never imagined.

Permanent Protection

Every bug found becomes a regression test. Once discovered, that issue can never silently return to your codebase.

E2E Quality Patterns

Quality testing goes beyond basic assertions. The ui-test-ux-quality skill provides specialized patterns for catching visual glitches, performance issues, layout shifts, and intermediate bad states that users experience but traditional tests miss.

When to Use Quality Patterns

Use these patterns when you need to verify the experience of using your UI, not just the final state. Basic assertions check "does this element exist?" — quality patterns check "did the user see a flash of wrong content?", "did the layout jump?", "was it fast enough?"

✓ Use quality patterns for:

  • • Loading states that shouldn't flash
  • • Drag-and-drop interactions
  • • Animations and transitions
  • • Performance-critical pages
  • • Layout-sensitive components

✗ Basic assertions suffice for:

  • • Form validation messages
  • • Navigation between pages
  • • Simple CRUD operations
  • • Static content verification
  • • API response checks

1assertNeverAppears

Verifies that an element never appears during an action — catches flickers, loading state flashes, and momentary error displays that users see but final-state assertions miss.

flicker-test.spec.ts
// Watch for skeleton flash during cached navigation
await assertNeverAppears(
  page,
  '[data-testid="skeleton"]',
  async () => {
    await page.click('[data-testid="cached-link"]');
    await page.waitForSelector('[data-testid="content"]');
  },
  { checkInterval: 16 } // Check every frame (~60fps)
);

// Ensure error toast doesn't flash during successful submit
await assertNeverAppears(
  page,
  '[data-testid="error-toast"]',
  async () => {
    await page.click('[data-testid="submit-button"]');
    await page.waitForSelector('[data-testid="success-message"]');
  }
);

Use when: Testing cached navigations, optimistic updates, or any action where intermediate states shouldn't be visible.

2withPerformanceBudget

Enforces performance constraints as test assertions. Fails the test if an action exceeds time or memory budgets — catches performance regressions before they reach production.

performance.spec.ts
// Dashboard must load within 2 seconds
await withPerformanceBudget(
  page,
  { timeout: 2000 },
  async () => {
    await page.goto('/dashboard');
    await page.waitForSelector('[data-testid="dashboard-loaded"]');
  }
);

// Search should respond within 500ms with reasonable memory
await withPerformanceBudget(
  page,
  { timeout: 500, maxHeapUsage: 50 * 1024 * 1024 }, // 50MB
  async () => {
    await page.fill('[data-testid="search"]', 'query');
    await page.waitForSelector('[data-testid="results"]');
  }
);

Use when: Protecting critical user journeys from performance regressions, especially after optimization work.

3assertNoLayoutShift

Captures element positions before and after an action, failing if any tracked element moved unexpectedly. Prevents the frustrating experience of clicking a button that jumps away.

layout-shift.spec.ts
// Verify ad loading doesn't shift article content
await assertNoLayoutShift(
  page,
  ['article', '.sidebar', '.cta-button'], // Elements to track
  async () => {
    // Wait for lazy-loaded ad to appear
    await page.waitForSelector('[data-testid="ad-loaded"]');
  }
);

// Ensure image loading doesn't shift text below
await assertNoLayoutShift(
  page,
  ['.hero-text', '.nav-button'],
  async () => {
    await page.waitForSelector('img[data-loaded="true"]');
  },
  { tolerance: 2 } // Allow 2px variance for subpixel rendering
);

Use when: Testing pages with lazy-loaded content, images without dimensions, or dynamic elements that could push other content around.

4assertStableRender

Monitors an element's content over time, failing if it changes unexpectedly. Catches React hydration mismatches, flickering values, and components that re-render with different content.

stable-render.spec.ts
// Price shouldn't flicker between different values
await assertStableRender(
  page,
  '[data-testid="price"]',
  { duration: 1000 } // Watch for 1 second
);

// Dashboard metrics should stabilize after load
await page.waitForSelector('[data-testid="dashboard-ready"]');
await assertStableRender(
  page,
  '[data-testid="total-revenue"]',
  { 
    duration: 500,
    allowedChanges: 1 // Allow one update, then must stabilize
  }
);

Use when: Testing SSR hydration, real-time data displays, or any component where content flickering would confuse users.

5measureCLS

Measures Cumulative Layout Shift using the browser's Performance Observer API. Returns the actual CLS score for assertions or reporting — essential for Core Web Vitals compliance.

cls-measurement.spec.ts
// Measure CLS during full page lifecycle
const cls = await measureCLS(page, async () => {
  await page.goto('/article');
  await page.waitForLoadState('networkidle');
  // Wait for all lazy content
  await page.waitForTimeout(2000);
});

// Google's "Good" threshold is < 0.1
expect(cls).toBeLessThan(0.1);

// For stricter pages, use tighter threshold
const checkoutCLS = await measureCLS(page, async () => {
  await page.goto('/checkout');
  await page.waitForSelector('[data-testid="checkout-ready"]');
});
expect(checkoutCLS).toBeLessThan(0.05);

Use when: Monitoring Core Web Vitals, testing pages with ads or dynamic content, or after any layout-related changes.

6assertStateStability

Verifies that once a desired state is reached, it doesn't regress. Catches race conditions where success briefly appears then reverts to loading or error states.

state-stability.spec.ts
// Once saved, the button shouldn't revert to "saving..."
await assertStateStability(
  page,
  '[data-testid="save-button"]',
  {
    desiredState: { text: 'Saved' },
    forbiddenStates: [{ text: 'Saving...' }, { text: 'Save' }],
    duration: 2000
  }
);

// Verify successful state persists after form submit
await page.click('[data-testid="submit"]');
await assertStateStability(
  page,
  '[data-testid="status"]',
  {
    desiredState: { attribute: 'data-status', value: 'success' },
    forbiddenStates: [
      { attribute: 'data-status', value: 'loading' },
      { attribute: 'data-status', value: 'error' }
    ],
    duration: 3000
  }
);

Use when: Testing async operations, optimistic updates with server reconciliation, or any action with multiple state transitions.

7expectMutualExclusivity

Asserts that certain UI states never coexist — catches impossible states like showing both a loading spinner and loaded content, or both success and error messages simultaneously.

mutual-exclusivity.spec.ts
// Loading and content should never appear together
await expectMutualExclusivity(
  page,
  ['[data-testid="loading"]', '[data-testid="content"]'],
  async () => {
    await page.goto('/dashboard');
    await page.waitForSelector('[data-testid="content"]');
  }
);

// Success and error toasts are mutually exclusive
await expectMutualExclusivity(
  page,
  [
    '[data-testid="success-toast"]',
    '[data-testid="error-toast"]',
    '[data-testid="warning-toast"]'
  ],
  async () => {
    await page.click('[data-testid="submit"]');
    await page.waitForSelector('[data-testid*="toast"]');
  }
);

Use when: Testing state machines, loading/error states, or any UI with states that should be mutually exclusive.

Full Skill Documentation

The ui-test-ux-quality skill includes implementation details, helper functions, and integration guidance. Load it with Loading skill: ui-test-ux-quality or see the full documentation.

View ui-test-ux-quality skill documentation

Mutation Testing Pattern

The 3-step mutation testing pattern ensures state changes are truly persisted, not just optimistically displayed. This pattern catches bugs at three distinct verification stages—from immediate UI feedback to permanent data persistence.

Why Three Stages?

Single-assertion tests only verify the final state. But users experience the full journey: they click a button, see it respond, wait for completion, and expect the change to survive a refresh. The 3-step pattern tests what users actually experience—catching bugs that final-state tests miss.

The Three Verification Stages

1

Immediate State

Assert right after the action. Verifies the UI responds immediately to user input.

Catches

  • • Action handler bugs
  • • Validation errors
  • • Event binding issues
  • • Missing optimistic updates

Example

"After clicking save, button shows 'Saving...'"

2

Stable State

Assert after async operations settle. Verifies the operation completed successfully.

Catches

  • • Async bugs
  • • Race conditions
  • • Optimistic UI mismatches
  • • API error handling

Example

"After save completes, success toast appears"

3

Persistence

Assert after page reload or re-fetch. Verifies the change was actually persisted.

Catches

  • • Persistence bugs
  • • Cache issues
  • • Serialization problems
  • • Database transaction failures

Example

"After reload, the saved data is still there"

Complete Pattern Example

Here's a full Playwright test demonstrating all three verification stages for a profile update flow:

mutation-test.spec.ts
test('saving profile updates persists', async ({ page }) => {
  // Setup: Navigate to the profile page
  await page.goto('/profile');
  await page.waitForSelector('[name="bio"]');

  // === Stage 1: Immediate State ===
  // Assert right after the action
  await page.fill('[name="bio"]', 'New bio text');
  await page.click('button:text("Save")');
  
  // Verify the button shows loading state immediately
  await expect(page.locator('button:text("Save")'))
    .toHaveAttribute('aria-busy', 'true');
  
  // Verify optimistic update appears in form
  await expect(page.locator('[name="bio"]'))
    .toHaveValue('New bio text');

  // === Stage 2: Stable State ===
  // Assert after async operations settle
  await expect(page.locator('[role="alert"]'))
    .toHaveText('Profile saved');
  
  await expect(page.locator('button:text("Save")'))
    .not.toHaveAttribute('aria-busy', 'true');
  
  // Verify the form still shows the correct value
  await expect(page.locator('[name="bio"]'))
    .toHaveValue('New bio text');

  // === Stage 3: Persistence ===
  // Assert after page reload or re-fetch
  await page.reload();
  await page.waitForSelector('[name="bio"]');
  
  // Verify the saved data survived the refresh
  await expect(page.locator('[name="bio"]'))
    .toHaveValue('New bio text');
});

Why Each Stage Matters

Each stage catches a different category of bugs. Skipping any stage leaves blind spots in your test coverage:

1

Immediate State catches UX failures

If Stage 1 fails, users don't know their action registered. They might click again, causing duplicate submissions. Or they might think the app is broken and leave. Common bugs: missing onClick handlers, broken form bindings, disabled state not applying.

2

Stable State catches async failures

If Stage 2 fails, the optimistic update showed success but the server rejected it. Or a race condition caused the UI to revert. Users see a "success" that disappears. Common bugs: unhandled API errors, race conditions in state updates, missing error boundaries.

3

Persistence catches data loss

If Stage 3 fails, everything looked correct but the data was never actually saved. Users think their work is safe, but refreshing the page reveals it's gone. Common bugs: cache-only updates without API calls, transaction rollbacks, serialization errors.

More Pattern Examples

The 3-step pattern applies to any mutation operation. Here are additional examples:

delete-item.spec.ts — Delete operation
test('deleting an item removes it permanently', async ({ page }) => {
  // Stage 1: Immediate - item starts fading/strikethrough
  await page.click('[data-testid="delete-item-1"]');
  await expect(page.locator('[data-testid="item-1"]'))
    .toHaveClass(/deleting/);

  // Stage 2: Stable - item is removed from list
  await expect(page.locator('[data-testid="item-1"]'))
    .not.toBeVisible();
  await expect(page.locator('[data-testid="toast"]'))
    .toHaveText('Item deleted');

  // Stage 3: Persistence - item stays gone after reload
  await page.reload();
  await expect(page.locator('[data-testid="item-1"]'))
    .not.toBeVisible();
});
create-task.spec.ts — Create operation
test('creating a task adds it to the list', async ({ page }) => {
  // Stage 1: Immediate - new task appears optimistically
  await page.fill('[data-testid="new-task"]', 'Buy groceries');
  await page.click('[data-testid="add-task"]');
  await expect(page.locator('text=Buy groceries')).toBeVisible();
  await expect(page.locator('[data-testid="add-task"]'))
    .toBeDisabled(); // Prevent double-submit

  // Stage 2: Stable - task gets permanent ID, button re-enabled
  await expect(page.locator('[data-testid="add-task"]'))
    .toBeEnabled();
  const task = page.locator('text=Buy groceries');
  await expect(task).toHaveAttribute('data-id', /.+/); // Has server ID

  // Stage 3: Persistence - task survives page refresh
  await page.reload();
  await expect(page.locator('text=Buy groceries')).toBeVisible();
});

When to Apply This Pattern

Use the 3-step mutation testing pattern for any operation that modifies data: creates, updates, deletes, settings changes, form submissions, and user preference updates. Skip it only for read-only operations like navigation and search.

CRUD operationsForm submissionsSettings changesUser preferencesFile uploads

Testing in the Workflow

Here's how testing agents integrate with the broader toolkit workflow:

1

@builder completes a feature

Implementation is done, code is ready for testing

2

tester orchestrator is invoked

Analyzes changed files and determines which specialists to call

3

Specialists generate tests

react-tester, jest-tester, etc. write tests for their domains

4

Tests run and verify

If tests pass, feature is ready. If they fail, issues are reported back.

5

E2E tests run (if configured)

ui-tester-playwright runs full browser tests for complete coverage

Electron Desktop Testing

For Electron desktop apps, the toolkit uses Playwright's Electron API instead of browser-based testing. The ui-test-electron skill is automatically loaded when your project includes an Electron app entry.

Playwright Web vs Playwright Electron

Web (Standard)

  • • Connects to URL via browser
  • • Uses page.goto()
  • • Standard DOM selectors

Electron

  • • Launches Electron binary directly
  • • Uses electron.launch()
  • • Access to main + renderer processes

project.json Configuration

Configure your Electron app in the apps[] array:

{
  "apps": [
    {
      "name": "desktop",
      "type": "electron",
      "devServer": {
        "startCommand": "npm run electron:dev",
        "port": null,  // Electron doesn't use HTTP port
        "readyPattern": "Electron ready"
      },
      "electron": {
        // Path to built executable (for production testing)
        "executablePath": "dist/MyApp-darwin-arm64/MyApp.app",
        
        // Args to launch in dev mode (uses electron .)
        "devLaunchArgs": [".", "--no-sandbox"]
      },
      // Architecture detection for verification strategy
      "webContent": "bundled",  // bundled | remote | hybrid
      "remoteUrl": null         // Only for remote/hybrid apps
    }
  ]
}

executablePath

Path to your built Electron app. On macOS, this is typically .app bundle. On Windows/Linux, point to the executable directly.

devLaunchArgs

Arguments passed to electron binary during development. The first arg is typically "." to run from project root.

port: null

Unlike web apps, Electron apps don't expose an HTTP port. Set port: null to indicate Playwright should launch the binary directly instead of connecting to a URL.

webContent

Describes how the app loads its UI content. Used for architecture-aware verification strategy selection:

  • bundled — UI is packaged with the app (file:// protocol)
  • remote — UI loads from a remote URL
  • hybrid — Mix of bundled shell with remote content

remoteUrl

For remote or hybrid apps, specify the URL where the UI content is loaded from. Set to null for bundled apps.

Zombie Process Cleanup

Electron apps can leave zombie processes if tests fail or are interrupted. The ui-test-electron skill includes a globalSetup.ts pattern that cleans up orphaned processes before each test run:

// playwright/globalSetup.ts
import { execSync } from 'child_process';

export default async function globalSetup() {
  // Kill any orphaned Electron processes from previous runs
  try {
    if (process.platform === 'darwin') {
      execSync('pkill -f "Electron" || true', { stdio: 'ignore' });
    } else if (process.platform === 'win32') {
      execSync('taskkill /F /IM electron.exe 2>nul || exit 0', { stdio: 'ignore' });
    } else {
      execSync('pkill -f electron || true', { stdio: 'ignore' });
    }
  } catch {
    // Ignore errors if no processes found
  }
  
  // Brief pause to ensure cleanup completes
  await new Promise(resolve => setTimeout(resolve, 500));
}

Pre-test verification: The test-flow skill runs a zombie process pre-check before Electron tests. If orphaned processes are detected, they are cleaned up automatically to prevent "another instance already running" errors.

How the ui-test-electron Skill Works

1

Skill is loaded automatically

The ui-test-flow skill detects Electron projects via architecture.deployment or apps.*.framework and automatically routes to the ui-test-electron skill, skipping browser-specific setup.

2

Playwright launches Electron

Tests use electron.launch() with your devLaunchArgs or executablePath.

3

Tests interact with the window

Standard Playwright page APIs work on the Electron renderer process. Main process can be accessed via electronApp.evaluate().

Multi-Platform Testing

If your project has both web and Electron targets in the apps[] array, agents automatically detect which platform a test targets based on the test file location or explicit annotations. See the Multi-Platform Apps section for configuration details.

Test Verify Settings

The testVerifySettings object in project.json controls which automated Playwright invocation points are enabled. All settings default to true when absent, so the system runs all verification steps unless you explicitly opt out.

What These Settings Control

These settings gate automated Playwright invocations triggered during the build workflow. They do not gate user-invoked workflows like @qa or @ui-test-full-app-auditor, nor do they affect test file creation or maintenance.

Configuration

// project.json
{
  "testVerifySettings": {
    "adHocUIVerify_Analysis": true,
    "adHocUIVerify_StoryTest": true,
    "adHocUIVerify_CompletionTest": true,
    "prdUIVerify_Analysis": true,
    "prdUIVerify_StoryTest": true,
    "prdUIVerify_PRDCompletionTest": true
  }
}

Settings Reference

SettingModeDescription
adHocUIVerify_AnalysisAd-hocRun Playwright analysis probe after code changes (adhoc-workflow Step 0.1b)
adHocUIVerify_StoryTestAd-hocWrite and run Playwright tests for completed tasks (test-flow Step 5 ad-hoc)
adHocUIVerify_CompletionTestAd-hocRun holistic Playwright tests covering the full batch of changes at task spec completion
prdUIVerify_AnalysisPRDRun per-story Playwright verification after implementation (test-flow Step 3 PRD)
prdUIVerify_StoryTestPRDWrite and run per-story Playwright tests (test-flow Step 5 PRD, tester Step 7)
prdUIVerify_PRDCompletionTestPRDGenerate deferred UI tests at PRD completion (prd-workflow Ship Phase "G" option)

Common Patterns

Skip analysis probes, keep test generation

Useful when analysis probes are slow but you still want Playwright tests written for each story.

"testVerifySettings": {
  "adHocUIVerify_Analysis": false,
  "prdUIVerify_Analysis": false
}

Disable all automated Playwright

For projects that rely on manual QA or external CI for UI testing. You can still invoke @qa or @ui-test-full-app-auditor directly.

"testVerifySettings": {
  "adHocUIVerify_Analysis": false,
  "adHocUIVerify_StoryTest": false,
  "adHocUIVerify_CompletionTest": false,
  "prdUIVerify_Analysis": false,
  "prdUIVerify_StoryTest": false,
  "prdUIVerify_PRDCompletionTest": false
}

Default Behavior

If the testVerifySettings object is absent from project.json, all six settings default to true. This means existing projects get full automated Playwright verification without any configuration changes.

CORS & Browser Verification

Cross-Origin Resource Sharing (CORS) is a browser security mechanism that controls which domains can access resources from another domain. This has important implications for how agents verify API behavior.

⚠️ Critical: CORS Is Browser-Enforced

Agents must never use curl, wget, or similar CLI tools to verify CORS behavior. CORS headers are enforced by browsers, not by servers or CLI tools.

Why CLI Tools Cannot Test CORS

CORS works as follows:

  1. Browser makes a preflight OPTIONS request
  2. Server responds with CORS headers (Access-Control-Allow-Origin, etc.)
  3. Browser decides whether to allow or block the actual request

CLI tools like curl skip step 3 entirely—they receive the response regardless of CORS headers. A curl request succeeding tells you nothing about whether a browser would allow the same request.

Correct CORS Verification Methods

MethodWhen to Use
Playwright E2E testPrimary method—runs in a real browser context
Browser DevTools (manual)Quick verification during development
QA adversarial agentExploratory testing of cross-origin scenarios

Example: Playwright CORS Test

test('API allows cross-origin requests from allowed domain', async ({ page }) => {
  // Navigate to the allowed origin
  await page.goto('https://allowed-origin.example.com');
  
  // Make cross-origin request from browser context
  const response = await page.evaluate(async () => {
    const res = await fetch('https://api.example.com/data');
    return { ok: res.ok, status: res.status };
  });
  
  expect(response.ok).toBe(true);
});

test('API blocks cross-origin requests from disallowed domain', async ({ page }) => {
  await page.goto('https://disallowed-origin.example.com');
  
  // This should fail due to CORS
  const error = await page.evaluate(async () => {
    try {
      await fetch('https://api.example.com/data');
      return null;
    } catch (e) {
      return e.message;
    }
  });
  
  expect(error).toContain('CORS');
});

Agent Enforcement

The security-critic and backend-critic agents are configured to flag any CORS verification that uses CLI tools. If you see a CORS test using curl, the test is invalid and must be rewritten to use browser-based verification.