Test-Driven Design Systems: How We're Enforcing Accessibility and Token Usage at Build Time

Here's a problem I didn't see coming: our design system adoption was excellent, but inconsistent. Components would ship with hardcoded colors. Contrast ratios would fail WCAG AA. Spacing values would drift from our Aurora tokens.

The usual solution is post-launch audits and Slack reminders. We tried something different: treating design system compliance the same way we treat broken business logic. If a component fails contrast requirements or uses a deprecated token, the test suite fails and CI blocks the merge.

The Pattern: Audit Files That Look Like Tests

We created a pattern called audit test files. They sit alongside regular component tests but enforce architectural guardrails instead of functional behavior.

Take our WCAG contrast audit. The file blog-page.audit.test.tsx loads the published blog index page, extracts every text element's computed color and background, and validates contrast ratios against WCAG AA thresholds (4.5:1 for normal text, 3:1 for large text). If a component fails, the test output shows exactly which element, which colors, and the actual ratio.

When we shipped the Aurora token audit, we took a different approach: static source analysis. The file os-thesis-section.audit.test.tsx reads component source code and flags four violation categories:

Hardcoded color values (hex codes or rgb() instead of semantic tokens)
Non-semantic token usage (aurora-cyan-600 instead of text-primary)
Missing responsive utilities (fixed spacing instead of responsive scale)
Inconsistent spacing patterns (arbitrary values instead of design system scale)

Both audits run in the same Vitest harness that validates functional behavior. Same CI pipeline, same failure modes, same developer experience.

Why This Works Better Than Linters

ESLint and Stylelint are excellent for syntax and pattern enforcement. But they don't see the rendered output. A component can pass linting and still ship with insufficient contrast because the linter doesn't know which colors end up adjacent in the DOM.

Audit test files close this gap. They evaluate the final rendered artifact—the thing users actually see. The WCAG audit loads the full page, computes styles after Tailwind processing, and validates the actual contrast ratios that reach the browser.

This also means the tests catch regressions across component boundaries. If a layout component passes a className prop that overrides a child's text color, the WCAG audit will catch the resulting contrast failure even though neither component individually violated design system rules.

Real Numbers: What We Found

When we ran the WCAG contrast audit, we found violations in eight locations across three page types and five shared components. Most failures involved muted text on light backgrounds—readable to some users, but below WCAG AA thresholds.

The Aurora token audit surfaced four violation categories across multiple components. The most common issue was direct color token usage (aurora-cyan-600) instead of semantic tokens (text-primary). This worked visually but broke the abstraction that lets us ship dark mode and theming without component rewrites.

Both audits provided exact line numbers and violation context. Fixing them wasn't a research project—the test output told us exactly what to change.

The Migration Path: From Audit to Enforcement

Here's the pattern we're following for both audits:

Phase 1: Run the audit, document violations, fix them in a focused PR
Phase 2: Add the audit test to CI but don't block merges yet (warning mode)
Phase 3: Once the test is green for 2+ weeks, promote it to blocking status
Phase 4: Expand coverage to additional pages and component categories

We're currently in Phase 1 for both audits. The WCAG violations are fixed, and the Aurora token violations have a migration roadmap. The tests exist but aren't blocking CI yet.

This gradual rollout matters because audit tests change the team contract. Once they're blocking, every new component must pass before merge. That's the right long-term state, but forcing it immediately would create friction and workaround pressure.

What This Enables: Confident Refactoring

The long-term value of audit tests isn't the initial cleanup—it's the confidence they provide during refactoring. When we migrate from Aurora tokens to a new design system version, the audit tests will tell us immediately if the migration breaks accessibility or token semantics.

Same benefit applies to component library updates. If we upgrade a shared component and accidentally introduce a contrast violation, the test suite catches it before the change reaches production. No manual QA pass required, no accessibility regression tickets three weeks later.

This is the same confidence property that makes unit tests valuable. The test suite becomes a specification of required behavior, and refactoring becomes safe as long as the tests stay green.

The Pattern Is Portable

We're using this pattern for accessibility and design tokens, but it generalizes to any architectural guardrail:

Performance budgets (page weight, bundle size, LCP thresholds)
Security headers (CSP directives, CORS configuration)
SEO requirements (meta tags, structured data, sitemap coverage)
API contract enforcement (response schemas, rate limit headers)

Any requirement that's important enough to document is important enough to test. And any test that enforces architectural quality should run in CI and block broken builds.

What's Next

We're expanding audit test coverage to product pages (Strug Works, Sabine, Feedtumi, Poppin) and building a performance budget audit that validates Core Web Vitals thresholds. The pattern is proven—now we're scaling it across the platform.

The goal isn't perfect compliance on day one. It's a system that makes compliance the path of least resistance, catches regressions automatically, and gives us confidence to refactor without manual audits.

That's how you build quality into the development process instead of bolting it on afterward.