Fixing the Schema Gap: Legal Document Domain Mapping

Sometimes the most interesting bugs aren't in the logic—they're in the assumptions. This week we fixed a schema mismatch in Sabine's legal document ingest pipeline that was trying to use domain categories our database didn't recognize.

The Problem

Our legal document ingest pipeline was generating domain hints like 'vehicle,' 'financial,' and 'medical'—perfectly reasonable categories for legal documents. The problem? Our database schema only recognizes three domain values: work, family, and persona. When the ingest system tried to store these legal-specific domains, it hit a wall.

This is a classic case of specialized pipeline logic running ahead of core schema design. The legal ingest module was built with domain-specific categories that made sense in isolation, but hadn't been reconciled with the database constraints that govern the entire biographical data layer.

The Fix

The solution was straightforward: map the legal domain hints to valid database enum values. Vehicle and financial documents now map to 'work,' medical documents to 'family,' and general legal documents to 'persona.' We also cleaned up debug logging that was cluttering production output—a small quality-of-life improvement while we were in the code.

The mapping isn't perfect—calling a car lease agreement 'work' is semantically fuzzy—but it's pragmatic. It gets legal documents into the system without expanding our core schema, which would have cascading effects across retrieval, UI filters, and agent context building.

What We Learned

Specialized ingest pipelines need tighter coordination with core schema definitions. When you're building modular systems, it's easy for domain-specific logic to drift from shared constraints. The fix was small, but it highlighted a gap in how we validate ingest outputs against database schemas before deployment.

What's Next

We're evaluating whether to expand the domain enum to include legal-specific categories or keep the current three-domain model and rely on document type metadata for finer categorization. Expanding the schema gives us more precision but adds complexity to retrieval and UI logic. Staying lean keeps the system simpler but sacrifices some semantic clarity.

We're also adding schema validation checks to the ingest pipeline test suite so future domain mismatches get caught before merge. Small fixes, big leverage.