Fixing What Breaks: Voice Context Persistence and Audio Quality

Voice interfaces are unforgiving. When they work, they feel like magic. When they break, users notice immediately. This week we fixed two related issues in Sabine that were eroding the voice chat experience: context loss after browser reload and noisy transcription artifacts from Whisper.

What Changed

The first issue was straightforward but frustrating: if you refreshed your browser mid-conversation, Sabine forgot everything you'd discussed. The voice session state wasn't being persisted to browser storage, so a reload meant starting over. For users in long planning sessions or debugging conversations, this was a dealbreaker.

The second issue was more subtle. Whisper, the transcription model we use, occasionally outputs silence markers and non-verbal artifacts—little transcription hiccups that show up as '[BLANK_AUDIO]' or similar noise in the chat history. These don't affect understanding, but they clutter the interface and make conversation history harder to scan.

We solved both in this commit. Voice chat context—including conversation history, active session state, and user preferences—now persists to localStorage on every turn. On reload, the app hydrates from that stored state and resumes exactly where you left off. We also implemented a filter layer between Whisper output and the UI that strips known silence artifacts before they hit the chat log.

Why It Matters

Voice interfaces need to be as reliable as text interfaces. When you refresh Slack or a code editor, your work is still there. Voice shouldn't be different. Users expect session persistence as a baseline, and we weren't delivering it.

The Whisper artifact filtering is about polish. Clean transcripts make conversation history more useful for review and reference. It's a small detail, but it's the kind of thing that separates a prototype from a product people trust with real work.

What's Next

This fix addressed the immediate pain points, but it opened up a few bigger questions. Should voice context sync across devices, not just persist locally? What about conversation branching—if you want to explore a different direction mid-session without losing the original thread? We're tracking both as potential improvements.

We're also evaluating whether the Whisper artifact filter should be user-configurable. Some users might want to see every transcription detail for debugging or accessibility reasons. For now, we're defaulting to clean output, but we're open to making it a toggle if there's demand.

Commit: ac53154 | Author: strugcity | SCE-769