One Conversation Mode, Two Inputs: How Voice and Text Coexist

The Problem With Separate Modes

Open most journaling apps and you'll find two distinct features: a "text journal" and a "voice journal." They live in separate tabs, produce separate entries, and feel like separate products bolted together.

This separation exists because it's easier to engineer. Voice needs speech-to-text processing, audio buffering, and silence detection. Text needs a keyboard, cursor management, and formatting. Mixing them introduces synchronization complexity.

But from the user's perspective, this separation makes no sense.

When you're talking to a friend about your day, you don't choose a "mode." You might start typing a message, then decide it's easier to send a voice note, then type a follow-up. The medium shifts based on context, environment, and energy — not a mode toggle.

Vividiary treats the AI conversation the same way: one mode, two inputs, switch freely.

What "One Mode" Means in Practice

In Vividiary's conversation:

1. You tap the text input and type "rough day at work"
2. You decide typing is tedious. You hold the mic button and say "basically my manager changed the requirements again and I spent three hours redoing the presentation"
3. The AI responds with a follow-up question (displayed as text)
4. You type a quick "yeah exactly" because you're on a bus and can't speak
5. You hold the mic again for a longer response about how it made you feel

All of this is one continuous conversation. No mode switch. No "exit voice mode." No separate entries. The AI processes text and transcribed voice identically — it's all words by the time it reaches the conversation engine.

The Engineering Challenge

Challenge 1: Seamless Turn Transitions

In a pure-text chat, a "turn" is clear: user types, presses send, AI responds. In a pure-voice app, a turn ends with silence detection.

In our hybrid model, we needed to handle:

User types partial message, switches to voice mid-thought
User holds mic, speaks, releases — this should send immediately (no extra "send" tap)
User alternates rapidly between voice and text within a single emotional train of thought

Solution: We treat the conversation as a stream of "segments." Each segment is either typed-text or transcribed-voice, tagged with its input type. A segment is "committed" when:

Text: user taps send
Voice: user releases the mic button (push-to-talk model)

The AI receives committed segments in order, regardless of input type. From the AI's perspective, it's just a sequence of user messages.

Challenge 2: Latency Hiding for Voice

Speech-to-text adds latency — typically 500ms-2s depending on utterance length. If the user speaks and then has to wait for transcription before they can see their words, the "seamless" feeling breaks.

Solution: We display a voice waveform during recording, then show a brief "processing" indicator (200ms average perceived wait due to streaming transcription), then reveal the transcribed text in the chat bubble. The key insight: the user's cognitive transition from "I just spoke" to "let me read what I said" naturally takes ~500ms, which overlaps with our transcription time. We're hiding latency in a natural attention gap.

For longer utterances (10+ seconds), we use streaming transcription that begins populating the text bubble while the user is still speaking. By the time they release the mic, 80-90% of the text is already visible.

Challenge 3: Error Correction Flow

Voice transcription isn't perfect. What happens when the AI misheard "I felt dismissed" as "I felt this missed"?

Solution: Each voice segment shows a small "edit" affordance after transcription. Tapping it converts the bubble to an editable text field with the transcription pre-filled. Users can fix errors without re-recording. In practice, users correct about 8% of voice segments — but the corrections matter because emotional vocabulary is precisely where STT struggles most.

Critically, this correction happens within the same conversation flow. You don't leave the conversation to fix a transcription error. Fix it inline, and continue.

Challenge 4: Context Window Management

The AI needs full conversation context to generate a coherent diary draft. But mixing voice and text creates variable-length segments — a voice segment might be 200 words, while a text segment might be 3 words.

Solution: We normalize segments by semantic content rather than word count. Our context management weights segments by:

Emotional density (detected via sentiment analysis)
Novelty (does this add new information vs. confirming previous statements?)
Recency (more recent segments get priority)

This ensures the AI's diary draft reflects the emotional arc of the conversation regardless of whether key moments were spoken (verbose) or typed (terse).

The UX Rationale

Why Not Two Separate Modes?

We prototyped a "two-tab" version early on. The data was clear:

Users who started in text mode rarely switched to voice (even when voice would be faster)
Users who started in voice mode rarely switched to text (even when text would be more precise)
Average conversation depth was 40% lower in two-mode version — because users committed to one input and stopped when that input became inconvenient, rather than switching

The unified mode produced 2.3x more conversation turns on average. More turns = richer conversations = better diary drafts.

When Users Switch (and Why It Matters)

From our analytics:

Voice to text: Most common trigger is environment change (left a private space, arrived somewhere public). Also happens when users want precision ("it was specifically March 14th, not the 15th").
Text to voice: Most common trigger is emotion escalation. When feelings intensify, typing feels inadequate — users instinctively reach for voice because it's faster and carries emotional texture.
Average switches per conversation: 1.4

That 1.4 number is key. It means the majority of conversations involve at least one switch. If we'd built separate modes, these users would have either:
1. Stopped at the switch point (lost depth), or
2. Started a new entry in the other mode (lost context)

Neither is acceptable for diary generation quality.

The Technical Architecture

```
User Input Layer
├── TextInput (keyboard → committed segment)
├── VoiceInput (mic → STT → committed segment)
└── Shared: segment queue, conversation state

Conversation Engine (input-agnostic)
├── Receives: ordered segments (text or transcribed voice)
├── Maintains: full conversation context
├── Generates: AI follow-up questions
└── Triggers: diary draft generation after conversation end

Diary Generation Layer
├── Receives: full conversation context (all segments)
├── Processes: emotional arc analysis, theme extraction
└── Outputs: first-person diary draft for review
```

The conversation engine is deliberately input-agnostic. It never asks "was this typed or spoken?" It only asks "what did the user say, and in what order?" This is the architectural decision that makes the seamless experience possible.

What We Shipped vs. What We Wanted

Honest disclosure of current limitations:

No simultaneous input: You can't type while a voice recording is in progress. We explored this but the UX was confusing.
No voice-to-text real-time overlay: Some users wanted to see transcription in real-time while speaking (like live captions). We opted for post-recording reveal to reduce cognitive load, but this is a debatable decision.
Transcription language detection: Currently, the app processes one language per session. Code-switching (Korean + English in the same utterance) has ~85% accuracy vs ~97% for single-language. Improving this is on our roadmap.

The Lesson

The biggest lesson from building unified voice+text: engineering convenience is not UX convenience. Separate modes are easier to build, test, and maintain. But they impose a false choice on users who naturally blend inputs based on context.

Building "one mode, two inputs" cost us roughly 3x the engineering time of building separate modes. But it produced conversations that are 2.3x richer. That's a trade we'd make again.