Dictation Guide

Dictation is the core of Conjure. This guide covers how it works, the different modes available, and tips for getting the best results.

How Dictation Works

The dictation pipeline has four stages:

Record -- your microphone captures audio while the hotkey is held (or toggled on)
Preprocess -- audio passes through the preprocessing pipeline (noise gate, high-pass filter, normalization)
Transcribe -- the audio is sent to whisper.cpp (local) or a cloud provider for speech-to-text
Post-process -- the raw transcript is run through your selected writing style via an LLM, then injected into the active application

Push-to-Talk vs Toggle Mode

Push-to-Talk (default)

Hold the dictation hotkey to record. Release to stop and process. This is the most natural mode for short dictations -- a sentence or two at a time.

Toggle Mode

Press the hotkey once to start recording, press again to stop. Better for longer dictations where holding a key becomes uncomfortable.

Configure this in Settings > Input.

Streaming vs Batch Mode

Batch Mode (default)

Your entire recording is transcribed and post-processed after you stop. The result is pasted all at once. This mode supports all writing styles and produces the most polished output.

Best for: emails, documents, chat messages, and any dictation where you want AI cleanup.

Streaming Mode

Text appears in real-time as you speak, with approximately 70ms latency. Each chunk is injected directly into the active application using character-by-character input (not clipboard paste).

Streaming requires a compatible provider

Real-time text output (streaming mode) works with AssemblyAI, Deepgram, ElevenLabs, or the local whisper sidecar. It also requires the Verbatim writing style selected.

Streaming mode requires:

Verbatim tone selected (no post-processing)
A streaming-capable transcription provider (AssemblyAI, Deepgram, ElevenLabs, or the local sidecar)

Best for: terminal dictation, live note-taking, coding comments, or any situation where you want to see words as you speak them.

TIP

Toggle between Realtime and Bulk modes from the styling bar at the bottom of the dictation overlay. The toggle only appears when you have Verbatim selected and a streaming-capable provider configured.

Streaming Text Injection

In streaming mode, Conjure uses a special text injection method (type_text) instead of clipboard paste. This is important because:

No clipboard interference -- your clipboard contents are preserved
No modifier key conflicts -- direct Unicode injection avoids issues with held hotkey modifiers (Ctrl, Alt, Shift)
Terminal-safe -- works in PowerShell, Windows Terminal, and other console apps where Ctrl+V doesn't always work

On Windows, this uses SendInput with KEYEVENTF_UNICODE for GUI apps and clipboard + right-click for console windows.

Terminal dictation

Conjure uses direct Unicode injection (type_text) for terminal apps, avoiding clipboard and modifier key conflicts that break push-to-talk hotkeys. This works in PowerShell, Windows Terminal, and similar console applications.

The Dictation Pill

The floating pill overlay shows your current dictation state:

Color	State
Red (pulsing)	Recording in progress
Blue (spinning)	Processing audio
Green (check)	Text pasted successfully
Purple (volume icon)	TTS is speaking

Click the pill during TTS playback to stop it. The pill position and size are configurable.

Post-Processing

When a writing style other than Verbatim is selected, your raw transcript is sent to an LLM for cleanup. The LLM applies the style's prompt template to:

Remove filler words ("um", "uh", "like")
Fix grammar and punctuation
Format the text according to the style (email format, casual chat, formal document, etc.)
Convert spoken formatting cues to actual formatting
Apply self-correction (when you say "actually, I mean...")

Post-processing requires an LLM provider configured in Settings. The same API key used for transcription can often be used for post-processing (e.g., Groq provides both speech-to-text and text generation).

Tips for Better Accuracy

Microphone

Use a dedicated microphone rather than a laptop's built-in mic
Position the mic 6-12 inches from your mouth
Use a pop filter for plosive sounds if you notice distortion

Environment

Dictate in a quiet room when possible
Enable the noise gate in Settings > Input > Audio to filter background noise
Enable the high-pass filter to remove low-frequency rumble from fans or AC

Speaking Style

Speak at a natural, conversational pace -- not too fast, not too slow
Pause briefly between sentences rather than rushing through
Use voice commands for punctuation rather than saying "period" and hoping the AI adds it
State corrections clearly: "actually, I mean..." triggers self-correction in most writing styles

Model Selection

Local whisper (large-v3 model) provides good accuracy for most languages
Groq (free tier) offers excellent cloud transcription quality
Smaller whisper models are faster but less accurate -- use the largest model your hardware can handle

Dictionary

Add frequently used jargon, names, and technical terms to your dictionary. These are fed to whisper as vocabulary hints and measurably improve recognition of uncommon words.

See the Dictionary Guide for details.

Dictation Guide ​

How Dictation Works ​

Push-to-Talk vs Toggle Mode ​

Push-to-Talk (default) ​

Toggle Mode ​

Streaming vs Batch Mode ​

Batch Mode (default) ​

Streaming Mode ​

Streaming Text Injection ​

The Dictation Pill ​

Post-Processing ​

Tips for Better Accuracy ​

Microphone ​

Environment ​

Speaking Style ​

Model Selection ​

Dictionary ​