Skip to content

Dictation Guide

Dictation is the core of Conjure. This guide covers how it works, the different modes available, and tips for getting the best results.

How Dictation Works

The dictation pipeline has four stages:

  1. Record -- your microphone captures audio while the hotkey is held (or toggled on)
  2. Preprocess -- audio passes through the preprocessing pipeline (noise gate, high-pass filter, normalization)
  3. Transcribe -- the audio is sent to whisper.cpp (local) or a cloud provider for speech-to-text
  4. Post-process -- the raw transcript is run through your selected writing style via an LLM, then injected into the active application

Push-to-Talk vs Toggle Mode

Push-to-Talk (default)

Hold the dictation hotkey to record. Release to stop and process. This is the most natural mode for short dictations -- a sentence or two at a time.

Toggle Mode

Press the hotkey once to start recording, press again to stop. Better for longer dictations where holding a key becomes uncomfortable.

Configure this in Settings > Input.

Streaming vs Batch Mode

Batch Mode (default)

Your entire recording is transcribed and post-processed after you stop. The result is pasted all at once. This mode supports all writing styles and produces the most polished output.

Best for: emails, documents, chat messages, and any dictation where you want AI cleanup.

Streaming Mode

Text appears in real-time as you speak, with approximately 70ms latency. Each chunk is injected directly into the active application using character-by-character input (not clipboard paste).

Streaming requires a compatible provider

Real-time text output (streaming mode) works with AssemblyAI, Deepgram, ElevenLabs, or the local whisper sidecar. It also requires the Verbatim writing style selected.

Streaming mode requires:

  • Verbatim tone selected (no post-processing)
  • A streaming-capable transcription provider (AssemblyAI, Deepgram, ElevenLabs, or the local sidecar)

Best for: terminal dictation, live note-taking, coding comments, or any situation where you want to see words as you speak them.

TIP

Toggle between Realtime and Bulk modes from the styling bar at the bottom of the dictation overlay. The toggle only appears when you have Verbatim selected and a streaming-capable provider configured.

Streaming Text Injection

In streaming mode, Conjure uses a special text injection method (type_text) instead of clipboard paste. This is important because:

  • No clipboard interference -- your clipboard contents are preserved
  • No modifier key conflicts -- direct Unicode injection avoids issues with held hotkey modifiers (Ctrl, Alt, Shift)
  • Terminal-safe -- works in PowerShell, Windows Terminal, and other console apps where Ctrl+V doesn't always work

On Windows, this uses SendInput with KEYEVENTF_UNICODE for GUI apps and clipboard + right-click for console windows.

Terminal dictation

Conjure uses direct Unicode injection (type_text) for terminal apps, avoiding clipboard and modifier key conflicts that break push-to-talk hotkeys. This works in PowerShell, Windows Terminal, and similar console applications.

The Dictation Pill

The floating pill overlay shows your current dictation state:

ColorState
Red (pulsing)Recording in progress
Blue (spinning)Processing audio
Green (check)Text pasted successfully
Purple (volume icon)TTS is speaking

Click the pill during TTS playback to stop it. The pill position and size are configurable.

Post-Processing

When a writing style other than Verbatim is selected, your raw transcript is sent to an LLM for cleanup. The LLM applies the style's prompt template to:

  • Remove filler words ("um", "uh", "like")
  • Fix grammar and punctuation
  • Format the text according to the style (email format, casual chat, formal document, etc.)
  • Convert spoken formatting cues to actual formatting
  • Apply self-correction (when you say "actually, I mean...")

Post-processing requires an LLM provider configured in Settings. The same API key used for transcription can often be used for post-processing (e.g., Groq provides both speech-to-text and text generation).

Tips for Better Accuracy

Microphone

  • Use a dedicated microphone rather than a laptop's built-in mic
  • Position the mic 6-12 inches from your mouth
  • Use a pop filter for plosive sounds if you notice distortion

Environment

  • Dictate in a quiet room when possible
  • Enable the noise gate in Settings > Input > Audio to filter background noise
  • Enable the high-pass filter to remove low-frequency rumble from fans or AC

Speaking Style

  • Speak at a natural, conversational pace -- not too fast, not too slow
  • Pause briefly between sentences rather than rushing through
  • Use voice commands for punctuation rather than saying "period" and hoping the AI adds it
  • State corrections clearly: "actually, I mean..." triggers self-correction in most writing styles

Model Selection

  • Local whisper (large-v3 model) provides good accuracy for most languages
  • Groq (free tier) offers excellent cloud transcription quality
  • Smaller whisper models are faster but less accurate -- use the largest model your hardware can handle

Dictionary

Add frequently used jargon, names, and technical terms to your dictionary. These are fed to whisper as vocabulary hints and measurably improve recognition of uncommon words.

See the Dictionary Guide for details.

Released under the AGPLv3 License.