Skip to content

Audio Preprocessing

Conjure applies a signal processing pipeline to your audio before it reaches whisper. This improves transcription accuracy by reducing noise, removing rumble, and normalizing volume levels.

All preprocessing runs in the Rust sidecar alongside whisper -- there's no additional latency from a separate processing step.

Local transcription only

Audio preprocessing is applied in the local whisper sidecar. Cloud transcription providers (Groq, Deepgram, AssemblyAI, etc.) handle their own audio preprocessing server-side.

Pipeline Stages

The preprocessing pipeline has three active stages, applied in order:

1. Noise Gate

The noise gate zeroes out audio frames that fall below a volume threshold. This removes background noise during pauses in your speech without affecting your actual voice.

  • How it works: Audio is processed in 20ms frames (320 samples at 16kHz). If a frame's peak amplitude is below the threshold, the entire frame is set to zero.
  • Default threshold: 0.015 (adjustable from 0.0 to 0.1)
  • Effect: Eliminates keyboard clicks, mouse clicks, ambient room noise, and other quiet sounds between words

TIP

Start with the default threshold (0.015) and increase it only if you hear background noise in your transcriptions. Setting it too high can clip the beginnings and ends of words.

2. High-Pass Filter

A single-pole IIR (Infinite Impulse Response) filter that attenuates frequencies below 80Hz. This removes low-frequency rumble that whisper interprets as speech artifacts.

  • Cutoff frequency: 80Hz
  • Filter type: Single-pole IIR (first-order)
  • Effect: Removes fan noise, air conditioning hum, traffic rumble, desk vibrations, and other low-frequency sounds

Human speech starts around 85Hz for bass voices, so the 80Hz cutoff preserves all speech content while removing sub-bass interference.

3. RMS Normalization

Normalizes the audio to a consistent volume level by scaling all samples to a target RMS (Root Mean Square) amplitude.

  • Target RMS: 0.1
  • Effect: If you speak quietly or your mic gain is low, normalization boosts the signal. If you're too loud, it attenuates. This gives whisper a consistent input level.

WARNING

Normalization amplifies everything, including noise. For best results, use the noise gate and high-pass filter in combination with normalization to ensure only clean speech is amplified.

Whisper Parameter Tuning

In addition to the audio preprocessing pipeline, Conjure tunes two whisper inference parameters:

No-Speech Threshold

Value: 0.4 (default whisper is 0.6)

Controls how aggressively whisper suppresses segments it considers non-speech. The lower value means whisper is less likely to hallucinate text from silence -- a common problem where whisper generates phrases like "Thank you for watching" or "Subscribe" from quiet audio.

Entropy Threshold

Value: 2.4 (default whisper is 2.4)

Controls the maximum entropy (randomness) allowed in whisper's output tokens. High entropy indicates the model is uncertain, which often produces garbled or repetitive text. Segments exceeding this threshold are suppressed.

Configuration

All preprocessing settings are in Settings > Input > Audio:

SettingControlDefaultDescription
Noise Gate ThresholdSlider (0.0 - 0.1)0.015Amplitude below which audio is zeroed
High-Pass FilterSwitchOnEnable/disable 80Hz high-pass filter
Audio NormalizationSwitchOnEnable/disable RMS normalization
VAD SensitivitySlider0.5Voice Activity Detection (stored but not yet active)

VAD Sensitivity

The VAD (Voice Activity Detection) sensitivity slider is visible in settings but currently deferred. Full VAD requires bundling the Silero ONNX model, which is planned for a future release. The preference is stored so it will be ready when VAD is implemented.

VAD coming soon

VAD sensitivity is stored in preferences but not yet active. It will be enabled when the Silero ONNX model is bundled in a future release.

When to Adjust Preprocessing

Noisy environment

  • Increase noise gate threshold (try 0.03-0.05)
  • Enable both high-pass filter and normalization
  • Consider a directional microphone or noise-canceling headset

Clean/quiet environment

  • Lower noise gate threshold or set to 0.0 (disabled)
  • High-pass filter and normalization still help but are less critical

Clipping or distorted output

  • Lower the noise gate threshold -- it may be cutting off the start of words
  • Check your microphone gain in your OS audio settings

Whisper hallucinating (generating text from silence)

  • Increase noise gate threshold to ensure silence is truly silent
  • The tuned no-speech threshold (0.4) should help, but a higher noise gate threshold provides additional protection

Released under the AGPLv3 License.