Audio Preprocessing
Conjure applies a signal processing pipeline to your audio before it reaches whisper. This improves transcription accuracy by reducing noise, removing rumble, and normalizing volume levels.
All preprocessing runs in the Rust sidecar alongside whisper -- there's no additional latency from a separate processing step.
Local transcription only
Audio preprocessing is applied in the local whisper sidecar. Cloud transcription providers (Groq, Deepgram, AssemblyAI, etc.) handle their own audio preprocessing server-side.
Pipeline Stages
The preprocessing pipeline has three active stages, applied in order:
1. Noise Gate
The noise gate zeroes out audio frames that fall below a volume threshold. This removes background noise during pauses in your speech without affecting your actual voice.
- How it works: Audio is processed in 20ms frames (320 samples at 16kHz). If a frame's peak amplitude is below the threshold, the entire frame is set to zero.
- Default threshold: 0.015 (adjustable from 0.0 to 0.1)
- Effect: Eliminates keyboard clicks, mouse clicks, ambient room noise, and other quiet sounds between words
TIP
Start with the default threshold (0.015) and increase it only if you hear background noise in your transcriptions. Setting it too high can clip the beginnings and ends of words.
2. High-Pass Filter
A single-pole IIR (Infinite Impulse Response) filter that attenuates frequencies below 80Hz. This removes low-frequency rumble that whisper interprets as speech artifacts.
- Cutoff frequency: 80Hz
- Filter type: Single-pole IIR (first-order)
- Effect: Removes fan noise, air conditioning hum, traffic rumble, desk vibrations, and other low-frequency sounds
Human speech starts around 85Hz for bass voices, so the 80Hz cutoff preserves all speech content while removing sub-bass interference.
3. RMS Normalization
Normalizes the audio to a consistent volume level by scaling all samples to a target RMS (Root Mean Square) amplitude.
- Target RMS: 0.1
- Effect: If you speak quietly or your mic gain is low, normalization boosts the signal. If you're too loud, it attenuates. This gives whisper a consistent input level.
WARNING
Normalization amplifies everything, including noise. For best results, use the noise gate and high-pass filter in combination with normalization to ensure only clean speech is amplified.
Whisper Parameter Tuning
In addition to the audio preprocessing pipeline, Conjure tunes two whisper inference parameters:
No-Speech Threshold
Value: 0.4 (default whisper is 0.6)
Controls how aggressively whisper suppresses segments it considers non-speech. The lower value means whisper is less likely to hallucinate text from silence -- a common problem where whisper generates phrases like "Thank you for watching" or "Subscribe" from quiet audio.
Entropy Threshold
Value: 2.4 (default whisper is 2.4)
Controls the maximum entropy (randomness) allowed in whisper's output tokens. High entropy indicates the model is uncertain, which often produces garbled or repetitive text. Segments exceeding this threshold are suppressed.
Configuration
All preprocessing settings are in Settings > Input > Audio:
| Setting | Control | Default | Description |
|---|---|---|---|
| Noise Gate Threshold | Slider (0.0 - 0.1) | 0.015 | Amplitude below which audio is zeroed |
| High-Pass Filter | Switch | On | Enable/disable 80Hz high-pass filter |
| Audio Normalization | Switch | On | Enable/disable RMS normalization |
| VAD Sensitivity | Slider | 0.5 | Voice Activity Detection (stored but not yet active) |
VAD Sensitivity
The VAD (Voice Activity Detection) sensitivity slider is visible in settings but currently deferred. Full VAD requires bundling the Silero ONNX model, which is planned for a future release. The preference is stored so it will be ready when VAD is implemented.
VAD coming soon
VAD sensitivity is stored in preferences but not yet active. It will be enabled when the Silero ONNX model is bundled in a future release.
When to Adjust Preprocessing
Noisy environment
- Increase noise gate threshold (try 0.03-0.05)
- Enable both high-pass filter and normalization
- Consider a directional microphone or noise-canceling headset
Clean/quiet environment
- Lower noise gate threshold or set to 0.0 (disabled)
- High-pass filter and normalization still help but are less critical
Clipping or distorted output
- Lower the noise gate threshold -- it may be cutting off the start of words
- Check your microphone gain in your OS audio settings
Whisper hallucinating (generating text from silence)
- Increase noise gate threshold to ensure silence is truly silent
- The tuned no-speech threshold (0.4) should help, but a higher noise gate threshold provides additional protection
