Voice Clone Cache - Solving Filler Audio Latency in Voice AI Pipelines

Real-time voice AI systems face a fundamental UX problem: silence during processing feels broken. When a user finishes speaking, the system must transcribe (ASR), generate a response (LLM), and synthesize speech (TTS). Even at 200-300ms total latency, the silence can feel awkward. Under load, this extends to seconds.

This post explores architectural approaches to filler audio—acknowledgment sounds like “Hmm…” or “Let me think…“—with a focus on the Voice Clone Cache pattern that achieves zero-latency playback while maintaining voice consistency.

The Latency Problem in Voice Pipelines

A typical voice AI pipeline executes sequentially:

┌──────────────────────────────────────────────────────────────────────────────┐
│                        VOICE AI PIPELINE (Sequential)                        │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌──────────┐  │
│   │  User   │    │   ASR   │    │   LLM   │    │   TTS   │    │  Audio   │  │
│   │ Speech  │───▶│ (STT)   │───▶│  (Gen)  │───▶│ Synth   │───▶│ Playback │  │
│   └─────────┘    └─────────┘    └─────────┘    └─────────┘    └──────────┘  │
│                                                                              │
│   ◀─────────────────────────────────────────────────────────────────────────▶│
│                                                                              │
│   T=0ms          T=50ms         T=100ms        T=200ms        T=350ms       │
│   User stops     ASR final      LLM starts     LLM done       Audio starts  │
│   speaking       transcript     generating     + TTS starts   playing       │
│                                                                              │
│                  ◀──────────────────────────────────────────▶                │
│                           "DEAD AIR" - User hears nothing                    │
│                                  (~250-350ms baseline)                       │
│                                  (~1-5 seconds under load)                   │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Under normal conditions with warm GPU inference:

Stage	Baseline Latency	Under Load
ASR final	~50ms	~100ms
LLM TTFT	~40-80ms	500ms+
TTS synthesis	~150-200ms	500ms+
Total E2E	~250ms	1-5 seconds

The baseline 250ms is acceptable—barely perceptible. The problem emerges under load when LLM queues build up or GPU saturation occurs. A 2-5 second silence after speaking feels like a dropped connection.

Approaches to Filler Audio

There are four main architectural approaches:

Approach	Latency	Voice Match	Complexity
Real-time TTS	150-300ms	✅ Perfect	Low
Pre-recorded audio files	~0ms	❌ Mismatch	Low
Voice Clone Cache	~0ms	✅ Perfect	Medium
Client-side signaling	~0ms	N/A	Medium

┌──────────────────────────────────────────────────────────────────────────────┐
│                  FILLER AUDIO APPROACHES COMPARISON                          │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. REAL-TIME TTS                     2. PRE-RECORDED FILES                  │
│  ════════════════                     ═════════════════════                  │
│                                                                              │
│  Request ──▶ TTS GPU ──▶ Audio        Request ──▶ Disk ──▶ Audio             │
│              │                                    │                          │
│              ▼                                    ▼                          │
│         150-300ms                             ~5-10ms                        │
│         Voice: ✓ Match                        Voice: ✗ Mismatch              │
│         Cost: GPU cycles/req                  Cost: Disk I/O                 │
│                                                                              │
│  Problem: Filler takes as long        Problem: Different voice               │
│  as the response itself               breaks user immersion                  │
│                                                                              │
│  ────────────────────────────────────────────────────────────────────────── │
│                                                                              │
│  3. VOICE CLONE CACHE (Recommended)   4. CLIENT-SIDE SIGNAL                  │
│  ══════════════════════════════════   ═════════════════════                  │
│                                                                              │
│  Startup: TTS GPU ──▶ RAM Cache       Server ──▶ {"status":"thinking"}       │
│                                                        │                     │
│  Runtime: Request ──▶ RAM ──▶ Audio            Client plays local audio      │
│                       │                                │                     │
│                       ▼                                ▼                     │
│                   ~0ms (memcpy)                    ~0ms network              │
│                   Voice: ✓ Match                  Voice: N/A                 │
│                   Cost: None                      Cost: None                 │
│                                                                              │
│  Best of both: Speed of files,        Best when you control the              │
│  quality of TTS                       client application                     │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Real-Time TTS Generation

The naive approach: when latency is detected, send filler text to TTS.

async def send_filler(self, tts_client):
    filler_text = random.choice(["Hmm...", "Let me think.", "One moment."])
    audio = await tts_client.synthesize(filler_text)  # 150-300ms
    return audio

Problem: The filler itself takes 150-300ms to generate. If your pipeline latency is 300ms, the filler arrives at the same time as the actual response. Under load when TTS is saturated, filler generation adds to the queue.

Pre-Recorded Audio Files

Store .wav files and play them directly:

FILLER_FILES = {
    "hmm": load_wav("assets/fillers/hmm.wav"),
    "thinking": load_wav("assets/fillers/thinking.wav"),
}

def get_filler(self):
    return random.choice(list(FILLER_FILES.values()))

Problem: Voice mismatch. If your TTS engine produces a specific voice (cloned or selected), pre-recorded files from a different voice/speaker break immersion. Users notice the inconsistency.

Voice Clone Cache (Recommended)

The Voice Clone Cache combines the speed of pre-recorded files with the voice consistency of TTS:

At startup: Generate filler phrases using your production TTS engine
Store in RAM: Keep the raw PCM/WAV bytes in memory
At runtime: Serve cached bytes with zero inference cost

┌──────────────────────────────────────────────────────────────────────────────┐
│                    VOICE CLONE CACHE ARCHITECTURE                            │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  STARTUP PHASE (Once per pod lifecycle)                                      │
│  ═══════════════════════════════════════                                     │
│                                                                              │
│      ┌─────────────┐         ┌─────────────┐         ┌─────────────────┐    │
│      │   Phrases   │         │  TTS Engine │         │   RAM Cache     │    │
│      │   Config    │────────▶│   (GPU)     │────────▶│   (dict)        │    │
│      │             │         │             │         │                 │    │
│      │ "Hmm..."    │  5 req  │  Synthesize │  Store  │ {"Hmm": bytes,  │    │
│      │ "Let me..." │ ──────▶ │  each phrase│ ──────▶ │  "Let me": ...} │    │
│      │ "One moment"│  ~1 sec │             │         │                 │    │
│      └─────────────┘         └─────────────┘         └─────────────────┘    │
│                                                                              │
│                              Cost: ~1 second                                 │
│                              Frequency: Once at startup                      │
│                                                                              │
│  ────────────────────────────────────────────────────────────────────────── │
│                                                                              │
│  RUNTIME PHASE (Every request needing filler)                                │
│  ════════════════════════════════════════════                                │
│                                                                              │
│      ┌─────────────┐                             ┌─────────────────┐         │
│      │  Latency    │    Threshold exceeded?      │   RAM Cache     │         │
│      │  Estimator  │────────────────────────────▶│   (dict)        │         │
│      │             │         │                   │                 │         │
│      │ queue_depth │    Yes  │    O(1) lookup    │ {"Hmm": bytes}──┼──▶ Audio│
│      │ > threshold │ ───────▶│    No GPU call    │                 │   Stream│
│      └─────────────┘         │                   └─────────────────┘         │
│                              │                                               │
│                              Cost: ~0ms (memory read)                        │
│                              Frequency: Per-request (when needed)            │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

class FillerAudioCache:
    def __init__(self, tts_client):
        self.tts_client = tts_client
        self.cache: dict[str, bytes] = {}
        self.phrases = [
            "Hmm...",
            "Let me think.",
            "One moment.",
            "Just a second.",
            "Alright...",
        ]
    
    async def initialize(self):
        """Generate fillers at startup. ~1 second total."""
        for phrase in self.phrases:
            audio_bytes = await self.tts_client.synthesize(phrase)
            self.cache[phrase] = audio_bytes
    
    def get_random_filler(self) -> bytes:
        """O(1) RAM lookup. Zero inference."""
        phrase = random.choice(self.phrases)
        return self.cache[phrase]

Key properties:

Zero runtime latency: Memory read, not GPU inference
Voice-matched: Generated by the same TTS engine/voice as responses
One-time cost: Pay for TTS generation once at startup, not per-request

When to Send Fillers

Sending fillers on every request degrades UX—a 200ms response prefaced with “Hmm…” feels artificial. Use adaptive triggering:

┌──────────────────────────────────────────────────────────────────────────────┐
│                    ADAPTIVE FILLER TRIGGERING                                │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│                          ┌───────────────┐                                   │
│                          │  ASR Final    │                                   │
│                          │  Transcript   │                                   │
│                          └───────┬───────┘                                   │
│                                  │                                           │
│                                  ▼                                           │
│                    ┌─────────────────────────────┐                           │
│                    │   Estimate Expected Latency │                           │
│                    │   ─────────────────────────  │                           │
│                    │   • LLM queue depth          │                           │
│                    │   • GPU utilization %        │                           │
│                    │   • Rolling P95 latency      │                           │
│                    └─────────────┬───────────────┘                           │
│                                  │                                           │
│                                  ▼                                           │
│                    ┌─────────────────────────────┐                           │
│                    │  estimated_latency > 500ms? │                           │
│                    └─────────────┬───────────────┘                           │
│                                  │                                           │
│               ┌──────────────────┼──────────────────┐                        │
│               │ YES              │                  │ NO                     │
│               ▼                  │                  ▼                        │
│    ┌──────────────────┐          │       ┌──────────────────┐                │
│    │ Send Filler      │          │       │ Skip Filler      │                │
│    │ ────────────────  │          │       │ ────────────────  │                │
│    │ Stream cached    │          │       │ Proceed directly │                │
│    │ audio bytes      │          │       │ to LLM call      │                │
│    │ (~300ms audio)   │          │       │                  │                │
│    └────────┬─────────┘          │       └────────┬─────────┘                │
│             │                    │                │                          │
│             └────────────────────┴────────────────┘                          │
│                                  │                                           │
│                                  ▼                                           │
│                    ┌─────────────────────────────┐                           │
│                    │   Continue Pipeline         │                           │
│                    │   LLM → TTS → Audio         │                           │
│                    └─────────────────────────────┘                           │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

FILLER_THRESHOLD_MS = 500

async def maybe_send_filler(self, estimated_latency_ms: int) -> Optional[bytes]:
    """Only send filler when latency is noticeable."""
    if estimated_latency_ms > FILLER_THRESHOLD_MS:
        return self.filler_cache.get_random_filler()
    return None

Latency estimation approaches:

Queue depth: If LLM request queue > N, expect delays
Rolling average: Track recent P95 latency
GPU utilization: If > 80%, expect queueing

Scaling Behavior

A common concern: does filler generation at startup cause problems during scale-out?

┌──────────────────────────────────────────────────────────────────────────────┐
│              SCALING: FILLER CACHE INITIALIZATION (1 → 5 Pods)               │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  BEFORE SCALE (Normal operation)                                             │
│  ═══════════════════════════════                                             │
│                                                                              │
│      ┌─────────┐                           ┌─────────────────┐               │
│      │  Pod 1  │                           │   TTS Service   │               │
│      │ ─────── │   Production TTS          │   (GPU)         │               │
│      │ Cache ✓ │──────────────────────────▶│                 │               │
│      │ Ready   │   ~11 req/s sustained     │   Handling      │               │
│      └─────────┘                           │   normal load   │               │
│                                            └─────────────────┘               │
│                                                                              │
│  ────────────────────────────────────────────────────────────────────────── │
│                                                                              │
│  DURING SCALE-UP (HPA triggers 1 → 5)                                        │
│  ════════════════════════════════════                                        │
│                                                                              │
│      ┌─────────┐                                                             │
│      │  Pod 1  │────┐                      ┌─────────────────┐               │
│      │ Cache ✓ │    │   Production TTS     │   TTS Service   │               │
│      └─────────┘    ├─────────────────────▶│   (GPU)         │               │
│                     │                      │                 │               │
│      ┌─────────┐    │   5 phrases each     │   Receives:     │               │
│      │  Pod 2  │────┼─────────────────────▶│   • Prod load   │               │
│      │ Init... │    │   (25 total)         │   • 25 filler   │               │
│      └─────────┘    │                      │     requests    │               │
│                     │   ~2-3 sec burst     │                 │               │
│      ┌─────────┐    │                      │   Impact:       │               │
│      │  Pod 3  │────┼─────────────────────▶│   Equivalent to │               │
│      │ Init... │    │                      │   ~2 sec of     │               │
│      └─────────┘    │                      │   prod traffic  │               │
│                     │                      │                 │               │
│      ┌─────────┐    │                      │                 │               │
│      │  Pod 4  │────┼─────────────────────▶│                 │               │
│      │ Init... │    │                      │                 │               │
│      └─────────┘    │                      └─────────────────┘               │
│                     │                                                        │
│      ┌─────────┐    │                                                        │
│      │  Pod 5  │────┘                                                        │
│      │ Init... │                                                             │
│      └─────────┘                                                             │
│                                                                              │
│  ────────────────────────────────────────────────────────────────────────── │
│                                                                              │
│  AFTER SCALE (All pods ready)                                                │
│  ════════════════════════════                                                │
│                                                                              │
│      Pod 1 ─────┐                          ┌─────────────────┐               │
│      Cache ✓    │                          │   TTS Service   │               │
│                 │                          │   (GPU)         │               │
│      Pod 2 ─────┤   Production TTS only    │                 │               │
│      Cache ✓    │   (no filler gen)        │   Normal load   │               │
│                 ├─────────────────────────▶│   distributed   │               │
│      Pod 3 ─────┤                          │   across 5 pods │               │
│      Cache ✓    │                          │                 │               │
│                 │                          │                 │               │
│      Pod 4 ─────┤                          └─────────────────┘               │
│      Cache ✓    │                                                            │
│                 │                                                            │
│      Pod 5 ─────┘                                                            │
│      Cache ✓                                                                 │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Consider a Kubernetes deployment with HPA scaling from 1 → 5 pods:

T+0s:  Load spike detected
T+30s: HPA triggers scale-up
T+31s: 5 new pods starting simultaneously
       Each pod initializes filler cache:
       - 5 phrases × ~200ms TTS = ~1 second per pod
       - All 5 pods hit TTS concurrently = 25 requests
T+33s: All pods have cached fillers, ready to serve

Impact analysis:

25 short-phrase TTS requests = trivial workload
Equivalent to ~2-3 seconds of normal production traffic
New pods don’t receive traffic until ready (K8s readiness probes)
Existing pods continue serving unaffected

The one-time startup cost is negligible compared to ongoing production TTS load.

Client-Side Signaling Alternative

If you control the frontend application, consider not sending audio at all:

┌──────────────────────────────────────────────────────────────────────────────┐
│                     CLIENT-SIDE SIGNALING ARCHITECTURE                       │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   SERVER                              NETWORK              CLIENT            │
│   ══════                              ═══════              ══════            │
│                                                                              │
│  ┌────────────┐                                        ┌────────────────┐   │
│  │  Voice     │     {"status": "thinking"}             │  Mobile App /  │   │
│  │  Gateway   │───────────────────────────────────────▶│  Web Client    │   │
│  │            │         ~50 bytes                      │                │   │
│  │  Detects   │         (JSON signal)                  │  Receives      │   │
│  │  latency   │                                        │  signal        │   │
│  │  threshold │                                        │       │        │   │
│  └────────────┘                                        │       ▼        │   │
│                                                        │  ┌──────────┐  │   │
│                                                        │  │ Local    │  │   │
│                 COMPARE: Audio stream                  │  │ Audio    │  │   │
│                 ─────────────────────                  │  │ Files    │  │   │
│                 ~50KB for 1 sec audio                  │  │ (bundled)│  │   │
│                 Network dependent                      │  └────┬─────┘  │   │
│                 Can drop packets                       │       │        │   │
│                                                        │       ▼        │   │
│                                                        │  Play locally  │   │
│                                                        │  (0ms latency) │   │
│                                                        └────────────────┘   │
│                                                                              │
│   Benefits:                                                                  │
│   • Zero bandwidth for filler audio                                         │
│   • Instant playback (no network RTT)                                       │
│   • Works with high packet loss                                             │
│   • Can trigger haptic feedback                                             │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

# Server sends signal, not audio
async def notify_thinking(self, websocket):
    await websocket.send(json.dumps({"status": "thinking"}))

// Client plays local audio
socket.onmessage = (event) => {
    const data = JSON.parse(event.data);
    if (data.status === "thinking") {
        playLocalSound("thinking.mp3");  // Already on device
    }
};

Advantages:

Zero bandwidth for filler audio
Instant playback (no network latency)
Works even with packet loss
Can trigger haptic feedback on mobile

Disadvantage: Requires client modification; not applicable for telephony/PSTN integrations.

Comfort Noise as Alternative

Instead of linguistic fillers (“Hmm…”), non-linguistic audio avoids repetition fatigue:

Breathing sounds: Subtle “intake of breath” signals “about to speak”
Room tone: Low-volume ambient noise keeps the line “alive”
Typing sounds: For text-based contexts, keyboard clicks indicate processing

These loops are tiny (KB-sized), can be extremely low bitrate, and don’t require TTS generation.

Implementation Checklist

Choose filler phrases: 5-10 variations to avoid repetition
Generate at startup: Initialize cache before accepting traffic
Match audio format: Same sample rate (16kHz/24kHz), encoding (PCM/Opus) as TTS output
Add latency estimation: Only trigger when delay > threshold
Monitor: Track filler send rate to understand real-world latency distribution

Summary

If your problem is…	Solution
Users don’t know they were heard	Visual feedback / client signals
1-2 second silence feels broken	Voice Clone Cache
Latency varies unpredictably	Threshold-based filler triggering
5+ second latency is common	Scale GPU capacity (root cause)

The Voice Clone Cache pattern provides the best balance: zero-latency playback with perfect voice consistency. The one-time startup cost (~1 second) is negligible, and the approach scales cleanly with horizontal pod scaling.

Filler audio is a UX optimization, not a fix for underlying throughput problems. If your P95 latency is consistently > 2 seconds, address GPU capacity first.