Use Cases for Voice Agents: What's Holding Us Back

Long

Performant STT

When handling long-running conversations OR live-streaming audio, new concerns emerge. To keep the problem space cogent, I'd break down the concerns into the following categories: Latency, Context, and Enterprise Compatibility.

Note

At the bottom, I'll write down a running list of problems we should acknowledge, but I'm explicitly not addressing & why.

Latency - Real time performance becomes significantly important - we want our apps to react fluidly to user input. While 500ms is typical for human-to-human conversations, 800ms it a good target to aim for conversational AI applications.

Context - Partial Transcriptions are helpful for LLM conversation history management & reducing the cost of storing data (for voice AI, one minute of speech audio as audio takes about 13x more tokens than the equivalent in text) - Diarization is helpful for workflows that require speaker identification (e.g. doctors vs patients)

Enterprise Compatibility - Currently, WebSocket

Evals become even more important when working with voice AI. While we can

Real, performant STT has a ton of rabbit holes to deal with:

Audio Quality: Mid-audio jitters, partial transcriptions, echo cancellation, background noise suppression, and more problems need to be dealt with.
Device Access: browser-based access to microphones & handling of permissions is non-standard & hacky
Conversation Design: push-to-talk, long-form conversations, and more.
Performance: Chunking, parallel processing, and more.
Reliability: Error handling, retries, and more.
Security: Authentication, E2E encryption, and more.
Scalability: Horizontal scaling, load balancing, and more.

With that in mind, a couple principles apply:

Recoverability: Handle errors and retries gracefully
Continuity: Stream audio into smaller, manageable chunks for transcription
Persistence: for long-form conversations of up to 30 mins
Browser/App Microphone Capture

If you're not familiar with WebRTC, here's a simple diagram to show how peer-to-peer audio streaming works:

(Person 1) ↔ (Signaling Server) ↔ (SFU) ↔ (Signaling Server) ↔ (Person 2)
                                  ↓
                          (Speech-to-Text Service)
                                  ↓
                          (Transcripts + Storage)

Let's see what this looks like in implementation:

load_dotenv(override=True)

class TranscriptionLogger(FrameProcessor):
    """Logs transcription frames."""
    async def process_frame(self, frame: Frame, direction: FrameDirection):
        await super().process_frame(frame, direction)

async def run_bot(webrtc_connection: SmallWebRTCConnection):
    """Runs a pipecat bot that transcribes audio from a microphone."""
    transport = SmallWebRTCTransport(
        webrtc_connection=webrtc_connection,
        params=TransportParams(
            audio_in_enabled=True,
            vad_analyzer=SileroVADAnalyzer(),
        ),
    )

    stt = OpenAISTTService(
        model="whisper-1",
        api_key=os.getenv("OPENAI_API_KEY"),
    )

    tl = TranscriptionLogger()

    logger.info("SmallWebRTCTransport created, waiting for audio...")

    # Pipeline: Transport -> STT -> Frame Logger -> Transcription Logger
    pipeline = Pipeline([
        transport.input(),
        stt,
        FrameLogger(),
        tl
    ])

    task = PipelineTask(pipeline)

    @transport.event_handler("on_client_disconnected")
    async def on_client_disconnected(transport, client):
        logger.info(f"Client disconnected: {client}")

    @transport.event_handler("on_client_closed")
    async def on_client_closed(transport, client):
        logger.info(f"Client closed connection: {client}")
        await task.cancel()

    runner = PipelineRunner(handle_sigint=False)

    await runner.run(task)

if __name__ == "__main__":
    from run import main

    main()
    # 1. Loads bot file
    # 2. Starts a Web Server with WebRTC endpoints
    # 3. Handles Application Lifecycle
    # 4. Logs Status

The full code can be found here