Jun 22, 2026

Real-Time Speech for Conversational AI

Today’s digest spotlights low-latency speech generation, prosody-aware voice conversion, and robust spoken-language agents. The common thread: speech systems that sound better, respond faster, and work more reliably in live conversation.

Model Overview of S5-TTS showing the streaming architecture and limited lookahead mechanism. From S5-TTS.

TTS, Prosody & Voice Conversion

S5-TTS

Streaming T5-based Text-to-Speech Synthesis with Limited Lookahead

S5-TTS enables streaming, low-latency text-to-speech synthesis by generating speech word-by-word with limited lookahead. It preserves quality and speaker similarity using monotonic alignment and lookahead-causal masks, making it ideal for real-time conversational AI systems.

ProsoCodec

ProsoCodec: Prosody-Oriented Speech Codec for Voice Conversion

ProsoCodec is a prosody-oriented speech codec designed for voice conversion. By conditioning on text and speaker embeddings, it isolates and preserves residual prosody, improving prosody preservation and reducing source timbre leakage during voice conversion.

SpeechLLMs & Voice Agents

CORTIS

CORTIS: Text-Only Adaptation of Spoken Language Models for Task-Oriented Voice Agents

CORTIS adapts spoken language models for task-oriented voice agents using text-only supervision, enabling direct speech-to-structured-output generation without paired speech-target data. It is more robust than ASR-LLM cascades under noise, especially in preserving high-level semantic understanding.

Streaming ASR Architectures

Online Predictive Coding for Dual-Mode Speech

Online Predictive Coding for Dual-Mode Self-Supervised Speech Model

This paper introduces Online Predictive Coding (OPC) to improve dual-mode self-supervised speech models that handle streaming and offline modes using shared parameters. OPC regularizes online registers to predict future frames, reducing performance gaps and stabilizing training for low-latency speech recognition.