Jun 27, 2026

Reliable voice AI for multi-speaker dialogue and long-form synthesis

Today’s digest spans speech LLMs, voice agents, and TTS: from diarization-aware multi-speaker grounding and persona-driven speech role-play to more robust, faster, and longer-form speech generation. The common thread is making voice systems more controllable, consistent, and reliable.

Proposed Decoupled Speech Role-Playing Agent (DeSRPA) framework. A Frozen LLM Controller (left) steers a Frozen StyleTTS~2~ (right) via Inference-Time Control Vectors, injecting personality and acoustic styles directly without parameter updates. From DeSRPA.

SpeechLLMs & Voice Agents

Dixtral

Grounding Spoken LLMs in Multi-Speaker Audio via Diarization Conditioning

Dixtral improves multi-speaker transcription by conditioning the acoustic encoder on diarization masks to isolate speakers, while keeping the language model decoder fixed. This preserves language reasoning and supports zero-shot multi-speaker QA without altering the decoder or output format.

DeSRPA

DeSRPA: Decoupled Speech Role-Playing Agent via Inference-Time Intervention

DeSRPA is a speech role-playing agent framework that decouples persona reasoning from speech rendering at inference. It uses frozen LLM and TTS models steered by control vectors, enabling scalable, training-free character adaptation with enhanced personality and emotional consistency.

TTS & Voice Synthesis

MagpieTTS-LF

MagpieTTS-LF: Inference-Time Long-Form Speech Generation Without Training on Long-Form data

MagpieTTS-LF improves long-form speech synthesis without retraining by maintaining context and guiding attention across sentence chunks. It combines stateful chunk generation, soft attention priors, and history-aware text encoding to enhance prosodic continuity, speaker consistency, and boundary smoothness.

Reliable Neural-Codec TTS

Reliable Neural-Codec Text-to-Speech by ASR Self-Verification and Distillation: Near-Zero Catastrophic Failures Across Models and Codecs

Eliminates catastrophic failures in neural-codec text-to-speech by ASR self-verification and distillation, improving single-shot decoding robustness across models and codecs without added inference cost.

MeanFlow Token2Wav

One-Step Token-to-Waveform Generation with MeanFlow in Latent Space

This paper presents a one-step Token-to-Waveform synthesis using MeanFlow in latent space, greatly speeding up generation while keeping quality high. It replaces costly iterative decoding with a single latent prediction, plus novel refinements to improve output fidelity without slowing inference.