Jun 27, 2026

Speech Agents Learn to Listen

Today’s digest spans full-duplex spoken dialogue, emotional voice synthesis, and ASR that catches hesitation and disfluency. Together, these papers push voice agents toward more natural, responsive, and expressive interaction.

BayLing-Duplex model architecture illustrating the multi-channel autoregressive design enabling integrated speech and text dialogue management. From BayLing-Duplex.

SpeechLLMs & Spoken Dialogue

BayLing-Duplex

BayLing-Duplex: Native Full-Duplex Speech Dialogue with a Single Autoregressive LLM

BayLing-Duplex is a full-duplex speech dialogue system that uses a single autoregressive LLM to listen and speak simultaneously, handling overlaps and interruptions without external turn-taking modules. It models dialogue states as token predictions, enabling real-time, seamless interaction.

TTS & Voice Synthesis

Latent-Conditioned Emotional TTS

An Empirical Study on Learning Latent Representations for Emotional Speech Synthesis

This work presents a system for emotional speech synthesis in Vietnamese, extending FastSpeech 2 with latent emotion and speaker embeddings plus a prosody bottleneck. It emphasizes emotional control and speaker adaptation on limited, noisy datasets through careful preprocessing.

ASR for Voice Agents

Disfluency-Aware ASR

Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR

This work presents a continual-learning method to enable ASR models to detect and emit disfluency markers like fillers and pauses. It balances the trade-off between maintaining transcription accuracy and learning disfluency detection, using explicit tokens and adapted training strategies.