Jun 20, 2026

Speech systems move toward richer dialogue, controllable voices, and stronger robustness

Today’s digest spans full-duplex spoken dialogue, multi-speaker scene generation, controllable TTS, and improved ASR representations. The papers focus on making speech systems more natural, more editable, and more reliable in noisy or complex settings.

ScenA teaser illustrating multi-speaker conversational scenes synthesized from natural language prompts and reference voices with overlapping speech, paralinguistic events, and ambient sound. From ScenA.

SpeechLLMs & Spoken Dialogue

Full-Duplex Spoken Dialogue Survey

A Survey of Full-Duplex Spoken Dialogue Systems: Architectural Hierarchy, Interaction Ontology, and Decision State Machine

This paper surveys full-duplex spoken dialogue systems with a new framework that clarifies their architectural hierarchy, interaction types, and state behaviors. It audits systems and data to reveal gaps and outlines future research directions in full-duplex conversation.

ScenA

Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

ScenA generates natural multi-speaker audio scenes from reference voices and free-form prompts without turn-based tags. It uses a high-noise training scheme ensuring speaker assignment via text, enabling rich overlapping dialogue, ambient sound, and paralinguistic events from real-world audio.

TTS & Voice Synthesis

FineCombo-TTS

FineCombo-TTS: Collaborative and Precise Controllable Speech Synthesis Using Text Descriptions and Reference Speech

FineCombo-TTS uniquely combines reference speech and text descriptions for precise, flexible control of speech synthesis, jointly modeling timbre, prosody, and emotion. It learns a unified acoustic embedding and uses a text-guided flow model to edit speech attributes based on relative changes from a reference.

Phoneme Addition Transfer

Exploring Pre-training Benefits on Phoneme Addition through Fine-tuning in Speech Synthesis

This paper studies how pre-training impacts adding new phonemes during fine-tuning in text-to-speech. Using a synthetic phoneme-controlled setup and real cross-lingual transfer, it shows pre-training mainly boosts naturalness but offers limited help for learning new phonemes.

ASR & Speech Representation

DASH

DASH: Dual-View Self-Distillation with Multi-Layer Hidden Representations for Robust Speech Recognition

DASH is a self-distillation framework for robust speech recognition that aligns prototype assignment distributions from multiple encoder layers between clean and noisy views. This approach stabilizes training and learns noise-invariant representations, improving robustness without sacrificing clean accuracy.

S-JEPA

S-JEPA : Soft Clustering Anchors for Self-Supervised Speech Representation Learning

S-JEPA uses soft Gaussian Mixture Model posteriors for self-supervised speech representation learning, avoiding hard cluster assignments and offline re-clustering. It continuously updates clusters online, preserving acoustic ambiguity and improving efficiency and performance on speech recognition and emotion tasks.