Jun 8, 2026

Voice Gets More Real-Time

Today’s digest spans holistic video dubbing, streaming speech LLMs, expressive and waveform-native TTS, and low-latency voice conversion. The common thread: better control, faster inference, and more natural-sounding speech across generation and recognition.

HoliDubber overall framework diagram. From HoliDubber.

Video Dubbing & Visual Speech Alignment

HoliDubber

HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis

HoliDubber unifies speech and sound effect generation from text prompts for video dubbing, aligning audio with visual lip movements for natural, holistic acoustic scenes. It breaks from prior speech-only models by producing synchronized, multimodal audio outputs from a single unified model.

SpeechLLMs & Streaming Recognition

TRADE

TRADE: Transducer-Augmented Decoder for Speech LLM

TRADE integrates a speech transducer with a large multimodal language model by tightly coupling frame-synchronous acoustic alignment and linguistic reasoning, enabling accurate, low-latency streaming and flexible long-form speech recognition in a single unified system.

TTS & Expressive Voice Synthesis

EmoInstruct-TTS

EmoInstruct-TTS: Dual-Path Instruction-Guided Emotional Speech Synthesis

EmoInstruct-TTS is a dual-path emotional TTS system that separates semantic planning from fine-grained emotion control using natural language instructions and a rich emotion embedding. This approach improves emotion accuracy and speech naturalness beyond prior instruction-based systems.

BareWave

BareWave: Waveform-Native Flow-Matching Text-to-Speech

BareWave is a waveform-native text-to-speech model that directly synthesizes waveforms from text and prompt audio without intermediate steps. It introduces new training methods to address waveform modeling challenges, enabling high-quality zero-shot voice cloning with a streamlined inference process.

Voice Conversion & Realtime Speech

MeanVC 2

MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion

MeanVC 2 improves streaming zero-shot voice conversion by using future-receptive chunking for stable low-latency output and a universal timbre token encoder for robustness to low-quality references, enabling real-time speaker conversion with better naturalness and speaker similarity.

Palindromic-VC

From A to B to A: Palindromic Zero-Shot Voice Conversion with Non-Parallel Data

Palindromic-VC is a zero-shot voice conversion method using synthetic pairs from KNN retrieval over self-supervised features for non-parallel training and strong speaker identity preservation, generalizing across languages without parallel data.