Jun 6, 2026

Smarter Voices, Fewer Errors

Today’s digest spotlights more expressive speech synthesis, emotion conversion, and better ASR reliability for voice agents. From image-based TTS and continuous latent speech models to hallucination steering in Whisper, the focus is on making spoken AI sound better and fail less.

Overview of the proposed method. From Pixel-TTS.

TTS & Voice Synthesis

dots.tts

dots.tts Technical Report

dots.tts is a 2B-parameter continuous autoregressive text-to-speech model that generates speech in a semantically structured continuous latent space. Innovations include full-history conditioning and self-corrective post-training for robust, expressive, and low-latency multilingual speech.

Pixel-TTS

Pixel-TTS: Image based Text Rendering for Robust Text-to-Speech

Pixel-TTS innovates text-to-speech by rendering text as images, enabling robust and visually grounded embeddings that improve synthesis quality, accelerate training, and enhance zero-shot multilingual generalization without needing embedding matrix expansion.

Expressive Voice Conversion

TargetSEC

TargetSEC: Plug-and-Play In-the-Wild Speech Emotion Conversion via Arousal-Conditioned Latent Style Diffusion

TargetSEC introduces a latent diffusion method to convert speech emotion by generating emotion-aligned style embeddings conditioned on speaker identity and arousal level. It uniquely enables high-quality emotion conversion without altering the core speech synthesis model, excelling on in-the-wild speech data.

ASR Reliability for Voice Agents

Whisper Hallucination Steering

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

This paper detects and mitigates hallucinations in Whisper ASR outputs by steering internal audio encoder representations, especially via sparse autoencoder latents. It reduces hallucinations at inference time without fine-tuning, maintaining transcription quality.