Jun 21, 2026

Speech synthesis, voice conversion, and interactive avatars push toward real-time realism

Today’s digest spans transcript-free text-to-speech, diffusion-guided speech generation, streaming voice conversion, and speech codecs built for cleaner identity control. It also features physically interactive 3D avatars that deform realistically under contact and motion.

An illustration of our framework. (a) To faithfully reflect the user-defined motion, we decouple the kinematic velocity from the deformation gradient update (Sec.~). (b) By computing the velocity from the transformations of the embedded skeletal structure, our method preserves the pose consistency throughout the simulation (Sec.~). From PIAvatar.

Digital Humans & 3D Avatars

PIAvatar

PIAvatar: Physically Interactive Avatars via Deformation Gradient Decoupling

PIAvatar enables physically interactive 3D human avatars by decoupling kinematic movement from deformation, allowing realistic physical interactions with environments and other avatars. It combines a skeletal pose tracking system with a material simulation for real-time, controllable, non-rigid avatar deformation.

TTS & Voice Synthesis

RTFree-F5

Transcript-Free Flow-Matching Text-to-Speech via Speech Feature Conditioning

RTFree-F5 innovates flow-matching zero-shot TTS by using self-supervised speech features instead of reference transcripts at inference. This improves robustness and speech quality for atypical speakers like those with dysarthria, while keeping competitive performance on typical voices.

Score Subnet

Repurposing a Speech Classifier for Guided Diffusion-Based Speech Generation

Repurposes a pretrained noise-conditioned speech classifier for diffusion-based speech generation with a lightweight decoder. This unifies classifier guidance and score modeling in one backbone, cutting parameters and compute while delivering high-quality conditional speech synthesis.

Voice Conversion & Streaming Speech

Zero-VC

Zero-VC: Zero-Lookahead Streaming Voice Conversion via Speaker Anonymization

Zero-VC uses speaker anonymization to enable zero-lookahead streaming voice conversion that minimizes timbre leakage while preserving prosody. This real-time approach better balances identity hiding and utility, allowing low-latency conversion to unseen speakers without buffering future audio frames.

SDP-Codec

SDP-Codec: A Speaker-Decoupled Speech Codec with Pitch Injection for Low-Bitrate Coding and Zero-Shot Voice Conversion

SDP-Codec is a low-bitrate speech codec that separates speaker identity from content and prosody to enable high-quality speech reconstruction and zero-shot voice conversion. It uses a single-stage training method with pitch injection to reduce speaker leakage and improve voice conversion accuracy.