Jun 3, 2026

Talking Avatars Get Real

Today’s digest spans full-duplex speech-motion avatars, portrait animation, outfit-personalized 3D humans, speech-LLM reasoning fixes, and raw-waveform zero-shot TTS. A strong day for more natural voices, more expressive bodies, and better spoken reasoning.

DyaPlex overview: partner speech and motion as input; the agent listens, backchannels, and responds with synchronized speech and motion; application scenarios for human-agent/robot interaction and synthetic dyadic interaction generation. From DyaPlex.

Talking Avatars & Embodied Interaction

DyaPlex

DyaPlex: Full-Duplex Speech-Motion Model for Dyadic Interaction

DyaPlex is a streaming full-duplex model that generates synchronized speech and full-body motion for dyadic interactions. It perceives and responds to both partners' speech and motion in real time, enabling natural continuous communication with improved multi-modal coherence for conversational AI agents.

Mamba-Enhanced Implicit Motion

Mamba-Enhanced Implicit Motion Learning for Audio-Driven Portrait Animation

A two-stage implicit motion learning framework for audio-driven portrait animation. It predicts detailed motion features without explicit landmarks using Mamba-enhanced diffusion, delivering high-quality, coherent talking-head and gesture animations from a single image and audio.

Digital Humans & 3D Avatars

AvatarMix

AvatarMix: Identity-Preserving Cross-Avatar Composition for Outfit Personalization

AvatarMix is a compositional method for 3D avatar outfit personalization that preserves both identity and garment quality by directly combining head and body from two Gaussian avatars. It uses a two-tier diffusion refinement and mesh retargeting to ensure seamless joins and adapt garments to diverse body shapes.

SpeechLLMs & Spoken Reasoning

Entity-Aware CoT for Speech LLMs

Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention

This paper identifies a failure in speech large language models related to entity binding in complex reasoning. The Entity-Aware Chain-of-Thought (EA-CoT) method explicitly enumerates and binds entities during inference, significantly improving performance on speech inputs and narrowing the gap with text models.

TTS & Voice Synthesis

WavTTS

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

WavTTS models raw waveforms directly for zero-shot TTS, avoiding lossy compressed representations. It uses diffusion transformers with multi-scale mel-spectrogram guidance and tailored noise scheduling to deliver high-quality speech synthesis without relying on vocoders or codecs.