Jun 5, 2026

Avatars Get More Social

Today’s digest spans social digital humans, efficient 3D avatar creation, noise-robust spoken dialogue, and more controllable voice synthesis. From theory-of-mind avatars to stronger zero-shot TTS, the common thread is making conversational systems feel more natural and expressive.

Diagram of the closed-loop framework overview for dual-agent interaction. From Resonant Minds.

Talking Avatars & Social Digital Humans

Resonant Minds

Resonant Minds: Closed-Loop Social Avatars with Theory of Mind

Resonant Minds is a closed-loop dual-agent avatar system combining perception, Theory of Mind reasoning, and emotion-driven video synthesis for strategic, lifelike conversations. It models social dialogue with information asymmetry, generating speaker and reactive listener behaviors in rich multimodal interaction.

Digital Humans & 3D Avatars

SAGE

Self-Learning Expression Deformations for Data-Efficient Gaussian Avatars

SAGE creates animatable 3D Gaussian avatars from minimal data by self-learning expression deformations using geometric and appearance consistency, eliminating the need for long expression data. It supports multiview, monocular, and one-shot inputs without pretraining, enabling efficient and accessible avatar creation.

RigPAPR

RigPAPR: Rig-Based Animation of Static Neural Point Clouds from a Fixed-Viewpoint Video

RigPAPR animates static neural point clouds to follow a fixed-viewpoint monocular video, recovering a rigged 3D asset. It avoids joint-boundary artifacts by using a shape-free proximity-attention renderer that naturally reforms surfaces under articulation, unlike prior Gaussian splat or mesh-based approaches.

SpeechLLMs & Voice Agents

IRAF

IRAF: Interference-Resilient Adaptive Fusion for Noise-Robust End-to-End Full-Duplex Spoken Dialogue Systems

IRAF introduces an adaptive gating module to dynamically rescale user audio embeddings for noise-robust, real-time full-duplex spoken dialogue systems. This mitigates interference from overlapping speakers and noise, improving turn-taking and response quality without added latency.

TTS & Voice Synthesis

VoxCPM2

VoxCPM2 Technical Report

VoxCPM2 is a 2B-parameter multilingual text-to-speech foundation model unifying voice design, zero-shot cloning, and high-fidelity super-resolution. It uses continuous-latent modeling for controllable and scalable speech generation without external discrete tokenizers.

GLASS

GLASS: GRPO-Trained LoRA for Acoustic Style Steering in Zero-Shot Text-to-Speech

GLASS is a zero-shot TTS framework that controls speaking rate and pitch via LoRA adapters trained from post-generation rewards, not style labels. This modular approach enables smooth style edits and multi-axis composition without changing speaker prompts, preserving speaker identity and naturalness.