Jun 26, 2026

Real-Time Voices, Agents, and Avatars

Today’s digest spans end-to-end interactive models, better speech and TTS synthesis, and instant 3D avatar generation. The common thread: more natural, controllable, low-latency conversational AI across voice and visual embodiment.

Qualitative Results. We show the animated results of our generated 3D Gaussian avatars for test IDs and novel expressions. Our ~generates authentic, ID-preserving avatars for diverse attributes, , races, genders, ages, hairstyles, and expressions, only from a single image. Also, the input image's visual details, such as tattoos or accessories, are faithfully reflected in the 3D Gaussian avatars. Note that ~can generate unseen observations from the input image, such as the mouth interior and eye pupil, aided by our diffusion model. Please refer to the supplementary video for the dynamic avatar animation results. From FiCA.

Digital Humans & 3D Avatars

FiCA

FiCA: Feed-forward instant Gaussian Codec Avatars from a Single Portrait Image

FiCA generates photorealistic, animatable 3D Gaussian Codec Avatars from a single portrait image using a feed-forward pipeline. It uniquely combines vision foundation models, diffusion-based completion, and UV refinement for real-time, identity-preserving avatars without offline tracking or optimization.

SpeechLLMs & Voice Agents

Wan-Streamer

Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

Wan-Streamer is an end-to-end interactive foundation model unifying text, audio, and video in a single Transformer for real-time full-duplex audio-visual interaction. It jointly learns perception and generation without separate modules, enabling sub-second latency streaming with synchronized multimodal responses.

Translation-Enhanced Speech Encoder

Does Translation-Enhanced Speech Encoder Pre-training Affect Speech LLMs?

This paper studies how adding translation objectives in speech encoder pre-training improves cross-lingual, language-agnostic representations for Speech LLMs. The bidirectional translation task aligns speech embeddings better with the LLM's shared semantic space, boosting downstream speech recognition and translation.

TTS & Voice Synthesis

ZONOS2

ZONOS2 Technical Report

ZONOS2 is an advanced text-to-speech model that excels in naturalness, prosody, and zero-shot voice cloning across multiple languages. It uniquely combines a large-scale mixture-of-experts architecture with a massive multilingual training corpus and simplified conditioning for high-quality, low-latency streaming TTS.

OscillaTTS

Adaptive Oscillatory Inductive Bias for Modeling Sharp Prosodic Dynamics in Diffusion-Based TTS

OscillaTTS enhances diffusion-based text-to-speech by introducing an adaptive oscillatory activation that improves modeling of sharp prosodic transitions and rapid pitch variations, enabling more expressive and stable speech synthesis compared to fixed periodic functions.

CrossAccent-TTS

CrossAccent-TTS: Cross-Lingual Accent-Intensity Controllable Text-to-Speech via Disentangled Speaker and Accent Representations

CrossAccent-TTS enables precise control of accent and accent intensity in cross-lingual TTS by disentangling speaker and accent features and conditioning synthesis on learned language embeddings. It preserves speaker identity while allowing smooth accent modulation, performing well on Indic and foreign English accents.