Jun 27, 2026

Voice, Avatars, and Spoken Reasoning

Today’s digest spans real-time talking avatars, relightable digital humans, universal speech synthesis, and a new look at how speech-language models reason internally. Together they point to more natural, controllable, and interactive conversational AI systems.

We propose InteractiveAvatar, a real-time streaming audio-driven avatar generation framework that enables intent-aware interaction. InteractiveAvatar interprets user intent to generate contextually relevant actions throughout the dialogue while maintaining long-range visual consistency. From InteractiveAvatar.

Talking Avatars & Interactive Video

InteractiveAvatar

InteractiveAvatar: Real-Time Streaming Video Generation for Consistent and Intent-Aware Avatars

A real-time streaming avatar system that ensures long-term visual consistency and enables context-aware speech and actions by understanding user intent. It combines memory mechanisms for visual coherence with an intent reasoning module for interactive, natural avatar behavior beyond standard audio-driven models.

Digital Humans & 3D Avatars

Generative Relightable Avatars

Generative Relightable Avatars enable photorealistic free-view rendering and relighting of full-body humans by combining physics-based relighting with generative video refinement. This captures fine details and pose-dependent effects, producing temporally coherent avatars under diverse lighting.

TTS & Voice Synthesis

Bagpiper-TTS

Bagpiper-TTS: Natural Language Guided Universal Speech Synthesis

Bagpiper-TTS is a universal speech synthesis system using natural language to guide speech generation with rich details. It replaces rigid metadata inputs and supports diverse tasks like multi-talker dialogue, intent-to-speech, role-play, and singing voice synthesis in a single unified model.

SpeechLLMs & Spoken Reasoning

Interleaved SLMs Latent Text

Interleaved Speech Language Models Latently Work In Text

This paper shows that interleaved speech-text language models internally transcribe speech into text tokens within intermediate layers, without explicit transcription training. This latent transcription enables a unique interplay of speech and text modalities for improved model function.