Akapulu Labs logo Akapulu Labs Research

Talking avatars and voice models push toward more natural, controllable generation

Today’s digest spans fine-tuning-free talking-face synthesis, semantically grounded gesture generation, unified digital human models, and faster streaming TTS. Across speech and avatar generation, the theme is better alignment between meaning, motion, and voice with less latency and more control.

Talking avatars and voice models push toward more natural, controllable generation

Archon unified multimodal framework overview image. From Archon.

Talking Avatars, Digital Humans & Motion

IP-Adapter Fine-Tuning-Free Talking Face

IP-Adapter Fine-Tuning-Free Talking Face

IP-Adapter Is All You Need: Towards Fine-Tuning-Free Diffusion-Based Talking Face Generation

A fine-tuning-free diffusion framework that uses pretrained Stable Diffusion and IP-Adapter for lip-synced talking face generation. It addresses identity drift, lip-sync accuracy, and temporal flicker with parameter-free modules, enabling scalable and efficient talking face synthesis without costly model training.

Semantic Motion Anchors

Semantic Motion Anchors

Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures

This paper introduces semantic motion anchors, a new intermediate representation that links 3D co-speech gestures with their communicative intent by verbalizing motion and grounding it in spoken text. This improves retrieval of semantically meaningful gestures and shows user preference for gestures conveying intent.

NAVA

NAVA

Native Audio-Visual Alignment for Generation

NAVA is a joint audio-video generation model that separates audio-video synchronization from semantic context. It enables precise alignment and controllable multi-speaker timbre by dedicating a space for native audio-visual alignment before context conditioning, improving on dual-tower and unified tri-modal methods.

Archon

Archon

Archon: A Unified Multimodal Model for Holistic Digital Human Generation

Archon is a unified multimodal model generating holistic digital humans by modeling seven modalities jointly. It uses efficient semantic video tokenization and a Thinking in Modality strategy to improve control and fidelity for talking-head video synthesis.

TTS & Unified Speech Modeling