Jun 25, 2026

Controlling speech, audio, and avatars

Today’s digest spans new ways to shape how machines speak, listen, and move: from universal audio generation and controlled TTS to stronger ASR alignment and adaptive speech encoders. It also includes a fresh step toward smoother digital-human motion from sparse sensor input.

AudioCALM architecture overview illustrating the continuous autoregressive model with flow-matching head and asymmetric experts for universal audio generation. From AudioCALM.

Digital Humans & Avatar Motion

MotionMAR

MotionMAR: Multi-scale Auto-Regressive Human Motion Reconstruction from Sparse Observations

MotionMAR reconstructs full-body human motion from sparse VR sensor data using a multi-scale autoregressive framework, capturing global trajectories to fine details. Its coarse-to-fine tokenization and scale-aware control provide accurate, jitter-free human motion respecting temporal hierarchy.

SpeechLLMs & Spoken Audio Generation

CAAD

CAAD: Contrastive Audio-Aware Distillation for Efficient Speech Language Models

CAAD distills contrastive audio-aware decoding into a student model to improve speech language understanding with efficiency and stronger acoustic grounding. It uses synchronized teacher forcing and metadata-based pseudo-ground truths to distill contrastive reasoning without inference-time overhead.

AudioCALM

AudioCALM: Continuous Autoregressive Language Modeling for Universal Audio Generation

AudioCALM is a universal audio generation model that autoregressively predicts continuous audio latents to unify speech, sound, and music synthesis. It balances modality differences with asymmetric experts and descriptive conditioning for high-quality, variable-length, end-to-end audio generation.

TTS & Voice Synthesis

FlowTTS-GRPO

FlowTTS-GRPO: Online Reinforcement Learning with Multi-Objective Reward Optimization for Flow-Matching Based Text-to-Speech

FlowTTS-GRPO uses online reinforcement learning to fine-tune flow-matching TTS models with multi-objective rewards for speaker similarity, quality, and intelligibility. It enables exploration via stochastic sampling without auxiliary models, improving voice cloning and cross-lingual transfer.

LombardTTS

Synthesizing the Lombard Effect: Multi-Level Control of Speech Clarity and Vocal Effort in TTS

This paper presents a TTS system that independently controls vocal effort and articulation to simulate the Lombard effect, enhancing speech clarity and intelligibility in noisy conditions. It enables continuous multi-level and word-level control for nuanced, context-specific speech emphasis.

ASR Architectures & Adaptation

InterAligner

Progressive Alignment Objectives for Aligner-Encoder based ASR

InterAligner enhances Aligner-Encoder ASR by adding progressive alignment objectives at intermediate layers, guiding the encoder to form monotonic alignment gradually. This approach stabilizes training and improves recognition, especially on long utterances, outperforming methods using only final-layer alignment.

Audio-Image Alignment for Low-Resource ASR

Audio--Image Alignment as a Continued-Pretraining Stage Improves Low-Resource ASR

This paper proposes an intermediate pretraining step using audio-image pairs to adapt pretrained audio encoders without transcripts. This stage improves ASR performance in low-resource languages by enhancing representation robustness and transferability before supervised fine-tuning.