Jun 2, 2026

Speech AI Across Empathy, Scale, and Timing

Today’s digest spans emotionally adaptive voice assistants, a 100+ language speech benchmark, and a new way to evaluate talking-head generation with better temporal alignment. Together, these papers push speech systems toward richer conversation, broader coverage, and fairer assessment.

System overview diagram of Sympatheia, illustrating speech input, continuous valence-arousal affect conditioning, multimodal sensing modules, and speech-to-speech dialogue output. From Sympatheia.

Talking Avatars & Lip Sync

Temporally-Aligned Evaluation

Temporally-Aligned Evaluation for Audio-Driven Talking Head Generation

Introduces a sequence-alignment approach using Soft-DTW to evaluate audio-driven talking-head generation, robustly handling timing shifts in speech motion. This unified metric framework benchmarks 20 methods across diverse datasets, revealing clearer trade-offs in key performance aspects.

SpeechLLMs & Voice Agents

Sympatheia

Sympatheia: Emotionally Adaptive Voice Assistant with Continuous Affect Conditioning

Sympatheia is a speech-to-speech dialogue system that adapts voice assistant responses using continuous valence-arousal affect signals from user speech or multimodal sensors. It combines implicit emotional inference with explicit affect control for nuanced empathetic conversation.

PolySpeech-100

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

PolySpeech-100 is a benchmark measuring speech understanding across 100+ languages and dialects with human and synthetic data. It highlights end-to-end models' strengths on dialects and reveals gaps in low-resource language comprehension beyond transcription.