Speech AI Across Empathy, Scale, and Timing
Today’s digest spans emotionally adaptive voice assistants, a 100+ language speech benchmark, and a new way to evaluate talking-head generation with better temporal alignment. Together, these papers push speech systems toward richer conversation, broader coverage, and fairer assessment.
System overview diagram of Sympatheia, illustrating speech input, continuous valence-arousal affect conditioning, multimodal sensing modules, and speech-to-speech dialogue output. From Sympatheia.
Talking Avatars & Lip Sync
SpeechLLMs & Voice Agents
Sympatheia
Sympatheia: Emotionally Adaptive Voice Assistant with Continuous Affect Conditioning
Sympatheia is a speech-to-speech dialogue system that adapts voice assistant responses using continuous valence-arousal affect signals from user speech or multimodal sensors. It combines implicit emotional inference with explicit affect control for nuanced empathetic conversation.
PolySpeech-100
PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects
PolySpeech-100 is a benchmark measuring speech understanding across 100+ languages and dialects with human and synthetic data. It highlights end-to-end models' strengths on dialects and reveals gaps in low-resource language comprehension beyond transcription.