Prosody Powers Spoken AI
Today’s digest spotlights two directions in speech-native AI: an empathetic multi-agent dialogue system that uses prosody for better emotional alignment, and a study of speech-token design that improves how frozen LLMs reason over spoken input.
PRISM framework architecture illustrating multi-agent coordination and prosody-to-language translation for empathetic dialogue. From PRISM.
SpeechLLMs & Spoken Dialogue
PRISM
PRISM: Prosody-Integrated Multi-Agent Reasoning Framework for Empathetic Spoken Dialogue
PRISM is a multi-agent empathetic spoken dialogue framework that integrates prosody in reasoning for emotionally aligned responses. It decouples speech perception, dialogue management, response generation, and speech synthesis to produce empathetic conversations with interpretable prosody and knowledge control.
Speech-Text Alignment for Reasoning
Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation
This paper studies how speech token design affects reasoning in frozen text LLMs by controlling frame rate and alignment. It introduces factorized quantization and a non-autoregressive audio head to enable efficient low-rate speech tokenization that aligns well with text embeddings for improved spoken dialogue.