Jun 16, 2026

Prosody and Stress Take Center Stage

Today’s digest focuses on speech fidelity at the syllable level, from dynamic prosody prediction and duration-based watermarking in LLM TTS to stress-preserving speech-to-speech translation. Together, these papers push synthesized and translated speech closer to natural, speaker-faithful delivery.

The overall architecture of our proposed stress-aware S2ST system. From StressPreserve S2ST.

TTS & Voice Synthesis

Dynamic Prosody Prediction

Dynamic Prosody Prediction in LLM-based TTS for Improving Speaker Similarity

This work proposes dynamic prosody prediction in LLM-based TTS, where prosody for each syllable is predicted conditioned on previously generated speech, improving speaker similarity and style consistency. It differs from prior static or implicit methods by tightly linking prosody with the evolving synthesized speech.

DuraMark

DuraMark: Duration-Embedded Watermarking in LLM-based TTS

DuraMark introduces a novel watermarking method for LLM-based text-to-speech by embedding information in the syllable duration rather than the audio signal. This duration-level watermarking offers superior robustness against neural codec and vocoder attacks compared to traditional signal-level techniques.

Speech-to-Speech Translation

StressPreserve S2ST

Evaluating and Preserving Lexical Stress in English-to-Chinese Speech-to-Speech Translation

This paper improves English-to-Chinese speech translation by preserving lexical stress, which conveys speaker intent. It introduces a Mandarin stress detector and a new metric to evaluate stress transfer, alongside a stress-aware system that enhances emphasis without compromising naturalness or translation quality.