Multi-Faceted Interactivity Alignment

Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

This work improves full-duplex spoken dialogue models via reinforcement learning, optimizing conversational timing and behaviors such as pauses, turn-taking, backchannels, and interruptions using real human audio segments and specialized rewards while preserving response quality.

dialogue
speech-to-speech
streaming
realtime

Authors: Atsumoto Ohashi, Neil Zeghidour, Alexandre Défossez, Eugene Kharitonov

Categories: cs.CL, eess.AS

Published 2026-06-09 · Updated 2026-06-09

Abstract

Full-duplex spoken dialogue models can listen and speak simultaneously, making them a promising architecture for natural conversation. However, current models are trained solely with supervised learning through token-level likelihood maximization, which does not directly optimize interaction-level behaviors, causing interactivity issues such as excessive silence and ill-timed turn-taking. Recent work has applied reinforcement learning (RL) to improve interactivity, but existing methods address only a limited set of interactive behaviors in their rewards. In this work, we propose a post-training alignment method that comprehensively improves the interactivity of full-duplex spoken dialogue models through RL. We address the four canonical axes of interactivity: pause handling, turn-taking, backchanneling, and user interruption. For each axis, we extract short audio segments from human conversation corpora and optimize the model with axis-specific reward functions. An extra LLM-based reward for response quality prevents semantic degradation. We apply our method to two open-source models, Moshi and PersonaPlex, demonstrating consistent improvements in interactivity on both offline evaluation with pre-recorded audio and real-time multi-turn dialogue evaluation.

Introduction

This paper studies post-training alignment for full-duplex spoken dialogue models, where the system listens and speaks in parallel rather than waiting for a strict turn boundary. The core motivation is that supervised training of such models typically maximizes token-level likelihood and therefore does not directly optimize interaction-level behaviors such as when to remain silent, when to take the floor, when to produce backchannels, and when to yield during interruption. The authors argue that these behaviors are central to natural spoken conversation and that current full-duplex models still exhibit practical failures such as excessive silence, mistimed turn transitions, and weak backchanneling.

The proposed method is an RL-based alignment procedure that targets the four canonical axes used by Full-Duplex-Bench: pause handling, turn-taking, backchanneling, and user interruption. Instead of rewarding the model on synthetic proxies, the paper extracts short segments from real two-speaker human conversations and uses axis-specific reward functions to train the model. A separate LLM-based reward is added to preserve semantic quality so that optimizing timing does not collapse response content.

The approach is applied to two open-source full-duplex systems, Moshi and PersonaPlex, and evaluated both on static pre-recorded audio benchmarks and on real-time multi-turn dialogues. The central claim is that the method improves interactivity consistently across both settings, and that these gains transfer from short extracted segments to longer streaming conversations.

Overview of the proposed method. We first extract segments related to each interactivity axis $\ell \in \{\mathrm{pause}, \mathrm{turn}, \mathrm{bc}, \mathrm{int}\}$ from human conversation datasets to construct the training data $D_\ell$. For each segment (and optionally its dialogue context) sampled from $D_\ell$, the full-duplex spoken dialogue model generates multiple outputs, which are scored by axis-specific reward functions $R_\ell$ and used to optimize the model via GRPO.

Model setting and optimization objective

The paper assumes the standard discrete-token formulation used by recent end-to-end full-duplex models. A speech tokenizer maps the two-channel dialogue waveforms to token sequences for speaker $X$ and speaker $Y$, written as $[x_{1:N}; y_{1:N}]$. The model then autoregressively predicts speaker $Y$'s text and audio streams conditioned on the user-side stream:

$$ p_\theta(y_{1:N}, w_{1:N} \mid x_{1:N}) = \prod_{n=1}^{N} p_\theta(y_n, w_n \mid x_{\le n}, y_{

The inclusion of a parallel text stream $w_{1:N}$ is important for two reasons in the authors' framing. First, it provides semantic guidance for the generated speech. Second, it is also treated as a control signal for timing, which is why the RL objective is computed over text-token probabilities rather than audio-token probabilities.

During RL, each training step samples one interactivity axis $\ell \in \{\mathrm{pause}, \mathrm{turn}, \mathrm{bc}, \mathrm{int}\}$, draws a segment from the axis-specific dataset $\mathcal{D}_\ell$, and generates $G$ completions from the current policy. Each completion is decoded back to audio and scored by the reward associated with the selected axis. The completion-level rewards are normalized across the $G$ samples to obtain advantages, and the policy is optimized with a clipped surrogate objective plus a KL penalty against the frozen pre-RL reference policy.

$$ \hat{A}^{(g)} = \frac{r^{(g)} - \operatorname{mean}({r^{(g)}}_{g=1}^{G})}{\operatorname{std}({r^{(g)}}_{g=1}^{G})} $$

$$ \mathcal{L}(\theta) = -\frac{1}{G} \sum_{g=1}^{G} \frac{1}{N} \sum_{n=1}^{N} \Bigl[ \min\bigl(\rho_n^{(g)}\hat{A}^{(g)}, \operatorname{clip}(\rho_n^{(g)},1-\epsilon,1+\epsilon)\hat{A}^{(g)}\bigr) - \beta\, \operatorname{KL}[p_\theta \| p_{\mathrm{ref}}]_n \Bigr]. $$

Here, $\rho_n^{(g)}$ is the importance ratio between the current policy and the sampling-time policy, $\epsilon$ is the PPO-style clipping parameter, and $\beta$ is the KL coefficient. A key implementation detail is that $\rho_n^{(g)}$ is computed only from the text-token stream, not from the audio tokens, because the authors view text generation as the main driver of both content and interaction timing.

To improve generalization beyond the short extracted segment, the model is also given the immediately preceding dialogue context as prepended audio. The context length is randomly sampled during training, and the loss is applied only to the target segment, with the context masked out.

Training data curation

The training data are built from two-party conversational corpora with separate audio channels for both speakers. One channel is treated as the user-side input $X$, and the other as the target behavior $Y$ that the model is trained to imitate. The authors first run VAD to segment each recording into inter-pausal units and silence intervals, then group consecutive IPUs from the same speaker into utterances by placing an utterance boundary at any silence longer than $1.0$ s. Silences of at most $1.0$ s inside an utterance are treated as pauses.

From the utterance sequence, the method extracts short segments that exemplify one of the four interaction axes:

Pause handling: a single utterance by $X$ of duration at least $\tau_{\min}$, containing at least one internal pause, with no speech from $Y$. This captures hesitation without yielding the floor.
Turn-taking: a consecutive pair $(U_k, U_{k+1})$ with $X \to Y$, both long enough, and a gap of at most $0.4$ s between them. This captures smooth turn yield and response onset.
Backchanneling: an utterance by $X$ of duration at least $\tau_{\min}$ during which $Y$ produces only short utterances of at most $1$ s. This isolates acknowledgments and feedback cues.
User interruption: a four-utterance pattern $X \to Y \to X \to Y$ where $X$ interrupts $Y$, and all utterances are sufficiently long. This captures yielding and responding after a barge-in.

The paper emphasizes that these segments are drawn from real human conversation corpora rather than TTS-generated or artificially noised dialogues, which is intended to better reflect natural overlap, timing, and channel behavior.

Reward design

Each axis has a dedicated reward function. The rewards are intentionally simple and interpretable, but they are specialized enough to encourage distinct timing behaviors.

Pause handling reward: the model should stay silent throughout the segment, including during intra-utterance pauses. The reward is $-1$ if the generated audio contains any speech interval longer than $1$ s and $0$ otherwise.
Turn-taking reward: the model should speak promptly after $X$ finishes. Let $d$ be the delay from the end of $X$'s turn to the onset of the model's first utterance longer than $1$ s. The reward is $-d$. If the model never speaks, $d$ is set to the remaining segment duration.
Backchanneling reward: short generated utterances are treated as backchannels, longer ones as takeovers. A generated backchannel is counted as a true positive if it falls within $\pm 1$ s of a ground-truth backchannel. The final reward is the F1 score, with unmatched backchannels and takeovers counted as false positives.
User interruption reward: analogous to turn-taking, but the response delay is measured from the end of the interrupting utterance $U_{k+2}$.

To preserve response quality, the authors add an LLM-judge reward for the turn-taking and interruption axes. Both the input and generated speech are transcribed with ASR, then judged for contextual relevance and naturalness. When the LLM reward is combined with a delay-based reward, the paper uses reward-decoupled normalization: the two components are standardized independently across the $G$ samples before the advantages are summed with equal weight.

The appendix specifies the LLM prompt as a three-point relevance scale and notes that the judge is used only as a quality-preserving constraint rather than a direct content optimizer.

Training setup

The RL training data are extracted from two multi-party telephone-style conversational resources with separate channels for each speaker: Fisher and Seamless Interaction. Fisher contributes $2{,}000$ hours of telephone conversations between random pairs of people. Seamless Interaction combines an Improvised subset ($1{,}300$ hours) and a Naturalistic subset ($2{,}700$ hours), and the paper trains on the combined Seamless set. The stated goal is to test whether the method transfers across corpora with different recording conditions, speaker populations, and dialogue styles.

Segment extraction uses Silero VAD. Up to $2{,}000$ segments per axis are extracted, with minimum duration thresholds of $\tau_{\min}=4.0$ s for pause handling, $\tau_{\min}=5.0$ s for turn-taking and backchanneling, and $\tau_{\min}=3.0$ s for user interruption.

The method is applied to two base models. Moshi is a $7$B-parameter speech-text model; following prior work, the authors prepend $3$ s of silence so the system can emit an initiation phrase before the user starts speaking. PersonaPlex extends Moshi with text-prompt control and voice cloning; the paper uses the standard prompt from the official code and a $3$ s female voice prompt, with the same prompts during training and inference.

Optimization is performed for $100$ epochs on $32$ NVIDIA H100 GPUs with FSDP. At each epoch, $32$ segments are sampled, each producing $G=16$ completions. The optimizer is AdamW with learning rate $2 \times 10^{-7}$, betas $(0.9, 0.95)$, weight decay $0.1$, gradient clipping at norm $2$, PPO-style clipping parameter $\epsilon=0.2$, and KL coefficient $\beta=0.01$. Sampling uses text temperature $0.7$, audio temperature $0.8$, and top-$k=250$ for audio tokens.

A context curriculum is also used. With probability $0.5$, a context window sampled from $[0, l_{\max}]$ seconds is prepended; otherwise, no context is added. The maximum context length $l_{\max}$ is linearly increased from $0$ to $30$ s over training.

Evaluation benchmarks

The paper evaluates both static interaction behavior and live multi-turn dialogue. The main static benchmark is Full-Duplex-Bench v1, which feeds pre-recorded audio to the model and reports scenario-specific interaction metrics. The dynamic benchmark is Full-Duplex-Bench v2, which uses a real-time automated examiner and evaluates the model in streaming multi-turn conversation.

Full-Duplex-Bench v1 measures the four axes directly aligned with the RL objectives. It reports Takeover Rate (TOR), response latency for turn-taking and interruption, backchannel frequency, backchannel timing divergence via JSD, and GPT-4o semantic quality for post-interruption responses. The authors note that they corrected a bug in the official backchannel evaluation script: the generated audio had not been resampled to the VAD's expected $16$ kHz sampling rate.

Full-Duplex-Bench v2 evaluates real-time dialogue with GPT-Realtime as the examiner and Gemini 2.5 Flash as the judge. The benchmark covers four task families: Daily, Correction, Entity Tracking, and Safety. For each task, it reports turn-taking fluency and instruction-following, and for Correction, Entity Tracking, and Safety it additionally reports task-specific competence. The dialogue duration is capped at $60$ s and the fast pacing mode is used, in which the examiner continues speaking even while the model is speaking.

Results on static interaction behavior

The main result on Full-Duplex-Bench v1 is that RL improves both responsiveness and timing balance within each model family. Compared with the base models, both Moshi and PersonaPlex show lower pause-handling TOR, better turn-taking latency and TOR, improved backchannel timing and frequency, and better interruption handling. The LLM-based reward is important because it prevents the semantic quality of the response from degrading when the policy is pushed toward more aggressive timing.

The paper also compares against three external references: dGSLM, Freeze-Omni, and ASPIRin. The qualitative comparison is that dGSLM achieves very high turn-taking TOR but at the expense of poor pause handling, effectively treating many silences as turn yields; Freeze-Omni is weaker on several interaction metrics; and ASPIRin improves some timing behavior but does not comprehensively address all four axes. In contrast, the proposed method is reported to improve all axes jointly.

Model	Pause TOR (Synthetic)	Pause TOR (Candor)	BC TOR	BC Freq	BC JSD	Turn TOR	Turn Latency	Interrupt TOR	Interrupt GPT-4o	Interrupt Latency
Moshi	0.445	0.528	0.255	0.074	0.824	0.739	0.162	0.920	3.440	1.377
Moshi + RL (Fisher)	0.226	0.417	0.091	0.095	0.789	0.966	0.121	1.000	3.575	0.461
Moshi + RL (Seamless)	0.307	0.463	0.145	0.101	0.794	0.958	0.160	1.000	3.630	0.409
PersonaPlex	0.482	0.444	0.182	0.046	0.841	0.958	0.219	0.940	4.500	0.271
PersonaPlex + RL (Fisher)	0.328	0.361	0.127	0.122	0.783	0.950	0.079	1.000	4.520	0.187
PersonaPlex + RL (Seamless)	0.350	0.356	0.073	0.112	0.786	0.975	0.086	0.995	4.533	0.223

On Moshi, the Fisher-trained RL model reduces pause TOR from $0.445$ to $0.226$ on synthetic pause handling and from $0.528$ to $0.417$ on the Candor subset, while also lowering turn-taking latency from $0.162$ to $0.121$ s and interruption latency from $1.377$ to $0.461$ s. Seamless also improves Moshi, with slightly weaker pause results than Fisher but better interruption semantics and similar turn-taking responsiveness. On PersonaPlex, the improvements are similarly strong, and the Seamless-trained model achieves the best overall balance: better pause handling, lower backchannel JSD, better turn-taking latency, and the strongest interruption quality score among the reported variants.

The authors highlight an important interaction between the axes: dGSLM has very strong turn-taking TOR but poor pause handling, while the proposed joint training learns to distinguish a short intra-utterance pause from a true yield of the floor. This is one of the paper's central empirical points: high responsiveness and correct pause discrimination need not be mutually exclusive if all four axes are optimized together.

Results on real-time multi-turn dialogue

The dynamic evaluation on Full-Duplex-Bench v2 shows that the gains transfer beyond static pre-recorded segments. Across all four task families, both Moshi and PersonaPlex improve after RL, especially in turn-taking fluency. In many cases, instruction-following and task-specific competence also increase, suggesting that the LLM reward helps preserve content quality during interaction optimization rather than merely preventing collapse on short segments.

The Seamless-trained models are consistently stronger than the Fisher-trained ones, which the authors attribute to Seamless having more varied and more consistently structured dialogue. Fisher can still improve interaction behavior, but it appears more likely to push the model toward a cooperative casual style that may interfere with safety-related behavior.

Model	Daily Turn	Daily Instruct	Correction Turn	Correction Instruct	Correction Task	Entity Turn	Entity Instruct	Entity Task	Safety Turn	Safety Instruct	Safety Task
Moshi	3.284	2.221	3.248	2.189	2.340	3.951	2.537	2.440	3.839	2.831	2.720
Moshi + RL (Fisher)	3.397	2.502	3.957	2.706	2.820	4.110	2.626	2.640	3.858	3.058	2.820
Moshi + RL (Seamless)	3.442	2.615	4.003	2.895	3.300	3.965	2.609	2.740	4.161	3.503	3.440
PersonaPlex	3.327	2.861	3.803	2.945	3.080	3.748	3.130	3.200	3.841	3.596	3.260
PersonaPlex + RL (Fisher)	3.627	2.915	3.840	3.026	3.500	4.055	3.562	3.700	3.695	3.288	3.000
PersonaPlex + RL (Seamless)	4.017	3.197	4.501	3.369	3.620	4.647	4.059	3.840	4.511	3.780	3.280

The strongest dynamic result is PersonaPlex + RL trained on Seamless, which achieves the best scores across nearly all reported metrics. Moshi also benefits clearly from RL, although the gains are more modest and more sensitive to the training corpus. Importantly, the paper reports that Safety can degrade when PersonaPlex is aligned on Fisher, indicating that a conversational style that is highly cooperative in ordinary dialogue can conflict with refusal and redirection behavior in safety-critical settings.

Ablation analysis

The ablation study is carried out on Moshi trained on Fisher and isolates the contribution of each reward component, the context curriculum, and the context itself. The reported results use a subset of the static metrics plus the Daily task from Full-Duplex-Bench v2.

Setting	Pause TOR	BC JSD	Turn Latency	Interrupt GPT-4o	Daily Turn	Daily Instruct
+ RL (Fisher)	0.42	0.79	0.12	3.58	3.40	2.50
w/o $\mathcal{D}_{\mathrm{pause}}$	0.74	0.77	0.05	3.66	3.14	2.32
w/o $\mathcal{D}_{\mathrm{turn}}$	0.29	0.79	0.30	3.28	3.41	2.46
w/o $\mathcal{D}_{\mathrm{bc}}$	0.47	0.83	0.22	3.67	3.61	2.39
w/o $\mathcal{D}_{\mathrm{int}}$	0.39	0.78	0.14	3.42	3.28	2.24
w/o $R_{\mathrm{llm}}$	0.48	0.78	0.17	3.05	3.00	2.18
w/o sched	0.51	0.78	0.15	3.70	3.50	2.41
w/o context	0.49	0.78	0.09	3.53	3.33	2.21

The ablations support several distinct conclusions. Removing the pause data makes the model overly eager to speak, which hurts pause handling. Removing the turn-taking data has the opposite effect: the model becomes too conservative and latency worsens sharply. Removing the backchannel data degrades backchannel timing, as reflected by the worst JSD. Removing interruption data weakens interruption handling. The LLM reward turns out to be especially important, because removing it causes the largest broad degradation, confirming that timing gains alone are not enough if semantic quality is not protected.

The context curriculum also matters. Training without the scheduled growth of the context window, or without context at all, hurts performance on multi-turn evaluation. This indicates that even though RL optimization happens on short clips, the model still benefits from exposure to preceding conversational context so that it can generalize to longer streaming interactions.

Speech quality preservation

The paper also checks whether the RL objective harms the perceptual quality of the generated speech using UTMOSv2. The reported means remain close to the base models for both Moshi and PersonaPlex, with only small fluctuations across the Fisher and Seamless training variants. For example, overall UTMOSv2 remains around the mid-$2.5$ range for all models, suggesting that the alignment process improves interactivity without obvious audio-quality collapse.

The authors attribute this stability to three factors: the reward is computed on VAD outputs, so obviously degraded speech is still penalized; the LLM judge reward indirectly encourages intelligible and contextually relevant content; and the KL penalty keeps the policy close to the pretrained model.

Case studies

The paper includes qualitative examples that illustrate the reported quantitative gains. In the Moshi case study, the baseline model can exhibit very long response delay and long speech overlap during a real-time Daily task. After RL, the model shows smoother turn transitions and better-timed backchannels. The figure in the main paper highlights transition latencies of $0.76$ s and $0.08$ s for two turn boundaries, and the caption reports turn-taking fluency and instruction-following scores of $4.80$ and $2.60$.

Example of a conversation between GPT-Realtime (Examiner) and Moshi in the Daily task of Full-Duplex-Bench v2 (Turn-taking fluency score: 2.75, Instruction-following score: 2.50).

Example of a conversation between GPT-Realtime (Examiner) and Moshi + RL in the Daily task of Full-Duplex-Bench v2. Arrows highlight turn-taking transitions with their latencies, and boxes mark backchannelings. Turn-taking fluency and instruction-following scores are 4.80 and 2.60, respectively.

The appendix also uses PersonaPlex to show a safety-related failure mode. The base model already maintains reasonably structured dialogue, but the Fisher-trained RL version shifts toward short cooperative utterances and backchannels such as "Yeah". The authors interpret this as a conversational-style bias learned from Fisher, which can conflict with the ability to refuse or redirect harmful requests in the Safety task. This is the paper's clearest example of the trade-off between interactive fluency and safety alignment.

PersonaPlex (Turn-taking fluency score: 5.00, Instruction-following score: 2.33, Task-specific competence: 1.00)

Limitations

The paper is explicit about several limitations. First, the reward design is manually engineered and therefore does not scale cleanly to more interaction axes or richer conversational behavior. The authors suggest that future work should learn reward models directly from speech rather than handcrafting them.

Second, the approach depends on a model that generates a parallel text stream. That makes the method less directly applicable to full-duplex architectures without text-side decoding, even if they generate audio well.

Third, the evaluation is fully automated: the paper relies on benchmark scripts, GPT-Realtime interactions, and LLM-based judges rather than human raters. This is practical and reasonably correlated with human judgments, but it may still miss conversational qualities that people care about.

Fourth, the authors warn that making a model more fluent and responsive can hurt safety. The Fisher-trained PersonaPlex experiment provides evidence for this concern, showing that interactivity optimization may increase cooperative behavior in contexts where the correct response is to refuse or redirect. The paper therefore frames safety-aware reward design or constraints as an important next step.

Conclusion

The paper's main contribution is a general RL-based alignment method for full-duplex speech models that treats interactivity as a multi-axis optimization problem rather than a single timing metric. By combining short real-conversation segments, axis-specific rewards, context-aware training, and an LLM-based semantic reward, the method improves pause handling, turn-taking, backchanneling, and interruption behavior in both Moshi and PersonaPlex. The gains appear in both static and dynamic benchmarks, and the speech quality stays broadly stable. At the same time, the work identifies important open problems around reward scaling, generalization to architectures without a text stream, automated evaluation, and safety preservation.

Code & Implementation

This repository contains the codebase for PersonaPlex, a real-time full-duplex conversational speech model that supports voice and role control as presented in the paper.

The core implementation primarily resides in the moshi directory, which includes:

moshi/moshi/server.py: The main server-side real-time interaction entrypoint, handling streaming speech-to-speech generation using Mimi encoders/decoders and a Moshi language model.
moshi/moshi/offline.py: An offline inference script that mirrors server behavior for batch audio input and output, useful for evaluation.
Model loading and utilities centralized in modules under moshi/moshi/models and moshi/moshi/modules.

The code leverages pretrained Moshi and Mimi models for speech generation and implements streaming audio tokenization, autoregressive language modeling, and audio decoding.

The server and offline scripts both perform warmup routines to initialize CUDA graphs and manage streaming states, consistent with the full-duplex interaction mechanism discussed in the paper.

The repository's README provides detailed usage instructions for live server interaction, offline evaluation, voice prompt management, and model prompting strategies to emulate the multi-faceted interactive behavior optimized by reinforcement learning as described in the work.

Multi-Faceted Interactivity Alignment

Links

Paper & demos

Code & resources

Impact

Abstract

Introduction

Model setting and optimization objective

Training data curation

Reward design

Training setup

Evaluation benchmarks

Results on static interaction behavior

Results on real-time multi-turn dialogue

Ablation analysis

Speech quality preservation

Case studies

Limitations

Conclusion

Code & Implementation

Introduction

Model setting and optimization objective

Training data curation

Reward design

Training setup

Evaluation benchmarks

Results on static interaction behavior

Results on real-time multi-turn dialogue

Ablation analysis

Speech quality preservation

Case studies

Limitations

Conclusion

Code & Implementation

Related papers