IRAF

IRAF: Interference-Resilient Adaptive Fusion for Noise-Robust End-to-End Full-Duplex Spoken Dialogue Systems

IRAF introduces an adaptive gating module to dynamically rescale user audio embeddings for noise-robust, real-time full-duplex spoken dialogue systems. This mitigates interference from overlapping speakers and noise, improving turn-taking and response quality without added latency.

llm
dialogue
agent
streaming
low-latency

Authors: Tao Zhong, Jiajun Deng, Nikita Kuzmin, Yinke Zhu, Tianxiang Cao, Tristan Tsoi, Zhili Tan, Simon Lui, Xunying Liu

Categories: cs.SD, cs.AI, eess.AS

Published 2026-06-04 · Updated 2026-06-04

Abstract

Full-duplex spoken dialogue models allow voice agents to listen and speak concurrently, enabling natural interaction with real-time overlap. However, end-to-end dual-channel models that jointly encode user and agent streams may degrade in realistic acoustic environments: interfering speakers leaking into the user microphone can be encoded as part of the user query, corrupting the LLM's conditioning and causing unstable turn-taking and reduced response quality. We propose Interference-Resilient Adaptive Fusion (IRAF), a lightweight, streaming-compatible module that modulates the contribution of user audio to the LLM frame by frame. IRAF predicts a scalar reliability gate from target-speaker and user audio embeddings and rescales user representations before fusion with agent embeddings. Experiments on MS-MARCO and InstructS2S-200K show consistent gains in response quality and full-duplex interaction under interfering-speaker conditions.

1. Problem Setting and Core Idea

This paper addresses a specific failure mode in end-to-end full-duplex spoken dialogue systems: when the user microphone picks up interfering speech, an E2E dual-channel model may encode non-target speech as if it were part of the user query. In the authors' framing, this corrupts the conditioning context seen by the language model, which can destabilize turn-taking, produce false barge-in behavior, and reduce response quality. The problem is especially acute for full-duplex systems because they must operate causally and with low latency, while also handling overlap between user and agent speech in real time.

The proposed solution is Interference-Resilient Adaptive Fusion (IRAF), a lightweight, streaming-compatible module that predicts a frame-level reliability gate from target-speaker and user-audio embeddings. The gate rescales the user representation before it is fused with the agent-side text representation, so interference-dominated frames contribute less to the language model input while target-user frames are preserved.

Full-duplex dialogue in (a) clean and (b) noisy conditions. Interference leaking into the user channel can corrupt conditioning, causing unstable turn-taking and false barge-in.

2. Model Architecture

The overall system is a multi-stream E2E duplex model with a user speech stream and an agent text stream. The user audio is processed by a streaming speech encoder operating at 12.5 Hz, plus a modality adapter that converts the waveform into frame-level embeddings $X \in \mathbb{R}^{T \times D}$, where $T$ is the number of frames and $D$ is the embedding dimension. The agent-side input is represented as text embeddings $Y^{\mathrm{txt}} \in \mathbb{R}^{T \times D}$.

The paper follows a duplex design in which the fused representation is consumed by an LLM backbone, while a separate autoregressive speech transformer decoder predicts the agent's speech tokens $Y^a$ conditioned on the LLM hidden states $h$. The speech decoder is described as a 12-layer causal Transformer following the T5 architecture. The text tokenizer is SentencePiece with a 32k vocabulary. The authors also state that agent speech tokens are extracted with NanoCodec at 12.5 Hz using finite scalar quantization.

In the baseline duplex fusion, the user audio and agent text embeddings are directly combined. IRAF replaces this static treatment with a learned, time-varying reliability estimate. At each frame $t$, the model concatenates a target-speaker embedding $s \in \mathbb{R}^n$ with the current and past user audio embeddings $X_{\le t}$ and passes the result through a small transformer-based fusion module $f(\cdot \mid \psi)$. The module is described as having three parts: an input projection layer that maps speaker and audio features into a shared space, a causal Transformer layer that aggregates streaming context, and a linear output head that produces the reliability gate.

The gate is defined as

$$g_t = 2 \times \operatorname{Sigmoid}(f(s, X_{\le t} \mid \psi)) \in [0,2].$$

This gate rescales the user audio embedding before fusion with the agent text embedding, yielding the fused input $g_t X_t + Y_t^{\mathrm{txt}}$ for the LLM. The range $[0,2]$ allows the module to both suppress and amplify the user stream relative to the direct-sum baseline.

Overview of the proposed E2E full-duplex model with the Interference-Resilient Adaptive Fusion (IRAF) module. A streaming speech encoder produces frame-level user embeddings, which IRAF adaptively gates before fusion with agent text embeddings and processing by the LLM to generate text tokens; a speech decoder, conditioned on the LLM hidden states, generates audio tokens.

3. Training Objective and Supervision

The paper formulates training as a multi-channel next-token prediction objective. Let $Y^{\mathrm{txt}}$ denote the agent text tokens and $Y^a$ the agent speech tokens. The losses are weighted cross-entropies for the text decoder and the speech decoder:

$$ \mathcal{L}(Y^{\mathrm{txt}}, Y^a \mid X, \theta, \phi) = - \sum_{t=1}^{T} \Big[ \lambda_1 \log p_{\theta}(Y_t^{\mathrm{txt}} \mid Y_{

The weights are set to $\lambda_1 = 1.0$ for text and $\lambda_2 = 5.0$ for speech. IRAF is trained jointly with the full duplex model, and it receives an auxiliary binary supervision signal derived from clean training utterances: frame labels are $1$ when the target speaker is active and $0$ otherwise. This auxiliary task is weighted by $0.1$.

The paper emphasizes that IRAF is lightweight and streaming-compatible, so it preserves the end-to-end formulation without adding extra response latency.

4. Dataset Construction and Noise/Interference Simulation

Two datasets are used to evaluate the method. The first is single-turn MS MARCO, a large-scale QA benchmark with anonymized Bing queries and human-written answers. For spoken QA, the authors synthesize speech for the question-answer pairs using CosyVoice2. The second is multi-turn InstructS2S-200K, a speech-to-speech dialogue dataset with approximately 200,000 conversation sessions covering common-sense and general world-knowledge interactions.

To convert the data into duplex training examples, each conversation is turned into two synchronized streams: user and agent. When one party speaks, the other channel is filled with silence, yielding non-overlapping duplex signals. A fixed inter-turn pause of 0.64 s is inserted between the user's utterance and the agent's response.

For user interruption modeling, the authors shorten the inter-turn gap so the next user utterance overlaps with the agent's ongoing speech. Once interruption starts, the remaining agent speech is truncated and replaced with silence after a fixed latency of 0.64 s. During training, barge-in events are injected stochastically with probability $0.5$ per turn.

To approximate real-world noisy full-duplex conditions, the paper also constructs interference-augmented data using MUSAN. MUSAN speech is used as interfering speakers, and MUSAN noise is used as background noise. The speech and noise corpora are partitioned into three non-overlapping splits for train, validation, and test. For MS MARCO, the interfering SNR is sampled uniformly from 0 to 10 dB; for InstructS2S-200K, the SNR range is 0 to 20 dB.

The paper evaluates two acoustic settings: (1) interfering speakers only, and (2) interfering speakers plus background noise.

5. Implementation Details

The implementation uses the NeMo Toolkit. The speech encoder is a 100M-parameter streaming pretrained encoder with 80 ms right context. The LLM backbone is initialized from TinyLlama with 1.1B parameters. The speech representation uses NanoCodec at 0.6 kbps by default, producing four code channels with vocabulary size 4,037 each. The speech decoder is a 12-layer causal Transformer, and the user-target identity embedding is extracted with a pretrained ECAPA-TDNN. For simplicity, speakers from different conversations are treated as distinct identities.

The IRAF fusion module itself uses only a single Transformer layer. Training uses AdamW with cosine-annealing learning rate scheduling, peak learning rate $3 \times 10^{-4}$, 2,500 warm-up steps, and gradient clipping with maximum norm 1.0. The dataset split ratio is 0.945 / 0.005 / 0.05 for train / validation / test. MS MARCO uses per-GPU batch size 1 with gradient accumulation of 8. InstructS2S-200K uses duration-based bucketing with 60 s batch duration and gradient accumulation of 4 per GPU.

6. Evaluation Protocol and Metrics

The paper evaluates two complementary aspects of duplex behavior: response quality and full-duplex interaction.

Response quality: The generated agent speech is transcribed with an ASR system, then compared with the reference text using BLEU and Sentence-BERT similarity.
Turn-taking performance: Response latency (RL) is measured as the time between estimated user end-of-speech and agent speech onset, using Silero VAD. Response success rate (RSR) is the fraction of responses that start within the latency threshold. Latencies over 1.5 s are counted as failures and clipped to 1.5 s for reporting.
User barge-in performance: For multi-turn dialogue, stop latency (SL) measures the time between interruption onset and agent cessation, also estimated with VAD. Stop success rate (SSR) measures whether the agent stops in time. Latencies above 1.5 s are counted as failures and clipped similarly. The test set is modified by shortening user gaps to encourage barge-ins.

7. Main Results on MS MARCO

Table 1 reports results on MS MARCO under MUSAN speech interference. The main takeaway is that clean-only training is fragile, noise augmentation helps substantially, and IRAF adds another consistent gain across both interference conditions.

Performance on MS MARCO under MUSAN speech interference.
Method	Noisy Source	BLEU	sBERT	RL (s)	RSR
Interfering speakers only
CleanBase	ALL	0.66	0.11	1.46	6.2%
NoisyAug	LIBRI	12.69	0.503	0.98	91.0%
NoisyAug	US-GOV	13.30	0.512	0.96	94.1%
NoisyAug	ALL	12.74	0.506	0.97	93.1%
IRAF	LIBRI	13.81	0.516	0.97	95.4%
IRAF	US-GOV	14.38	0.536	0.94	98.2%
IRAF	ALL	14.20	0.523	0.96	95.7%
Both interfering speakers and background noise
CleanBase	ALL	0.00	0.03	1.49	2.8%
NoisyAug	LIBRI	11.12	0.445	0.98	87.1%
NoisyAug	US-GOV	11.53	0.465	0.96	91.3%
NoisyAug	ALL	11.33	0.454	0.98	88.2%
IRAF	LIBRI	11.64	0.472	0.94	91.2%
IRAF	US-GOV	12.34	0.486	0.93	92.8%
IRAF	ALL	12.01	0.476	0.94	92.5%

Under interfering speakers only, the clean baseline collapses to 0.66 BLEU, 0.11 sBERT, 1.46 s RL, and 6.2% RSR. Noise augmentation recovers most of the lost performance, reaching 12.74 BLEU, 0.506 sBERT, 0.97 s RL, and 93.1% RSR. IRAF improves further to 14.20 BLEU, 0.523 sBERT, 0.96 s RL, and 95.7% RSR.

The same pattern holds when both interfering speakers and background noise are present. The clean baseline is nearly unusable on this setting, while NoisyAug reaches 11.33 BLEU, 0.454 sBERT, 0.98 s RL, and 88.2% RSR. IRAF improves this to 12.01 BLEU, 0.476 sBERT, 0.94 s RL, and 92.5% RSR.

The authors summarize the effect of IRAF on MS MARCO as a consistent improvement over NoisyAug in both response quality and interaction reliability, with the strongest absolute gains appearing in the response success rate and the semantic/lexical metrics.

8. Main Results on InstructS2S-200K

Table 2 evaluates the method on the multi-turn InstructS2S-200K benchmark, which is a harder setting because it stresses both long-horizon response quality and interaction control, including explicit barge-ins.

Performance on InstructS2S-200K under MUSAN speech interference.
Method	BLEU	sBERT	RL (s)	RSR	SL (s)	SSR
Interfering speakers only
CleanBase	1.13	0.22	1.39	13.9%	1.29	42.7%
NoisyAug	9.64	0.47	0.97	69.2%	0.74	99.0%
IRAF	13.76	0.58	0.82	91.0%	0.73	99.8%
Both interfering speakers and background noise
CleanBase	0.91	0.21	1.41	9.8%	1.34	40.2%
NoisyAug	8.32	0.44	1.05	56.0%	0.74	99.6%
IRAF	9.83	0.47	0.98	69.2%	0.73	100.0%

The reported deltas over NoisyAug show that IRAF is particularly beneficial in the multi-turn setting. Under interfering speakers only, it improves BLEU by +4.12, sBERT by +0.11, reduces RL by 0.15 s, raises RSR by 21.8 percentage points, and slightly lowers stop latency by 0.01 s while increasing SSR to 99.8%. Under both interference and background noise, it improves BLEU by +1.51, sBERT by +0.03, reduces RL by 0.07 s, and raises RSR by 13.2 percentage points, while maintaining near-ceiling barge-in performance.

Importantly, the barge-in metrics are already high for NoisyAug in the test setting, so the main value of IRAF in InstructS2S-200K is not merely improved stopping, but substantially better response quality and turn-taking reliability in the presence of interference.

BLEU and response success rate (RSR) on InstructS2S-200K with interfering speakers across SNRs.

The paper also reports an SNR study on InstructS2S-200K, showing that IRAF improves both BLEU and response success rate across all tested SNRs. The authors use this figure to argue that the method is not tuned to a narrow acoustic regime, but remains effective as interference severity changes.

9. Interpretation of the Method

The key conceptual contribution is that the model does not treat every frame of user audio as equally reliable. Instead, it uses target-speaker information to estimate whether the current acoustic evidence is likely to come from the intended user. This matters in the duplex setting because the user channel can be contaminated by other speakers, and naive fusion would pass those corrupted frames straight into the LLM.

In effect, IRAF is a learned, causal, per-frame confidence estimator for the user stream. Because the gate is produced with a small transformer module and applied before fusion, the approach is compatible with streaming and avoids the cost and latency of explicit source separation. The paper positions this as especially useful in realistic conversational environments where overlap is irregular, interference is non-stationary, and the system must decide quickly whether to continue listening, respond, or stop speaking.

10. Stated Contributions and Novelty

The paper claims to be the first to explicitly address noise- and interference-induced conditioning corruption in end-to-end full-duplex spoken dialogue systems.
It proposes IRAF, a lightweight and streaming-compatible adaptive fusion module that predicts a frame-level reliability gate from target-speaker and user-audio embeddings.
It evaluates the method on both single-turn and multi-turn spoken dialogue tasks, and under both interfering-speaker-only and interfering-speaker-plus-background-noise conditions.
It reports improvements in response quality, turn-taking, and barge-in metrics, suggesting the method improves both linguistic output and interaction control.

11. Scope, Ablations, and Limitations Reflected by the Paper

The paper's empirical comparison is centered on three configurations: CleanBase, NoisyAug, and IRAF. In that sense, the reported ablation is at the system level rather than a component-by-component dissection of the gating module. The source text does not report finer-grained ablations such as removing the speaker embedding, changing the gate range, or replacing the causal Transformer inside IRAF with a simpler predictor.

The evaluation is also bounded by the paper's simulation setup. All noisy conditions are synthetic mixtures based on MS MARCO and InstructS2S-200K with MUSAN speech/noise, and the multi-turn interruption setup uses a fixed 0.64 s truncation/latency rule. Speakers from different conversations are treated as distinct identities. These choices make the experiments controlled and reproducible, but they also mean the paper does not yet demonstrate performance on fully unconstrained live microphone recordings.

A further scope limitation is that the paper does not present a detailed runtime or memory benchmark, even though it claims streaming compatibility and low latency. The evidence for efficiency is therefore architectural rather than measured in a dedicated speed table.

12. Bottom Line

IRAF is a simple but well-targeted modification to duplex spoken dialogue models: it learns when the user stream should be trusted and when it should be suppressed before fusion with the agent-side context. Across both MS MARCO and InstructS2S-200K, and across both interference conditions tested, the method consistently improves response quality and turn-taking behavior over a strong noise-augmented baseline. The results suggest that frame-level reliability gating is an effective strategy for making end-to-end full-duplex spoken dialogue systems more robust in realistic multi-speaker acoustics.