Context-Aware Decoding (CAD)

From Awareness to Adherence: Bridging the Context Gap in Spoken Dialogue Systems via Context-Aware Decoding

Introduces an inference-only method that boosts multi-round spoken dialogue systems by amplifying relevant context during response generation. It bridges the gap between internal context awareness and active adherence, improving context fidelity without retraining or extra modules.

llm
dialogue
asr
multimodal
autoregressive

Authors: Che Hyun Lee, Heeseung Kim, Sungroh Yoon

Categories: cs.CL

Comment: Interspeech 2026 Main Track

Published 2026-06-15 · Updated 2026-06-15

Abstract

Despite the success of end-to-end (E2E) spoken dialogue systems, maintaining strict context adherence in multi-round conversations remains a challenge. While prior works attribute these failures to models forgetting dialogue history, we highlight an equally critical but overlooked bottleneck: a gap between latent context awareness and active adherence. Although models internally recognize relevant past utterances, strong parametric priors often overshadow these signals during decoding. To bridge this gap, we propose an audio-adapted Context-Aware Decoding (CAD) approach. By leveraging internal attention mechanisms to isolate key historical rounds, our approach contrasts output distributions with and without this key context during inference, directly amplifying multimodal contextual signals. Evaluations on the Audio MultiChallenge benchmark demonstrate significant improvements in Semantic Memory and Self Coherence subtasks, successfully enforcing strict, context-faithful adherence.

Introduction

This paper addresses a specific failure mode in end-to-end spoken dialogue systems: in multi-round conversations, models often appear to recognize relevant past information internally, yet still fail to express that information in the final response. The authors argue that this is not only a memory problem, but also a generation-time adherence problem. They frame the issue as a gap between latent context awareness and active adherence.

In the paper's terminology, latent awareness is the model's internal ability to attend to or recognize useful historical utterances, while active adherence is the successful manifestation of that context in the generated output. The key observation is that strong parametric priors can dominate the decoding process even when the model has internally identified the correct historical evidence. Rather than treating the system as if it had completely forgotten the dialogue, the paper argues that the bottleneck often lies in the decoding stage.

To address this, the paper introduces an audio-adapted Context-Aware Decoding (CAD) method. The method is inference-only, does not require extra training, and does not add an external retrieval module. Instead, it amplifies the impact of selected historical context by contrasting output distributions computed with and without that context. The approach is designed for spoken dialogue systems and is presented as model-agnostic across different end-to-end architectures.

The authors summarize their main contributions as follows: (1) they explicitly formalize context failures in multi-round spoken dialogue as a gap between latent awareness and active adherence; (2) they propose an audio-adapted CAD mechanism that strengthens contextual signals during decoding; and (3) they show consistent gains on the Audio MultiChallenge benchmark, especially on strict context-adherence tasks such as Semantic Memory and Self Coherence.

Overall framework of our audio-adapted Context-Aware Decoding (CAD) for spoken dialogue systems. (a) Formalization of multi-round spoken dialogue. (b) CAD mechanism amplifying contextual adherence by penalizing parametric priors during inference. (c) Latent awareness pipeline identifying the key context $C_{\mathrm{key}}$ via attention analysis.

Method

Problem formulation: multi-round spoken dialogue

The paper defines a conversation as a sequence of rounds, where each round consists of a user turn followed by an assistant turn. For a single-round setting, a user utterance $U = [u_1, u_2, \dots, u_i]$ is mapped to an assistant response $A = [a_1, a_2, \dots, a_k]$ autoregressively:

$$ P_{\phi}(A \mid U) = \prod_{j=1}^{k} P_{\phi}(a_j \mid U, a_{

In the multi-round case, the model receives the prior dialogue history $H_{n-1} = [U^1, A^1, \dots, U^{n-1}, A^{n-1}]$ and the current user utterance $U^n$. The full input context is $C_n = [H_{n-1}, U^n]$, and the assistant generates $A^n$ conditioned on that context:

$$ P_{\phi}(A^n \mid C_n) = \prod_{j=1}^{|A^n|} P_{\phi}(a^n_j \mid C_n, a^n_{

The paper focuses on a speech-to-text setting, where the user input is composed of speech tokens and the response is generated as text. The authors note, however, that the decoding strategy is modality-agnostic and can be extended to speech-to-speech generation as well.

Context-Aware Decoding for active adherence

Standard CAD contrasts the model's output distribution under the full context with an unconditional version computed after removing the history. This paper adapts that idea for spoken dialogue by defining a key context $C_{\mathrm{key}} \subset H_{n-1}$ rather than masking the entire history. The unconditional context is then defined as $C_{\mathrm{unc}} = C_n \setminus C_{\mathrm{key}}$.

The intuition is that the difference between the full-context and key-context-removed predictions reflects how much the selected historical evidence should influence the next token. The modified logit for token $a^n_j$ is:

$$ \tilde{l}_j = \log P_{\phi}(a^n_j \mid C_{\mathrm{unc}}, a^n_{

Here, $\alpha$ is a penalty weight that controls how strongly the key context is amplified relative to the model's prior. The paper explores $\alpha \in [1.0, 3.0]$ with a step size of $0.5$.

In words, the decoding rule increases the influence of tokens whose probabilities change meaningfully when the key context is present. This is the paper's main mechanism for converting internal awareness into explicit adherence at generation time.

Latent context awareness and key-context extraction

A central novelty of the paper is that it does not use the whole history as the key context. Instead, it tries to identify the rounds that the model already attends to most strongly, using the model's own attention patterns as a proxy for latent awareness. The pipeline is:

Compute attention-based token scores from the current query to history tokens.
Aggregate token scores into turn scores.
Combine user and assistant turns into round scores.
Select the top-scoring historical rounds as $C_{\mathrm{key}}$.

The paper studies four design axes.

Layer selection: token scores can be computed from the final layer only (last_1), the average of the last four layers (last_4), or all layers.
Token-to-turn aggregation: token scores within a turn can be aggregated with mean, max, or sum.
Turn-to-round aggregation: since a user speech turn is tokenized into many more codecs than the assistant's text tokens, the user turn is down-weighted by $\beta$ when forming the round score:
$$ S_{\mathrm{round}}(R^m) = S_{\mathrm{turn}}(A^m) + \beta \cdot S_{\mathrm{turn}}(U^m). $$
The paper studies $\beta \in \{0.25, 0.5, 1.0\}$.
Context scope: the system selects the top $K$ historical rounds by score, with $K \in \{1, 2\}$.

This design is explicitly meant to avoid the noise introduced by naive whole-history conditioning. The authors argue that selecting too much history can dilute the relevant signal and even hurt decoding.

Experiments

Benchmark and evaluation protocol

The paper evaluates on the Audio MultiChallenge benchmark, which is designed for multi-round, end-to-end spoken dialogue under realistic conditions. The benchmark includes unscripted speech, disfluencies, and ambient noise, and conversations span roughly 3 to 8 continuous rounds. This makes it a good stress test for long-horizon state tracking.

The paper focuses on two benchmark subsets that require strict adherence to context:

Semantic Memory (90 samples): tests whether the assistant respects user-provided constraints and facts from earlier rounds, such as avoiding previously mentioned allergens.
Self Coherence (83 samples): tests whether the assistant remains consistent with its own prior responses, such as recalling recommendations made earlier in the conversation.

Evaluation is done with a strict rubric-based binary judge on the assistant's final utterance. A response passes only if it satisfies all instance-specific rubrics. The paper reports Average Pass Ratio (APR), averaged over 5 independent generation runs. The authors use gpt-5-nano as the judge instead of the benchmark's default o4-mini, while otherwise following the official pipeline.

Baselines and setup

The experiments compare CAD against three state-of-the-art spoken dialogue systems: MiMo-Audio-7B-Instruct with its thinking mode enabled, Qwen3-Omni-30B-A3B-Instruct, and Kimi-Audio-7B-Instruct. All models are run with their default generation settings, such as the official temperature and top-p parameters, to isolate the effect of the decoding method itself.

For the main table, the paper uses the empirically chosen CAD configuration last_4, mean, and $\beta = 0.5$ across all baselines.

Main results

Model	Semantic Memory	Self Coherence	Average
MiMo-Audio-7B-Instruct	26.00	26.02	26.01
+ CAD	32.00	36.39	34.11
Qwen3-Omni-30B-A3B-Instruct	22.67	29.16	25.78
+ CAD	39.33	38.80	39.08
Kimi-Audio-7B-Instruct	13.56	19.04	16.19
+ CAD	23.11	22.65	22.89

The main finding is that CAD improves every evaluated model on both subtasks and on the average APR. The strongest gain is observed for Qwen3-Omni-30B-A3B-Instruct, where the average APR increases from 25.78 to 39.08, an absolute gain of 13.30 points. MiMo-Audio improves from 26.01 to 34.11, and Kimi-Audio improves from 16.19 to 22.89.

The paper interprets these gains as evidence that the models were not simply unable to remember the context; rather, the relevant context was present internally but underused at decoding time. CAD makes the models actively rely on the selected history instead of following stronger parametric priors.

Ablation Studies

The ablation studies are performed with MiMo-Audio as the base model. They probe the design choices behind key-context extraction and the effect of the CAD hyperparameters.

Ablation of CAD hyperparameters: layer selection, token-to-turn aggregation, and turn-to-round aggregation ($\beta$) under a penalty weight of $\alpha=3.0$ and Top-1 context scope. The purple dashed line denotes APR without CAD, and the red dotted line represents Whole History CAD, where the whole history is considered as $C_{\mathrm{key}}$.

Ablation of CAD hyperparameters: penalty weight ($\alpha$) and context scope ($K$) under the empirically optimal configuration last_4, mean, and $\beta=0.5$. — Ablation of CAD hyperparameters: penalty weight ($\alpha$) and context scope ($K$) under the empirically optimal configuration `last_4`, `mean`, and $\beta=0.5$.

Why whole-history CAD fails

One of the most important ablations is the comparison between selective key-context CAD and a naive variant that uses the entire history as $C_{\mathrm{key}}$. The paper reports that this Whole History CAD setting drops to 21.04 compared with the No CAD baseline of 26.01. In contrast, using carefully isolated key context raises performance to 33.10.

The authors interpret this as evidence that indiscriminately amplifying all prior rounds introduces substantial noise. In other words, more history is not always better; context selection must be precise for decoding-time contrastive amplification to help.

Layer selection and aggregation behavior

The ablation on layer selection shows that relying on only the final layer does not capture enough contextual nuance, while averaging all layers introduces too much low-level noise from shallow representations. The best trade-off comes from averaging the last four layers, which the paper selects as the default.

For token-to-turn aggregation, max is too sensitive to isolated outlier tokens and sum is biased toward longer turns. The paper finds mean to be the most stable and representative turn-level aggregation strategy.

For turn-to-round aggregation, the paper argues that user speech turns contain many more tokens than assistant text turns, so a moderate down-weighting of the user side is helpful. Among the tested values, $\beta = 0.5$ performs best, outperforming both the uncompensated case ($\beta = 1.0$) and the stronger compensation setting ($\beta = 0.25$).

Penalty weight and context scope

The paper also studies the penalty weight $\alpha$ and the number of selected rounds $K$. For $K = 1$, increasing $\alpha$ steadily improves APR until the performance saturates, and the paper chooses $\alpha = 2.5$ as the best setting in that regime. The searched range is $\alpha \in \{1.0, 1.5, 2.0, 2.5, 3.0\}$.

Expanding the scope to $K = 2$ generally hurts performance, and high values of $\alpha$ are especially harmful in that setting. The paper's interpretation is that a broader context window is more likely to include irrelevant rounds, and aggressively amplifying such noisy context degrades decoding quality.

Discussion and limitations

The paper's key takeaway is that strict context adherence in spoken dialogue can be improved without retraining the base model. By using the model's own attention as a proxy for latent awareness and then applying contrastive decoding against a context-masked variant, the method turns internal recognition into visible output adherence.

The stated evidence is strong but also scoped: the experiments are limited to the Audio MultiChallenge benchmark, to two benchmark subsets for detailed reporting, and to three end-to-end spoken dialogue systems. The provided LaTeX does not include a separate limitations section, so this summary cannot claim limitations beyond the experimental scope reported by the authors. The paper also does not present training of a new model; it is a decoding-time intervention evaluated under default generation settings.

Conclusion

The paper concludes that audio-adapted CAD is a robust, model-agnostic way to reduce context failures in multi-round spoken dialogue systems. The method consistently improves strict adherence metrics across models, especially on Semantic Memory and Self Coherence, and the ablations support the central claim that precise key-context isolation is essential. Overall, the work argues that many dialogue failures are not pure memory failures, but failures to convert latent contextual understanding into active adherence during decoding.

Code & Implementation

This repository provides the official implementation of the paper's audio-adapted Context-Aware Decoding (CAD) method for spoken dialogue systems. It implements CAD as a training-free, inference-time procedure that modifies the generation loop of existing state-of-the-art models to enforce context adherence by contrasting output distributions with and without key dialogue context.

The repo is organized around three separate model implementations—Kimi-Audio-7B-Instruct, MiMo-Audio-7B-Instruct (with Thinking mode), and Qwen3-Omni-30B-A3B-Instruct—each in their own subdirectory (kimi/, mimo/, and qwen/ respectively). Each model folder contains a dedicated inference.py script that shows how the CAD method is applied at inference time on the Audio MultiChallenge benchmark dataset.

These inference scripts handle dataset loading, caching of audio files, building conversations from dialog turns, and invoking the respective model's generation call with CAD-specific hyperparameters that include aggregation methods (intra, inter), layer selection (layer_from), the number of key context rounds (num_rounds), and the guidance scale (guidance_scale) that controls the CAD penalty strength.

The eval/ directory contains evaluation scripts such as evaluate_openai.py which uses GPT-based LLMs as judges to assess the quality of generated dialogues. The repo also includes shell scripts in each model folder's bash_scripts/ to reproduce key experiment tables and ablations reported in the paper.

Overall, the repository provides full reproducibility of the paper's experiments and a clear separation of the CAD logic intertwined with each upstream model's inference pipeline, making it straightforward to adapt the CAD decoding procedure to new models that support token-level attention inspection.