Entity-Aware CoT for Speech LLMs
Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention
This paper identifies a failure in speech large language models related to entity binding in complex reasoning. The Entity-Aware Chain-of-Thought (EA-CoT) method explicitly enumerates and binds entities during inference, significantly improving performance on speech inputs and narrowing the gap with text models.
Links
Paper & demos
Impact
Abstract
Speech Large Language Models (SLLMs) underperform their text counterparts on complex reasoning. We reveal that this gap is not a uniform cognitive deficit. Evaluating two architecturally diverse SLLMs, we show speech-to-text (S2T) matches or exceeds text-to-text (T2T) on spatial, syntactic, and factual tasks. Yet on logical tasks requiring entity tracking, S2T accuracy collapses to chance. We diagnose this as an entity binding failure: continuous speech features blur precise entity-property associations during implicit reasoning. To validate this diagnosis, we introduce Entity-Aware Chain-of-Thought (EA-CoT), a lightweight inference-time intervention forcing SLLMs to enumerate entities and bind them to claims before reasoning. EA-CoT bridges the gap, even when spoken names are misrecognized, yielding up to a 24.4 percentage-point accuracy gain. Ablations confirm the gains stem from explicit semantic binding, reframing the gap as an elicitation failure rather than a missing capability.
Overview
This paper studies why speech large language models (SLLMs) underperform text-based models on reasoning tasks, and argues that the gap is not uniform across task types. Rather than a broad cognitive deficit, the authors identify a concentrated failure mode in logical tasks that require tracking entities and their changing properties. They call this failure entity binding failure: continuous speech representations preserve broad semantics but blur the discrete associations needed to keep entities, claims, and state changes aligned during multi-step reasoning.
The main empirical claim is that, for two architecturally different SLLMs, speech-to-text input can match or even exceed text-to-text performance on spatial, syntactic, and factual tasks, but collapses to chance on the entity-tracking logical task web of lies. To test whether this gap reflects a missing capability or an elicitation problem, the paper introduces Entity-Aware Chain-of-Thought (EA-CoT), an inference-time prompting intervention that forces explicit entity enumeration and claim binding before reasoning. EA-CoT substantially closes the gap, including when spoken names are not transcribed perfectly, which supports the paper's diagnosis that the bottleneck is structural semantic binding rather than simple speech recognition errors.
Problem framing and paper claim
The work positions the speech/text modality gap as task-specific. The authors evaluate two SLLMs that differ architecturally: Qwen2.5-Omni-7B, which uses a dedicated thinker module that generates internal reasoning tokens before the visible response, and Phi-4-Multimodal, which generates responses directly without a separate reasoning module. Across these models, the paper reports that speech input is not consistently worse than text input. Instead, the largest degradation is concentrated in a single category, web of lies, where one must propagate truth values through a chain of people making claims about previous speakers. In that setting, speech accuracy falls to chance, while text accuracy remains high.
The authors argue that this pattern is best explained by the way speech encoders compress continuous audio. Pooling and downsampling preserve global semantics but can erase fine-grained token boundaries and entity-specific detail. That matters less for tasks where the model can rely on holistic meaning, but it is harmful when the model must maintain exact bindings between entities and changing properties over multiple reasoning steps. In the paper's framing, this is not merely a lower-level recognition problem; it is a failure to keep semantic associations stable during implicit reasoning.
Evaluation protocol and datasets
The paper decomposes the modality gap by task category rather than reporting only aggregate averages. It uses four categories from the VoiceBench BBH split, with 250 items each for a total of 1,000 samples:
- hyperbaton
- navigate
- sports understanding
- web of lies
Each item is evaluated in both synthesized speech and plain text, using the same model and the same answer format. This paired design allows direct S2T versus T2T comparison. Outputs without a parseable final answer are counted as incorrect, with format guarding and deterministic fallback extraction. Statistical significance is tested with McNemar's test on paired outcomes.
All reported experiments use the released model configurations, without fine-tuning. The baseline generation limit is 256 tokens. EA-CoT and the structured control prompts use a 1,024-token limit to accommodate longer reasoning traces.
EA-CoT method
EA-CoT is a prompt-based, inference-time intervention designed specifically for the entity-binding bottleneck. For web of lies, the prompt instructs the model to:
- enumerate all people mentioned;
- record each claim linking a person to a property;
- reason step by step through the chain of claims;
- extract the final answer in the required format.
The key design choice is that the model itself must generate the entity list and claim record. There are no human-provided entity annotations, no oracle transcripts, and no externally supplied binding structure. The intervention is therefore fully automatic and can be viewed as a causal probe: if explicit entity binding restores accuracy, then the diagnosis of binding failure is strengthened.
The paper also uses structured control prompts for the other categories to make sure EA-CoT's effect is not just a generic chain-of-thought benefit. These controls induce explicit step-by-step reasoning without entity tracking, such as classifying adjectives in hyperbaton, sequentially tracking coordinates in navigate, and identifying sports in sports understanding.
The paper formalizes the token-budget confound as a decomposition of total improvement:
Here, the first term isolates the effect of increasing the generation budget from 256 to 1,024 tokens, and the second term isolates the effect of the instruction itself.
Main experimental results
The main result is that the modality gap is concentrated in web of lies. On the other categories, baseline S2T performance is comparable to T2T, and in some cases slightly better. Excluding web of lies, Qwen speech accuracy is actually higher than text by $+1.9$ pp, and the Phi-4 gap shrinks from $13.1$ pp to $3.9$ pp. This supports the paper's claim that the major weakness is not a uniform speech reasoning deficit.
EA-CoT improves speech-input accuracy substantially on both models. On the full four-category benchmark, the speech gains are $+8.5$ to $+9.1$ percentage points overall, and these improvements are significant under McNemar's test. The largest single-task recovery is on web of lies, where speech accuracy rises by $+16.4$ pp for Qwen and by $+24.4$ pp for Phi-4. The paper highlights that this recovery is especially striking because it occurs even when spoken names are misrecognized, as long as the model creates a stable textual anchor during the chain-of-thought.
The table below reproduces the paper's main accuracy results, reported as S2T / T2T pairs.
| Model | Method | Overall | HYP | NAV | SPO | WOL |
|---|---|---|---|---|---|---|
| Qwen2.5-Omni | BL | 59.9 / 67.0 | 73.2 / 72.0 | 58.0 / 52.8 | 55.6 / 56.4 | 52.8 / 86.8 |
| Qwen2.5-Omni | CoT | 68.4 / 84.3 | 62.4 / 83.2 | 80.4 / 80.8 | 61.6 / 77.6 | 69.2 / 95.6 |
| Phi-4-MM | BL | 53.6 / 66.7 | 56.4 / 61.6 | 59.2 / 58.0 | 48.0 / 55.6 | 50.8 / 91.6 |
| Phi-4-MM | CoT | 62.7 / 77.6 | 54.8 / 77.2 | 66.4 / 82.0 | 54.4 / 64.0 | 75.2 / 87.2 |
The results show a sharp asymmetry: web of lies is the only category where speech gains exceed text gains. For Qwen, speech improves by $+16.4$ pp versus $+8.8$ pp on text. For Phi-4, speech improves by $+24.4$ pp, while the text side drops by $-4.4$ pp under the structured prompt. For the other categories, the structured prompts help text at least as much as speech. The authors interpret this as evidence that EA-CoT repairs a speech-specific binding issue rather than simply making reasoning easier in general.
Token budget control and figure-based evidence
The paper explicitly checks whether the gain comes from using more tokens rather than from the entity-aware instruction. It finds that increasing the generation limit from 256 to 1,024 tokens without changing the instruction gives essentially no speech improvement, with $lpha$?