Spoken Language Adherence in Multimodal LLMs

Are you speaking my languages? On spoken language adherence in multimodal LLMs

Addresses language misidentification by multimodal ASR models through a soft prompting approach, a new adherence metric, and compares three mitigation strategies to improve transcription fidelity in multilingual and code-switching speech.

llm
multimodal
asr

Authors: Hyungwon Kim, Kandarp Joshi, Lillian Zhou, Pavel Golik, Petar Aleksic

Categories: cs.CL, cs.SD, eess.AS

Comment: 7 pages, 3 tables in the main body

Published 2026-06-15 · Updated 2026-06-15

Abstract

While Large Language Model (LLM) based Automatic Speech Recognition (ASR) enables seamless multilingual use, models often misidentify the output language, compromising transcription fidelity and downstream application quality. To preserve flexibility and code-switching capabilities, we propose a soft prompting approach that hints at potential spoken languages without strictly constraining the output. We formally define this challenge as a lack of language adherence, introduce a novel metric to quantify violations, and evaluate three mitigation strategies: (1) zero-shot prompting for robust guidance under uncertainty, (2) supervised fine-tuning (SFT) to improve prompt adherence, and (3) Chain-of-Thought (CoT) reasoning to enforce adherence during decoding. We present a comparative analysis of these methods across multiple languages, evaluating effectiveness in reducing the language violation while maintaining overall ASR performance. Finally, we discuss trade-offs to guide strategy selection under various compute constraints.

Problem framing and motivation

This paper studies a failure mode in multimodal LLM-based ASR: the model may transcribe speech in the wrong output language even when the audio is otherwise intelligible. The authors call this lack of language adherence. The issue is especially harmful in multilingual and code-switching settings, because a transcript in the wrong script or language can distort meaning, break downstream pipelines such as machine translation or command systems, and create a poor user experience that can feel culturally inappropriate or biased.

The paper’s core design goal is not to hard-constrain the model to one language. Hard constraints would solve the adherence problem but would also remove flexibility for code-switching and for cases where the upstream language signal is imperfect. Instead, the paper explores soft language hinting: the prompt suggests one or more plausible spoken languages, while still allowing the model to follow the audio evidence if the hint is wrong.

The study makes three contributions:

a formal definition of language adherence violations and a new metric, Language Adherence Violation Rate (LAVR);
a comparison of three mitigation strategies: zero-shot prompting, supervised fine-tuning (SFT), and Chain-of-Thought (CoT) prompting;
a comparative evaluation across monolingual and code-switching speech in several languages, with a focus on robustness under correct, incorrect, mixed, and absent language hints.

Formalization and metric

For offline evaluation, each utterance is assumed to have a reference set of languages, $L_{ref}$, that appear in the audio, and the model output is mapped to a hypothesis set, $L_{hyp}$, using an external text language identifier. A language adherence violation occurs whenever

$$L_{hyp} \not\subseteq L_{ref}.$$

For a test set of $N$ utterances, the paper defines

$$\mathrm{LAVR} = \frac{1}{N}\sum_{i=1}^{N} \mathbb{I}(L_{hyp,i} \not\subseteq L_{ref,i}),$$

where $\mathbb{I}(\cdot)$ is the indicator function. The metric is intentionally permissive for common language mixtures and proper names, and it is implemented at the character/script level rather than the word level. In practice, the paper operationalizes language by canonical orthography: for example, a German reference set is treated as acceptable characters like $[\text{a-zäöüß}]$, ASCII digits and punctuation are treated as neutral, and a single unexpected character is enough to flag the whole utterance as a violation.

This design choice is meant to capture characters that would stand out to a user, rather than to judge every token-level language choice. The authors explicitly note the trade-off: character-level LAVR can miss outputs that are globally wrong-language but still use only acceptable characters, and it cannot measure severity. For that reason, they always report LAVR together with WER (or CER for scripts where character error rate is more appropriate).

The paper also discusses an online interpretation of the metric: user language settings and prior conversation history can serve as a proxy for $L_{ref}$, but this is inherently imperfect for bilingual users, travel, learning scenarios, and natural code-switching. The recommendation is to track relative changes in adherence rather than to expect absolute zero violations in production.

Methods

The experiments are built around a proprietary multimodal ASR foundation model: Gemini Flash lite 2.0, described as a deep transformer-based LLM trained on large-scale transcribed speech in the languages covered by the paper. The baseline used for zero-shot experiments is a proprietary ASR-tuned variant; the SFT and CoT variants are further fine-tuned on proprietary transcribed speech data containing mostly monolingual and code-switching single-utterance recordings. The paper does not explore reinforcement learning methods, explicitly limiting itself to prompting-based and supervised approaches.

1) Zero-shot language-hint prompting

The first strategy is prompt engineering. The model is given a language hint that biases it toward a target language, but the prompt is designed to remain tolerant when the hint is incorrect. The paper evaluates three prompt styles:

P1: “Transcribe the following speech segment in <languages>:”
P2: “The following speech segment is spoken by someone who knows <languages>. Transcribe the following speech segment:”
P3: “Transcribe this speech segment. It may contain a mix of <languages> and other languages.”

The best prompt is selected by evaluating language adherence on short utterances, where language inference is especially hard because there is little context and phonetic ambiguity is high. The authors choose P3 because it is the most robust to incorrect hints.

2) Supervised fine-tuning with instruction

The second strategy fine-tunes the model to better follow language-hinted prompts. The training objective is the standard token-level cross-entropy loss used in instruction tuning, but with prompts that explicitly encode language hints. The training prompt follows the P3 style from the zero-shot stage.

To encourage robustness, the paper randomizes the language hint condition during training across four categories: no-hint, correct, distractor, and mix. For distractor and mix cases, the authors add up to three randomly selected languages from a pool of 56 languages, with a 60%/30%/10% split for one/two/three distractor languages. The best training mixture is selected by short-utterance validation.

The final SFT mixture is 10% no-hint, 40% correct, 35% distractor, and 15% mix. The authors note that the SFT model has the same latency as the baseline because it does not add any extra decoding steps.

3) Chain-of-Thought prompting

The third strategy asks the model to first reason about the spoken language before producing the transcript. The paper frames this as a way to narrow the sampling space by committing to a language estimate before decoding the transcription. The prompt used is:

Think about the languages of the speech and transcribe it in those languages.

For training, the reference language is prepended to the transcript inside special control tokens. The CoT training mixture is 90% distractor-only and 10% no-hint. The authors emphasize that their models are non-streaming, so the full audio is available before decoding. They therefore consider the extra CoT tokens to have negligible average decoding overhead in their setup, while noting that streaming systems could experience higher latency.

Experimental setup

The evaluation uses two classes of data: monolingual speech and code-switching speech. The monolingual sets contain a few thousand user queries per language, sourced from real-world interactions with a production AI agent. The code-switching sets are synthesized from approximately 10,000 anonymized queries, with diverse voices and controlled accents.

In all experiments, the authors compare four prompt conditions:

no-hint: the prompt is simply “Transcribe the following speech segment:”
correct: the prompt contains only the correct spoken language(s)
distractor: the prompt contains only an incorrect language
mix: the prompt contains the correct language plus a distractor language

For code-switching, correct includes both spoken languages, distractor includes only the non-English distractor, and mix includes the non-English language plus its distractor. The paper always allows Latin characters $[\text{a-z}]$ in adherence scoring because English proper names and abbreviations are ubiquitous; for example, a Korean transcript containing “Netflix” is not treated as a violation.

Model selection is done on dedicated short-utterance datasets: 1,500 English one-word utterances and 3,000 Korean one-word utterances. The selection criterion is the smallest overall LAVR across scenarios plus the smallest gap between the correct-hint and distractor-hint conditions, which is taken as a proxy for robustness.

Dataset details

The appendix provides statistics for the evaluation corpora. Hindi transcripts are transliterated in the table for presentation, but native scripts are used in the actual evaluation. The code-switching datasets are generated with ten distinct male and female voices per non-English language, selecting one voice per utterance. The voices preserve the accent of the non-English language in mixed-language speech.

Dataset	Utterances	Total hours
English	1,760	5.7
French	3,152	2.1
Hindi	2,784	2.4
Korean	6,448	4.5
French-English	10,944	6.7
Hindi-English	10,160	7.2
Korean-English	9,968	7.9

Zero-shot prompt selection

The short-utterance validation confirms that prompt wording matters substantially when the hint is wrong. P3 is selected because it has the best robustness to distractors while retaining good adherence when the hint is correct.

Prompt	English LAVR (%)			Korean LAVR (%)
	correct	distractor	mix	correct	distractor	mix
P1	1.7	22.5	5.9	0.0	6.5	7.3
P2	1.9	3.0	1.8	0.0	6.2	0.1
P3	2.3	2.0	1.8	0.0	3.3	0.0

Two takeaways are clear: P1 is highly brittle under incorrect hints, and P3 is the most stable across both English and Korean short utterances. This prompt is then used for SFT and CoT training.

SFT mixture selection and prompt ablation

The paper also performs a small ablation over training-mixture ratios. The tested mixtures are:

Mixture	Correct	Distractor	Mix	No-hint
M1	0.40	0.25	0.25	0.10
M2	0.40	0.35	0.15	0.10
M3	0.40	0.30	0.20	0.10
M4	0.40	0.30	0.10	0.20
M5	0.40	0.20	0.20	0.20
M6	0.40	0.10	0.30	0.20

Using the short English and Korean validation sets, the authors choose M2 as the final SFT mixture. The ablation shows that overly skewed training mixes can severely hurt distractor robustness, especially in English, while the selected mixture balances the four prompt conditions more effectively.

Prompt	English short-utterance LAVR (%)
	correct	distractor	mix	no-hint
P1	5.4	21.5	23.9	6.7
P2	3.0	23.8	19.3	4.1
P3	3.0	3.0	2.4	5.4

On Korean short utterances, P3 also outperforms P1 and P2, so the same prompt is used consistently throughout the final experiments.

Main results on monolingual and code-switching speech

The main tables report both LAVR and WER (or CER for scripts such as Korean and Japanese). The authors compare the baseline zero-shot prompt, SFT, and CoT under each of the four prompt conditions. Overall, the three approaches are broadly comparable when given the same hinting setup, with the correct hint usually giving the best result. A strong pattern also emerges: having at least one correct language hint (correct or mix) is far better than either no hint or a distractor-only prompt.

Below is a compact summary of the main monolingual results reported in the paper for the zero-shot model. These numbers illustrate the overall trend: correct hints produce low LAVR, distractor hints worsen adherence, and no-hint is often intermediate but not consistently better than distractor-only prompting.

Language	correct	distractor	mix	no-hint
English	0.8 (6.8)	0.7 (7.2)	0.7 (7.7)	1.0 (6.9)
French	0.2 (9.6)	1.2 (12.4)	0.3 (11.0)	2.2 (10.6)
Hindi	0.0 (12.2)	1.1 (13.0)	0.0 (11.2)	0.6 (11.4)
Korean	0.4 (11.0)	3.5 (11.7)	0.6 (11.0)	1.7 (11.3)

In the code-switching setting with English, the zero-shot model remains strong when the prompt contains the correct non-English language, and the LAVR values are typically low. The main exception is Korean, where distractor prompting is much more harmful.

Language	correct	distractor	mix	no-hint
French-English	0.1 (31.1)	0.2 (31.3)	0.1 (30.8)	0.1 (31.9)
Hindi-English	0.0 (24.4)	0.4 (25.1)	0.0 (24.3)	0.4 (25.9)
Korean-English	0.1 (19.2)	6.1 (21.9)	0.6 (19.2)	0.7 (20.7)

The paper then shows that the same qualitative pattern continues for SFT and CoT. The main practical difference is that these trained variants can perform worse in the no-hint condition, which the authors attribute to catastrophic forgetting: the fine-tuning mixtures reduce the proportion of no-hint examples compared with the baseline model’s prior behavior. In other words, the extra training makes the system better conditioned on hints, but less robust when no hint is provided.

A representative summary from the paper is as follows:

Zero-shot prompting is already competitive with SFT and CoT once a good prompt is used.
Correct hints almost always give the best LAVR and WER.
Mix typically performs close to correct, showing that adding a distractor does not hurt much if the right language is present.
Distractor-only hints are usually worse than no hint.
No-hint can degrade noticeably for the fine-tuned variants, especially in WER, due to the reduced share of no-hint training examples.

Additional languages in the appendix

The appendix extends the same evaluation to German, Japanese, and Brazilian Portuguese, again in both monolingual and English code-switching settings. The reported results reinforce the main conclusions.

Language	Method	correct	distractor	mix	no-hint
German	ZS	0.2 (8.3)	0.4 (9.1)	0.2 (8.4)	2.1 (9.0)
German	SFT	1.4 (8.5)	1.3 (8.8)	1.2 (8.6)	2.9 (8.8)
German	CoT	1.1 (8.9)	1.2 (9.3)	0.9 (8.9)	2.5 (16.5)
Japanese	ZS	1.7 (12.6)	40.8 (35.5)	14.0 (16.7)	6.7 (13.4)
Japanese	SFT	6.0 (12.8)	14.2 (16.1)	8.8 (13.5)	6.1 (13.1)
Japanese	CoT	4.6 (15.5)	15.5 (18.3)	8.5 (16.5)	6.1 (17.5)
Portuguese	ZS	0.1 (6.8)	0.2 (8.4)	0.0 (6.9)	1.0 (7.3)
Portuguese	SFT	0.6 (7.1)	0.8 (7.7)	0.4 (7.9)	1.8 (7.6)
Portuguese	CoT	0.4 (7.3)	0.6 (7.8)	0.4 (7.4)	1.3 (15.2)

Two notable observations stand out. First, the overall trend remains stable across languages: correct and mix hints are best, while distractor-only prompting is weakest. Second, Japanese is the most challenging case in the appendix, with a very large zero-shot distractor LAVR on monolingual speech. Even there, however, SFT and CoT reduce the error substantially, though they do not outperform the zero-shot baseline in a way that changes the paper’s main conclusion.

Language	Method	correct	distractor	mix	no-hint
German-English	ZS	0.3 (19.5)	0.4 (20.1)	0.3 (19.5)	0.5 (20.1)
German-English	SFT	0.3 (19.4)	0.3 (19.6)	0.3 (19.4)	0.6 (20.3)
German-English	CoT	0.4 (19.2)	0.4 (19.5)	0.4 (19.3)	0.5 (22.4)
Japanese-English	ZS	0.7 (16.2)	18.2 (20.3)	2.9 (16.5)	1.6 (15.9)
Japanese-English	SFT	1.0 (15.9)	3.9 (17.0)	1.3 (16.0)	1.6 (17.8)
Japanese-English	CoT	1.0 (15.1)	2.8 (16.5)	1.4 (15.2)	1.2 (16.8)
Portuguese-English	ZS	0.3 (16.4)	0.3 (17.2)	0.3 (16.4)	0.7 (17.1)
Portuguese-English	SFT	0.4 (16.9)	0.6 (19.3)	0.4 (17.1)	0.9 (18.2)
Portuguese-English	CoT	0.5 (16.6)	0.5 (17.3)	0.5 (16.1)	0.9 (23.0)

Interpretation of the results

The paper’s main conclusion is that the quality of the language hint is more important than the choice among zero-shot prompting, SFT, and CoT. In the reported experiments, zero-shot P3 is already close to the best trained variants, and the addition of supervised training does not reliably surpass it. This makes zero-shot prompting especially attractive when compute, training data, or iteration budget is limited.

The most actionable pattern is that the system benefits greatly when at least one correct language is present in the prompt. In contrast, distractor-only prompting is typically worse than no hint at all, which implies that upstream language prediction should aim to provide a reliable spoken-language estimate before the ASR model is invoked. The authors therefore recommend using language-ID models or metadata to predict at least one spoken language with high confidence.

The paper also emphasizes that the mix condition is usually a safe compromise: if the correct language is present, adding a distractor language rarely hurts much and often remains close to the correct-only condition. This is especially relevant for code-switching, where users may speak more than one language naturally and the prompt should be expressive enough to reflect that.

Limitations and scope

The paper is careful about limitations. First, the metric is coarse: it flags unexpected scripts/characters, but it cannot distinguish between mild and severe violations. It can also miss outputs that are wrong-language in meaning but composed entirely of allowed characters, or outputs that are gibberish but happen to stay within the acceptable character set.

Second, the evaluation relies on a proprietary ASR foundation model and proprietary datasets, so the exact numbers are not directly reproducible from public resources. The authors nevertheless argue that the qualitative findings are general: the importance of accurate language hints, the weakness of distractor-only hints, and the limited advantage of more complex prompting strategies.

Third, the paper does not provide a detailed latency or compute analysis. This matters because the CoT variant may be less suitable for low-latency streaming deployments, even if it is acceptable in the non-streaming setting used here.

Takeaways for a talking-head / conversational-AI team

Language adherence is a distinct ASR quality dimension, separate from WER alone.
Soft language hints are a practical compromise between strict language locking and fully unconstrained multilingual decoding.
Zero-shot prompt design matters a lot; the more tolerant P3 prompt is the best choice among those tested.
Correct language prediction upstream is the highest-leverage intervention; the presence of at least one correct hint dominates the choice among ZS, SFT, and CoT.
SFT and CoT are not clear wins over well-designed zero-shot prompting, and they can hurt no-hint behavior if the training mixture underrepresents that condition.
Distractor-only hints are risky and often worse than providing no hint.
Code-switching support is preserved by the soft prompting approach, which is important for natural conversational systems.