Translation-Enhanced Speech Encoder

Does Translation-Enhanced Speech Encoder Pre-training Affect Speech LLMs?

This paper studies how adding translation objectives in speech encoder pre-training improves cross-lingual, language-agnostic representations for Speech LLMs. The bidirectional translation task aligns speech embeddings better with the LLM's shared semantic space, boosting downstream speech recognition and translation.

llm
multimodal
asr
speech-to-speech

Authors: Tomoya Mizumoto, Yusuke Fujita

Categories: eess.AS, cs.CL, cs.SD

Comment: Accepted to Interspeech2026

Published 2026-06-24 · Updated 2026-06-24

Abstract

Connecting a pre-trained speech encoder to a Large Language Model (LLM) is the standard architecture for building Speech LLMs. However, a structural misalignment exists between the encoder and the LLM. Unlike encoders based on automatic speech recognition, which often produce representations in separate language-specific spaces, LLMs operate within a unified language-agnostic space. A mechanism is required to align the encoder's language-specific representations with the LLM's shared space. We argue that speech translation provides a principled way to achieve this. Unlike monolingual transcription, translation requires the model to bridge different languages and learn language-agnostic representations. We experimentally evaluate the impact of incorporating translation objectives into speech encoder pre-training. Our results demonstrate that translation-enhanced pre-training improves cross-modal integration and leads to superior performance across downstream Speech LLM tasks.

Introduction

This paper studies a practical question in Speech LLM design: does the pre-training objective used for the speech encoder matter once that encoder is connected to a frozen LLM? The authors focus on the standard Speech LLM architecture in which a pre-trained speech encoder produces continuous acoustic features, a small trainable adaptor maps those features into the LLM embedding space, and the LLM itself remains frozen during adaptor training.

Their central argument is that a conventional ASR-oriented encoder is not necessarily the best interface for an LLM. ASR pre-training often organizes representations around language-specific speech-to-text mappings, whereas LLMs operate in a shared, language-agnostic semantic space. The paper hypothesizes that speech translation is a better pre-training signal because translation forces the encoder to extract meaning across languages rather than merely reproduce a transcript in the same language.

The main empirical claim is that adding translation objectives to speech encoder pre-training improves cross-modal alignment and downstream Speech LLM performance. The strongest variant is a symmetric bidirectional objective that trains on both $X \rightarrow \text{en}$ and $\text{en} \rightarrow X$ translation, rather than only the unidirectional non-English-to-English direction used by Whisper-style systems.

Problem Setup and Motivation

The paper frames the issue as a representational mismatch. Speech encoders trained for ASR or SSL tend to preserve acoustic and phonetic structure, and even when they capture semantics, the resulting mappings are often language-specific. In contrast, the LLM’s token space is shared across languages and relies on more abstract semantic organization. A lightweight adaptor must therefore bridge two very different spaces, which can become a bottleneck when the encoder representations are not sufficiently language-agnostic.

The authors propose that translation is a principled way to encourage language-agnostic abstractions. Translating speech into another language requires the model to ignore surface phonetic overlap and recover meaning. They further argue that English inputs may be under-trained in common Whisper-style setups because English speech is typically only exposed to monolingual transcription, while non-English speech is exposed to translation. Their experiment asks whether adding English-to-other-language translation during encoder pre-training improves the resulting Speech LLM, especially for English inputs.

Model Architecture

The overall Speech LLM architecture is intentionally standard so that changes in downstream performance can be attributed to the encoder pre-training objective rather than to architectural novelty. The system consists of three parts:

a speech encoder that converts waveform input into continuous representations,
a trainable adaptor that projects those representations into the LLM input embedding space, and
a frozen LLM that performs generation or classification conditioned on the projected speech embeddings.

The encoder used for downstream integration is the Whisper medium encoder. The decoder used during pre-training is discarded after the encoder is trained, and only the encoder is transferred to the Speech LLM pipeline.

The adaptor is deliberately lightweight: a two-layer CNN followed by a linear projection. The CNN downsamples temporal resolution to reduce the sequence length presented to the LLM, and the linear layer matches the hidden dimensionality expected by the LLM text embeddings.

Two frozen LLM back ends are evaluated: Llama-3.2-1B-Instruct and Llama-3.2-3B-Instruct. During the main experiments, both the speech encoder and the LLM are kept frozen while the adaptor is trained. This design makes the comparison between encoder pre-training objectives as clean as possible.

Encoder Pre-training Objectives

The paper compares three encoder pre-training configurations, all built on the same Seq2Seq-style Whisper architecture:

ASR Only: multilingual transcription only, across English, Japanese, Chinese, and German.
ASR & ST ($X \rightarrow \text{en}$): multilingual transcription plus translation from Japanese, Chinese, and German into English, mirroring Whisper’s original asymmetric translation setup.
ASR & ST ($X \leftrightarrow \text{en}$): multilingual transcription plus bidirectional translation between English and the other three languages, so the encoder is trained on both $X \rightarrow \text{en}$ and $\text{en} \rightarrow X$.

The key methodological difference is the inclusion of English-to-non-English translation. The authors treat this as the mechanism that forces the encoder to abstract away from language-specific surface forms in a symmetric way across all input languages.

Prompt formatting for bidirectional translation

The original Whisper prompt format only naturally supports translation into English. To enable bidirectional generation, the paper adopts a redesigned prompt format inspired by multi-target decoding approaches. The target language token is placed before the task token as a provided condition, while the source language token is placed after the task token as a predicted attribute.

For example, German-to-English translation is formatted as <|BOS|> <|en|> <|translate|> <|de|>, while German transcription uses <|BOS|> <|de|> <|transcribe|> <|de|>. This explicit conditioning on target language is what allows the bidirectional translation objective to be implemented within the same Seq2Seq framework.

Data and Training Pipeline

The experimental setup is split into two stages: base model pre-training for the encoder-decoder speech model, and Speech LLM training for the adaptor.

Base model pre-training data

The base speech corpus contains approximately 130k hours across four languages: English (73.6k hours), Japanese (36.2k hours), German (10.0k hours), and Chinese (9.8k hours). The corpus is assembled from multiple public datasets: LibriSpeech for English, ReazonSpeech for Japanese, Multilingual LibriSpeech for German, WenetSpeech for Chinese, plus subsets of YODAS-OWSMv4 and Common Voice across all four languages.

Because the underlying corpora are mostly monolingual transcription datasets, the translation supervision is synthetic. The paper uses Qwen2.5-32B-Instruct to translate the original transcriptions into the desired target language for the translation objectives.

The task mixtures are:

ASR Only: 100% transcription for all languages.
ASR & ST ($X \rightarrow \text{en}$): English is 100% transcription; non-English languages use a 75% transcription / 25% translation-to-English mix.
ASR & ST ($X \leftrightarrow \text{en}$): the same 75% transcription / 25% translation mix is applied uniformly across all four languages.

Speech LLM training data

The adaptor is trained on a separate multi-task speech dataset of about 6.2k hours. It includes transcription, translation, intent classification, and emotion recognition. The transcription task covers English, Japanese, Chinese, and German. Translation covers bidirectional English pairs for Japanese, Chinese, and German, and also English-to-language translation for four additional languages not used in the base pre-training stage: Persian, Indonesian, Swedish, and Turkish.

The datasets named in the paper for this stage are VoxPopuli, FLEURS, AISHELL, JSUT, CoVoST2, and SpeechBSD for transcription and translation, SLURP and the German subset of Speech-MASSIVE for intent classification, and MELD for emotion recognition. The authors also add 300 hours of transcription data and 200 hours of translation data sampled from the pre-training corpora. To reduce imbalance, they oversample smaller tasks, for example upsampling MELD by a factor of 5.

Optimization details

Base model pre-training runs for 3 epochs with a global batch size of 512. The learning rate uses a piecewise-linear warmup: it rises to $5 \times 10^{-5}$ over the first 15k steps and then to the peak value of $2 \times 10^{-4}$ over the next 15k steps, followed by cosine decay. This stage uses 16 NVIDIA H100 GPUs.

The adaptor training stage runs for 25k steps with a global batch size of 512. The peak learning rate is $1 \times 10^{-4}$ with a 500-step linear warmup and cosine decay afterward. This stage uses 8 NVIDIA H100 GPUs.

Evaluation Protocol

The paper evaluates four downstream task families to test both generation and understanding:

ASR: WER for English and German, CER for Japanese and Chinese, evaluated on FLEURS.
Speech translation: BLEU for both $X \rightarrow \text{en}$ and $\text{en} \rightarrow X$ directions. For $\text{en} \rightarrow X$, results are reported as CoVoST2 / FLEURS.
Intent classification: accuracy on SLURP and Speech-MASSIVE.
Emotion recognition: accuracy on MELD.

All reported scores are averaged over three independent inference runs. The main setting freezes both the speech encoder and the LLM, so the only trainable component during Speech LLM integration is the adaptor. The paper also includes a supplementary experiment where the encoder is unfrozen during the Speech LLM stage to see whether the benefits of translation-enhanced pre-training persist under joint adaptation.

Main Results: ASR and Speech Translation

The strongest and most consistent result is that translation-enhanced pre-training improves downstream performance across both model sizes. The gains are visible in ASR, speech translation into English, and speech translation out of English. The bidirectional objective $X \leftrightarrow \text{en}$ is usually the best configuration.

Downstream ASR and speech translation results for frozen-encoder Speech LLMs.
LLM	Pre-training mixture	ASR error				ST $X \rightarrow \text{en}$ BLEU			ST $\text{en} \rightarrow X$ BLEU
LLM	Pre-training mixture	en	ja	zh	de	ja	zh	de	ja seen	zh seen	de seen	fa	id	sv	tr
1B	ASR Only	16.6	29.2	30.0	27.1	7.1	7.2	21.3	15.4/13.5	21.5/18.7	16.6/14.7	5.8/5.9	16.4/18.6	16.5/17.2	3.7/3.8
	ASR & ST ($X \rightarrow \text{en}$)	16.3	21.1	25.9	26.1	10.5	10.0	23.4	15.9/13.7	21.4/18.7	16.8/15.5	6.1/6.4	16.6/19.2	16.7/17.6	3.9/3.8
	ASR & ST ($X \leftrightarrow \text{en}$)	14.6	19.7	23.0	24.3	11.8	11.3	23.9	18.2/15.7	24.8/21.3	19.3/18.2	7.7/7.6	19.4/22.0	19.9/20.9	5.3/4.5
3B	ASR Only	11.6	22.4	24.3	42.5	11.6	11.4	28.7	20.8/18.9	28.3/26.5	22.1/22.2	9.8/12.0	22.5/27.0	23.7/24.8	7.6/8.8
	ASR & ST ($X \rightarrow \text{en}$)	11.6	16.6	21.4	26.2	15.1	14.5	30.0	20.9/19.6	28.6/26.9	22.4/22.3	10.1/12.2	22.7/26.5	23.8/25.6	7.8/9.0
	ASR & ST ($X \leftrightarrow \text{en}$)	11.0	15.8	21.1	24.3	15.1	15.5	30.9	22.7/21.3	30.8/28.5	24.2/24.5	11.3/13.2	24.5/28.5	26.3/27.2	9.0/10.6

For the 1B LLM, moving from ASR-only to $X \rightarrow \text{en}$ reduces Japanese CER from 29.2 to 21.1 and improves ST into English for the non-English source languages. The bidirectional setting is better still, lowering ASR errors further and producing the best ST scores across almost all columns.

For the 3B LLM, the same pattern holds. The bidirectional objective provides the strongest ASR results overall and the best speech translation performance in the out-of-English direction. The improvements are especially notable for English-to-other-language translation, which the authors emphasize as evidence that the pre-trained encoder is helping unlock the frozen LLM's multilingual generative capacity.

An important detail is that the $\text{en} \rightarrow X$ gains extend to target languages that were unseen during Seq2Seq pre-training but present during Speech LLM training: Persian, Indonesian, Swedish, and Turkish. This suggests that the representation learned under bidirectional translation is not merely memorizing the pre-training language pairs; instead, it yields a more general interface that helps the frozen LLM perform multilingual generation more broadly.

Classification Results: Intent and Emotion

The paper also tests whether translation-enhanced pre-training helps tasks that require more than transcription, namely spoken intent classification and emotion recognition. The results show a clear benefit for intent understanding, but not for emotion recognition.

Classification accuracy for the 3B model.
Pre-training mixture	Intent en	Intent de	Emotion en
ASR Only	57.3	57.9	49.2
ASR & ST ($X \rightarrow \text{en}$)	58.5	62.1	50.3
ASR & ST ($X \leftrightarrow \text{en}$)	64.5	66.3	49.5

The authors interpret these results as follows. Intent classification depends on extracting semantic content and user goals, so translation objectives help by pushing the encoder toward language-agnostic meaning representations. The gains are largest when bidirectional translation is used, with English intent accuracy rising from 57.3 to 64.5 and German from 57.9 to 66.3.

Emotion recognition, by contrast, depends more on fine-grained acoustic and paralinguistic cues than on semantic abstraction. As a result, translation-enhanced pre-training does not materially improve MELD accuracy. Importantly, it also does not harm performance: the bidirectional objective yields 49.5 versus 49.2 for ASR-only, while the unidirectional configuration gives the highest value at 50.3. The paper uses this to argue that semantic abstraction via translation can be introduced without sacrificing acoustic-sensitive tasks.

Effect of Unfreezing the Encoder

The supplementary experiment unfreezes the encoder during Speech LLM training, more closely resembling some existing end-to-end Speech LLM pipelines. Even in this setting, the bidirectional pre-training objective remains best.

Downstream translation with an unfrozen encoder, evaluated on the 3B model.
Pre-training mixture	ST $X \rightarrow \text{en}$ avg. BLEU	ST $\text{en} \rightarrow X$ avg. BLEU
ASR Only	20.1	20.4/21.0
ASR & ST ($X \rightarrow \text{en}$)	22.1	21.1/21.5
ASR & ST ($X \leftrightarrow \text{en}$)	22.9	21.9/22.5

Unfreezing the encoder improves the absolute scores, but the ranking of the three pre-training objectives does not change. The authors take this as evidence that the benefit of symmetric translation pre-training is not an artifact of freezing the encoder. Instead, it provides a stronger starting representation that remains advantageous even when additional downstream adaptation is allowed.

Interpretation and Contributions

The paper's main contribution is not a new Speech LLM architecture, but rather a controlled study of which speech encoder pre-training objective best supports LLM integration. Within the same Whisper-style encoder, the same adaptor, and the same frozen LLM integration pipeline, the authors isolate the effect of adding translation objectives. The empirical message is that monolingual ASR is not sufficient if the goal is to produce representations that interface cleanly with a language-agnostic LLM space.

A second contribution is the comparison between asymmetric and symmetric translation pre-training. Whisper-style $X \rightarrow \text{en}$ translation helps, but the full $X \leftrightarrow \text{en}$ objective consistently does better. This supports the paper's conceptual claim that English-to-other-language translation forces the encoder to abstract meaning more completely, including for English inputs that would otherwise only see transcription supervision.

A third contribution is the demonstration that these gains are not narrowly confined to the exact pre-training language pairs. The bidirectional objective improves generation for unseen target languages in the later Speech LLM stage, suggesting better cross-modal integration rather than simple memorization of paired training data.

What the Paper Actually Shows About the Representation Space

The paper does not provide direct embedding visualizations or probing analyses in the LaTeX source supplied here, so its evidence is entirely behavioral. Still, the result pattern is consistent with the following interpretation: translation objectives encourage the encoder to produce representations that are easier for a frozen LLM to consume because they are less tied to language-specific phonetic surfaces and more tied to semantic content. This is strongest when translation is symmetric across input languages.

The tasks also separate semantic and acoustic effects. ASR and intent classification benefit from translation pre-training, while emotion recognition largely does not. The authors use this contrast to argue that the method improves semantic abstraction without erasing lower-level acoustic information needed by paralinguistic tasks.

Limitations and Scope

The paper's own experimental scope is intentionally controlled, but that also defines its limitations:

The base pre-training study is limited to four languages: English, Japanese, Chinese, and German.
Translation supervision is synthetic, generated by an LLM from monolingual transcriptions, rather than being collected as human parallel speech translation data.
The main Speech LLM experiments freeze both the encoder and the LLM, so the conclusions are about representation quality under adaptor-only integration rather than about fully end-to-end adaptation.
The architecture choices are fixed to Whisper medium for the encoder backbone, Llama 3.2 for the LLM, and a simple CNN-plus-linear adaptor, so the results are strongest as evidence about training objective design within this specific framework.
The reported evidence is performance-based; the supplied paper text does not include additional probing, interpretability, or theoretical analysis beyond the task results.

These constraints do not invalidate the findings, but they do mean the paper’s conclusions are best read as a strong empirical case for symmetric translation pre-training under a standard Speech LLM recipe, rather than as a universal proof across all possible speech encoders, languages, or adaptor designs.

Conclusion

The paper concludes that bidirectional speech translation pre-training is a better foundation for Speech LLMs than transcription-only or one-way translation pre-training. In the authors' experiments, the $X \leftrightarrow \text{en}$ objective consistently improves ASR, speech translation, and intent classification, while leaving emotion recognition essentially intact. The results hold for both 1B and 3B frozen LLM back ends, and they remain visible even when the encoder is later unfrozen.

The practical takeaway for a conversational-AI or talking-head system is straightforward: if a Speech LLM depends on a pre-trained speech encoder plus a small adaptor, then encoder pre-training should not be treated as a generic ASR problem. Training the encoder to translate in both directions appears to produce representations that align better with a frozen LLM’s shared semantic space, which in turn yields better downstream generative behavior.