BLoRA Code-Switching Adaptation

Adding Robust Code-Switching Capabilities to High Performance Multilingual ASR

BLoRA is a Bayesian method for extending strong multilingual ASR models to handle English-German code-switching. It integrates new knowledge selectively, reducing code-switch errors significantly while preserving monolingual accuracy, unlike naive fine-tuning which degrades performance.

asr
multimodal
tts

Authors: Enes Yavuz Ugan, Alexander Waibel

Categories: cs.CL, eess.AS

Comment: Accepted to INTERSPEECH 2026

Published 2026-06-20 · Updated 2026-06-20

Abstract

Code-switching (CSW) remains challenging for large multi-lingual ASR systems in real-world deployment. While fine-tuning on synthetic CSW data is possible, it generally degrades strong monolingual baselines. Our goal is to preserve these capabilities while extending models to handle complex code-switching, including morphological variations across languages. We propose Bayesian factorized adaptation, which learns to efficiently integrate switching-relevant knowledge into strong pretrained models without overwriting existing capabilities. Requiring only a small amount of synthetic data, our approach reduces transcription errors by 32.87% on code-switched words while improving overall WER by 5.31%, all while maintaining mono-lingual performance. Our results demonstrate that effective CSW adaptation depends more on knowledge integration than data complexity.

Introduction

This paper studies a deployment-oriented version of code-switching ASR: starting from an already strong multilingual speech model, can we add code-switching capability without erasing the model's existing monolingual strengths? The authors argue that this is the practically important setting for production ASR, but it is also the hardest one, because standard adaptation to synthetic code-switched data tends to overwrite a model's prior knowledge rather than extend it.

The paper focuses on English-German code-switching as a deliberately difficult test case. The starting Whisper v3 turbo model is already strong on the two monolingual languages, reporting $8.53\%$ WER on German and $13.56\%$ WER on English on CommonVoice 14. The central claim is that, for such a strong baseline, the main bottleneck is not how complex the synthetic data pipeline is, but how the new code-switching knowledge is integrated into the model.

The work is positioned around four scenarios for code-switching research. The paper's focus is the most realistic but least studied one: improving code-switching while preserving a strong pretrained multilingual model. In contrast to prior work that either evaluates only on code-switching test sets, uses weak baselines, or does not verify preservation on diverse monolingual speech, this paper treats preservation as a first-class requirement.

Scenario 1: optimize only for in-domain code-switching.
Scenario 2: train jointly on monolingual and synthetic code-switching, but evaluate monolingual speech mostly in-domain.
Scenario 3: adapt weak baselines, where gains are hard to interpret.
Scenario 4: preserve a strong multilingual model while adding code-switching support; this is the paper's target.

The headline result is that ordinary LoRA fine-tuning on synthetic code-switching data is harmful in this regime, while Bayesian factorized adaptation, or BLoRA, can turn the same synthetic data into a net benefit. Depending on the setting, the paper reports up to a $32.87\%$ relative reduction in PIER on code-switched words and a $5.31\%$ relative improvement in overall WER, while keeping monolingual performance intact.

Core idea: knowledge integration, not just data generation

The paper's main thesis is that once the base ASR model is already strong, simply adding more synthetic code-switching data is not enough. In fact, naive fine-tuning can make the model substantially worse on both the source languages and the code-switched target behavior. The authors show this across multiple synthetic data sizes and across both simple and more elaborate synthesis pipelines.

Their alternative is a parameter-efficient adaptation strategy that explicitly constrains how new knowledge is written into the model. Rather than updating all weights densely, they use low-rank adaptation, and then extend it with a Bayesian treatment that encourages sparse, uncertainty-aware updates. The emphasis is on selectively integrating switching-relevant knowledge while preserving the original model's general multilingual competence.

Method

Synthetic code-switching text generation

The synthetic data pipeline begins with text generation. The authors use GPT-4o at temperature $0.3$ to generate German matrix sentences containing English insertions. The prompt is carefully constrained by linguistic rules so that the English insertion matches the syntactic role, valency, reflexivity, and register of the original German element, following the equivalence constraint theorem. The generated text is intended to reflect natural German-English code-switching rather than naïve word replacement.

Morphological integration is important in this setup. The paper explicitly notes that foreign words are inflected according to German morphology when they appear inside German sentences, for example an English verb can receive German endings. This is meant to capture the kind of hybrid forms observed in real code-switched speech, including sub-lexical or morphologically integrated switches.

The prompt also wraps substituted words in delimiter tags of the form $\mathrm{\S\S...\S\S}$, which the authors use to recover switch locations automatically for downstream synthesis. This eliminates manual annotation of switch points and makes the pipeline fully synthetic and annotation-free.

The paper notes that the prompt design and full linguistic rules are provided in the authors' repository, but the technical point for the paper itself is that the switch markers are produced as part of generation, not by human labeling.

Multilingual TTS and stitching

After text generation, the authors synthesize speech using x-tts-v2, a multilingual TTS model with 58 speaker embeddings. They compare three synthesis strategies:

German: synthesize the entire transcript as German.
English: synthesize the entire transcript as English.
Stitching: use the delimiter tags to segment the transcript by language, synthesize segments separately, then concatenate them after trimming silence and smoothing boundaries.

The paper reports that the stitching strategy works best in listening tests and is used in the main experiments. The important design point is that the model leverages multilingual TTS to approximate code-switched acoustics without requiring any real code-switched speech recordings.

Adaptation methods: LoRA versus BLoRA

The base adaptation method is LoRA. For a weight matrix $\mathbf{W}$, LoRA adds a low-rank update $$ \Delta \mathbf{W} = \frac{\alpha}{r} \mathbf{A}\mathbf{B}, $$ where $r$ is the rank and $r \ll \min(d_{\text{in}}, d_{\text{out}})$. This is standard parameter-efficient fine-tuning.

The paper's proposed method, BLoRA, turns the factor matrices into random variables with Gaussian posteriors. In the notation used in the paper, $$ q_\phi(A_{ij}) = \mathcal{N}(\mu_{ij}, \sigma_{ij}^2), \qquad q_\phi(B_{ij}) = \mathcal{N}(\mu'_{ij}, {\sigma'_{ij}}^2). $$ The priors for the means and standard deviations are centered at $0$ and $0.01$, respectively, which biases the learned update toward sparsity. The paper describes this as changing fixed low-rank weights into learned distributions, so the adapter can integrate new information more cautiously and overwrite less of the pretrained model.

Conceptually, the paper argues that this Bayesian regularization is the key difference between successful and unsuccessful adaptation in the strong-baseline regime. The aim is not to make the adapter more expressive; it is to make it more selective.

Experimental setup

The experiments use Whisper v3 turbo as the pretrained multilingual ASR backbone. The adaptation methods are configured as follows:

LoRA: rank $r = 32$.
BLoRA: rank $r = 32$ and KL regularization weight $\lambda_{\text{KL}} = 0.5$.
Optimization: learning rate $10^{-3}$, warmup $2000$ steps, weight decay $5 \times 10^{-4}$, maximum $30000$ steps.

The paper states that $\lambda_{\text{KL}} = 0.5$ follows prior Bayesian LoRA work and is treated as a robust default.

The code-switching benchmark is CSFleurs, a recently published German-English CSW dataset. For monolingual preservation, the authors evaluate on CommonVoice 14.0. They explicitly choose CommonVoice because it is read speech, because it covers diverse topics, and because their synthetic training data is derived from CommonVoice transcripts, making it a conservative backward test.

Evaluation uses standard WER for overall transcription quality and PIER for the embedded-language words that matter most for code-switching. PIER is computed on manually annotated English code-switch points in CSFleurs. The annotation rules include lexical insertions, English function words, hybrid morphological forms, and English phrases, while excluding proper names, standardized codes, and fully integrated German loanwords.

The paper also notes an additional evaluation on DECM in a footnote: LoRA degrades substantially, whereas BLoRA remains near baseline; the remaining gap is attributed to acoustic mismatch between read-speech synthetic data and conversational speech.

Main results

The core empirical finding is stark: on this strong baseline, standard LoRA is not a safe way to add code-switching capability. Across all tested data sizes, LoRA sharply degrades monolingual WER and usually also worsens PIER. BLoRA, by contrast, can improve code-switching behavior while keeping monolingual performance stable or close to stable.

The table below collects the main numbers reported in the paper for the baseline, a prior synthetic-data pipeline from Nguyen et al. 2025, and the LoRA/BLoRA comparison across data sizes.

Setup	German WER	English WER	CSFleurs WER	CSFleurs PIER
Whisper baseline	8.53	13.56	11.49	26.59
Nguyen et al. 2025, 1word, 10k	22.54	44.05	28.81	38.91
Nguyen et al. 2025, 3word, 10k	30.78	47.90	35.51	37.49
Nguyen et al. 2025, 0.2, 10k	26.84	50.67	33.89	43.09
BLoRA, 1k	11.59	15.00	13.31	23.60
LoRA, 1k	44.15	66.05	66.00	82.30
BLoRA, 10k	9.77	13.68	11.37	22.25
LoRA, 10k	20.80	50.47	33.61	62.14
BLoRA, 20k	9.31	13.35	11.09	21.58
LoRA, 20k	17.69	49.19	30.21	56.46
BLoRA, 246k	9.29	13.59	10.88	20.84
LoRA, 246k	13.00	33.19	20.02	43.47

Several conclusions follow from these numbers. First, standard LoRA is unstable in the strong-baseline regime: for example, German WER rises from $8.53$ to $44.15$ at 1k samples, which is the kind of catastrophic degradation the paper highlights. Second, the prior data-generation method from Nguyen et al. 2025 also degrades both monolingual and code-switching performance. Third, BLoRA is consistently much safer and often beneficial, especially as the amount of synthetic data increases.

The best overall CSFleurs WER in the main table is $10.88$ with BLoRA and 246k utterances, which is a $5.31\%$ relative improvement over the Whisper baseline. The corresponding PIER is $20.84$, a $21.63\%$ relative reduction from the baseline $26.59$. The paper also reports an even stronger PIER reduction, down to $17.85$, under aggressive filtering at low data scale, which corresponds to a $32.87\%$ relative reduction.

In contrast, LoRA's best reported CSFleurs WER in the table is still much worse than baseline. The paper summarizes this as evidence that naive fine-tuning is fundamentally ill-suited to adapting an already-strong multilingual model: the adapter writes over the base model rather than extending it.

Effect of data quantity and filtering

One important ablation asks how much synthetic data is actually useful, and whether quality filtering matters. The authors synthesize text and speech, then re-transcribe each segment with Whisper-medium and filter out utterances whose segment-level CER is at least $40\%$. They also compare against more aggressive filters at $5\%$, $20\%$, and $40\%$ CER thresholds.

The qualitative trend is that filtering matters most in the low-data regime, especially because the TTS model can hallucinate on short, single-word segments. With only 1k samples, aggressive filtering produces the best PIER. As the data size grows, the negative effect of noisy synthetic audio becomes less severe, but the best results still come from filtering rather than using unfiltered data.

Filter	1k samples	10k samples	20k samples	Larger run
5% CER	17.85	18.37	18.44	18.15 at 40,869 utterances
20% CER	20.84	20.31	20.54	21.29 at 103,550 utterances
40% CER	23.60	22.26	21.58	20.84 at 246,503 utterances
No filter	23.90	24.20	24.72	23.00 at 580,000 utterances
Whisper baseline	26.59

The best score in this study is $17.85$ PIER with a $5\%$ CER filter and only 1k samples. The authors interpret this as strong evidence that the pipeline should prioritize excluding obviously corrupted synthetic segments rather than maximizing raw quantity alone. They also observe that the advantage of filtering is strongest when data are scarce and becomes less dramatic as more data are added.

The qualitative example in the paper illustrates the same phenomenon. The base Whisper model repeatedly transcribes the English word matter as the German-looking token meta, apparently due to acoustic similarity and a stronger German prior. After BLoRA adaptation with filtered synthetic data, the model correctly preserves matter in both occurrences. The example is used to argue that the method helps the model resist an overly strong monolingual prior when the English insertion is the intended transcription.

Speaker diversity versus text diversity

The paper includes a controlled ablation at a fixed budget of 6,535 utterances to ask whether gains come more from acoustic diversity or linguistic diversity. Two settings are compared under BLoRA: a text-rich, speaker-poor condition with one speaker and 6,535 different transcripts, and a speaker-rich, text-poor condition with 58 speakers and only 117 different transcripts.

Setup	German WER	English WER	CSFleurs WER	CSFleurs PIER
Whisper baseline	8.53	13.56	11.49	26.59
TextRich, BLoRA	9.15	13.16	10.85	21.58
SpeakerRich, BLoRA	10.33	14.11	12.03	22.11

This ablation suggests that, for this setup, linguistic diversity is slightly more valuable than acoustic diversity. Both settings improve PIER relative to baseline, but the text-rich variant does better on both overall CSFleurs WER and embedded-word transcription. The paper uses this to reinforce its broader argument that the critical factor is not just how much diversity is present, but how the synthetic knowledge is integrated into the pretrained model.

What the experiments collectively show

Naive synthetic-data fine-tuning is unsafe on strong models. LoRA often catastrophically degrades monolingual performance, even on read-speech CommonVoice.
More complex synthesis is not a substitute for better adaptation. The multi-stage pipeline from prior work still underperforms BLoRA and can worsen code-switching metrics.
Selective Bayesian integration is effective. BLoRA preserves the base model better while allowing code-switching knowledge to be added.
Quality filtering helps, especially at low data scale. Short, noisy synthetic segments can be particularly harmful if left unfiltered.
Text diversity matters slightly more than speaker diversity in the controlled 6,535-utterance ablation.

Limitations and scope

The paper is careful about the limits of the result. First, it is centered on one language pair, English-German, so the findings are strongest for that pair and for similar adaptation conditions. Second, the synthetic speech is derived from read-speech CommonVoice transcripts, so the method is naturally aligned with read speech rather than spontaneous conversational speech. This likely explains some of the residual gap the authors mention on DECM.

Third, the synthetic pipeline depends on accurate text generation, switch tagging, and TTS quality; the authors observe hallucinations in short segments, which is why filtering matters. Fourth, the paper does not claim that more synthetic data alone is always useful; in fact, the experiments show the opposite when adaptation is done with ordinary LoRA. The implication is that future work should treat adaptation design as the main lever, not just data volume or generation complexity.

Finally, the annotation scheme for PIER is manually constructed for CSFleurs, so the reported code-switching metrics are tied to that specific evaluation protocol. The paper nonetheless presents this as a strength, because PIER focuses directly on the embedded-language words where code-switching failures actually occur.

Conclusion

The paper's main contribution is a clear empirical demonstration that strong multilingual ASR models require a different adaptation philosophy than weak or from-scratch models. In this regime, the problem is not simply to generate more synthetic code-switched data; it is to integrate the new linguistic behavior without destroying existing competence. Bayesian factorized adaptation, as implemented with BLoRA, addresses exactly that issue by constraining the low-rank update with Bayesian regularization and sparsity pressure.

The result is a practical, annotation-free route to code-switching ASR: generate synthetic code-switched text with explicit linguistic constraints, synthesize speech with a multilingual TTS model, filter noisy segments, and adapt the base ASR model with a Bayesian low-rank update rather than standard fine-tuning. The authors' broader message is that, once foundation models are strong, adaptation methodology matters more than data methodology.

Code & Implementation

This repository provides supplementary resources accompanying the paper "Adding Robust Code-Switching Capabilities to High Performance Multilingual ASR". It focuses on the experimental materials including the prompting strategies for synthetic German–English code-switch generation, detailed annotation guidelines for code-switch tokens, and resources for data generation and evaluation.

Notably, the implementation of the model training, adaptation methods, and ASR system integration is not contained here. Instead, these components are available in a separate companion repository named continual-asr, which includes the full training and adaptation framework, various LoRA-based methods, and evaluation pipelines.

Thus, this repository acts primarily as a resource and documentation hub for the linguistic and experimental setup of code-switching data and annotation, rather than an implementation of the ASR system or adaptation algorithms themselves.