UR-BERT

UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction

UR-BERT is a Romanized transcription-based text encoder for massively multilingual TTS. It scales to 495 languages without phoneme systems and uses speech token prediction to improve phonetic fidelity and alignment, enabling strong results especially for low-resource and unseen languages.

tts
multimodal
voice-cloning

Demos

The demos illustrate UR-BERT's architecture and its capability to scale massively multilingual TTS using universal Romanization and speech token prediction. When evaluating, note the model's support for 495 languages, improved phonetic fidelity, alignment, and stronger generalization to new languages. The architecture overview highlights the core design enabling these advances.

Architecture overview of UR-BERT illustrating the universal Romanization and speech token prediction approach for multilingual TTS.

Authors: Sangmin Lee, Eekgyun Ahn, Woongjib Choi, Hong-Goo Kang

Categories: cs.CL, cs.SD

Comment: Accepted to Interspeech 2026, Github: https://github.com/sanghyang00/ur-bert

Published 2026-06-10 · Updated 2026-06-11

Abstract

We propose UR-BERT, a Romanized transcription-based text-to-speech (TTS) encoder for massively multilingual TTS systems. Conventional grapheme-to-phoneme (G2P)-based approaches are limited to around 100 languages due to the availability of reliable G2P resources. In contrast, UR-BERT scales to 495 languages by unifying diverse writing systems into a shared Romanization representation. To further enhance phonetic fidelity and text-speech alignment, we introduce a speech token prediction objective during training, which encourages the encoder to learn speech-aware phonetic representations in a data-efficient manner. Experiments show that TTS systems built on UR-BERT consistently outperform recent text encoder baselines across a wide range of languages and resource conditions, and demonstrate strong generalization to unseen languages.

Introduction and Motivation

The paper tackles a specific bottleneck in massively multilingual text-to-speech (TTS): text encoders that rely on language-specific grapheme-to-phoneme (G2P) toolkits do not scale to the long tail of the world’s languages. The authors argue that the encoder side of modern encoder-decoder TTS has been comparatively underexplored, even though encoder quality strongly affects text-speech alignment and phonetic accuracy. Prior BERT-style TTS encoders improved naturalness, but they usually operate on phoneme sequences produced by G2P systems, which limits language coverage to roughly one hundred languages and creates a structural dependency on external linguistic resources.

UR-BERT is proposed as a multilingual, speech-aware TTS text encoder that removes the G2P dependency by using Romanization as a shared text interface across writing systems. The model is pretrained on speech-text pairs from 495 languages and adds an auxiliary speech token prediction objective so that the text encoder learns not just textual context but also acoustic structure. The intended outcome is a text encoder that can be used in TTS systems for both high-resource and low-resource languages, while remaining usable for unseen languages.

Core claims from the paper:

Romanization can serve as a scalable replacement for phoneme conversion in multilingual TTS.
Speech token prediction injects acoustic information into the text encoder and improves alignment.
UR-BERT outperforms prior BERT-style TTS encoders across the evaluated languages and resource conditions.

Overview of the UR-BERT showing pretraining and finetuning stage.

Positioning Against Prior Work

The related-work discussion centers on multilingual PLBERT (m-PLBERT) and XPhoneBERT. m-PLBERT follows the PL-BERT idea and is pretrained on phoneme sequences from 15 languages generated with Phonemizer. XPhoneBERT extends this to 88 languages using CharsiuG2P. Both improve multilingual TTS, but both remain constrained by the availability and quality of G2P systems, and their language distribution is still skewed toward better-resourced European and Asian languages.

UR-BERT’s novelty is therefore not only larger language coverage, but also a change in representation: it avoids phoneme inventories and instead maps diverse scripts into Romanized strings. This is meant to make the encoder language-agnostic and to reduce the engineering burden associated with maintaining per-language phoneme resources.

Reported scale of the multilingual text encoders discussed in the paper.
Model	Languages used for training	Languages supported	Text pretraining data	Speech data
m-PLBERT	15	127	150K sentences	None reported
XPhoneBERT	88	99	330M sentences	None reported
UR-BERT	495	1162 empirically validated in prior studies	8M sentences	13K hours

UR-BERT Method

Architecture

UR-BERT uses a standard BERT-base-style encoder with a character-level tokenizer and 12 Transformer encoder layers. The encoder is pretrained on speech-text pairs rather than text alone. During pretraining it combines masked language modeling with an auxiliary speech token prediction objective.

The paper does not provide a closed-form loss equation in the LaTeX source shown here, but conceptually the training signal is the sum of the usual masked-text reconstruction objective and a prediction objective over discrete speech tokens aligned to characters.

Romanization as the text interface

The first design choice is to Romanize all input text into Latin characters. The motivation is practical as well as representational. Compared with phoneme-based encodings, Romanization avoids dependence on language-specific G2P systems and keeps the symbol inventory compact. The authors contrast this with IPA-style phoneme encodings, which can require very large symbol sets and complicated tokenization rules, especially when diacritics and suprasegmental markers are treated inconsistently.

In the paper’s framing, Romanization reduces the token inventory to roughly 30 alphabetic symbols, while still preserving sufficient phonetic information for TTS. The authors cite the Uroman toolkit as a practical example of a transliteration approach that can scale across writing systems.

Speech token prediction

Romanization improves coverage, but it also introduces ambiguity because multiple pronunciations can collapse onto the same Romanized form. To recover phonetic detail, UR-BERT uses a knowledge-distillation-style pretraining target called speech token prediction (STP). A pretrained multilingual speech self-supervised model serves as the teacher, and the student encoder learns to predict discrete speech tokens derived from the teacher’s representations.

The paper describes the STP pipeline as three steps: extract speech representations from a multilingual self-supervised speech model, align them to text at the character level, and discretize the aligned vectors into a token codebook. The teacher model used for extraction is omnilingual-ASR-W2V-300M, and the representation is taken from layer 16 because intermediate layers are expected to contain more phonetic information than higher semantic layers.

Illustration of the CTC-based speech-text alignment.

For alignment, the authors use CTC-based forced alignment with MMS-FA, followed by average pooling over the aligned frames for each character. This converts a long speech sequence into a sequence of character-level acoustic representations. Those vectors are then clustered with k-means to form a discrete codebook.

The codebook size is fixed at 257: index 0 is a mute token, and indices 1 through 256 represent acoustic tokens. The authors explicitly avoid larger codebooks because they may capture speaker-specific or paralinguistic variation instead of phonetic content, and because larger vocabularies can destabilize training given the compact text side. This design choice is presented as a balance between representational capacity and phonetic abstraction.

Pretraining Data and Optimization

The pretraining corpus combines three ASR-style speech-text datasets: FLEURS, Common Voice, and the Omnilingual ASR corpus. The reported corpus contains approximately 13K hours of speech, 8M sentences, and 495 languages. FLEURS contributes 102 read-speech languages, Common Voice covers 131 languages, and Omnilingual ASR contributes 348 low-resource languages.

The appendix adds two important preprocessing details. First, samples containing digits or parenthetical expressions were removed because their pronunciations can be ambiguous or inconsistently realized across languages. Second, the longer Omnilingual ASR utterances were segmented into chunks of up to 30 seconds using MMS-FA to reduce padding and computation while better matching the duration distribution of FLEURS and Common Voice.

Pretraining runs for 150K steps with batch size 1024 using gradient accumulation. Optimization uses AdamW and a tri-stage learning-rate schedule with warm-up, peak, and decay ratios of 0.1, 0.5, and 0.4, respectively, and a peak learning rate of 1e-4.

Downstream TTS Setup

For downstream synthesis experiments, the paper uses VITS as the backbone TTS model and swaps in different text encoders: the original VITS encoder, m-PLBERT, XPhoneBERT, and UR-BERT. All datasets are resampled to 22,050 Hz.

The evaluation spans 11 languages. The high-resource group contains English, German, and Mandarin Chinese, each with 20 hours of training data. The low-resource group contains eight languages: Javanese and Sundanese at 5 hours each, Khmer at 3 hours, Afrikaans, Nepali, Setswana, and Xhosa at 2 hours each, and Sinhala at 1 hour.

Training follows the protocols of prior work: low-resource models are trained for 100K steps and high-resource models for 300K steps, both with batch size 32. The text encoder is frozen for the first 25 percent of training steps, and warm-up is omitted during fine-tuning to stabilize the monotonic alignment search module.

Evaluation Protocol

The paper uses one subjective measure and four objective measures. Subjective quality is assessed with mean opinion score on a 1-to-5 scale, with phonetic guidance. Ratings are collected on 520 samples from 44 participants with diverse regional backgrounds.

Objective quality is measured with the UTokyo MOS Prediction System, reported as relative degradation against ground-truth samples. Intelligibility is measured with character error rate from transcriptions generated by Omnilingual-ASR-CTC-1B, also reported as relative degradation against ground truth. Spectral and prosodic differences are captured by mel-cepstral distance and log-F0 root mean squared error.

The appendix provides additional MOS protocol details: participants were recruited through community outreach, all were fluent in English, and several had exposure to other language regions. Participants were instructed to evaluate in a quiet environment with headphones, and loudspeakers were prohibited. Grapheme transcriptions and Romanized transcriptions were shown as phonetic guidance, along with the language identity for each sample.

Sampling was designed for fair comparison. The authors randomly sampled 10 utterances per configuration, reused the same text prompts across models within a language, and evaluated 110 ground-truth samples plus 410 generated samples. To reduce fatigue, the 520 samples were split into four disjoint groups of 130 utterances, with 11 participants assigned to each group.

(a) Instructions provided to the participants.

(b) Interface of the MOS evaluation platform.

Results on High-Resource Languages

UR-BERT improves over VITS, m-PLBERT, and XPhoneBERT on all three high-resource languages in terms of MOS, and it also achieves the best or near-best objective scores. The qualitative conclusion is that Romanization plus STP does not merely expand coverage; it also yields a stronger encoder for languages already well represented in TTS benchmarks.

A notable result is that UR-BERT was trained on 8M sentences, which the authors note is only 2.5 percent of the text used for XPhoneBERT, yet it still surpasses XPhoneBERT on most metrics. The paper attributes this data efficiency to the compact Romanized token space and to the acoustic guidance supplied by STP.

High-resource TTS results. Lower is better for relative UTM degradation, relative CER degradation, mel-cepstral distance, and log-F0 RMSE. Higher is better for MOS.
Language	Model	MOS	Rel. UTM degradation	Rel. CER degradation	MCD	Log-F0 RMSE
English	GT	4.61	-	-	-	-
	VITS	3.78	0.29	6.15	5.71	0.162
	m-PLBERT	1.83	0.58	66.50	8.34	0.163
	XPhoneBERT	4.11	0.23	4.79	5.17	0.152
	UR-BERT	4.35	0.12	3.78	5.23	0.149
German	GT	4.02	-	-	-	-
	VITS	3.45	0.53	6.37	5.36	0.188
	m-PLBERT	2.65	0.57	67.78	6.93	0.203
	XPhoneBERT	3.53	0.53	5.85	4.77	0.171
	UR-BERT	3.78	0.33	3.07	4.66	0.170
Mandarin Chinese	GT	4.35	-	-	-	-
	VITS	3.65	0.46	29.28	5.17	0.102
	m-PLBERT	2.88	0.35	67.45	5.90	0.105
	XPhoneBERT	3.49	0.30	25.98	5.16	0.097
	UR-BERT	3.88	0.36	21.83	4.95	0.098

The strongest subjective gains are visible in all three languages. In English and German, UR-BERT is best on every reported metric. In Mandarin Chinese, UR-BERT is best on MOS, relative UTM degradation, relative CER degradation, and MCD, while XPhoneBERT is slightly better on log-F0 RMSE. The paper also highlights that m-PLBERT can degrade sharply when plugged into VITS, producing speech that may sound fluent but is often phonetically incorrect.

Results on Low-Resource and Zero-Shot Languages

The low-resource experiments are where UR-BERT’s scale advantage becomes especially clear. Phoneme-based baselines do not support many of the languages because reliable G2P resources are unavailable, whereas UR-BERT can still be used through Romanization. Across the reported languages, UR-BERT achieves the highest MOS in every case.

The objective metrics are mostly favorable as well, but not uniformly so for every language. This is worth emphasizing because it shows that subjective quality and some objective measures can diverge slightly on very small data sets. Even so, UR-BERT is consistently strong and often best on relative UTM degradation and relative CER degradation.

Low-resource and zero-shot TTS results. Lower is better for relative UTM degradation, relative CER degradation, mel-cepstral distance, and log-F0 RMSE. Higher is better for MOS.
Language	Setting	Model	MOS	Rel. UTM degradation	Rel. CER degradation	MCD	Log-F0 RMSE
Afrikaans	Seen by XPhoneBERT and UR-BERT	GT	4.21	-	-	-	-
Afrikaans	Seen by XPhoneBERT and UR-BERT	VITS	2.80	0.59	26.03	6.41	0.127
Afrikaans	Seen by XPhoneBERT and UR-BERT	XPhoneBERT	2.85	0.38	18.54	6.25	0.122
Afrikaans	Seen by XPhoneBERT and UR-BERT	UR-BERT	3.34	0.37	15.82	6.09	0.121
Khmer	Seen by XPhoneBERT and UR-BERT	GT	3.58	-	-	-	-
Khmer	Seen by XPhoneBERT and UR-BERT	VITS	2.96	0.51	15.51	6.17	0.117
Khmer	Seen by XPhoneBERT and UR-BERT	XPhoneBERT	2.98	0.50	12.40	5.72	0.113
Khmer	Seen by XPhoneBERT and UR-BERT	UR-BERT	3.21	0.52	6.88	5.59	0.112
Javanese	Seen only by UR-BERT	GT	3.90	-	-	-	-
Javanese	Seen only by UR-BERT	VITS	2.80	0.61	23.50	6.42	0.107
Javanese	Seen only by UR-BERT	UR-BERT	3.05	0.52	28.03	6.64	0.105
Nepali	Seen only by UR-BERT	GT	4.35	-	-	-	-
Nepali	Seen only by UR-BERT	VITS	3.33	0.47	13.55	7.10	0.071
Nepali	Seen only by UR-BERT	UR-BERT	3.66	0.46	6.71	6.76	0.068
Setswana	Seen only by UR-BERT	GT	3.92	-	-	-	-
Setswana	Seen only by UR-BERT	VITS	2.51	0.78	32.27	5.38	0.118
Setswana	Seen only by UR-BERT	UR-BERT	2.92	0.77	25.50	5.13	0.113
Xhosa	Seen only by UR-BERT	GT	4.20	-	-	-	-
Xhosa	Seen only by UR-BERT	VITS	3.05	0.58	17.70	6.96	0.112
Xhosa	Seen only by UR-BERT	UR-BERT	3.48	0.54	10.74	6.13	0.106
Sinhala	Seen only by UR-BERT	GT	4.11	-	-	-	-
Sinhala	Seen only by UR-BERT	VITS	3.49	0.48	13.09	5.31	0.095
Sinhala	Seen only by UR-BERT	UR-BERT	3.82	0.39	9.49	4.75	0.091
Sundanese	Zero-shot for UR-BERT	GT	4.25	-	-	-	-
Sundanese	Zero-shot for UR-BERT	VITS	3.15	0.59	14.86	4.98	0.078
Sundanese	Zero-shot for UR-BERT	UR-BERT	3.43	0.47	13.46	5.08	0.077

Two notable nuances appear in the low-resource table. First, UR-BERT is not universally best on every objective metric for every language: for example, Javanese keeps better relative CER and MCD under VITS, while Sundanese keeps slightly better MCD under VITS. Second, despite those metric-level exceptions, UR-BERT still wins on MOS throughout and usually improves the most interpretable speech-quality and intelligibility indicators.

The zero-shot Sundanese result is important because Sundanese is excluded from the pretraining corpus. Even in that setting, UR-BERT outperforms VITS on MOS, relative UTM degradation, relative CER degradation, and log-F0 RMSE, showing that the Romanized representation transfers beyond the languages seen during pretraining.

Ablation on Speech Token Prediction

The paper’s ablation removes the STP objective and compares MOS. The overall pattern is clear: STP helps most languages, especially where the phonetic mapping is harder or the resource level is lower. The authors interpret this as evidence that STP counteracts the phonetic abstraction introduced by Romanization and stabilizes text-speech alignment.

MOS ablation for UR-BERT with and without speech token prediction.
Language	MOS with STP	MOS without STP
English	4.35	4.00
German	3.78	3.64
Mandarin Chinese	3.88	3.75
Afrikaans	3.34	3.04
Khmer	3.21	3.23
Javanese	3.05	2.65
Nepali	3.66	3.66
Setswana	2.92	2.70
Xhosa	3.48	3.34
Sinhala	3.82	3.52
Sundanese	3.43	3.41

The improvement is especially visible in English, Afrikaans, Javanese, Setswana, Xhosa, and Sinhala. There are two near-neutral cases: Khmer is very slightly better without STP in MOS, and Nepali is tied. Even with those exceptions, the overall trend supports the claim that acoustic-token prediction is useful for multilingual TTS pretraining.

Practical Implications and Limitations

The paper does not include a dedicated limitations section, but several practical constraints are clear from the method. UR-BERT still depends on paired speech-text corpora for pretraining, a pretrained multilingual speech teacher, and forced alignment through MMS-FA. It therefore removes the dependence on language-specific G2P toolkits, but not the dependence on substantial multilingual speech resources.

Another important tradeoff is that Romanization intentionally compresses the symbol inventory, which can blur pronunciation differences. STP is introduced specifically to recover some of that lost detail, but the reported results show that objective metrics can still vary by language, especially in very low-resource or zero-shot settings. The authors also note that they did not consider larger speech-token codebooks, because such codebooks may begin to encode speaker or paralinguistic information rather than the phonetic content needed for TTS.

So, the method is best understood as a scalable compromise: it sacrifices some phonemic specificity in exchange for language coverage and token efficiency, then partially restores acoustic fidelity through speech-aware pretraining.

Conclusion

UR-BERT is a multilingual TTS text encoder that combines universal Romanization with speech token prediction. The paper’s main contribution is a practical route to scaling text encoders to 495 languages without relying on per-language G2P systems, while still keeping the encoder speech-aware enough to support high-quality synthesis. In the reported experiments, UR-BERT improves MOS across high-resource, low-resource, and zero-shot languages, and it usually improves the objective metrics as well. The resulting system is presented as a building block for more universal and inclusive multilingual TTS.

Code & Implementation

This repository implements the UR-BERT model described in the paper, designed for massively multilingual text-to-speech (TTS) with universal Romanization and speech token prediction.

The codebase is organized into two main stages:

Pretraining: This stage includes training the UR-BERT encoder using masked language modeling (MLM) and optionally MLM with distillation objectives. The relevant code is under the pretraining/ directory, including training scripts, configuration files, and preprocessing utilities.
Finetuning: This stage builds multilingual TTS models based on VITS architecture with various encoders such as UR-BERT, PLBERT, and XPhoneBERT. The finetuning scripts and configurations reside in the finetuning/ directory, which also includes inference capabilities for synthesized speech generation.

The repository provides a Hugging Face model export of the UR-BERT backbone for easy integration. The pretraining focuses on learning robust phonetic representations across 495 languages, while finetuning adapts the model for TTS tasks. Configuration files define experiment-specific parameters, facilitating repeatable trainings and evaluations.

Typical usage includes running train.py in pretraining/ for initial UR-BERT encoder training, followed by TTS model training using train.py in finetuning/, and finally speech synthesis via inference.py.