Akapulu Labs logo Akapulu Labs Research

UniVocal

UniVocal: Unified Speech-Singing Code-Switching Synthesis

UniVocal — method overview

UniVocal unifies speech and singing synthesis into one model that switches automatically based on text cues, without explicit tags. Using a refined pitch token and a staged learning approach, it enables smooth speech-singing code-switching driven purely by textual semantics.

  • tts
  • prosody
  • speech-to-speech
  • autoregressive

Demos

These demos demonstrate UniVocal's capability to generate seamless speech-singing code-switching audio from text using implicit and explicit cues, as well as its ability to produce emotionally empathetic speech and diverse singing styles. Watch for natural transitions between speech and singing and accurate pitch contours in the refined cent tokens visualizations, which show the structural pitch framework underlying the synthesis.

Authors: Yufei Shi, Qian Chen, Wen Wang, Xiangang Li, Zhen-Hua Ling, Yang Ai

Categories: cs.SD

Comment: accepted by ACL 2026

Published 2026-06-01 · Updated 2026-06-01

Abstract

We propose UniVocal, a unified framework that implicitly infers vocal modes from text context to pioneer Speech-Singing Code-Switching (SCS) Synthesis - a task where transitions are autonomously driven by textual semantics, akin to seamless human language blending. Unlike single-mode generation or systems relying on switching-control tags, our proposed UniVocal implicitly infers vocal modes solely from text context. To achieve this, we employ a data-efficient two-stage curriculum learning strategy that progressively trains a competitive TTS system to acquire the desired SCS capability. Addressing data scarcity, we introduce a scalable pipeline to synthesize diverse code-switching data that is both semantically and acoustically natural, alongside a new multi-scenario benchmark, SCSBench. To address limitations of semantic tokenizers in capturing acoustic details, we also introduce refined cent token and Chain-of-Thought (CoT) generation for planning prosody before content generation, effectively enhancing empathetic speech generation and singing melody. Experimental results demonstrate that UniVocal achieves state-of-the-art performance on SCSBench while maintaining competitive performance on regular speech and singing tasks. Audio samples are available at https://project-univocal-demo.github.io/demo/. The code and dataset are released at https://github.com/FunAudioLLM/FunResearch/tree/main/UniVocal.


1. Problem Setting and Main Idea

UniVocal: Unified Speech-Singing Code-Switching Synthesis addresses a new generation problem: producing a single vocal stream in which the model autonomously switches between speaking and singing based only on text semantics. The key distinction from prior speech, singing, or unified audio systems is that the switch is not driven by explicit segment tags or external segmentation; instead, the model is supposed to infer when to speak, hum, or sing from the textual context itself. The paper frames this as Speech-Singing Code-Switching (SCS) Synthesis.

The motivation is that human communication often blends modes naturally, but existing systems are usually specialized: TTS systems generate only speech, SVS/music systems generate only singing or music, and unified audio systems either select one mode per prompt or require explicit control tags. The paper argues that this leaves a gap for semantically driven intra-utterance switching.

Common audio generation tasks, categorized into specialized tasks on the left and unified tasks on the right. Notably, in addition to its capabilities for regular speech and singing generation, UniVocal can also produce vocal streams where speech and singing naturally code-switch.
Common audio generation tasks, categorized into specialized tasks on the left and unified tasks on the right. Notably, in addition to its capabilities for regular speech and singing generation, UniVocal can also produce vocal streams where speech and singing naturally code-switch.

The method is built as a post-training adaptation of a strong semantic-token-based TTS backbone, instantiated in the paper with CosyVoice 2. The framework is model-agnostic in principle, but the reported experiments use CosyVoice 2 because it is a competitive open-source TTS system with stable generation quality. The paper’s core claim is that with a curriculum learning strategy, synthetic SCS data, and a refined pitch token, an existing TTS model can be extended to handle speech, singing, and code-switching without explicit mode tags.

2. Claimed Contributions

  • Task definition: the paper defines SCS synthesis as automatic speech-singing switching driven by text semantics.
  • Training strategy: it introduces a two-stage curriculum learning recipe that first aligns speech and singing in a shared latent space and then teaches autonomous switching.
  • Data pipeline: it proposes a scalable synthetic data generation pipeline for semantically and acoustically natural code-switching examples.
  • Benchmark: it introduces SCSBench, a multi-scenario benchmark for evaluating code-switching behavior under different cue types.
  • Prosody modeling: it adds a refined cent token and an interleaved chain-of-thought-style generation order to better plan pitch before generating content.
  • Results: UniVocal achieves state-of-the-art SCS performance on SCSBench while remaining competitive on regular speech and singing tasks.

3. Method Overview

UniVocal is organized around three interacting pieces: a text-to-vocal language model, a refined pitch representation, and a downstream waveform reconstruction module. At a high level, the text input may be prefixed by a task instruction, and the model then autoregressively predicts an interleaved sequence of refined cent tokens and semantic tokens. Those predicted tokens, together with the prompt audio, condition the subsequent acoustic synthesis module.

Overview of UniVocal. The Text-to-Vocal language model receives the text to be generated, along with an optional natural language description of the task. At each timestep, it autoregressively generates a refined cent token and a semantic token in sequence. These two types of predicted tokens are then fed, along with the prompt audio, into a downstream module to synthesize the final voice output.
Overview of UniVocal. The Text-to-Vocal language model receives the text to be generated, along with an optional natural language description of the task. At each timestep, it autoregressively generates a refined cent token and a semantic token in sequence. These two types of predicted tokens are then fed, along with the prompt audio, into a downstream module to synthesize the final voice output.

3.1 Backbone and Instruction Conditioning

The backbone is a 24-layer causal Transformer with about 0.5B parameters. The paper follows the CosyVoice 2 paradigm in which natural-language instructions act as coarse task-level control signals. Regular TTS uses the default instruction-free format. Singing generation uses style-description prompts. For SCS, a global instruction such as “Generate a monologue.<|endofprompt|>” specifies the scenario, but the exact speech-versus-singing switching points are not explicitly tagged in the text and are instead inferred from content.

This design keeps the switching signal weak at the instruction level and pushes the model to learn semantic triggers from data. The paper emphasizes that the instruction defines the task scope, not the per-segment vocal mode.

3.2 Refined Cent Token

The main technical obstacle the authors identify is that semantic tokenizers preserve linguistic content but discard fine acoustic detail, especially pitch contour. To inject a compact but expressive prosodic signal, they define a refined cent token based on the musical cent scale. The cent scale divides each semitone into 100 parts, giving 1200 bins per octave.

The conversion from frequency in hertz to cent scale is:

$$f_{cent} = 1200 \times \log_2\left(\frac{f_{Hz}}{440}\right).$$

The discretized token is then defined as:

$$I(f_{cent}) = \begin{cases} \lceil f_{cent} \bmod 1200 \rceil & \text{if } f_{Hz} \neq 0,\\ -1 & \text{if } f_{Hz} = 0. \end{cases}$$

Here, $f_{Hz}=0$ denotes unvoiced regions and is mapped to token $-1$. The modulo operation folds pitch into a single octave, and the ceiling operation discretizes it to integer bins. The paper argues that this gives a high-resolution representation with at most 1-cent quantization error, which is small enough to be perceptually negligible.

Unlike prior approaches that train an entirely new codec or chromagram tokenizer, the refined cent token is meant as a lightweight augmentation to a semantic-token system. The authors show that 1200 bins are especially well matched to their setting: coarse enough to be manageable, but fine enough to preserve micro-prosody in speech and melody detail in singing.

3.3 Interleaved Chain-of-Thought Generation

UniVocal does not predict semantic tokens alone. Instead, it autoregressively predicts a pair $(c_t, s_t)$ at each frame $t$, where $c_t$ is the refined cent token and $s_t$ is the semantic token. Both streams operate at 25 Hz. The factorization used in the paper is:

$$P(\mathbf{Y}\mid \mathbf{X}) = \prod_{t=1}^{T} P(c_t \mid \mathbf{X}, \mathbf{Y}_{<t})\, P(s_t \mid \mathbf{X}, \mathbf{Y}_{<t}, c_t).$$

Conceptually, the model first predicts a pitch contour, then uses that contour to condition the semantic token prediction. This is presented as a chain-of-thought-like planning mechanism for prosody: the model “thinks” about pitch before generating content. During inference, the implementation uses logit masks so that cent-token positions can only sample from the cent vocabulary and semantic-token positions can only sample from the semantic vocabulary.

The paper also notes a practical tradeoff: the cent-token path can be omitted for alignment-heavy tasks such as ordinary TTS and SCS, but can be enabled for more aesthetic or expressive tasks such as empathetic speech and singing.

3.4 Waveform Reconstruction

After the language-model stage, the authors adapt the CosyVoice 2 flow-matching module by adding a randomly initialized embedding layer to consume the refined cent tokens as an extra conditioning signal. The adapted module generates Mel-spectrograms, which are then converted to waveform with a pre-trained HiFi-GAN vocoder. This keeps the overall synthesis pipeline consistent with the original semantic-token TTS design while adding pitch-aware control.

3.5 Scalable Data Synthesis Pipeline for SCS

Because natural code-switching speech-singing data is scarce, the paper introduces a three-step synthetic pipeline designed to produce semantically coherent and acoustically consistent SCS samples.

  1. Semantic text generation: Gemini 2.5 Pro is used to generate scripts for scenarios such as monologues, podcasts, and audiobooks. The prompts deliberately create “boundary-blurring” contexts. The paper distinguishes implicit cues from explicit cues. Implicit cues rely on the natural semantic difference between prose and lyrics; explicit cues are transitional phrases such as “reminds me of a tune...” that more directly signal a switch.
  2. Unified acoustic synthesis: the stage-1 model is used to synthesize both speech and singing segments. To prevent timbre mismatch, both modes use the same speaker embedding. Speech segments are additionally conditioned on emotion-specific reference audio to better align vocal delivery with textual sentiment. The final sample is created by concatenating the generated speech and singing segments.
  3. Quality control: samples with severe recognition or alignment problems are filtered out using word-error-rate-based heuristics, so that the final synthetic set is both semantically usable and acoustically clean enough for training.

The paper’s appendix provides more detail: for the released training dataset, samples with WER above 20% are removed, while moderate-WER samples may be retained with ASR-transcribed text to preserve alignment.

3.6 Two-Stage Curriculum Learning

The learning strategy is deliberately staged so the model first masters independent speech and singing generation before it learns to switch automatically.

  • Stage 1: latent representation alignment. The authors continue pre-training on speech and singing data so the model learns a unified latent space. They use a 4:1 singing-to-speech ratio in the main description. The full speech corpus is used here, as is the full singing corpus.
  • Stage 2: autonomous switching learning. The model is then supervised on synthetic SCS data. To prevent catastrophic forgetting, the stage-2 mixture is balanced across code-switching, speech, and singing data in a 1:1:1 ratio.

This curriculum is central to the paper’s claim. The authors argue that if the model is trained directly on the mixed SCS objective without first aligning the modalities, it struggles to learn the subtle semantic triggers that decide when to switch.

4. Data, Benchmarks, and Training Setup

4.1 Training Data

The paper trains on three data sources: speech, singing, and synthetic code-switching. The main reported training sets are:

  • Speech: 960 hours from LibriTTS.
  • Singing: about 3,700 hours from Suno, plus a small GTSinger subset for stage-2.
  • Code-switching: 11,769 synthetic samples, totaling 261.9 hours.

The code-switching set covers three narrative scenarios: monologue, personal podcast, and audiobook. It is designed so that speech and singing are interwoven naturally, rather than simply concatenated at arbitrary positions.

ScenarioCountTotal Duration (h)Average Duration (s)
Monologue6,24784.348.6
Podcast2,43287.2129.1
Audiobook3,09090.4105.3
Total11,769261.980.1

Singing data cleaning. The appendix describes a multi-step cleaning pipeline for the Suno-derived singing corpus: source separation with MelBand Roformer, filtering with DNSMOS and SRMR to remove noisy/reverberant tracks, dereverberation, voice activity segmentation, transcription with FastWhisper, and phoneme-per-second filtering to reject hallucinated or pathological lyrics. The authors note that despite this cleaning, the singing data still contains artifacts such as electric tones and weak style-to-performance correlation.

Training schedule. Optimization uses AdamW with $b2_1=0.9$, $b2_2=0.95$, weight decay $0.1$, and gradient clipping at $1.0$. Stage-1 uses linear learning-rate decay from $2 \times 10^{-4}$ to zero over 70,000 steps with 5,000 warmup steps. Stage-2 uses a constant learning rate of $1 \times 10^{-4}$ for 30,000 steps. The authors report training on 4 NVIDIA A800 GPUs, with the full process taking about 6 days in total. The flow-matching module is also fine-tuned on stage-2 data.

4.2 SCSBench

The benchmark for code-switching evaluation is SCSBench, a held-out subset of the synthetic data. The appendix reports roughly 1,200 samples, balanced across both cue type and narrative scenario. The main paper splits it into three subsets:

  • SCSBench-Implicit: only implicit semantic cues.
  • SCSBench-Explicit: only explicit trigger phrases.
  • SCSBench-Mixed: both cue types mixed together.

This design is useful because it separates the model’s ability to infer mode changes from subtle semantic contrast versus direct trigger phrases.

4.3 Evaluation Tasks

  • SCS: transition accuracy on SCSBench using F1 scores, with both Gemini-based objective scoring $F1(O)$ and human scoring $F1(S)$.
  • Regular TTS: SeedTTS-EN.
  • Textual empathy: a 50-sentence test set spanning 10 emotional scenarios.
  • Singing: GTsinger for short phrases and Fullsong for long-form singing.

For SCS, the paper uses an in-context-learning calibration strategy for Gemini 2.5 Pro so that the automatic evaluator focuses on acoustic singing versus speech, rather than merely on whether the text looks lyrical. Human agreement is reported as substantial, with Fleiss’ $ba = 0.684$. On a 243-sample calibration set, Gemini and human judgments correlate positively ($r=0.343$, $c1=0.346$, $p<0.05$), and the paper reports perfect system-level rank consistency across SCSBench subsets.

5. Main Results

5.1 SCSBench: Switching Accuracy

UniVocal is compared against two cascaded baselines: Gemini + Bark and Gemini + Cosy2 + LeVo. The strongest baseline differs by subset: Gemini + Cosy2 + LeVo is best on the implicit subset, while UniVocal is best on explicit and mixed subsets. The strongest result overall is on SCSBench-Mixed, where UniVocal reaches F1(O) = 0.871 and F1(S) = 0.810.

Model SCSBench-Implicit SCSBench-Explicit SCSBench-Mixed
F1(O)F1(S)F1(O)F1(S)F1(O)F1(S)
Gemini + Bark0.4140.1420.5330.2500.4650.199
Gemini + Cosy2 + LeVo0.7520.6850.5720.4890.6070.566
UniVocal0.6260.5950.7140.6350.8710.810

The qualitative interpretation is important: UniVocal performs best where the semantic cues are explicit enough for the model to infer switching reliably, and it remains stronger than the pure tag-based Bark pipeline on every subset. On the hardest implicit subset, however, the cascade with Gemini segmentation still leads in F1, which aligns with the paper’s later limitation analysis: real-world implicit switching remains difficult without at least some explicit anchor.

Model SCSBench-Implicit SCSBench-Explicit SCSBench-Mixed
WERSIMUTMOSWERSIMUTMOSWERSIMUTMOS
Gemini + Bark21.83---3.4129.47---3.3129.60---3.31
Gemini + Cosy2 + LeVo17.970.7583.428.180.7633.6212.430.7733.54
UniVocal5.830.6504.368.800.6434.4110.900.6524.36

On quality metrics, UniVocal achieves the best WER and UTMOS across all subsets. Its SIM is slightly lower than the strongest cascade baseline, which the paper attributes to imperfections in the singing data. However, an intra-sample speaker-consistency analysis shows that UniVocal maintains stronger internal identity stability across the temporal segments of each sample than the cascaded baselines, reducing timbre drift during switching.

Intra-sample speaker consistency. Pairwise similarity heatmap between five temporal segments, averaged across all generated samples from each system. Darker colors indicate higher speaker stability.
Intra-sample speaker consistency. Pairwise similarity heatmap between five temporal segments, averaged across all generated samples from each system. Darker colors indicate higher speaker stability.

5.2 Regular TTS and Textual Empathy

UniVocal remains competitive on standard zero-shot TTS and improves empathetic speech generation significantly relative to the CosyVoice 2 baseline. On SeedTTS-EN, it slightly trails F5-TTS on WER and SIM but obtains the best UTMOS. On the empathy test set, it closes much of the gap to the commercial ElevenLabs baseline.

SeedTTS-EN
ModelWERSIMUTMOS
F5-TTS2.150.7553.68
CosyVoice 22.960.7444.18
Vevo 1.512.580.7183.68
UniVocal2.690.7034.21
Textual Empathy Test Set
ModelE-MOSP-MOSWER
CosyVoice 21.781.740.53
ElevenLabs multilingual-v22.302.470.24
UniVocal2.262.220.32

The paper interprets this gain as the result of two factors: stage-1 singing data exposes the model to emotionally richer vocal patterns, and the refined cent token then amplifies prosodic planning during generation. In other words, the empathy gain does not come mainly from stage-2 code-switching data; it is already latent in the stage-1 alignment process and is further strengthened by the pitch-aware tokenization.

5.3 Singing Generation

For singing, UniVocal is evaluated on short-phrase GTsinger and long-form Fullsong. The model achieves the best WER and quality scores on both sets, while subjective assessments show that it balances naturalness and musicality better than the comparison systems. This is an important result because the paper’s goal is not simply to improve SCS, but to keep regular speech and singing performance competitive.

Model GTsinger Fullsong
AESWERSIMQUAAESWERSIMQUAN-MOSM-MOS
Vevo 1.55.3822.790.7098.715.4649.550.666.972.172.08
YuE5.3240.320.3529.335.1477.600.466.342.222.24
LeVo5.2523.440.6039.305.3769.410.547.512.412.34
UniVocal5.4418.070.70310.705.5835.880.727.752.232.18

On GTsinger, UniVocal leads in WER and QUA and is close to the best SIM. On Fullsong, it is again the best on objective metrics, while the subjective scores indicate that it remains musically convincing and natural even for longer songs.

6. Ablations and Additional Analyses

6.1 CoT and Curriculum Learning Ablations

The ablation study isolates two design choices: the refined cent token / CoT mechanism and the two-stage curriculum. The key outcome is that these two components trade off differently across tasks. Removing CoT improves switching accuracy on the mixed SCS set but hurts expressive quality. Removing curriculum learning damages switching accuracy substantially, showing that latent-space alignment is a prerequisite for learning autonomous switching.

Model Textual Empathy Fullsong SCSBench-Mixed
E-MOSP-MOSWERN-MOSM-MOSWERF1WER
UniVocal2.262.220.322.232.1835.300.7165.99
UniVocal w/o CoT2.031.840.512.201.8635.880.81010.90
UniVocal w/o CL2.242.230.522.292.1737.210.49614.46

The most revealing pattern is that w/o CoT improves SCS F1 but sharply degrades emotional and musical quality. The authors therefore interpret the cent-token branch as a prosodic planner: it helps the model generate more expressive audio, but it is not strictly required for mode-switch timing. By contrast, the w/o CL variant confirms that a one-stage mixed training recipe is insufficient for learning reliable semantic switching.

6.2 Cent Token Resolution Ablation

The appendix compares 12-bin, 480-bin, and 1200-bin cent token resolutions. The main conclusion is that low resolution is too coarse for emotional speech, while 1200 bins is the best compromise for UniVocal’s expressive setting.

Resolution E-MOS(O) P-MOS(O) WER AES WER
12 bins1.571.630.463.6158.87%
480 bins1.821.970.513.4249.72%
1200 bins1.852.060.423.4556.13%

This table supports the paper’s argument that the refined cent token is meant to capture micro-prosody for speech and fine melodic structure for singing, not just broad pitch categories.

6.3 Prosodic Planning Verification

The paper checks whether the predicted cent tokens truly behave like a planning signal by measuring correlation with ground-truth cent tokens extracted from the generated audio. The reported correlations are positive: SRCC $0.633$ and LCC $0.604$ on the textual empathy set, and SRCC $0.679$ and LCC $0.628$ on Fullsong. The authors interpret this as evidence that the model is drafting the main pitch contour before refining the rest of the utterance.

6.4 Qualitative Analysis of Switching Cues

Case studies show that explicit cues are very effective anchors. When trigger phrases such as “always the same tune” or “It goes...” appear before a singing segment, UniVocal switches accurately. Purely implicit switching is harder: a lyrical sentence without a trigger may be treated as prose. The exception is humming, whose non-lexical form itself provides a strong signal. This qualitative result is consistent with the benchmark findings: explicit cues make the task much easier than purely implicit switching.

6.5 Real-World Generalization

The appendix evaluates about 30 minutes of real-world human SCS recordings. UniVocal initially performs poorly on the raw real-world set, with F1 = 0.201, but improves sharply to 0.730 once a single explicit semantic cue is manually inserted. This reinforces the paper’s central claim that the model generalizes well when given a modest textual anchor, but remains sensitive to the gap between synthetic training data and uncontrolled real-world switching.

ModelReal SCS F1Enhanced Real SCS F1
Gemini + Cosy2 + LeVo0.4520.691
UniVocal0.2010.730
Intra-sample speaker consistency. Pairwise similarity heatmap between five temporal segments, averaged across all generated samples from each system. Darker colors indicate higher speaker stability.
Intra-sample speaker consistency. Pairwise similarity heatmap between five temporal segments, averaged across all generated samples from each system. Darker colors indicate higher speaker stability.

7. Limitations and Ethical Considerations

The paper is unusually explicit about limitations. First, the singing training data is derived from Suno and cleaned through source separation and ASR tooling, which leaves residual artifacts such as electric tones and imperfect lyric alignment. This sets an upper bound on the fidelity of the generated singing voice.

Second, there remains a domain gap between the synthetic training distribution and real-world SCS. The model currently benefits from minor explicit triggers, and purely implicit transitions in unconstrained real text remain difficult. The real-world evaluation confirms this: performance is much better with an added explicit cue than without one.

Third, the automatic F1 metric for SCS, while reliable at the system level, has limited sample-level resolution because short samples often collapse to binary decisions. The authors therefore rely on a combination of Gemini-based scoring, human evaluation, and rank-consistency analysis rather than treating any single scalar as definitive.

On ethics, the paper acknowledges potential misuse for deepfakes and impersonation. It states that the work is intended for academic use, that the training data comes from open sources, and that the released models are under a restrictive license designed to prevent commercial misuse and illegal impersonation.

8. Takeaway

UniVocal’s main technical contribution is a practical recipe for extending a strong semantic-token TTS system into a unified speech-singing generator that can also perform code-switching. The recipe combines: (1) a curriculum that first aligns speech and singing, then teaches switching; (2) a synthetic benchmark/data pipeline that creates semantically natural switching examples; and (3) a pitch-aware refined cent token that improves expressive prosody and melody. The resulting system is strongest on explicit and mixed code-switching, remains competitive on standard speech and singing, and exposes a useful design point for future unified audio models: lightweight prosodic planning can help bridge the gap between intelligible speech and musically structured singing.

Code & Implementation

The UniVocal codebase, found under the UniVocal/ directory, is the official implementation of the proposed unified Speech-Singing Code-Switching (SCS) synthesis framework. As indicated in the UniVocal/README.md, the repository is currently under active development, with upcoming releases planned to include the full pipeline: inference scripts for generating speech, singing, and seamless code-switching audio; pre-trained checkpoints; training code for the two-stage curriculum learning strategy; and data processing scripts for dataset construction.

While the UniVocal directory currently contains mainly documentation and the paper PDF, related components for speech tokenization and other audio processing tasks are provided in companion folders such as VARSTok/. The VARSTok module implements a variable-frame-rate speech tokenizer crucial for tokenizing input acoustic features and generating discrete codes used by the main synthesis framework.

Overall, the repository structure aligns with the paper's description: data-efficient curriculum learning to train a unified model for implicit vocal mode inference and speech-singing code-switching synthesis. Future releases will clarify the exact scripts and checkpoints that reproduce the results presented in the paper.