MeanFlow Token2Wav

One-Step Token-to-Waveform Generation with MeanFlow in Latent Space

This paper presents a one-step Token-to-Waveform synthesis using MeanFlow in latent space, greatly speeding up generation while keeping quality high. It replaces costly iterative decoding with a single latent prediction, plus novel refinements to improve output fidelity without slowing inference.

tts
low-latency
llm

Demos

These demos showcase the Token2Wav MeanFlow model for one-step token-to-waveform generation in latent space. Evaluate how fine-tuning strategies (No-FT, Decoder-FT, Joint-FT) improve spectrogram quality compared to baseline 10-step synthesis and approach ground truth fidelity. The Joint-FT D=24 model offers best quality with 17× speedup.

Baseline 10-step Token2Wave Mel-spectrogram for sample 1, representing multi-step generation.

Best Joint fine-tuning (Joint-FT) with D=24, closest to ground truth quality for sample 1.

Authors: Zheqi Dai, Guangyan Zhang, Zhen Ye, Jingyu Li, Haolin He, Chunyat Wu, Yiwen Guo, Qiuqiang Kong

Categories: eess.AS

Comment: 5 pages, 1 figure

Published 2026-06-16 · Updated 2026-06-16

Abstract

Neural audio codecs are central to modern LLM-based Text-to-Speech (TTS) and multimodal systems. As low-bitrate semantic codecs gain prominence, the Token-to-Waveform (Token2Wav) decoder becomes a bottleneck determining both perceptual quality and system efficiency. Conventional multi-step flow-matching decoders offer superior quality but suffer from high inference latency due to iterative sampling, creating a severe quality-speed trade-off. In this paper, we propose a novel Token2Wav architecture that overcomes this limitation by applying MeanFlow in a highly compressed latent space. By modeling the average velocity rather than the instantaneous velocity field, MeanFlow enables true one-step generation. Operating in the latent domain mitigates the memory and stability issues of waveform-level flows, yielding up to a 17$\times$ improvement in Real-Time Factor (RTF) compared to multi-step baselines with negligible quality degradation. Furthermore, we introduce refinement strategies that mitigate latent mismatch, including decoder-only fine-tuning with the MeanFlow generator frozen and end-to-end joint fine-tuning, improving fidelity without increasing inference-time cost. Code and demo are publicly available.

Introduction

This paper addresses the inference bottleneck in modern Token-to-Waveform (Token2Wav) decoders used in LLM-based text-to-speech and related multimodal systems. The central observation is that as speech codecs move toward lower bitrate and more semantic tokenizations, the downstream decoder must recover progressively more acoustic detail from less information. That makes the Token2Wav stage both more important and more expensive: it determines perceptual quality, but it can also dominate latency.

The authors focus on the quality-speed trade-off of flow-matching decoders. Multi-step flow models can produce strong speech quality, but they typically require many network evaluations (NFEs) during ODE sampling, which hurts real-time performance. The key idea in this work is to replace iterative waveform-space decoding with MeanFlow in a compressed latent space. By learning average velocity instead of instantaneous velocity, the model can generate a latent speech representation in exactly one network evaluation, then deterministically decode that latent to waveform using a lightweight VAE decoder.

The method is designed for the Token2Wav setting where the conditioning consists of semantic tokens and a speaker embedding. The paper’s main claim is that this latent-space one-step design eliminates the iterative sampling overhead of conventional flow decoders while avoiding some of the memory and stability problems that arise when attempting one-step generation directly in waveform space.

The paper’s stated contributions are threefold: (1) a one-step Token2Wav framework using latent-space MeanFlow, (2) an empirical study of latent dimensionality and model capacity under tight latency constraints, and (3) refinement strategies that reduce mismatch between generated latents and VAE latents without increasing inference-time cost.

Overview of the Proposed System

The proposed decoder is a two-stage cascade. First, a conditional 1D Diffusion Transformer (DiT) trained with MeanFlow generates a latent sequence from semantic tokens and a speaker embedding. Second, a deterministic waveform VAE decoder reconstructs the audio waveform from that latent sequence. At inference time, the latent is sampled once from Gaussian noise and transformed in a single step:

$$\mathbf{z}_{\text{gen}} = \mathbf{z}_1 - f_\theta(\mathbf{z}_1,0,1,\mathbf{c}),$$

where $\mathbf{z}_1 \sim \mathcal{N}(\mathbf{0},\mathbf{I})$ and $\mathbf{c}$ denotes the conditioning pair $(\mathbf{s},\mathbf{e})$ of semantic tokens and speaker embedding. The waveform is then reconstructed as $\hat{\mathbf{x}} = G_\psi(\mathbf{z}_{\text{gen}})$.

Overall framework. Left: MeanFlow training in latent space using VAE latents as targets. Middle: refinement during fine-tuning using waveform-domain losses on audio reconstructed from generated latents. Right: inference uses the same one-step sampling plus VAE decoding, with either the original or refined decoder.

Method

Waveform VAE for Latent Representation

To make one-step generation feasible, the authors first compress waveform audio into a short latent sequence using a lightweight waveform variational autoencoder. The encoder $E_\phi$ maps a waveform segment $\mathbf{x}$ to a latent tensor:

$$\mathbf{z}_{\text{data}} = E_\phi(\mathbf{x}) \in \mathbb{R}^{T' \times D},$$

where $T'$ is the downsampled length and $D$ is the latent channel dimension. A deterministic decoder $G_\psi$ reconstructs waveform audio from latents:

$$\hat{\mathbf{x}} = G_\psi(\mathbf{z}).$$

This design intentionally confines the one-step generative problem to a much shorter and lower-dimensional sequence than raw waveform modeling. The paper emphasizes that this improves memory use and training stability relative to direct waveform-space flow models.

Rectified Flow and MeanFlow in Latent Space

Training uses a rectified probability path in latent space. Let $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0},\mathbf{I})$. The latent interpolation path is

$$\mathbf{z}_t = (1-t)\mathbf{z}_{\text{data}} + t\boldsymbol{\epsilon}, \quad t \in [0,1].$$

Under this path, the target instantaneous velocity is constant:

$$\mathbf{u}_{\text{rf}}(\mathbf{z}_t,t,\mathbf{c}) = \frac{d\mathbf{z}_t}{dt} = \boldsymbol{\epsilon} - \mathbf{z}_{\text{data}}.$$

Standard conditional flow matching would train a velocity field $\mathbf{v}_\theta(\mathbf{z}_t,t,\mathbf{c})$ to match this target via an $\ell_2$ loss, but it still requires multi-step ODE integration at inference. The paper instead adopts MeanFlow, which predicts an average velocity over an interval $[r,t]$:

$$\bar{\mathbf{u}}(\mathbf{z}_t;r,t,\mathbf{c}) = \frac{1}{t-r} \int_r^t \mathbf{v}(\mathbf{z}_\tau,\tau,\mathbf{c})\, d\tau.$$

The model $f_\theta$ is trained to predict this average velocity with a MeanFlow objective using stop-gradient and adaptive reweighting:

$$\mathcal{L}_{\text{MF}}(\theta)=\mathbb{E}_{r,t,\mathbf{x},\boldsymbol{\epsilon}}\Big[w(r,t)\,\lVert f_\theta(\mathbf{z}_t,r,t,\mathbf{c})-\operatorname{sg}(\bar{\mathbf{u}}_{\text{gt}})\rVert_2^2\Big].$$

The paper notes that the needed derivatives with respect to $t$ are computed efficiently with Jacobian-vector products. When $r=t$, the objective reduces to standard conditional flow matching.

Conditioning and Architecture

The latent generator is a 1D DiT model conditioned on semantic token embeddings, timestep embeddings, and a speaker embedding. These signals are fused and injected into each transformer block with adaptive layer normalization (adaLN-Zero). Speaker conditioning uses a 192-dimensional embedding extracted from a pretrained CAM++ speaker encoder. Semantic tokens are extracted at 25 Hz with the CosyVoice2 tokenizer, which uses a single codebook with vocabulary size 6,561.

The latent generator is instantiated at two scales: 140M parameters and 600M parameters. The paper studies whether larger capacity helps one-step decoding under the constraints of the compressed latent space.

Refinement for Latent Mismatch

A central practical issue is latent mismatch: the one-step generator may produce latents that differ from the distribution seen by the VAE decoder during its own training. This can introduce audible artifacts even when the latent generator is efficient. To reduce this mismatch, the authors define a waveform-domain refinement loss:

$$\mathcal{L}_{\text{ref}} = \mathcal{L}_{\text{MRSTFT}} + \lambda_{\text{adv}}\mathcal{L}_{\text{adv}} + \lambda_{\text{fm}}\mathcal{L}_{\text{fm}},$$

where $\mathcal{L}_{\text{MRSTFT}}$ is a multi-resolution STFT reconstruction loss, and $\mathcal{L}_{\text{adv}}$ and $\mathcal{L}_{\text{fm}}$ are adversarial and feature-matching losses computed with a multi-scale discriminator. The discriminator is used only during training.

The paper studies two refinement strategies:

Decoder-only refinement: freeze the MeanFlow generator $f_\theta$ and fine-tune only the VAE decoder $G_\psi$ (and discriminator) on generated latents.
End-to-end joint fine-tuning: update both $f_\theta$ and $G_\psi$ by backpropagating waveform-domain losses through the one-step update, while keeping the encoder $E_\phi$ frozen.

Importantly, both strategies preserve the same inference pipeline: one generator pass plus one decoder pass.

Training and Experimental Setup

All models are trained on LibriTTS and evaluated on the test-clean subset of LibriSpeech. The paper uses the same semantic tokenization as the CosyVoice2 baseline for fair comparison.

The waveform VAE operates on 24 kHz audio and is trained to align its latent frame rate with the token rate of 25 Hz. The encoder uses an Oobleck-style strided convolution stack with strides $[2,4,4,6,5]$, giving a total downsampling ratio of 960. The decoder is deterministic.

The VAE is trained on 2-second waveform chunks using a multi-resolution STFT loss with FFT sizes from 32 to 2048, an adversarial hinge loss, a feature-matching loss with an EnCodec-style multi-scale discriminator, and a KL regularization term. The paper sets $\lambda_{\text{adv}}=0.1$, $\lambda_{\text{fm}}=5.0$, and $\lambda_{\text{kl}}=10^{-4}$, and turns on the discriminator after 1,000 warmup steps.

The MeanFlow latent generator is trained on 5-second segments to better model longer-range semantic-acoustic dependencies. The interval variables $r$ and $t$ are sampled with the logit-normal scheme from the MeanFlow literature, using $\mu=-0.4$ and $\sigma=1.0$.

The main baseline is CosyVoice2 Token2Wav, which performs token-to-mel conditional flow matching with 10-step Euler sampling followed by a pretrained vocoder. The paper also reports a VAE-only reconstruction upper bound by encoding and decoding ground-truth waveforms through the VAE.

Metrics are WER, speaker similarity (SpkSim), UTMOS, MOS, and real-time factor (RTF). WER is computed using a fine-tuned HuBERT-Large ASR model, SpkSim uses a fine-tuned WavLM-Large speaker verification model, and UTMOS uses the published UTMOS estimator. For metric computation, generated waveforms are resampled to 16 kHz because LibriSpeech references are natively 16 kHz, but the synthesis models produce 24 kHz audio. RTF is measured with batch size 1 on an NVIDIA H20 GPU using FP16 inference, and includes both generator and decoder time. The subjective MOS test uses 50 randomly selected utterances rated by 20 listeners on a 1–5 scale.

Main Results

The headline result is that latent-space MeanFlow achieves a large latency reduction while preserving much of the baseline quality. The best configuration is the 140M DiT with latent dimension $D=24$ and end-to-end joint fine-tuning. It reaches an end-to-end RTF of 0.0046, compared with 0.0775 for the 10-step CosyVoice2 baseline, corresponding to a roughly 17× speedup under the same measurement protocol.

Quality remains competitive: the best model obtains WER 3.41%, SpkSim 0.932, UTMOS 3.64, and MOS $3.85 \pm 0.03$, compared with the baseline’s WER 3.18%, SpkSim 0.940, UTMOS 3.76, and MOS $4.05 \pm 0.03$. The paper stresses that the speedup is obtained without changing the inference structure beyond replacing multi-step flow sampling with a single latent generation step.

The VAE reconstruction upper bound is useful for diagnosis: with oracle latents, the VAE decoder alone achieves WER 2.14%, SpkSim 0.966, UTMOS 3.67, and MOS $4.10 \pm 0.04$. This indicates that a substantial fraction of the remaining quality gap comes from the latent generator rather than from the waveform decoder itself.

System	Dim	WER (%) ↓	SpkSim ↑	UTMOS ↑	MOS ↑	RTF ↓
CosyVoice2 Token2Wav (10-step)	—	3.18	0.940	3.76	4.05 ± 0.03	0.0775
VAE reconstruction (oracle latent)	24	2.14	0.966	3.67	4.10 ± 0.04	—
Latent MeanFlow + VAE (Joint-FT)	24	3.41	0.932	3.64	3.85 ± 0.03	0.0046
Latent MeanFlow + VAE (Joint-FT)	16	3.62	0.927	3.56	3.72 ± 0.03	0.0046

The paper also reports a stage-wise breakdown for the best system: the DiT contributes 0.0016 RTF and the VAE decoder contributes 0.0030 RTF, summing to 0.0046. For the baseline, the 10-step flow stage plus vocoder total 0.0775 RTF. This decomposition highlights that the speed gain primarily comes from eliminating iterative sampling, not from decoder simplification alone.

Ablation Studies

The ablations are organized around three questions: how latent dimensionality affects quality, whether a larger DiT helps, and how much the refinement strategies matter.

Effect of latent dimensionality

Increasing latent dimension consistently improves all reported quality metrics. Moving from $D=8$ to $D=24$ reduces WER from 4.82% to 3.41%, improves SpkSim from 0.909 to 0.932, increases UTMOS from 3.47 to 3.64, and raises MOS from $3.45 \pm 0.03$ to $3.85 \pm 0.03$. RTF remains essentially unchanged across these settings because the one-step generator dominates runtime and the decoder cost changes only mildly at these dimensions.

The authors interpret $D=16$ to $D=24$ as a practical quality-compression trade-off, and use $D=24$ as the default setting in the rest of the paper.

Effect of model capacity

The paper compares 140M and 600M DiT generators at $D=24$. Surprisingly, the smaller 140M model matches or slightly outperforms the 600M model on the main perceptual metrics while also running faster. Specifically, the 140M model obtains UTMOS 3.64 and MOS $3.85 \pm 0.03$ versus 3.57 and $3.78 \pm 0.03$ for the 600M model, with RTF 0.0047 versus 0.0075. The authors suggest that simply scaling the model may not improve one-step decoding under the current training recipe and may even make the single-step estimate less robust.

Effect of refinement strategy

The refinement ablation clearly shows the latent mismatch problem. When the pretrained generator and decoder are composed directly without fine-tuning, quality drops noticeably: UTMOS falls to 3.11 and MOS to $3.35 \pm 0.03$. Decoder-only fine-tuning improves reconstruction by adapting the VAE decoder to the generator’s latent distribution, reaching UTMOS 3.43 and MOS $3.70 \pm 0.03$. End-to-end joint fine-tuning yields the best overall results, with WER 3.41%, SpkSim 0.932, UTMOS 3.64, and MOS $3.85 \pm 0.04$.

Configuration	WER (%) ↓	SpkSim ↑	UTMOS ↑	MOS ↑
Latent dimensionality ($D$)
$D=8$	4.82	0.909	3.47	3.45 ± 0.03
$D=16$	3.62	0.927	3.56	3.72 ± 0.03
$D=24$	3.41	0.932	3.64	3.85 ± 0.03
Model capacity ($D=24$)
140M	3.41	0.932	3.64	3.85 ± 0.03
600M	3.44	0.930	3.57	3.78 ± 0.03
Refinement strategy ($D=24$)
No-FT	3.52	0.931	3.11	3.35 ± 0.03
Decoder-FT	3.43	0.931	3.43	3.70 ± 0.03
Joint-FT	3.41	0.932	3.64	3.85 ± 0.04

Interpretation and Takeaways

Conceptually, the paper’s contribution is to move one-step generative decoding into a lower-dimensional latent space, where the single-step update is more manageable than in waveform space. This is paired with a decoder that is lightweight enough to preserve the latency benefit. The result is a fixed-cost inference pipeline: one DiT pass plus one deterministic VAE decode.

The ablations indicate that the dominant quality bottleneck is not the VAE decoder alone, but the generator-induced latent distribution. The decoder-only and joint fine-tuning experiments are especially important because they show that waveform-domain losses can be used to recover quality without giving up the one-step runtime profile.

The empirical evidence also suggests that, under this setup, larger latent generators do not automatically improve one-step performance. Instead, a smaller 140M model is the best reported trade-off. Likewise, the gains from increasing latent dimension saturate, with $D=24$ giving the strongest reported results and $D=16$ appearing close behind on several metrics.

Conclusion

The paper presents a practical one-step Token2Wav decoder built from latent-space MeanFlow plus a deterministic VAE decoder. On LibriSpeech test-clean, the proposed system substantially reduces inference latency, achieving up to a 17× RTF speedup over a representative multi-step baseline, while retaining competitive intelligibility and perceptual quality. The refinement strategies are a key part of the story: they reduce latent mismatch and improve speech fidelity without increasing inference-time cost.

The paper’s experimental scope is centered on LibriTTS training and LibriSpeech test-clean evaluation, and the main observed trade-offs are between latent compression, generator capacity, and fidelity. Within that scope, latent-space MeanFlow appears to be a compelling direction for real-time and on-device Token2Wav decoding under tight latency budgets.

Code & Implementation

This repository implements the Token2Wav system described in the paper "One-Step Token-to-Waveform Generation with MeanFlow in Latent Space." The core method is realized within the code/ directory, which contains training and fine-tuning code crucial to the approach.

The speechflow/ subdirectory houses the MeanFlow latent generator based on the DiT1D backbone, responsible for the novel one-step waveform generation in the latent space. The stable_audio_tools/ subdirectory includes VAE and audio autoencoder utilities adapted from the Stability-AI stable-audio-tools repository, which are leveraged for encoding and decoding audio signals.

Two key training scripts demonstrate the refinement strategies introduced in the paper to mitigate latent mismatch:

finetune_decoder.py: Implements decoder-only refinement by freezing the MeanFlow generator and fine-tuning the VAE decoder using losses such as MRSTFT, adversarial hinge loss, and feature matching.
finetune_joint.py: Implements end-to-end joint refinement where both the MeanFlow generator and VAE decoder are trained together using a differentiable sampling method, allowing waveform-domain losses to update both components.

The repository also includes a self-contained HTML demo under the demo/ directory showcasing generated audio samples.

Overall, the codebase offers a faithful research reference implementation of the proposed one-step Token2Wav system with clear modular structure supporting training, refinement, and demonstration of MeanFlow latent synthesis in compressed audio spaces.