MELD

MELD: Mel-Spectrogram-Based Speech Language Modeling with Discrete Latent Variables

MELD uses discrete latent variables on mel-spectrograms to jointly train speech encoders and autoregressive models. This approach improves zero-shot text-to-speech and speech-to-text performance, and reduces issues like prolonged silence common in previous models.

tts
asr
autoregressive
voice-cloning

Authors: Sung-Lin Yeh, Wei Zhou, Gil Keren, Duc Le, Zhong Meng, Hao Tang, Jay Mahadeokar, Ozlem Kalinli, Alexandre Mourachko

Categories: eess.AS, cs.CL

Published 2026-05-28 · Updated 2026-05-28

Abstract

Recent speech language models rely on encoders that are optimized separately from autoregressive models. Since these encoders are unaware of the downstream objectives, the extracted representations may not be optimal for downstream tasks. To address this limitation, we introduce a discrete latent variable model on mel spectrograms that jointly optimizes the encoder and the speech language model. Joint optimization not only brings improvements over codec-based and other mel-spectrogram-based baselines on zero-shot Text-to-Speech (TTS) and Speech-to-Text (STT) tasks, but also effectively alleviates common issues in autoregressive mel-spectrogram modeling, such as prolonged silence generation and word omissions.

Introduction and motivation

The paper addresses a recurring limitation in speech language modeling: many modern systems are trained in two stages, where a speech encoder or codec is optimized independently from the autoregressive model that consumes its outputs. The authors argue that this decoupling makes the intermediate representation suboptimal for downstream objectives such as zero-shot text-to-speech (TTS), speech-to-text (STT), and joint TTS-STT modeling. Their central claim is that if the speech representation is learned jointly with the autoregressive predictor, then the representation can retain exactly the information needed for the target tasks, rather than whatever a standalone encoder happens to preserve.

MELD stands for Mel-Spectrogram-Based Speech Language Modeling with Discrete Latent Variables. It is built around mel-spectrogram frames rather than speech codecs, but unlike a plain mel autoregressive model, it introduces a discrete latent variable at every time step. The model is designed to combine two desirable properties: (1) discrete sampling, which previous work has found useful for suppressing stretched silence and improving generation control, and (2) direct optimization on mel-spectrograms, which avoids the task-information loss that can occur when speech is discretized by an upstream encoder that is not trained for the downstream objective.

The paper evaluates three settings: zero-shot TTS, STT, and a single model that can do both TTS and STT. Across these settings, the reported gains are strongest for STT, where joint optimization over mel-spectrograms consistently improves over codec-based and discretized-mel baselines. For TTS, MELD matches or exceeds codec-based and continuous-mel baselines while reducing the common failure mode of prolonged silence.

Core idea: a discrete latent autoregressive model over mel-spectrograms

Let $y=(y_1,\dots,y_M)$ be a byte-pair-encoding (BPE) text sequence and let $x=(x_1,\dots,x_T)$ be the sequence of mel-spectrogram frames. A standard autoregressive mel model factorizes the conditional likelihood as

$$ -\log p(x\mid y)=\sum_{t=1}^{T}-\log p(x_t\mid x_{

The paper emphasizes that such models often behave like deterministic next-frame predictors and therefore need an auxiliary stop predictor, while also being prone to infinite silence or repeated low-information frames during generation. MELD modifies the generative process by inserting a discrete latent variable $z_t$ at each time step:

$$ p(x_t,z_t\mid x_{

The key modeling choice is that the latent $z_t$ is sampled from a categorical distribution over a codebook of size $K$, while the spectrogram frame is reconstructed conditioned on both the latent token and the autoregressive context. This keeps the latent space discrete, enabling top-$k$ / top-$p$ sampling and repetition penalties, but still lets the spectrogram decoder directly model continuous acoustic detail.

Variational training objective

Because the marginal $\log \sum_{z_t} p(x_t,z_t\mid x_{

$$ q(z_t\mid x_t)=\frac{\exp\left(-\lVert x_t-c_{z_t}\rVert^2/\tau\right)}{\sum_{k=1}^{K}\exp\left(-\lVert x_t-c_k\rVert^2/\tau\right)}, $$

where $c_k$ is the $k$-th codebook vector and $\tau=1$ in the reported experiments. The codebook is initialized with $k$-means centroids on mel frames and then frozen during training. The variational lower bound is

$$ \mathcal{L}_{\mathrm{VLB}}=\sum_{t=1}^{T}\Big[\mathrm{KL}\big(q(z_t\mid x_t)\,\|\,p(z_t\mid x_{

The first term aligns the autoregressive latent predictor with the frame-wise quantizer. The second term reconstructs the next mel frame from the latent token and the history. The appendix gives the Jensen-based derivation of this bound.

The reconstruction term is implemented as mean squared error (MSE) under a Gaussian likelihood assumption. The paper models the reconstruction in two stages: a direct prediction plus a convolutional refinement, similar in spirit to a Tacotron 2 postnet.

$$ \mathrm{MSE}(x,\hat{x})=\frac{1}{T}\sum_{t=1}^{T}\Big[\lVert x_t-\hat{x}_t\rVert_2^2+\lVert x_t-(\hat{x}_t+\operatorname{conv}(\hat{x})_t)\rVert_2^2\Big]. $$

The model also uses a slowness regularizer

$$ \mathcal{L}_{\mathrm{slow}}=-\frac{1}{T-1}\sum_{t=1}^{T-1}\lVert \hat{x}_t-\hat{x}_{t+1}\rVert_2^2, $$

which is added as an auxiliary loss. In practice, the authors report that this term is important for suppressing prolonged silence and for improving generation diversity.

Why the discrete latent space matters

A major empirical claim is that the discrete latent space is not just a modeling convenience. The paper reports that when the latent contribution is removed at inference time, quality collapses sharply. This supports the interpretation that the discrete variables carry essential acoustic information rather than being ignored by the decoder.

The authors also argue that discrete sampling gives better control over failure modes than a single Gaussian latent space. In particular, they contrast MELD with MELLE, whose Gaussian latent formulation can become stuck in silence once the latent distribution enters a silence region. By comparison, MELD can sample among multiple discrete codes, including codes associated with non-silent frames, and can therefore escape silence loops more reliably.

Architecture and parameterization

MELD uses a single decoder-only Transformer for both TTS and STT-style autoregressive prediction. Unless otherwise stated, the base configuration is a 12-layer Transformer with 16 attention heads, a model dimension of 1024, a feed-forward dimension of 4096, and dropout 0.2. The resulting model size is roughly 200M parameters.

Input and output representations

Text: BPE tokenization with a vocabulary size of 4096, used for both TTS and STT.
Speech: 80-dimensional log mel-spectrogram frames.
Latent space: discrete tokens from a codebook of size $K=8192$.
Special tokens: <TTS>, <STT>, and <EOS>. The end-of-sequence token is merged into the discrete vocabulary, so the model does not need a separate stop predictor.

Mel-spectrogram frontend

The paper extracts mel features using settings matched to the pre-trained HiFi-GAN vocoder: 80-dimensional log mel-spectrograms at 62.5 Hz, i.e. a 16 ms frame shift, 64 ms frame length, Hann windowing, and a 1024-point Fourier transform. The mel filterbank spans 80 Hz to 7600 Hz. Features are normalized with the global mean and variance computed from the training set.

For mel-to-waveform synthesis, the authors use a HiFi-GAN vocoder pre-trained on 585 hours of LibriTTS. Predicted mel frames are rescaled before vocoding because the model is trained on normalized features.

Quantization network and latent prediction

The quantization network $q(z_t\mid x_t)$ is not used at inference time; it only defines the variational target during training. The codebook is initialized with $k$-means and then frozen. The authors explicitly note that freezing the codebook avoids some of the instability that can occur with learned vector quantization.

The latent predictor and speech encoder both map their inputs to the Transformer dimension. Text tokens use a standard embedding lookup. Mel frames are mapped through a 3-layer MLP encoder $g_{\mathrm{Mel}}$ with hidden dimension 1024, GELU activations, and dropout 0.5. The authors also apply test-time dropout to $g_{\mathrm{Mel}}$ during zero-shot TTS inference to reduce training-inference mismatch; they report that this is necessary for stable synthesis.

During generation, the model samples $z_t$ from the predicted categorical distribution $p(z_t\mid x_{

Mel reconstruction module

Reconstruction is handled by a Tacotron-2-style module. The paper describes it as a linear layer plus residual MLP processing and a convolutional postnet. In the appendix, the postnet is specified as three convolutional layers with 512 filters of shape $5\times 1$, batch normalization, and a $\tanh$ activation on every layer except the last. The effective receptive field is $5\times 16$ ms.

The model predicts the next frame as

$$ \hat{x}_t = \mathrm{SpecNet}\big(h_t + g_{\mathrm{Mel}}(c_{z_t})\big), $$

where $h_t$ summarizes the autoregressive context and $c_{z_t}$ is the selected codeword. This same reconstruction pathway is used during training, while at inference the latent token is sampled from the model’s predicted distribution rather than taken from the posterior quantizer.

How MELD extends to STT and joint TTS-STT modeling

A key contribution of the paper is that the same discrete-latent formulation can be used for STT. The authors reinterpret next-token prediction over a union vocabulary $\mathcal{V}=\mathcal{V}_{\text{text}}\cup\mathcal{V}_{\text{latent}}$, where the speech side uses discrete latent codes and the text side uses BPE tokens. The same decoder-only Transformer can therefore support both modalities through different prompt tokens.

In the TTS configuration, the model predicts the next latent code and then reconstructs the next mel frame. In the STT configuration, the model is given speech input and predicts the BPE sequence. The paper’s notation writes the STT objective as a cross-entropy over the target text tokens, with the discrete latent formulation providing the shared modeling framework rather than a separate ASR architecture.

For joint TTS-STT training, the model is initialized on TTS for 80k steps and then trained with mixed speech-text sequences for both modes, sampled equally within each batch. The paper uses dropout 0.5 in TTS mode during both training and inference, while dropout is disabled for STT. SpecAugment is used on the speech side, with a weaker configuration for the joint model than in the STT-only setting because the original settings were found to be too aggressive.

Training setup and data

Training corpus: the 960-hour LibriSpeech training subset, LS960.
Optimization: Adam.
Batching: up to 50k frames per batch.
Gradient clipping: 10.
Learning rate: warmed up linearly for 1000 steps to $5\times10^{-4}$, held constant for 100k steps, then linearly decayed over the final 100k steps.
Total steps: 200k for all models.
Hardware: 16 NVIDIA V100 GPUs with 16 GB memory each.

The appendix notes that longer training did not bring notable improvements. This is relevant because some compared systems, especially their MELLE reproduction, were sensitive to optimization details and did not clearly benefit from substantially longer training.

Codec baseline details

The paper compares MELD against a codec-based baseline, Codec-LM, built with a DAC encoder. The DAC setup uses 12 codebooks with a codebook size of 1024 and produces 12-level RVQ codes at 50 Hz. To capture residual dependencies, the authors apply a one-code delay to each level of the RVQ stack, resulting in a total delay of $(12-1)\times20$ ms. Twelve linear heads predict the 12 levels of RVQ codes for TTS; a separate linear head predicts BPE tokens for STT.

For fairer comparison on STT, the codec baseline’s codebooks are initialized and frozen from a pre-trained TTS-only Codec-LM. The paper reports that this initialization is crucial: without codebook initialization, the codec baseline fails to converge, with WERs above 100 on dev and test.

Evaluation protocol

Zero-shot TTS

The TTS task uses the first 3 seconds of an utterance as a prompt and asks the model to continue the utterance given the transcription. The evaluation uses LibriSpeech test-clean samples with durations between 4 and 10 seconds, totaling 2.2 hours. Each sample is generated 3 times and the scores are averaged.

The paper reports both subjective and objective metrics. Subjectively, it uses Similarity MOS (SMOS) for speaker similarity and Comparison MOS (CMOS) for naturalness relative to MELD. Objectively, it uses WER from a Conformer-Transducer and from Whisper-large, plus speaker-embedding cosine similarity (SIM) from a WavLM-finetuned speaker verifier. The authors use SIM purely as a proxy for speaker similarity.

The appendix describes the subjective setup in more detail: 43 tests across 40 speakers from LibriSpeech test-clean, with 5 listeners per screen, collected via Amazon Mechanical Turk. Each screen has two subtasks, and each receives 15-20 scores overall.

STT and joint modeling

For STT, the paper reports word error rates on LibriSpeech dev-clean, dev-other, test-clean, and test-other, using beam search with beam size 5. SpecAugment is applied in the STT setting. The joint TTS-STT experiment reports TTS metrics together with STT test-clean and test-other WERs.

Results: zero-shot TTS

The main TTS result is that MELD substantially improves over both codec-based and mel-only baselines while maintaining competitive speaker similarity. Compared with the authors’ codec baseline, MELD reduces WER from 5.3/4.8 to 2.4/1.9 when using BPE tokens and mel spectrograms at 62.5 Hz, while retaining the same SIM reported for the codec baseline with phoneme tokens and exceeding the BPE codec baseline.

The model also compares favorably to the paper’s reproduction of MELLE. MELD’s mel-based discrete latent sampling is presented as a better alternative to MELLE’s single-Gaussian latent space, especially because it avoids the repeated silence behavior that the authors observed in their MELLE reproduction.

Model	Text	Speech	Frequency (Hz)	WER ↓	SIM ↑
Ground truth	--	--	--	2.2 / 1.6	0.925
DAC	--	--	50.0	2.2 / 1.6	0.922
HiFi-GAN	--	--	62.5	2.2 / 1.6	0.903
VALL-E	Phn	Encodec	75.0	- / 5.0	0.868
Codec-LM	Phn	DAC	50.0	5.7 / 4.7	0.872
Codec-LM	BPE	DAC	50.0	5.3 / 4.8	0.864
MELD	BPE	Mel	62.5	2.4 / 1.9	0.872
MELD	BPE	Mel	31.3	2.5 / 1.9	0.855

The authors’ interpretation is that MELD nearly matches the speaker similarity of the codec baseline while significantly improving content fidelity. Reducing the mel frame rate from 62.5 Hz to 31.3 Hz slightly degrades speaker similarity, suggesting that the higher frame rate helps preserve speaker characteristics.

In subjective evaluation on 43 samples, MELD also scores better than the codec baseline.

Model	SMOS ↑	CMOS ↑
Ground Truth	4.11 ± 0.10	0.27
Codec-LM	3.72 ± 0.15	-0.31
MELD (joint)	3.81 ± 0.12	-0.20
MELD	3.89 ± 0.06	0.0

The paper also reports that, compared with codec-based approaches, the mel-discrete latent model achieves up to 2.3% lower WER while keeping speaker similarity comparable. The subjective ratings are consistent with this picture: MELD improves SMOS over Codec-LM and is preferred in comparative naturalness.

Results: mel-only baselines and the role of discrete latent sampling

The paper compares MELD against two mel-based baselines: Mel-LM, which removes the latent space and behaves like a decoder-only Tacotron-style model, and MELLE, which samples from a Gaussian latent space. Both baselines are trained with the slowness penalty. Despite that, both lag MELD substantially on WER and speaker similarity.

The authors’ explanation is that a Gaussian latent space constrained by reverse KL divergence can become overconfident around silence, making it difficult to sample out of a silence state once the model enters it. Discrete latent sampling, in contrast, gives the model multiple candidate codes, including codes associated with voiced frames, and thus improves robustness against silence loops.

Model	$\mathcal{L}_{\mathrm{slow}}$	Frequency (Hz)	WER ↓	SIM ↑
Mel-LM	✓	62.5	4.7 / 4.2	0.825
MELLE	✓	62.5	4.8 / 4.2	0.826
MELD	✓	62.5	2.4 / 1.9	0.872
MELD	✗	62.5	6.0 / 3.7	0.862
MELD	✓	31.3	2.5 / 1.9	0.855

The sharp degradation without the slowness penalty is notable: WER increases from 2.4/1.9 to 6.0/3.7, and speaker similarity falls slightly. The paper also reports qualitatively that this setting produces more long pauses and silence. This supports the claim that the discrete latent formulation alone is not enough; the auxiliary anti-silence regularization is still important.

Results: STT with direct mel optimization

The STT experiments are where joint optimization over mel-spectrograms has the clearest benefit. The paper extends the decoder-only architecture to transcribe speech into BPE tokens, using the same model family but deactivating dropout in $g_{\mathrm{Mel}}$ and adding SpecAugment. The authors report that optimizing directly over mel-spectrograms gives lower WERs than codec-based baselines, showing that discretizing speech representations upstream can hurt task-relevant information preservation.

They also compare against dMel, a discretized-mel ASR system. MELD outperforms it on dev and test, and the gap is larger for the bigger 260M model, suggesting that the advantage of direct mel modeling persists and may grow with model capacity.

Model	Size	Hours	Dev clean	Dev other	Test clean	Test other
Moshi	7B	7M	-	-	5.8	-
`dMel` (ASR)	258M	960	3.8	10.3	4.2	10.4
Codec-LM	200M	960	6.1	16.5	6.4	16.4
MELD	200M	960	4.0	9.8	4.2	10.0
MELD, w/o SpecAug	200M	960	4.3	12.5	4.5	12.5
MELD	260M	960	3.6	9.0	3.5	9.2

The codec baseline is substantially worse than MELD. The paper highlights that MELD improves over Codec-LM by up to 6% on test-other, which is the hardest reported split in the table. The 260M MELD variant further improves over the 200M version, indicating that the architecture scales in the expected way.

The authors also note that codebook initialization matters a great deal for the codec baseline: without it, the model does not converge and all reported WERs exceed 100. That observation is used to support the broader point that codec pipelines can be brittle when the learned representation is not tightly aligned with the downstream objective.

Results: joint TTS-STT modeling

The final experiment asks whether a single MELD model can support both TTS and STT effectively. The answer is yes, but with a modest tradeoff compared with task-specific training. Joint training reduces TTS quality somewhat relative to separately trained MELD, and STT WER increases as well, but the joint model still outperforms the joint dMel baseline and clearly beats the codec baseline.

Model	Training setup	TTS WER ↓	TTS SIM ↑	STT clean	STT other
Moshi	joint	-	-	5.8	-
`dMel` (ASR)	separate	-	-	4.2	10.4
`dMel` (ASR-TTS)	joint	-	-	7.5	15.3
Codec-LM	separate	5.3 / 4.8	0.864	6.4	16.4
MELD	separate	2.4 / 1.9	0.872	4.2	10.0
MELD	joint	2.8 / 2.2	0.870	4.9	12.1

This table shows the paper’s main multi-task message: joint optimization over mel-spectrograms reduces the gap between a task-specific system and a unified TTS-STT system. The joint MELD model slightly underperforms the separately trained MELD on both TTS and STT, but still provides a coherent single-model solution that is materially better than the joint dMel baseline.

Ablations and diagnostic analysis

The ablation study is especially important because it isolates the role of the discrete latent space and the repetition penalty. The paper finds that removing the latent signal at inference time causes a dramatic collapse in content fidelity and speaker similarity. This is the clearest evidence that the discrete latent tokens are doing real work rather than being ignored by the decoder.

The repetition penalty also matters. Without it, WER worsens and the synthesized utterances become noticeably longer than the ground truth, which the authors interpret as excessive silence generation and a larger number of deletions. With the penalty, deletion errors drop substantially.

Setting	WER ↓	SIM ↑	Total duration (min)	S / D / I
Ground truth	2.2 / 1.6	0.925	131.8	0 / 0 / 0
MELD	2.4 / 1.9	0.872	129.3	330 / 157 / 63
MELD, w/o repetition penalty	3.1 / 2.6	0.869	137.4	330 / 300 / 65
MELD, w/o $z_t$	52.3 / 51.7	0.520	> 200	--

The authors note that their objective can in principle admit collapsed solutions such as constant $z_t$, but they did not observe convergence to such degenerate behavior in practice. The ablation where the discrete latent contribution is zeroed at decoding time confirms that the latent codes are essential. The repetition penalty also decreases deletion errors, which the paper connects to suppression of word omissions.

Comparison to related approaches

The related-work discussion clarifies how MELD differs from three major lines of prior work. First, codec-based LMs typically rely on an upstream RVQ encoder with multiple codebooks and hierarchical prediction. This increases complexity and memory use because several codes must be predicted per time step. MELD instead predicts one latent token per frame, which simplifies the autoregressive target while still preserving a discrete sampling interface.

Second, discretized-mel approaches such as dMel quantize mel features before autoregressive modeling. The paper’s experiments indicate that such discretization is not necessary and may in fact harm STT and joint modeling because the quantized representation can discard task-relevant detail.

Third, MELLE is a direct mel-spectrogram autoregressive method with a Gaussian sampling space. The paper positions MELD as a more principled alternative because its variational objective is explicit, its latent space is discrete, it does not require a separate stop predictor, and it extends naturally to STT.

Limitations and practical caveats

The paper is explicit about several limitations. First, codec-based and mel-spectrogram-based systems are not perfectly fair to compare, even when the same Transformer decoder is used, because the final vocoders differ: DAC decoder for codecs versus HiFi-GAN for mel spectrograms. This means some of the reported advantage may reflect differences in the synthesis back-end as well as the front-end representation.

Second, the authors state that they were not able to fully reproduce MELLE’s reported results on LS960, even though they tried to follow the training configuration closely. They suspect that a voice activity detection (VAD) pre-processing step may be important in MELLE, but the original paper did not provide enough detail to verify this. As a result, the paper’s comparison against MELLE should be read as a careful reproduction rather than a definitive reimplementation.

Third, the work is focused on conditional speech generation and transcription. The authors explicitly mention that other speech tasks such as question answering and speech translation are outside the current scope and are natural next steps.

Ethical considerations

The paper states that all experiments are conducted on the public LibriSpeech dataset, under its CC BY 4.0 license, and that speakers are anonymized using numeric IDs. The main ethical concern raised is potential misuse for speech cloning and realistic voice generation. The authors note that the systems are trained on read speech only, but could still be adapted to generate speech for unseen speakers, so any deployment should involve explicit protocols for speaker consent and data use.

Bottom line

MELD’s main contribution is a variational, discrete-latent autoregressive model over mel-spectrograms that is jointly optimized for the speech representation and the downstream decoder. Empirically, this combination gives the paper its main benefits: strong zero-shot TTS, better control over silence-related failures than continuous Gaussian sampling, and substantially improved STT compared with codec-based and discretized-mel baselines. The work’s most important technical message is that jointly learned mel representations plus discrete latent sampling can be a practical alternative to two-stage speech language modeling.