Akapulu Labs logo Akapulu Labs Research

Sparse Autoencoder Steering for TTS

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

Sparse Autoencoder Steering for TTS — method overview

The paper uses sparse autoencoders to interpret and control features in a text-to-speech language model's shared text-speech representation. This enables causal steering of speech attributes such as laughter, speaker gender, and speech rate without altering the spoken content.

  • llm
  • tts
  • multimodal
  • voice-cloning
  • prosody
  • autoregressive

Authors: Nikita Koriagin, Georgii Aparin, Nikita Balagansky, Daniil Gavrilov

Categories: cs.LG, cs.AI, cs.CL

Published 2026-06-08 · Updated 2026-06-08

Abstract

Language models increasingly serve as the backbone of text-to-speech (TTS) systems, yet we understand little about the representations they build when text and generated speech tokens share a single residual stream. We train BatchTopK sparse autoencoders on the LM backbone of CosyVoice3 and introduce a modality-aware auto-interp pipeline that labels each feature from where it fires-text-prefix context, 1-second speech clips, or both. The recovered features are interpretable, spanning phonemes, laughter, accent prompts and speaker gender. Steering through the SAE latent space shows these features are causal rather than merely descriptive: targeted interventions raise laughter probability from 0.02 to 0.79, flip perceived speaker gender, and control speech rate while preserving spoken content. SAE features thus serve both as interpretability objects and as control directions for TTS synthesis.


Overview and main claims

This paper studies mechanistic interpretability for a text-to-speech (TTS) language model backbone, rather than for a text-only LLM. The core setting is CosyVoice3, whose LM backbone processes a mixed sequence containing a text/instruction prefix followed by autoregressively generated discrete speech tokens. The authors ask what sparse autoencoder (SAE) features recover from that shared residual stream, how those features differ by token modality, and whether the recovered features can be used as causal steering directions for speech generation.

The paper’s main claim is that BatchTopK SAEs trained on CosyVoice3 recover interpretable text-modal, audio-modal, and mixed features, and that some of these features are not merely descriptive: latent interventions on the SAE coordinates can materially control perceptual properties of generated speech, including laughter, speaker-gender cues, and speech rate, while preserving spoken content.

At a high level, the paper contributes three things: (1) SAE training on a TTS LM backbone at scale, (2) a modality-aware auto-interpretation pipeline adapted to the mixed text-speech sequence, and (3) a layer-wise analysis showing how feature modality shifts across depth.

Model, data, and SAE training setup

The target model is CosyVoice3, whose LM backbone is Qwen2.5-0.5B with hidden size $896$ and $28$ layers. The backbone receives a text prompt tokenized with BPE and generates $25$ Hz discrete speech tokens autoregressively. The authors attach SAEs to the residual stream of this backbone at multiple layers.

Training uses BatchTopK SAEs with dictionary size $d = 16{,}384$ and $k = 50$ active features per token. The training corpus is approximately $250$ million tokens from the Emilia dataset. The paper states that the SAE objective is the standard reconstruction-plus-sparsity loss with an auxiliary dead-feature loss.

The analysis uses the full layer sweep for modality and reconstruction studies, while layer $20$ is used for most qualitative examples and the steering case studies.

Component Reported setting
Backbone Qwen2.5-0.5B inside CosyVoice3
Hidden size / layers $896$ / $28$
Speech token rate $25$ Hz
SAE dictionary size $16{,}384$
Active features per token $k = 50$
Training data Approximately $250$M tokens from Emilia

Modality-aware interpretation pipeline

CosyVoice-aware view of our modality-aware SAE interpretation pipeline. CosyVoice3 synthesizes speech by passing text and prompt audio through tokenizers, an LLM backbone, discrete speech token generation, DiT flow matching, and a HiFi-GAN vocoder. Our analysis attaches SAEs to the LLM residual stream, uses exact text/speech boundaries to route top activations to text, audio, or mixed evidence, then labels and scores the resulting features.
CosyVoice-aware view of our modality-aware SAE interpretation pipeline. CosyVoice3 synthesizes speech by passing text and prompt audio through tokenizers, an LLM backbone, discrete speech token generation, DiT flow matching, and a HiFi-GAN vocoder. Our analysis attaches SAEs to the LLM residual stream, uses exact text/speech boundaries to route top activations to text, audio, or mixed evidence, then labels and scores the resulting features.

The paper’s interpretation pipeline is designed around a key complication: CosyVoice3 interleaves text prefix tokens and speech tokens in a single residual stream. That means a feature may activate on text-prefix context, on generated speech, or on both. Rather than forcing all evidence into a single format, the pipeline uses the exact token boundaries in each sequence to route the strongest activations to modality-specific evidence.

For each SAE feature, the authors collect the top-20 activating token positions across the dataset by encoding residual-stream activations through the frozen SAE. They use teacher forcing with the sequence layout $[\mathrm{sos} \mid \mathrm{instruct} \mid \mathrm{text} \mid \mathrm{task} \mid \mathrm{speech}]$, where the task token marks the text-to-speech transition. For speech positions, token indices are mapped to timestamps using the $25$ Hz rate, and the source audio is cropped to a $1$-second window centered on the peak position. Padded positions are masked before ranking and extraction.

Feature modality is determined from the composition of the top-20 activations. Let the speech-start position for example $i$ be $b_i$, computed from the prefix lengths and the task marker. A top activation at position $p$ is counted as speech if $p \ge b_i$ and is within the valid sequence length. The speech fraction is then the proportion of the top-20 activations that fall in the speech segment.

The modality thresholds are explicit: features are labeled audio-modal if their speech fraction is at least $0.8$, text-modal if it is at most $0.2$, and mixed otherwise. The paper emphasizes that this is necessary because absolute position is not semantically stable in randomly interleaved text/speech sequences.

Automatic labeling and held-out scoring

The authors adapt LLM-based auto-interpretation to this mixed-modality setting. They use Gemini 3.0 Pro as the labeler, with prompts that differ by feature modality:

  • Text-modal features are shown text evidence only: marked token context, activation value, and source text context.
  • Audio-modal features are shown only $1$-second speech clips centered on activating speech tokens.
  • Mixed features are shown both text evidence and speech clips, and are only labeled when both evidence types are present.

The prompt asks for a single concise sentence describing the property consistently associated with the evidence. For text features, the label may describe lexical, punctuation, language, prompt-style, or transcript properties; for audio features, it may describe acoustic, phonetic, or prosodic properties; for mixed features, it may describe a cross-modal relation if one is visible.

To evaluate label quality, the paper uses a detection-style protocol adapted from prior auto-interp work. Given a proposed label, a held-out scorer rates positive and negative evidence items for how well they match the label. The evaluation reports AUROC and balanced accuracy. To reduce leakage from label-writing examples, the authors use a rank-held-out split: the five strongest activations are reserved for labeling and lower-ranked activations are used for scoring.

For the layer-20 comparison, the paper evaluates all $668$ mixed features together with matched $668$-feature samples of text-modal and audio-modal features. This is the primary quantitative benchmark for auto-interpretation quality in the paper.

Layer sweep across the Qwen2.5-0.5B backbone. (a) SAE explained variance by token type. (b) Feature modality composition. Middle layers mix text and speech evidence, layers 16--20 become audio-heavy, and the final hidden state becomes mostly text-modal.
Layer sweep across the Qwen2.5-0.5B backbone. (a) SAE explained variance by token type. (b) Feature modality composition. Middle layers mix text and speech evidence, layers 16--20 become audio-heavy, and the final hidden state becomes mostly text-modal.

Layer-wise reconstruction and modality dynamics

The layer sweep shows that the SAE remains a strong reconstruction model across the backbone, but the feature dictionary changes its modality composition substantially with depth. The most important qualitative pattern is that early and middle layers are multimodal and mixed, late layers become strongly audio-heavy, and the final hidden state reverts to a mostly text-modal subspace.

The appendix gives concrete modality proportions across depth. At layer $0$, the dictionary is already strongly multimodal: $12.3\%$ text-modal, $45.1\%$ audio-modal, and $42.6\%$ mixed. Mixed features peak at layer $12$ with $47.3\%$. The late layers form an “audio-commitment zone”: audio-modal features dominate at layer $16$ ($76.1\%$), layer $18$ ($65.0\%$), and layer $20$ ($74.3\%$), while mixed features collapse to $4.1\%$ by layer $20$. Layer $23$, the final hidden state, sharply flips back to a text-dominant composition: $83.1\%$ text-modal, $14.3\%$ audio-modal, and $2.6\%$ mixed.

The paper also reports per-token explained variance on a $5{,}000$-sample sweep. Reconstruction quality is very high in early layers, with overall explained variance around $0.97$--$0.99$ at layers $0$--$8$, then declines through the body of the network to a minimum of $0.82$ at layer $20$. The final layer rebounds to $0.945$. Text positions are generally reconstructed at least as well as audio positions, with the largest text-audio gap at the audio-commitment layers.

This layer-wise pattern is one of the central mechanistic findings of the paper: the shared residual stream does not merely preserve the prefix text, but gradually develops sparse directions that are tied to the generated speech-token stream as acoustic prediction is formed.

Layer Text-modal Audio-modal Mixed Notable observation
0 12.3% 45.1% 42.6% Already strongly multimodal
12 Not stated explicitly Not stated explicitly 47.3% Mixed features peak
16 Minority 76.1% Declining Audio-commitment zone begins
18 Minority 65.0% Declining Audio remains dominant
20 Minority 74.3% 4.1% Audio-heavy, mixed nearly gone
23 83.1% 14.3% 2.6% Final hidden state reverts to text-modal
Held-out auto-interp scores for layer-20 features. Text labels are easiest to verify; mixed labels are hardest.
Held-out auto-interp scores for layer-20 features. Text labels are easiest to verify; mixed labels are hardest.

Auto-interpretation quality

The held-out detection-style evaluation confirms that the modality-aware labeling protocol works best for text features, reasonably well for audio features, and least well for mixed features. For layer $20$, the reported AUROC values are $0.921$ for text-modal labels, $0.653$ for audio-modal labels, and $0.558$ for mixed labels.

The layer-wise appendix reports the same ordering throughout the network. Text-modal labels are consistently easiest to verify, with AUROC roughly in the $0.90$--$0.94$ range. Audio-modal labels are weaker but remain above chance, around $0.65$--$0.72$. Mixed labels are the hardest and most variable, ranging roughly from $0.53$ to $0.69$.

This is an important practical conclusion: single-modality features are often well described by one concise sentence, while mixed features more often combine several correlated textual and acoustic properties and are therefore harder to summarize cleanly.

Setting AUROC Interpretation
Layer 20, text-modal 0.921 Easiest to verify
Layer 20, audio-modal 0.653 Above chance, but weaker
Layer 20, mixed 0.558 Hardest in aggregate
Rank-held-out auto-interp scores across completed layers. Text-modal labels remain consistently easiest to verify; audio-modal labels are above chance but weaker; mixed labels are the most variable.
Rank-held-out auto-interp scores across completed layers. Text-modal labels remain consistently easiest to verify; audio-modal labels are above chance but weaker; mixed labels are the most variable.

Representative SAE features

The paper’s qualitative examples show that the dictionary learns a mix of very local linguistic features and speech-acoustic features. Text-modal features often correspond to individual tokens, words, years, punctuation, or prompt-side descriptors of voice style and accent. Audio-modal features often correspond to phonemes, short phoneme sequences, laughter, breathing, stuttering, or accent cues. Mixed features sometimes connect the same word or phoneme-like event across both text and speech.

Examples from the appendix include a text feature for the word “British” in speaker-accent descriptions, an audio feature for human laughter, and a mixed feature for the phoneme sequence /ohl/ in both text and speech. Another representative set includes text features for the substring “ang”, the adjective “shrill”, and four-digit years; audio features for /k/, /if/ or /ef/, /ing/, screaming/shouting/heavy breathing, and a mixed feature for stutters, false starts, and hesitation markers.

Feature Modality Auto-interp label BAcc AUROC
1376 Text Word “British” in speaker-accent descriptions 1.000 1.000
233 Audio The sound of human laughter 0.750 0.750
5543 Mixed The phoneme sequence /ohl/ in text and speech 0.917 0.979

Probe-based feature analysis

Raw-residual probe ROC-AUC as a function of layer for laughter, emotion, and accent. Emotion and accent are shown as mean with min--max bands across their class-specific probes. All three concepts exceed 0.99 ROC-AUC by L=8.
Raw-residual probe ROC-AUC as a function of layer for laughter, emotion, and accent. Emotion and accent are shown as mean with min--max bands across their class-specific probes. All three concepts exceed $0.99$ ROC-AUC by $L = 8$.
Top-1 SAE feature ROC-AUC as a function of layer. This monosemanticity test asks whether a single SAE coordinate is sufficient to separate each concept from neutral speech. Laughter and emotion peak around L=12--16, while accent is most localized around L=8--12.
Top-1 SAE feature ROC-AUC as a function of layer. This monosemanticity test asks whether a single SAE coordinate is sufficient to separate each concept from neutral speech. Laughter and emotion peak around $L = 12$--$16$, while accent is most localized around $L = 8$--$12$.

Beyond auto-interpretation, the paper asks whether individual concepts are linearly decodable from the residual stream and whether those concepts are concentrated in SAE coordinates. The authors train binary logistic-regression probes for three speech-style concepts: laughter, emotion, and accent. Probes are evaluated on either the raw residual vector $h_L$ or the SAE latent vector $z_L = \mathrm{SAE}_L(h_L)$, with ROC-AUC reported under stratified $5$-fold cross-validation.

For all three concepts, raw-residual probes become nearly perfect very early in the network. In the layer-wise table, laughter reaches ROC-AUC $0.980$ by layer $4$ and $1.000$ by layer $8$; the emotion and accent probes likewise reach approximately perfect decodability by layer $8$. SAE-latent probes closely track raw-residual probes from around layer $8$, which the authors interpret as evidence that the sparse code preserves the relevant speech-style information while expressing it in dictionary coordinates.

The top-1 SAE analysis is the monosemanticity test: it keeps only the single SAE coordinate with the largest mean absolute probe coefficient and asks how much concept signal remains. The reported top-1 AUCs are below 0.6 at layer $0$, but rise sharply in the middle layers. Laughter peaks around layers $12$--$16$ at $0.924$--$0.929$, emotion peaks around layers $12$--$16$ at $0.927$--$0.928$, and accent is most localized around layers $8$--$12$ with AUC about $0.923$--$0.925$.

Concept / representation L0 L4 L8 L12 L16 L20 L23
Laughter, raw residual 0.885 0.980 1.000 1.000 1.000 1.000 1.000
Laughter, SAE latent 0.866 0.948 0.998 1.000 1.000 1.000 0.999
Emotion, raw residual mean 0.966 0.998 1.000 1.000 1.000 1.000 1.000
Accent, raw residual mean 0.896 0.986 0.998 1.000 1.000 1.000 1.000
Concept L0 L4 L8 L12 L16 L20 L23
Laughter 0.508 0.805 0.909 0.924 0.929 0.695 0.831
Emotion mean 0.507 0.886 0.763 0.927 0.928 0.887 0.856
Accent mean 0.510 0.762 0.925 0.923 0.835 0.789 0.836

The probing experiments are used in the paper as a bridge from interpretability to steering: they identify SAE features whose activations are predictive of controllable speech properties and suggest that some of these properties are concentrated enough to be manipulated through a small number of SAE coordinates.

Steering in SAE latent space

The paper’s causal test is to intervene on SAE coordinates during generation instead of adding a raw residual vector directly. For a hooked residual vector $h$, the authors compute SAE activations

$$ z = \sigma(W_{\mathrm{enc}} h + b_{\mathrm{enc}}) $$

then perturb selected coordinates with

$$ z' = z + \alpha \cdot s \odot \bar{Z}, $$

where $\alpha$ is the steering strength, $s$ is a sparse sign vector controlling polarity, and $\bar{Z}$ provides feature-wise activation scales. The modified latent vector is decoded back to residual space as

$$ \hat{h}' = W_{\mathrm{dec}} z' + b_{\mathrm{dec}}. $$

The hook is applied only to speech-token positions. During prefill it affects the speech-prompt segment; during autoregressive decoding it affects only the current generated speech token. Instruction, text-prefix, and task-token positions are left untouched.

Feature Auto-interp label Reported steering effect
14834 Laughter-like vocal events More laughter
11402 Speaker-gender cues Male/female shift
3024 Speech-rate variation Slower/faster speech
Steering effects for laughter, speaker-gender cues, and speech rate. Error bars show 1 SEM over the 40-voice by 10-text prompts grid.
Steering effects for laughter, speaker-gender cues, and speech rate. Error bars show $\pm 1$ SEM over the $40$-voice $\times$ $10$-text prompts grid.
Steering effects for laughter, speaker-gender cues, and speech rate. Error bars show 1 SEM over the 40-voice by 10-text prompts grid.
Steering effects for laughter, speaker-gender cues, and speech rate. Error bars show $\pm 1$ SEM over the $40$-voice $\times$ $10$-text prompts grid.
Steering effects for laughter, speaker-gender cues, and speech rate. Error bars show 1 SEM over the 40-voice by 10-text prompts grid.
Steering effects for laughter, speaker-gender cues, and speech rate. Error bars show $\pm 1$ SEM over the $40$-voice $\times$ $10$-text prompts grid.
Gender steering split by prompt-speaker gender. Feature 11402 shifts both male- and female-prompted generations toward the target gender.
Gender steering split by prompt-speaker gender. Feature 11402 shifts both male- and female-prompted generations toward the target gender.

The steering results are strong and specific. Feature $14834$ increases laughter probability from $0.015$ to $0.791$ at $\alpha = +60$. Feature $11402$ changes speaker-gender cues, moving wav2vec2 $P(\mathrm{male})$ from $0.629$ at baseline to $0.944$ at $\alpha = -50$ and $0.063$ at $\alpha = +50$. Feature $3024$ controls speech rate, changing voiced duration from $3.96$ s at baseline to $10.57$ s at $\alpha = -50$ and $2.75$ s at $\alpha = +50$, while preserving spoken content.

The gender plot also breaks the result down by the prompt speaker’s original gender, showing that feature $11402$ shifts both male- and female-prompted generations toward the target gender. The paper presents this as evidence that the SAE features are causal control directions rather than merely descriptive correlates.

How the paper positions the contributions

The discussion argues that the residual stream of a TTS LM backbone is genuinely multimodal. Early and middle layers mix text and speech evidence, late layers become strongly tied to the generated speech-token stream, and the final hidden state re-centers toward text-modal structure. The learned SAE features therefore track the model’s transition from text conditioning to acoustic realization.

The authors also emphasize that the features are useful at two levels simultaneously. First, they serve as interpretability objects: their auto-interpreted labels describe phonemes, laughter, accent prompts, speaker gender, years, and other text/acoustic cues. Second, they can be used as intervention directions in latent space to produce targeted changes in generation.

Limitations and caveats stated in the paper

  • Single-model scope: the results are for CosyVoice3-0.5B and may not transfer to larger or differently designed TTS models.
  • Circular evaluation risk: the labeler and scorer are both Gemini-based, so shared hallucinations could inflate scores; the authors explicitly call for human evaluation and a scorer-model ablation.
  • Partial auto-interp sweep: modality and reconstruction statistics are reported across layers, but detection-style auto-interp scores are available only for the completed subset of layers.
  • Temporal granularity: speech-token timestamps are at the $25$ Hz rate and do not localize sub-token ($40$ ms) acoustic onsets.
  • Negative sampling choice: negatives are drawn from other features, which tests label specificity but not robustness against representation-neighbor confounds.

Bottom-line takeaway

The paper shows that SAEs can be used fruitfully on a TTS language-model backbone, not just on text LLMs. The resulting dictionary learns a structured mix of text-modal, audio-modal, and mixed features; the authors can label those features with a modality-aware auto-interp pipeline; and selected features can be steered to causally alter speech generation. The strongest practical message is that sparse features in the shared residual stream are both interpretable and actionable control handles for TTS synthesis.