Jun 28, 2026

This Week in Conversational AI Research: June 22–28, 2026

A packed week for real-time speech and digital humans: native streaming audiovisual agents moved toward single-model full duplex, while TTS work pushed on scale, latency, controllability, and preference optimization. Several papers also showed that speech systems still struggle to use non-lexical information—emotion, accent, prosody, and expressive intent—when it matters operationally.

Real-time networked conversation recording highlighting low-latency, full-duplex interaction. From Wan-Streamer.

Full-duplex audiovisual agents become model-native

The week’s most ambitious systems stop treating speech, video, timing, and avatar motion as separate post-processing modules. Instead, they either put multimodal streams into one causal model or add tightly synchronized face/video token pathways to existing speech dialogue stacks.

The clearest statement of the week is Wan-Streamer, which proposes an end-to-end interactive foundation model where text, audio, and video are both inputs and outputs inside a single Transformer. The important engineering idea is not just “multimodal generation,” but native streamability: causal encoders/decoders, block-causal attention, multimodal token scheduling, and 160 ms streaming units target roughly 200 ms model-side response latency for full-duplex interaction.

Wan-Streamer

Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

Wan-Streamer is an end-to-end interactive foundation model unifying text, audio, and video in a single Transformer for real-time full-duplex audio-visual interaction. It jointly learns perception and generation without separate modules, enabling sub-second latency streaming with synchronized multimodal responses.

That is a direct challenge to the familiar VAD → ASR → LLM → TTS → animation/video cascade. Wan-Streamer tries to learn response timing, turn management, speech generation, visual generation, and cross-modal synchronization jointly, which is exactly the kind of coupling that cascades usually approximate with hand-built buffering and interruption logic.

Moshi-Face attacks a narrower but very practical version of the same problem: how to add a face to a low-latency full-duplex speech model without breaking streaming. It builds a VQ-VAE face codec over 3D head meshes, then extends Moshi with face token streams and a non-autoregressive Face Transformer that predicts per-frame face tokens in parallel from the dialogue model’s hidden state plus text/audio embeddings.

Moshi-Face

Integrating Facial Generation into Full-Duplex Spoken Dialogue Systems

Moshi-Face extends full-duplex spoken dialogue by adding real-time facial motion generation synchronized with speech. It uses discrete tokens for 3D face motion, enabling natural low-latency audiovisual dialogue without losing semantic quality.

The contrast is useful: Wan-Streamer is a unified audio-video foundation model, while Moshi-Face is a modular extension of a proven audio-first full-duplex architecture. For teams already invested in speech dialogue models, the face-token interface is a plausible migration path toward embodied agents.

InteractiveAvatar focuses on the downstream avatar renderer rather than the dialogue core. It combines autoregressive distillation for real-time diffusion video generation, Long-Short Visual Memory for identity and appearance consistency over arbitrarily long streams, and a Reasoning-Reaction Module that uses user intent to choose speech/actions and stable post-action states.

InteractiveAvatar

InteractiveAvatar: Real-Time Streaming Video Generation for Consistent and Intent-Aware Avatars

A real-time streaming avatar system that ensures long-term visual consistency and enables context-aware speech and actions by understanding user intent. It combines memory mechanisms for visual coherence with an intent reasoning module for interactive, natural avatar behavior beyond standard audio-driven models.

Together, these papers show three layers of the same stack becoming stream-aware: the interaction model, the face token layer, and the video avatar renderer. The common requirement is persistent state under low latency—what the agent has heard, what it is saying, how its face is moving, and what its visual identity should remain over minutes of interaction.

Universal TTS, streaming synthesis, and natural-language control

Speech generation work split between two goals: scaling general-purpose TTS/audio models, and making synthesis start earlier or obey more flexible user instructions. The strongest systems are converging on language-model-like interfaces, but with very different choices about discrete codecs, continuous latents, and conditioning.

ZONOS2 is the major scale-and-release story for TTS this week: an 8B-parameter, 900M-active MoE autoregressive TTS model trained on more than 6M hours, released with weights and inference code. Its design choices are very deployment-relevant: byte-level UTF-8 text instead of phonemes, a single ECAPA-derived speaker prefix for zero-shot cloning, and a delayed DAC token pattern that preserves codebook dependencies while keeping streaming practical.

ZONOS2

ZONOS2 Technical Report

ZONOS2 is an advanced text-to-speech model that excels in naturalness, prosody, and zero-shot voice cloning across multiple languages. It uniquely combines a large-scale mixture-of-experts architecture with a massive multilingual training corpus and simplified conditioning for high-quality, low-latency streaming TTS.

Where ZONOS2 scales an autoregressive codec-token recipe, S5-TTS directly targets the latency bottleneck in cascaded LLM→TTS systems. It modifies T5-TTS for word-by-word streaming with limited lookahead, using monotonic alignment learning, lookahead-causal masking, Conv-based auxiliary attention, and multi-source distillation to recover naturalness and speaker similarity under constrained context.

S5-TTS

Streaming T5-based Text-to-Speech Synthesis with Limited Lookahead

S5-TTS enables streaming, low-latency text-to-speech synthesis by generating speech word-by-word with limited lookahead. It preserves quality and speaker similarity using monotonic alignment and lookahead-causal masks, making it ideal for real-time conversational AI systems.

This matters because upstream LLMs already stream text tokens; if TTS waits for a full sentence, the agent still feels slow. S5-TTS is a concrete recipe for letting speech begin after only a few words while keeping zero-shot speaker conditioning close to the full-context baseline.

Bagpiper-TTS pushes on the control interface instead of only latency. It treats natural language as the universal frontend: a user request is internally converted into a planning trace and rich caption that contains transcript, speaker/style/prosody metadata, and task-specific constraints, then speech is generated from that caption. The same model is evaluated across standard TTS, multi-talker dialogue, intent-to-speech, role-play, and singing-style synthesis.

Bagpiper-TTS

Bagpiper-TTS: Natural Language Guided Universal Speech Synthesis

Bagpiper-TTS is a universal speech synthesis system using natural language to guide speech generation with rich details. It replaces rigid metadata inputs and supports diverse tasks like multi-talker dialogue, intent-to-speech, role-play, and singing voice synthesis in a single unified model.

AudioCALM asks whether a single model can cover speech, sound, and music without giving up autoregressive in-context behavior or continuous-audio fidelity. It replaces the softmax token head with a thin flow-matching head over continuous audio latents, uses block-causal AR-Flow attention for arbitrary-length generation, and adds an asymmetric speech-only residual expert to reduce interference between tightly aligned transcript-to-speech conditioning and diffuse sound/music captions.

AudioCALM

AudioCALM: Continuous Autoregressive Language Modeling for Universal Audio Generation

AudioCALM is a universal audio generation model that autoregressively predicts continuous audio latents to unify speech, sound, and music synthesis. It balances modality differences with asymmetric experts and descriptive conditioning for high-quality, variable-length, end-to-end audio generation.

A smaller but important systems paper reminds us that text normalization and pronunciation are still not “solved” by end-to-end modeling. The Japanese G2P benchmark finds that strong LLMs, especially in a parse mode where the model performs morphological analysis and rules handle kana normalization, can beat conventional analyzers and improve kana-input TTS pronunciation.

LLM-Based Japanese G2P Benchmark

Benchmarking Large Language Models for Grapheme-to-Phoneme Conversion: A Japanese Case Study

This work benchmarks over 30 large language models on Japanese grapheme-to-phoneme conversion, comparing novel prompting pipelines to traditional morphological analyzers, and shows that LLMs can outperform classical methods for text-to-speech pronunciation accuracy.

Finally, VoxCPM2 LoRA adaptation gives a pragmatic answer for low-resource deployment: a single zero-initialized LoRA adapter trained on about 26 hours of Khmer/Korean improves Khmer MOS significantly while offering no benefit—and even some degradation at high rank—for Korean, which the base model already handles well.

VoxCPM2 LoRA TTS Adaptation

Closing the Quality Gap in Low-Resource Text-to-Speech: LoRA Fine-Tuning of VoxCPM2 for Khmer and Korean

This paper presents a parameter-efficient LoRA fine-tuning method to adapt a large pretrained TTS model for low-resource Khmer and Korean. A single shared adapter improves Khmer speech quality notably while maintaining Korean performance without the cost of full fine-tuning.

The takeaway for builders is that “large multilingual TTS” still needs careful interfaces at the edges: streaming masks for latency, rich captions for user control, G2P for pronunciation-critical languages, and parameter-efficient adaptation when pretraining coverage is uneven.

Prosody, accent, voice cloning, and preference optimization

A large cluster of papers treated speech style as something to control, preserve, or optimize—not merely something the model should imitate implicitly. The technical approaches range from codec design and disentangled embeddings to RL, test-time adaptation, and inference-time guidance decomposition.

Representations for controllable style

ProsoCodec redesigns the codec bottleneck for voice conversion. Instead of trying to fully disentangle prosody as an independent stream, it conditions the encoder and decoder on text and speaker embeddings so the discrete latent is pressured to store the conditional residual—prosodic variation not explained by content or speaker identity. That improves prosody preservation while reducing source-timbre leakage.

ProsoCodec

ProsoCodec: Prosody-Oriented Speech Codec for Voice Conversion

ProsoCodec is a prosody-oriented speech codec designed for voice conversion. By conditioning on text and speaker embeddings, it isolates and preserves residual prosody, improving prosody preservation and reducing source timbre leakage during voice conversion.

LombardTTS makes clarity controllable along two axes: vocal effort and articulation. Built on Matcha-TTS, it uses pseudo-labels from speaking-style data to support continuous control and even word-level emphasis, aiming to reproduce the intelligibility gains of Lombard/clear speech in noisy listening conditions.

LombardTTS

Synthesizing the Lombard Effect: Multi-Level Control of Speech Clarity and Vocal Effort in TTS

This paper presents a TTS system that independently controls vocal effort and articulation to simulate the Lombard effect, enhancing speech clarity and intelligibility in noisy conditions. It enables continuous multi-level and word-level control for nuanced, context-specific speech emphasis.

CrossAccent-TTS applies the same spirit to accent. It uses an adversarial accent-suppression module to make the reference-derived speaker/style latents less accent-predictive, then reintroduces accent explicitly through learned language embeddings and an Accent Intensity Controller that interpolates between accents at inference time.

CrossAccent-TTS

CrossAccent-TTS: Cross-Lingual Accent-Intensity Controllable Text-to-Speech via Disentangled Speaker and Accent Representations

CrossAccent-TTS enables precise control of accent and accent intensity in cross-lingual TTS by disentangling speaker and accent features and conditioning synthesis on learned language embeddings. It preserves speaker identity while allowing smooth accent modulation, performing well on Indic and foreign English accents.

OscillaTTS goes lower in the stack: it changes the decoder nonlinearity. The proposed adaptive oscillatory activation, `x + tanh(α sin²(x))`, is meant to preserve a harmonic inductive bias while making periodic modulation more stable and adaptable for sharp pitch and energy transitions in expressive speech.

OscillaTTS

Adaptive Oscillatory Inductive Bias for Modeling Sharp Prosodic Dynamics in Diffusion-Based TTS

OscillaTTS enhances diffusion-based text-to-speech by introducing an adaptive oscillatory activation that improves modeling of sharp prosodic transitions and rapid pitch variations, enabling more expressive and stable speech synthesis compared to fixed periodic functions.

Optimizing speech with rewards and guidance

FlowTTS-GRPO brings online RL to flow-matching TTS by converting deterministic ODE sampling into stochastic SDE trajectories, enabling GRPO-style exploration directly in the FM component. Its reward mix targets speaker similarity, ASR intelligibility, and DNSMOS-style perceptual quality, with experiments on CosyVoice 3.0 and F5-TTS.

FlowTTS-GRPO

FlowTTS-GRPO: Online Reinforcement Learning with Multi-Objective Reward Optimization for Flow-Matching Based Text-to-Speech

FlowTTS-GRPO uses online reinforcement learning to fine-tune flow-matching TTS models with multi-objective rewards for speaker similarity, quality, and intelligibility. It enables exploration via stochastic sampling without auxiliary models, improving voice cloning and cross-lingual transfer.

VoiceTTA moves the adaptation loop to inference time. For hard prompts—dialects, children’s speech, slurred speech, crosstalk—it optimizes lightweight prefixes with GRPO using rewards for F0 variation, energy variation, speaker similarity, and Whisper-based intelligibility, rather than fine-tuning the whole zero-shot TTS model.

VoiceTTA

VoiceTTA: Enhancing Zero-Shot Text-to-Speech via Reinforcement Learning-Based Test-Time Adaptation

VoiceTTA enhances zero-shot text-to-speech by using reinforcement learning for test-time adaptation that optimizes lightweight prefixes with style and intelligibility rewards. It adapts pretrained models on unseen speech styles at inference without large fine-tuning datasets.

Joint Residual Reweighting is an inference-time alternative for flow-matching zero-shot TTS. By evaluating null, text-only, speaker-only, and full branches, it decomposes classifier-free guidance into text, speaker, and joint residuals, then reweights the speaker and joint terms separately to improve speaker similarity without the usual hit to text correctness.

Joint Residual Reweighting

Joint Residual Reweighting for Classifier Free Guidance in Flow-Matching Zero-Shot TTS

The paper presents a new method for zero-shot TTS that separates classifier-free guidance into text, speaker, and joint residuals, allowing better control over speaker similarity and text correctness. This reduces the trade-off in prior methods by independently weighting speaker and joint terms during inference.

These papers collectively make style control more explicit. Some control it through learned latent factorization, some through reward-driven adaptation, and some through sampler algebra. For production voice cloning, the interesting pattern is that speaker similarity, text correctness, prosody, and intelligibility are increasingly treated as competing objectives that need knobs—not as a single scalar “quality” score.

Speech LLMs, ASR, and cross-modal alignment

Several papers probed how speech representations align with text and semantics, while others improved streaming or low-resource ASR with more structured objectives. A recurring lesson is that speech models benefit when alignment is shaped deliberately rather than left to emerge only at the final layer or final decoding step.

Interleaved Speech Language Models Latently Work In Text provides one of the week’s most illuminating analyses. Using logit-lens probes, it shows that interleaved speech-text LMs often pass through an implicit transcription phase: intermediate layers make the spoken word’s text token decodable, then the model predicts in text space before returning to speech-token generation.

Interleaved SLMs Latent Text

Interleaved Speech Language Models Latently Work In Text

This paper shows that interleaved speech-text language models internally transcribe speech into text tokens within intermediate layers, without explicit transcription training. This latent transcription enables a unique interplay of speech and text modalities for improved model function.

That finding helps explain why text initialization and interleaved data can improve speech-only behavior: the model may be using text as an internal workspace. It also frames the design problem for SpeechLLMs—if latent text is doing much of the reasoning, we need to understand when that is helpful and when it discards paralinguistic information.

CORTIS makes the same interface question practical for task-oriented voice agents. It fine-tunes a spoken language model using only text-form structured-output supervision, then runs speech-to-function-call or speech-to-semantic-frame inference without task-specific speech-target pairs. Under acoustic degradation, it can preserve higher-level task semantics better than matched ASR→LLM cascades.

CORTIS

CORTIS: Text-Only Adaptation of Spoken Language Models for Task-Oriented Voice Agents

CORTIS adapts spoken language models for task-oriented voice agents using text-only supervision, enabling direct speech-to-structured-output generation without paired speech-target data. It is more robust than ASR-LLM cascades under noise, especially in preserving high-level semantic understanding.

CAAD tackles a different SpeechLLM weakness: linguistic priors dominating the acoustic signal. It distills contrastive audio-aware decoding into a smaller student by synchronizing audio-aware and text-only teacher passes with a pseudo-ground-truth anchor sequence, improving acoustic grounding without paying dual-path inference cost.

CAAD

CAAD: Contrastive Audio-Aware Distillation for Efficient Speech Language Models

CAAD distills contrastive audio-aware decoding into a student model to improve speech language understanding with efficiency and stronger acoustic grounding. It uses synchronized teacher forcing and metadata-based pseudo-ground truths to distill contrastive reasoning without inference-time overhead.

Translation-enhanced speech encoder pretraining argues that ASR-only encoders are structurally mismatched with LLMs because they can preserve language-specific spaces, while the LLM expects a more language-agnostic semantic embedding. Adding bidirectional speech translation objectives during encoder pretraining improves downstream SpeechLLM integration.

Translation-Enhanced Speech Encoder

Does Translation-Enhanced Speech Encoder Pre-training Affect Speech LLMs?

This paper studies how adding translation objectives in speech encoder pre-training improves cross-lingual, language-agnostic representations for Speech LLMs. The bidirectional translation task aligns speech embeddings better with the LLM's shared semantic space, boosting downstream speech recognition and translation.

Streaming and alignment for ASR

Online Predictive Coding improves dual-mode self-supervised speech encoders that must support both streaming and offline inference. It regularizes online register tokens to predict future offline representations, while dual-mode LayerNorm reduces distribution mismatch between online and offline paths.

Online Predictive Coding for Dual-Mode Speech

Online Predictive Coding for Dual-Mode Self-Supervised Speech Model

This paper introduces Online Predictive Coding (OPC) to improve dual-mode self-supervised speech models that handle streaming and offline modes using shared parameters. OPC regularizes online registers to predict future frames, reducing performance gaps and stabilizing training for low-latency speech recognition.

InterAligner addresses the brittleness of Aligner-Encoder ASR, where the encoder must learn monotonic alignment internally without decoder attention or an RNN-T lattice. By adding intermediate CTC and intermediate Aligner losses, it encourages alignment to form progressively across depth and improves long-utterance WER.

InterAligner

Progressive Alignment Objectives for Aligner-Encoder based ASR

InterAligner enhances Aligner-Encoder ASR by adding progressive alignment objectives at intermediate layers, guiding the encoder to form monotonic alignment gradually. This approach stabilizes training and improves recognition, especially on long utterances, outperforming methods using only final-layer alignment.

Audio–Image Alignment for Low-Resource ASR inserts a transcription-free continued-pretraining stage between SSL pretraining and supervised ASR fine-tuning. Frozen vision encoders provide image semantics for paired picture-prompt speech, and the audio encoder is adapted through contrastive alignment before becoming audio-only again at ASR time.

Audio-Image Alignment for Low-Resource ASR

Audio--Image Alignment as a Continued-Pretraining Stage Improves Low-Resource ASR

This paper proposes an intermediate pretraining step using audio-image pairs to adapt pretrained audio encoders without transcripts. This stage improves ASR performance in low-resource languages by enhancing representation robustness and transferability before supervised fine-tuning.

BLoRA code-switching adaptation focuses on preserving a strong multilingual ASR model while adding English-German code-switching. Its Bayesian low-rank adaptation constrains how synthetic code-switching knowledge is written into the model, reducing code-switched-word errors while avoiding the monolingual degradation seen with naive fine-tuning.

BLoRA Code-Switching Adaptation

Adding Robust Code-Switching Capabilities to High Performance Multilingual ASR

BLoRA is a Bayesian method for extending strong multilingual ASR models to handle English-German code-switching. It integrates new knowledge selectively, reducing code-switch errors significantly while preserving monolingual accuracy, unlike naive fine-tuning which degrades performance.

Across these ASR and SpeechLLM papers, alignment appears at multiple levels: acoustic-to-token alignment, online-to-offline alignment, speech-to-image semantic alignment, speech-to-LLM space alignment, and multilingual/code-switched alignment. Each paper adds structure to a place where “just fine-tune it” tends to be unstable.

3D avatars, motion, relighting, and geometry priors

Digital-human work this week emphasized controllable 3D structure rather than only image-space realism. The papers cover single-image avatar creation, full-body relighting, sparse-sensor motion reconstruction, and human-preference tuning of implicit face geometry.

FiCA is a feed-forward pipeline for generating a drivable Gaussian Codec Avatar from a single portrait image. It uses human-centric foundation models to unwrap partial UV texture/geometry observations, a diffusion model to complete canonical texture and mesh geometry, a feed-forward UV refinement network for identity fidelity, and a universal prior that decodes the result into real-time 3D Gaussians.

FiCA

FiCA: Feed-forward instant Gaussian Codec Avatars from a Single Portrait Image

FiCA generates photorealistic, animatable 3D Gaussian Codec Avatars from a single portrait image using a feed-forward pipeline. It uniquely combines vision foundation models, diffusion-based completion, and UV refinement for real-time, identity-preserving avatars without offline tracking or optimization.

The key practical point is that FiCA avoids per-person test-time optimization. For avatar products, that is the difference between a capture/reconstruction pipeline and an instant onboarding flow from one portrait.

Generative Relightable Avatars targets full-body humans under novel environment maps. It combines explicit UV-space material optimization and microfacet relighting with learned texture refinement and a fine-tuned video-to-video diffusion model, preserving 3D control while adding the high-frequency detail that deterministic relightable avatars tend to average away.

Generative Relightable Avatars

Generative Relightable Avatars enable photorealistic free-view rendering and relighting of full-body humans by combining physics-based relighting with generative video refinement. This captures fine details and pose-dependent effects, producing temporally coherent avatars under diverse lighting.

MotionMAR reconstructs full-body motion from sparse head/hand tracker observations using a multi-scale autoregressive latent hierarchy. A temporal multi-scale VQ-VAE separates coarse global trajectories from finer jitter/detail, a GPT-style Motion Autoregressive Network predicts tokens scale by scale, and scale-aware control keeps the generated motion anchored to sparse VR observations.

MotionMAR

MotionMAR: Multi-scale Auto-Regressive Human Motion Reconstruction from Sparse Observations

MotionMAR reconstructs full-body human motion from sparse VR sensor data using a multi-scale autoregressive framework, capturing global trajectories to fine details. Its coarse-to-fine tokenization and scale-aware control provide accurate, jitter-free human motion respecting temporal hierarchy.

Sculpting NeRF Geometry brings preference optimization to implicit 3D face geometry. Instead of extracting meshes or relying on text prompts, it trains a reward model directly over EG3D’s radiance-field density volume and fine-tunes the generator toward face geometries preferred by humans, with a density-consistency constraint to limit appearance drift.

Sculpting NeRF Geometry

Sculpting NeRF Geometry: Human-Preference Fine-Tuning of a 3D-Aware Face GAN

This paper fine-tunes a pretrained 3D-aware face GAN's geometry using a reward model trained on human preferences directly from the radiance field's density values. It uniquely improves 3D facial geometry without relying on text prompts, mesh priors, or explicit surface supervision, while preserving 2D appearance.

These works all preserve some explicit handle—UV maps, meshes, sparse kinematic controls, density fields, or Gaussians—while using generative models for the ambiguous parts. For digital humans, that hybridization is becoming essential: realism alone is not enough if lighting, pose, expression, identity, and long-horizon geometry cannot be controlled.

Evaluation and safety beyond transcripts

Two papers made the same uncomfortable point from different angles: speech systems can look strong on lexical metrics while failing at expressive transfer or vocal decision-making. For voice agents, dubbing, and speech-to-speech translation, evaluation needs to include what is said and how it is said.

STEB introduces a speech-to-speech translation benchmark that evaluates not only translation fidelity, speaker similarity, and duration alignment, but also emotion, scenario style, and nonverbal vocalization preservation. Its reference-free expressiveness evaluation converts source and hypothesis speech into structured expressive attributes, then compares them with an LLM judge validated against human listeners.

STEB

STEB: A Speech-to-Speech Translation Expressiveness Benchmark for Evaluating Beyond Translation Fidelity

STEB is a speech-to-speech translation benchmark that evaluates both translation fidelity and expressive aspects like emotion, scenario style, and nonverbal vocalizations. It uses a reference-free LLM-based method comparing structured expressive attributes, revealing challenges in preserving expressiveness.

The results are telling: cascaded systems can score well on translation fidelity while still losing emotion and nonverbal events. That is a serious gap for dubbing, virtual presenters, and cross-lingual agents, where laughter, fear, sarcasm, and scene style are part of the message.

Real-Time Voice AI Hears but Does Not Listen evaluates production realtime voice systems in scenarios where words and vocal delivery imply opposite actions: crying callers saying they are fine, frightened callers approving a wire transfer, and sarcastic callers agreeing to volunteer. The systems often identify the emotion when asked directly, yet still act on the words when making decisions.

Emotional Intelligence Gap in Voice AI

Real-Time Voice AI Hears but Does Not Listen

This paper shows that leading realtime voice AI systems detect vocal emotions but ignore them when making decisions, acting only on words. It identifies an "emotional intelligence gap" where AI hears but does not listen, posing risks for applications reliant on tone and emotion.

That disconnect—perceiving vocal cues but not using them for action—should shape how engineers test voice agents. It is not enough to ask whether the model can classify emotion, accent, age, or sarcasm in isolation; we need task evaluations where those acoustic cues change the correct policy.