Jun 27, 2026

From speech to face, code-switching, and pronunciation

Today's digest spans full-duplex talking avatars, stronger multilingual ASR for code-switching, and LLM-based grapheme-to-phoneme benchmarking for Japanese TTS. Together they point to voice systems that sound more natural, adapt more flexibly, and animate more convincingly.

Overview of Moshi-Face. a)~Face codec encodes facial motion into $N$ discrete face tokens and decodes them back. b)~Moshi-Face appends $N$ face token streams to existing text and audio token streams. c)~Face Transformer generates $N$ face tokens from conditioning vector $ ^ _i+1$ that aggregates hidden state, text, and audio embeddings. From Moshi-Face.

Talking Avatars & Full-Duplex Dialogue

Moshi-Face

Integrating Facial Generation into Full-Duplex Spoken Dialogue Systems

Moshi-Face extends full-duplex spoken dialogue by adding real-time facial motion generation synchronized with speech. It uses discrete tokens for 3D face motion, enabling natural low-latency audiovisual dialogue without losing semantic quality.

Speech Recognition for Voice Systems

BLoRA Code-Switching Adaptation

Adding Robust Code-Switching Capabilities to High Performance Multilingual ASR

BLoRA is a Bayesian method for extending strong multilingual ASR models to handle English-German code-switching. It integrates new knowledge selectively, reducing code-switch errors significantly while preserving monolingual accuracy, unlike naive fine-tuning which degrades performance.

TTS & Speech Front-End

LLM-Based Japanese G2P Benchmark

Benchmarking Large Language Models for Grapheme-to-Phoneme Conversion: A Japanese Case Study

This work benchmarks over 30 large language models on Japanese grapheme-to-phoneme conversion, comparing novel prompting pipelines to traditional morphological analyzers, and shows that LLMs can outperform classical methods for text-to-speech pronunciation accuracy.