EmoZone-Talker

EmoZone-Talker: Regional Semantic Control of Audio-Driven 3DGS Talking Heads via Facial Action Units

EmoZone-Talker enables fine-grained, anatomically interpretable control of facial expressions in audio-driven 3D Gaussian Splatting talking heads by disentangling spatial and temporal interactions of speech and facial Action Units, improving expression realism, controllability, and lip-sync accuracy.

talking-head
audio-driven
face-animation
3d-avatar
gaussian-splatting
emotion
lip-sync
realtime

Authors: Tingting Chen, Shaojun Wang, Huaye Zhang, Diqiong Jiang, Chenglizhao Chen

Categories: cs.CV

Published 2026-06-14 · Updated 2026-06-14

Abstract

3D Gaussian Splatting (3DGS) has shown strong potential for high-fidelity talking head synthesis. However, enabling fine-grained, interpretable, and editable facial expression control remains fundamentally challenging due to intrinsic conflicts between speech-driven facial dynamics and explicit expression signals. Existing methods rely on implicit multimodal fusion, leading to spatial entanglement and temporal instability. We present EmoZone-Talker, a novel framework that reformulates audio-driven facial animation as a structured spatial-temporal coordination problem under cross-modal conflicts. Our approach introduces an explicit spatial disentanglement and temporal dynamics modeling of facial motion. Specifically, we propose Synergy Zones with Prioritized Attention Bias (SZ-PAB) to explicitly decouple modality contributions via region-wise constraints guided by anatomical priors, and a Channel-Independent Temporal AU Encoder (CIT-AE) to model temporally coherent AU dynamics. By integrating these representations into 3D Gaussian deformation, EmoZone-Talker enables precise and interpretable control over facial expressions. Extensive experiments demonstrate that our method improves expression controllability and realism, with notable gains in upper-face accuracy and temporal coherence, while preserving high rendering quality and accurate lip synchronization. Code will be publicly released to facilitate reproducibility and further research.

Overview and Problem Framing

EmoZone-Talker addresses a specific gap in audio-driven 3D Gaussian Splatting (3DGS) talking-head synthesis: the field has largely optimized for photorealism and lip synchronization, but it still lacks an explicit, interpretable mechanism for fine-grained facial expression control. The paper argues that Action Units (AUs) are a better interface than global emotion labels because they are anatomically meaningful and spatially localized, but they also create a harder control problem: speech-driven articulation and AU-driven expression often affect overlapping facial regions, especially around the mouth, cheeks, jaw, eyebrows, eyes, and nose. The core claim is that this is not just a fusion problem, but a cross-modal conflict problem that must be handled with explicit spatial and temporal coordination.

The method is built around two ideas. First, it introduces Synergy Zones with Prioritized Attention Bias (SZ-PAB) to assign different facial regions different modality responsibilities, rather than mixing audio and AU signals implicitly across the whole face. Second, it introduces a Channel-Independent Temporal AU Encoder (CIT-AE) to smooth frame-level AU sequences into temporally coherent latent trajectories, reducing jitter and inconsistent motion. These representations are then injected into the 3DGS deformation pipeline, with an auxiliary AU-consistency branch used only during training.

Fine-grained and temporally coherent expression control enabled by EmoZone-Talker. (a) Independent AU editing: disentangled, intensity-controllable manipulation of individual Action Units for targeted facial motion. (b) Natural synthesis under conflicting AUs: anatomically plausible expressions even when incompatible cues coexist. (c) Smooth expression transitions: temporally coherent interpolation between affective states driven by speech without jitter.

Method Overview

Overview of the AU-conditioned 3D Gaussian talking-head framework. Audio features, AU features, and canonical tri-plane features are fused to predict Gaussian deformations for rendering. SZ-PAB introduces region-prioritized attention bias to decouple upper-face AU control from mouth motion, while CIT-AE models local temporal AU context to improve temporal stability. An auxiliary AU-consistency branch further constrains the rendered results.

The paper formulates the task as audio-driven facial animation with explicit AU control. The subject is represented by canonical 3D Gaussians, and a deformation network predicts per-frame Gaussian offsets from multimodal inputs: speech audio, AU features, and camera-related inputs. The method is designed for upper-face control in particular, because upper-face AUs such as AU1, AU2, AU4, AU6, and AU9 are weakly correlated with articulation but highly important for expressive motion.

The conceptual decomposition is:

$$y = f(x_{audio}, x_{AU})$$

where speech drives articulation and AUs explicitly drive expression. Instead of assuming that both signals can be fused uniformly, EmoZone-Talker introduces a region-wise factorization:

$$p(y \mid x_{audio}, x_{AU}) \approx \prod_r p(y_r \mid x_{audio}^{(r)}, x_{AU}^{(r)})$$

This reflects the paper’s main design principle: each facial region should preferentially attend to the modality that should dominate it, while transition regions can remain cooperative.

Synergy Zones with Prioritized Attention Bias (SZ-PAB)

SZ-PAB is the spatial disentanglement module. The face is partitioned into three semantically motivated zones using anatomical priors: an audio-dominant zone $\mathcal{Z}_A$ for the mouth, an expression-dominant zone $\mathcal{Z}_E$ for the upper face, and a synergy zone $\mathcal{Z}_S$ for transition regions. The paper uses these zone assignments to guide cross-attention so that mouth motion stays tied to speech, while AU control is concentrated in the upper face.

The attention rule is modified with a region-dependent bias:

$$\operatorname{Attn}(Q, K, V) = \operatorname{softmax}\left(\frac{QK^\top}{\sqrt{d}} + B\right)V$$

where $B$ encodes region-specific preference. In effect, AU tokens are suppressed in the audio-dominant mouth zone, and audio tokens are suppressed in the expression-dominant upper-face zone, while the synergy zone is left less constrained. The goal is not hard separation everywhere, but explicit responsibility assignment where conflicts are expected.

To further stabilize this behavior, the paper adds Region-Aware Attention Regularization (RAAR). Let $\alpha_{i,j}$ be the attention weight from Gaussian $i$ to token $j$. RAAR compares average attention response over the upper face and mouth regions, and imposes margin-based hinge losses so that upper-face regions prefer AU tokens over audio tokens, while mouth regions prefer audio tokens over AU tokens. The overall regularizer is written as:

$$\mathcal{L}_{RAAR} = \mathcal{L}_{upper} + \mathcal{L}_{mouth}$$

Channel-Independent Temporal AU Encoder (CIT-AE)

CIT-AE addresses temporal conflict. The paper observes that frame-level AU estimates are noisy and fluctuate over time, which can produce jittery facial motion if injected directly into the deformation model. It models each AU vector as a noisy observation:

$$\mathbf{a}_t = \mathbf{s}_t + \boldsymbol{\epsilon}_t$$

where $\mathbf{s}_t$ is the smooth underlying expression trajectory and $\boldsymbol{\epsilon}_t$ is high-frequency noise. Instead of using $\mathbf{a}_t$ directly, CIT-AE builds a center-aligned temporal window:

$$\mathbf{A}_t = [\mathbf{a}_{t-h}, \dots, \mathbf{a}_t, \dots, \mathbf{a}_{t+h-1}] \in \mathbb{R}^{T \times K}$$

and produces a refined center-frame embedding $\tilde{\mathbf{a}}_t \in \mathbb{R}^K$. The key design choice is that the channels are modeled independently, which reduces cross-channel interference while still leveraging local temporal context. The intended effect is a learnable temporal filter that suppresses high-frequency jitter and keeps expression transitions coherent.

AU Consistency Supervision

The architecture includes a training-only AU-consistency branch to ensure that the generated frames actually reflect the requested expression semantics. Ground-truth AU labels $\mathbf{a}^{gt}_t$ are extracted with a pretrained AU detector, and the rendered frame $\hat{I}_t$ is evaluated by a pretrained JAA network to obtain predicted AU responses:

$$\hat{\mathbf{a}}_t = \Psi_{JAA}(\hat{I}_t)$$

The AU loss is then:

$$\mathcal{L}_{AU} = \|\hat{\mathbf{a}}_t - \mathbf{a}^{gt}_t\|_1$$

This branch does not add inference cost; it acts as an external semantic constraint during training so that the deformation network cannot ignore AU inputs or collapse them into the speech branch.

Training Objective

The total objective combines reconstruction fidelity, lip synchronization, AU consistency, and region-aware regularization:

$$\mathcal{L}_{total} = \mathcal{L}_{rec} + \lambda_{sync}\mathcal{L}_{sync} + \lambda_{AU}\mathcal{L}_{AU} + \lambda_{RAAR}\mathcal{L}_{RAAR}$$

The paper states that $\mathcal{L}_{rec}$ enforces photometric and perceptual consistency, $\mathcal{L}_{sync}$ uses SyncNet for speech-lip alignment, $\mathcal{L}_{AU}$ enforces AU semantics, and $\mathcal{L}_{RAAR}$ regularizes region-wise attention. The reported weights are $\lambda_{sync}=0.1$, $\lambda_{AU}=0.05$, and $\lambda_{RAAR}=0.01$.

Experimental Setup

The experiments are conducted on two datasets. For self-reconstruction, the paper follows the TalkingGaussian protocol and uses the neutral subset containing Obama, May, Lieu, and Macron. For emotion-conditioned generation, it uses MEAD, which is a large-scale multi-view emotional talking-head dataset with 60 speakers, 8 emotion categories, and 3 intensity levels. Videos are resized to $256\times256$ at 25 FPS, and audio is sampled at 16 kHz.

For AU annotations, the paper uses JAANet during training and OpenFace for evaluation, with the latter chosen to align with prior work and maintain unbiased comparisons.

Implementation details are brief but concrete: the model is implemented in PyTorch, optimized with Adam, trained in a coarse stage followed by a fine stage for lip and depth refinement, and the learning rate decays exponentially from $10^{-4}$ to $10^{-5}$. Gaussian densification is performed every 100 iterations between 1k and 7k steps.

The evaluation suite is designed to cover multiple axes of quality:

Rendering fidelity: SSIM, PSNR, and LPIPS.
Synchronization and geometry: SyncNet confidence (Sync) and landmark distance (LMD).
Expression controllability: AUE-U and AUE-L for AU reconstruction in upper and lower regions.
Temporal stability: AU-Jerk, defined as the average second-order derivative of AU trajectories.
Emotion-conditioned generation: E-Score from a pretrained emotion classifier.
Identity preservation: cosine similarity (CSIM).

Baselines

For self-reconstruction, the paper compares against ER-NeRF, GaussianTalker, and TalkingGaussian. For emotion-conditioned generation, it compares against StyleTalk, EAT, DreamTalk, and DICE-Talk. The comparison set intentionally spans both reconstruction-oriented methods and emotion-aware methods so that the paper can test whether AU-level control helps without sacrificing lip synchronization or fidelity.

Quantitative Results

Self-Reconstruction

On the neutral subset and on MEAD, EmoZone-Talker reports the best overall performance on most metrics, especially the AU-specific measures. The paper emphasizes that the main gain is not only higher image quality, but a much better balance between upper-face expression control and speech synchronization.

Self-reconstruction results on the neutral subset and MEAD.
Method	PSNR	SSIM	LPIPS	LMD	Sync	CSIM	AUE-U / AUE-L	PSNR	SSIM	LPIPS	LMD	Sync	CSIM	AUE-U / AUE-L
ER-NeRF	33.059	0.935	0.027	2.269	5.554	0.930	0.271 / 0.228	31.156	0.931	0.033	2.649	4.492	0.864	0.355 / 0.376
GaussianTalker	33.023	0.939	0.033	2.306	5.741	0.946	0.256 / 0.228	31.952	0.946	0.047	2.410	4.823	0.913	0.338 / 0.330
TalkingGaussian	33.637	0.940	0.026	2.013	5.919	0.945	0.206 / 0.223	32.769	0.942	0.029	2.074	4.794	0.914	0.163 / 0.314
EmoZone-Talker	34.732	0.952	0.030	1.951	5.849	0.956	0.156 / 0.199	33.940	0.957	0.038	1.852	5.228	0.944	0.145 / 0.258

The paper’s main interpretation is that explicit region-aware disentanglement substantially improves upper-face accuracy. On the neutral dataset, EmoZone-Talker reaches AUE-U/AUE-L of $0.156/0.199$, compared with TalkingGaussian’s $0.206/0.223$. On MEAD, it achieves $0.145/0.258$, again better than the baselines. The paper also reports real-time performance at 110.5 FPS, comparable to GaussianTalker and TalkingGaussian, while remaining much faster than ER-NeRF.

One notable nuance is that the method does not maximize every single rendering metric: for example, TalkingGaussian is slightly better on LPIPS in some settings, and it remains competitive on lip synchronization. The advantage of EmoZone-Talker is therefore best understood as a better overall balance, especially on controllability, upper-face reconstruction, and identity preservation.

Emotion-Conditioned Generation

On the emotion subset of MEAD, the AU-conditioned design outperforms emotion-label-based or global emotion methods by a clear margin. The paper attributes this to the fact that AUs correspond to specific muscle activations, while global emotion categories only approximate a facial state distribution.

Emotion-conditioned generation on MEAD.
Method	E-Score	Neutral	Non-neutral	CSIM
StyleTalk	0.383	0.392	0.378	0.632
EAT	0.552	0.368	0.613	0.616
DreamTalk	0.526	0.585	0.501	0.606
DICE-Talk	0.311	0.329	0.305	0.651
EmoZone-Talker	0.653	0.601	0.685	0.841

The strongest result here is the identity score: EmoZone-Talker’s CSIM of $0.841$ is far above the baselines, indicating that AU-level control can preserve identity more effectively than holistic emotion conditioning. Its E-Score is also the best overall, with strong performance on both neutral and non-neutral cases. The paper interprets this as evidence that fine-grained muscle-level control better captures real emotional expression than categorical emotion embeddings.

Qualitative Results

Qualitative comparison in the self-reconstruction scenario. — Qualitative comparison in the self-reconstruction scenario, where each subject is driven by their own speech. The regions highlighted by red solid boxes indicate mismatched lip articulation.

In self-reconstruction, the paper shows that prior methods often produce visible lip articulation errors, particularly at phonetic moments where the mouth should open more clearly. ER-NeRF is described as exhibiting clear lip deformation in some phonemes, while GaussianTalker and TalkingGaussian can produce mismatched lip motion in the highlighted mouth regions. EmoZone-Talker is qualitatively better at maintaining both speech alignment and facial detail, supporting the quantitative Sync and LMD gains.

Qualitative comparison in the emotion-control scenario. — Qualitative comparison in the emotion-control scenario, where all methods receive the same audio and specified target emotions.

In emotion-controlled generation, the paper highlights that some baselines can create emotional faces but often fail to coordinate local action units with the target affect. StyleTalk and DreamTalk produce generic emotions, EAT improves global emotional conveyance but still lacks local coordination, and DICE-Talk can become over-smoothed or overly exaggerated in the mouth region. EmoZone-Talker better preserves lip articulation while rendering clearer emotion cues.

Ablation Studies

The ablation section is important because it isolates the paper’s central claims: AU conditioning alone is not enough, and both spatial decoupling and temporal modeling are necessary. The authors evaluate four configurations: an audio-only baseline, audio plus AU conditioning, AU conditioning with SZ-PAB, and AU conditioning with CIT-AE. The full model combines all three.

Ablation study on spatial and temporal controllability.
Config	AU Cond	SZ-PAB	CIT-AE	PSNR	LPIPS	Sync	AUE-U	AUE-L	AU-Jerk
(A)	-	-	-	32.879	0.042	5.741	0.309	0.314	0.137
(B)	✓	-	-	31.795	0.046	5.378	0.324	0.358	0.153
(C)	✓	✓	-	33.947	0.037	5.569	0.152	0.235	0.126
(D)	✓	-	✓	33.921	0.039	5.571	0.305	0.276	0.098
EmoZone-Talker	✓	✓	✓	33.960	0.034	5.712	0.143	0.231	0.095

The ablation results make two points very clear. First, simply adding AU conditioning without region decoupling hurts performance: configuration (B) has worse PSNR than the audio-only baseline and substantially worse AU errors, which the paper interprets as cross-modal interference. Second, each proposed module fixes a different failure mode. SZ-PAB sharply lowers upper-face error by separating speech and AU responsibilities, while CIT-AE reduces AU-Jerk by smoothing temporal fluctuations. The full model is best overall because the two modules are complementary rather than redundant.

Ablation study of SZ-PAB. Blue boxes highlight the upper-face regions.

Figure 5 visualizes the effect of removing SZ-PAB across several AUs. With the full model, AU1 and AU2 produce visible brow elevation, AU4 generates brow lowering with wrinkles, AU6 raises the cheeks, and AU9 activates nose wrinkling. Without SZ-PAB, these upper-face activations largely disappear, which supports the paper’s claim that AU conditioning becomes suppressed when the model is forced to learn everything implicitly.

Temporal Stability Analysis under Stress Testing.

Figure 6 stress-tests the model with noisy AU inputs. The paper shows that the output without temporal modeling follows the noisy input too closely and remains jittery, while CIT-AE produces much smoother trajectories that preserve the underlying intensity trends. This is important because real AU detectors are not perfectly stable frame to frame.

Ablation study of CIT-AE. Ours w/o CIT-AE exhibits erratic activations (red) and temporal blur (orange), highlighting the importance of temporal modeling.

Figure 7 further demonstrates the importance of CIT-AE. Without it, the upper-face activations become erratic and temporal blur appears, especially under noisy AU control. The full model remains stable and preserves sharper facial motion over time.

User Study

The paper also reports a user study with 20 participants, evenly split by gender, who rated videos on a 5-point Likert scale across five dimensions: quality, emotion, lip synchronization, identity, and naturalness. EmoZone-Talker received the best score in every category.

User study results.
Method	Quality	Emotion	Lip Sync	Identity	Naturalness
StyleTalk	3.38	2.68	3.65	2.34	2.98
EAT	3.62	3.24	3.62	3.53	3.19
DreamTalk	3.45	3.19	3.65	3.65	3.47
DICE-Talk	3.12	2.82	3.45	2.47	3.21
EmoZone-Talker	4.33	3.89	3.67	4.67	4.28

The strongest human-rated gains are in identity and naturalness, which is consistent with the method’s explicit decoupling of expression control from speech-driven articulation. The paper uses this to reinforce the idea that AU-level control is not just more interpretable, but also perceptually more natural when done with region awareness and temporal smoothing.

Limitations

The paper explicitly notes one limitation: facial actions are not strictly independent in practice. Some expression-related AUs, such as AU6, can naturally influence the mouth region because of shared muscle structures. As a result, strong expressions may still produce minor artifacts. The authors frame handling such anatomically coupled effects as an important direction for future work.

Takeaway

EmoZone-Talker’s main contribution is a structured way to think about emotional talking-head generation under conflict: instead of mixing speech and expression signals implicitly, it separates where each signal should act and how expression should evolve over time. In practice, this yields better AU controllability, better temporal stability, strong lip synchronization, and improved identity preservation, while staying efficient enough for real-time rendering.