MindFlow

MindFlow: Harmonizing Cognitive Semantics and Acoustic Dynamics for Facial Animation Generation in Dyadic Conversations

MindFlow generates lifelike facial animations in dyadic conversations by combining evolving emotional state reasoning with precise motion control. It models raw audio as emotion states and adaptively fuses acoustic cues to produce semantically rich and temporally accurate facial animation.

llm
multimodal
dialogue
talking-head
face-animation
emotion
audio-driven
streaming

Demos

The demos show MindFlow's skill in generating natural facial animations for dyadic conversations by harmonizing semantic understanding with acoustic dynamics. Look for smooth expressions that reflect emotional shifts and natural talking-listening interactions without rigid motions. These examples highlight fine-grained, synchronized facial behaviors that align with conversational roles and emotions.

MindFlow teaser image illustrating harmonized cognitive semantics and acoustic dynamics in facial animation of dyadic conversations.

Authors: Hejia Chen, Haoxian Zhang, Xu He, Xiaoqiang Liu, Pengfei Wan, Shoulong Zhang, Shuai Li

Categories: cs.CV

Comment: Accepted by ECCV 2026

Published 2026-06-26 · Updated 2026-06-26

Links

Paper & demos

Abstract

Generating lifelike facial animation for dyadic conversations requires reconciling high-level cognitive intent with precise low-level motor reflexes, yet existing methods fall short in the semantic understanding of dialogue context and in precise dynamic control. In this paper, we propose MindFlow, a dual-pathway generative framework inspired by the Ventral-Dorsal pathway model in neuroscience, which decouples generation into two collaborative streams, thereby harmonizing deep semantic reasoning with fine-grained control. In the Ventral module, we transform the conventional Sentence-Action approach into a novel Chunk-State approach that models raw acoustic streams as a context-aware, evolving emotional state chain, capturing subtle paralinguistic nuances and mid-utterance emotional shifts missed by sentence-level modeling. The Dorsal module features a conditional autoregressive flow matching network for high-fidelity facial motion, driven by high-frequency acoustic cues and modulated by emotion states, plus a Selective Acoustic Injector for adaptive audio gating to ensure robustness in talking-and-listening dynamics without interference. Extensive experiments demonstrate that MindFlow achieves superior semantic appropriateness and motion naturalness compared to state-of-the-art baselines.

1. Problem Setting and High-Level Idea

MindFlow targets streaming facial animation generation in dyadic conversations, where an avatar must both react naturally while listening and produce synchronized motion while speaking. The paper argues that existing audio-driven or dialogue-aware methods tend to fail in one of two ways: they are either good at low-level acoustic synchronization but semantically hollow, or semantically aware but too coarse, static, or sentence-bound to capture fine conversational timing.

The central design choice is to mirror the neuroscience-inspired Ventral-Dorsal dual-pathway hypothesis. MindFlow splits the problem into two coupled streams: a Ventral module for streaming cognitive-semantic inference over dialogue audio, and a Dorsal module for reflexive, high-fidelity motion synthesis. The key conceptual shift from prior work is the move from a Sentence-Action paradigm to a Chunk-State paradigm: rather than summarizing an entire sentence into a discrete motion command, MindFlow reasons over short raw-audio chunks and maintains an evolving emotion state across time.

Grounded in the Ventral-Dorsal dual-pathway cognitive model, MindFlow introduces a novel framework for streaming facial animation in dyadic conversations, under which digital avatars simultaneously perceive conversational emotions while reflexively synchronizing with acoustic rhythms, naturally yielding interactions that are both semantically rich and physically fluid.

2. Task Formulation

The task is formulated as a causal, streaming generation problem. At time step $t$, the system predicts the facial motion $M_a^t$ for interlocutor $a$ using only information available up to time $t$:

$$M_a^t = G(A_a^{\le t}, A_b^{\le t}, S_a^{

Here, $A_a^{\le t}$ and $A_b^{\le t}$ are the continuous raw audio streams from the two interlocutors, and $S_a^{

The paper further discretizes time into chunks of size $w$ and writes the coupled reasoning/generation process as:

$$ \begin{cases} S_a^k = \operatorname{Ventral}(A_a^{\le k}, A_b^{\le k}, S_a^{

where $k = \lfloor t / w \rfloor$ is the chunk index. In the final system, the chunk window is set to 1.5 seconds.

3. Overall Architecture

MindFlow is organized as a dual-pathway framework with two cooperating modules. The Ventral module acts as a cognitive semantic perceiver that continuously updates emotional state from audio chunks using a multimodal large language model (MLLM). The Dorsal module acts as a sensory-motor executor that uses the semantic state, along with acoustic cues from both interlocutors, to synthesize motion in an autoregressive and streaming fashion.

Inspired by the Ventral- Dorsal dual-pathway model, MindFlow generates lifelike conversational facial animations (both listening and speaking) driven by continuous dialogue audio $A_a$ and $A_b$. 1) The Ventral module functions as a cognitive semantic perceiver, leveraging an MLLM with a streaming Chain-of-State to decode evolving emotion states chunk-by-chunk, conditioned on the historical cognitive trajectory. 2) The Dorsal module functions as a reflexive sensory-motor executor. It incorporates a Selective Acoustic Injector that uses motion queries to adaptively gate unmixed audio streams, and a Stylistic Temporal Modulator that injects the Ventral emotion states as semantic guidance. These combined features condition an autoregressive flow-matching backbone for the continuous generation of high-fidelity, synchronous facial motions.

3.1 Why Chunk-State instead of Sentence-Action?

Prior dialogue-aware facial animation methods often use a Sentence-Action pipeline: an LLM reads text-level dialogue and emits a predefined motion instruction such as a nod, a gaze shift, or a smile. MindFlow argues that this has two structural flaws. First, text-only reasoning discards prosody and other paralinguistic cues that are present in the raw audio. Second, sentence boundaries are too coarse for conversational motion: real human expressions evolve within a sentence, not only at its end. Chunk-State addresses both issues by reasoning directly on fixed-window audio chunks and updating a state chain over time.

The supplementary prompt makes this design explicit: the MLLM is asked to infer one emotion label per chunk, jointly considering semantics and vocal prosody, while maintaining continuity with previous predictions. The label set is restricted to eight emotions: angry, contempt, disgusted, fear, happy, sad, surprised, and neutral.

4. Ventral Module: Streaming Cognitive State Modeling

The Ventral module is the semantic and emotional anchor of the system. Instead of mapping a full sentence to a static action, it consumes short audio chunks and updates an emotion state chain that represents the evolving conversational state. The implementation uses an MLLM in a streaming setting and a Chain-of-State mechanism to preserve context across chunks.

The core motivation is that if each chunk were analyzed independently, the model would lose conversational history. If, conversely, each query simply concatenated the current chunk with all prior chunks, the MLLM would have history but would not explicitly track previous emotion estimates. The paper’s solution is to append predicted state back into the context so that the model reasons over both the sensory history and its own previous emotional trajectory. This makes the inferred state smoother and more stable across time.

In the supplementary material, the authors stress that the Ventral module should infer the target speaker’s emotion even during listening: when the target is silent, the system should still infer the listener’s reaction from the other speaker’s speech content and prosody, rather than falling back to neutral by default.

This module therefore provides two forms of guidance to the Dorsal module: a coarse semantic interpretation of the ongoing interaction and a temporally evolving emotion state that acts as a prior for facial dynamics generation.

5. Dorsal Module: Reflexive Motion Generation

The Dorsal module is responsible for physically plausible motion synthesis. It is built as an autoregressive Transformer backbone with two specialized injectors and a flow-matching output head. The overall objective is to keep generation responsive enough for streaming use while still producing diverse, natural motion rather than deterministic regression outputs.

5.1 Stylistic Temporal Modulator

The Stylistic Temporal Modulator injects the Ventral emotion state into the motion backbone. The state embedding is appended to the hidden motion sequence and made visible to subsequent motion tokens through masked causal attention. This allows the model to condition on emotion without peeking into the future, preserving the streaming constraint.

Conceptually, this is the semantic bridge between cognition and reflexes: the emotion state shapes the style of the upcoming motion, while the autoregressive backbone handles the exact frame-by-frame realization.

5.2 Selective Acoustic Injector

A major problem in dyadic animation is how to use the two audio streams. Prior work often early-fuses them by concatenation, which can blur the distinction between the target speaker’s own speech and the partner’s speech. MindFlow instead keeps the audio streams unmixed and interleaves them into a source-independent acoustic context $A_{\text{ctx}}$.

The Dorsal module then performs attention-based injection from motion queries to this acoustic context:

$$F_{\text{inject}} = \operatorname{Attn}(H_m, A_{\text{ctx}}, A_{\text{ctx}}) = \operatorname{Softmax}\left(\frac{Q_m K_{\text{ctx}}^\top}{\sqrt{d}}\right) V_{\text{ctx}}$$

This is called the Selective Acoustic Injector. Its purpose is to learn an adaptive gating policy: during speech, it tends to focus on the target speaker’s own audio for lip synchronization; during listening, it shifts toward the partner’s audio to trigger reactive expressions. Importantly, the paper states that this gating behavior is not manually supervised with speaker labels; it emerges from optimization under the flow-matching loss.

Attention map showing motion query focusing on relevant audio.

5.3 Autoregressive Transformer Backbone

The backbone stacks the Stylistic Temporal Modulator and the Selective Acoustic Injector in alternating blocks. The paper states that the network uses $L = 6$ identical blocks. During training, teacher forcing is used: the model consumes ground-truth history $M_a^{<t}$ to predict the current motion conditioning vector $C_a^t$ from motion history, emotion state, and audio context:

$$C_a^t = \operatorname{Backbone}(M_{<t}, S_a^t, A_{\text{ctx}})$$

This conditioning vector is then passed to the generative head.

5.4 Flow Matching Head

Instead of directly regressing motion, MindFlow uses a flow-matching objective. The motion generator learns a velocity field $v_\theta$ that transports samples from a standard Gaussian prior $\pi_0 = \mathcal{N}(0, I)$ to the data distribution $\pi_1$ along an ordinary differential equation:

$$dZ_\tau = v_\theta(Z_\tau, \tau \mid C)\, d\tau$$

The training loss is a mean-squared error between the predicted velocity and the straight-line displacement between noise and data samples:

$$\mathcal{L}_{\text{flow}} = \mathbb{E}_{\tau \sim \mathcal{U}[0,1],\, Z_0 \sim \pi_0,\, Z_1 \sim \pi_1} \left[ \lVert v_\theta(Z_\tau, \tau \mid C) - (Z_1 - Z_0) \rVert^2 \right]$$

The authors emphasize that this choice avoids iterative denoising at inference time. In practice, the ODE is solved with a 5-step Euler solver, enabling real-time operation while still producing diverse, expressive motion.

6. Training and Implementation Details

The Dorsal module is trained on a combined dataset of roughly 20 hours of public data. The paper uses:

HDTF for fundamental motion priors;
VICOX for stronger listening behavior and interaction modeling;
MEAD and VICO for emotional talking and listening behaviors.

The training schedule is two-stage: pretraining on HDTF and VICOX for 90k steps, followed by fine-tuning on MEAD and VICO for 30k steps. Total training time is about 4 days. Optimization uses Adam, batch size $64$, peak learning rate $10^{-5}$, cosine decay, $1\%$ warmup, and no weight decay. The audio encoder is frozen throughout.

Motion is represented with 51-dimensional ARKit blendshape coefficients for facial expressions and 3D Euler angles for head pose. These are extracted using MediaPipe and FSA-Net.

At inference time, the two modules run asynchronously. The Ventral module processes each 1.5 s chunk in approximately $1.38 \pm 0.10$ s, while the Dorsal module outputs motion at 25 FPS in real time. The paper reports that the system can sustain 2-minute sequences without memory growth and requires about 59 GB VRAM.

6.1 Supplementary implementation refinements

The supplementary material adds several practical details. For head pose, the authors found that directly predicting absolute Euler angles can lead to drift: the pose may gradually rotate to extreme values during autoregressive inference. To mitigate this, the model predicts angular velocity instead. For all frames except the first, the current pose is obtained by adding the predicted velocity to the previous pose; for the first frame, the module still predicts the absolute pose.

They also apply a historical perturbation during training to reduce over-reliance on past motion. Specifically, the input motion $M$ is noised as:

$$M^{\text{noised}} = (1 - \sigma)M + \sigma N$$

where $N$ is Gaussian noise and $\sigma$ is sampled from $[0.01, 0.05]$ during training and fixed to $0.02$ during inference. In addition, the cross-attention injection coefficient $\lambda_{\text{cross}}$ is increased to $2.5$ to strengthen the influence of audio on expression generation.

A failure case of head-pose drift: the drift appears after only a few inference steps. Frames are displayed at $2.5$ fps for clarity.

7. Evaluation Protocol

The paper evaluates both talking and listening generation. For motion realism it uses Fréchet Distance (FD). For talking-phase lip synchronization it uses SyncNet-based metrics: SyncD and SyncC. For listening-phase expression quality it uses FD and mean squared error (MSE). The authors adapt SyncNet from video evaluation to 3D motion by replacing the visual branch with MLP layers and training the adapted model on a high-quality 3D mocap talking dataset.

The experiments compare against both talking-head and dyadic listening methods. For talking, the paper compares with EmoTalk, UniTalker, DualTalk, and a retrained Audio2Photoreal baseline. For listening, it compares with L2L, RLHG, DIM, DualTalk, and Audio2Photoreal.

8. Main Results

MindFlow reports state-of-the-art performance on both talking and listening metrics. The strongest gains appear in motion naturalness and expression quality, while lip synchronization also improves slightly over the best baseline.

8.1 Talking-state results

Method	SyncD ↓	SyncC ↑	FD Exp ↓	FD Pose ↓
EmoTalk	0.429	0.412	23.52	—
UniTalker	0.480	0.300	29.88	—
DualTalk	0.467	0.346	26.06	0.18
Audio2Photoreal	0.341	0.519	17.64	0.03
MindFlow	0.333	0.520	15.76	0.01

On the HDTF test set, MindFlow achieves the best reported performance across all talking metrics. The largest gain is in facial expression realism, where FD for expressions drops to $15.76$, and head-pose realism also improves to $0.01$. SyncD improves modestly over Audio2Photoreal, while SyncC is slightly higher as well.

8.2 Listening-state results

Method	FD Exp ↓	FD Pose ↓	MSE Exp ↓	MSE Pose ↓
L2L	33.93	0.06	0.93	0.01
RLHG	39.02	0.07	0.86	0.01
DIM	23.88	0.06	0.70	0.01
DualTalk	22.27	0.05	0.58	0.01
Audio2Photoreal	14.24	0.03	0.34	0.01
MindFlow	13.86	0.03	0.30	0.01

On the VICO test set, MindFlow again performs best overall. The listening results are especially important because the paper’s thesis is that conversation-aware animation must remain expressive even when the avatar is not talking. MindFlow’s semantic guidance and generative motion model are credited for reducing over-smoothed expressions and improving contextual fit.

8.3 Additional evaluation on HDTF

Method	LVE ↓	Diversity × $10^{-2}$ ↑
EmoTalk	6.212	5.558
UniTalker	7.866	2.242
Audio2Photoreal	9.676	21.332
DualTalk	8.806	2.074
MindFlow	6.305	24.197

The supplementary results show that MindFlow remains competitive in lip accuracy while producing the highest expression diversity among the compared methods, supporting the claim that flow matching improves variety without sacrificing synchronization.

Performance under conversational tension: MindFlow captures the intense atmosphere to exhibit semantically and temporally appropriate expressions, whereas baseline methods suffer from unstable and hollow expressions due to the boundary artifacts and a lack of semantic guidance.

9. Qualitative Analysis

The qualitative comparisons in the paper support the quantitative results. Against Dorsal-only baselines such as DualTalk and Audio2Photoreal, MindFlow is reported to produce smoother head movement across both speaking and listening phases, with better long-range emotional coherence. In contrast to fixed-length or sliding-window generation systems, its autoregressive flow-matching backbone is used to maintain stability across long sequences.

The paper also compares the proposed Chunk-State approach with the Sentence-Action paradigm used by CustomListener-style systems. The figure and discussion emphasize that MindFlow generates more appropriately timed reactions, such as a smile at the end of an utterance, whereas sentence-level action planning tends to produce static or delayed responses.

Compared to the Sentence-Action approach, our proposed Chunk-State approach yields reactions with more appropriate timing and style.

Visualization of Ventral module reasoning and corresponding generated result. The MetaHuman plugin in Unreal Engine is utilized for rendering.

The authors also report a blind perception study with 24 collected evaluations, where participants chose the best result under naturalness and fitness criteria. The paper states that MindFlow consistently outperformed prior methods in this human study.

10. Ablation Studies

The ablations isolate the contribution of the Ventral module, the Selective Acoustic Injector, and the choice of chunk size and sampling steps. Overall, the ablations support the paper’s main thesis that semantic state, not just audio correlation, is required for good dyadic animation.

10.1 Ventral guidance and Chunk-State

On the VICO test set, the authors compare the Ventral state used to guide the Dorsal module against three alternatives: random state, fixed sentence-level ground truth state, and a Sentence-Action variant. The reported numbers show that the evolving emotion state from the Ventral module performs best.

Method	FD Exp ↓	FD Pose ↓	MSE Exp ↓	MSE Pose ↓
Random	15.21	0.03	0.36	0.01
Fixed	14.15	0.03	0.32	0.01
Sentence-Action variant	14.39	0.03	0.33	0.01
MindFlow	13.86	0.03	0.30	0.01

The paper’s interpretation is that the dynamic emotion trajectory is more useful than static or random guidance because it is temporally aligned with the ongoing conversation and better matches the ground-truth facial dynamics.

10.2 Selective Acoustic Injector

The Selective Acoustic Injector is tested by cross-ablations with Audio2Photoreal. The results show consistent improvement in lip synchronization when the injector is used, indicating that selective access to unmixed audio streams is beneficial for both architectures.

Method	Injector	SyncD ↓	SyncC ↑
Audio2Photoreal	w/o	0.341	0.519
Audio2Photoreal	w/	0.331	0.526
MindFlow	w/o	0.350	0.424
MindFlow	w/	0.333	0.520

The paper notes that the bidirectional Audio2Photoreal variant still performs well, which is consistent with the general advantage of non-causal access, but the injector itself improves both systems. For MindFlow, removing the injector harms SyncC significantly, showing that the module is important not just for lip accuracy but also for better coordination between acoustic context and motion generation.

10.3 Chunk size

Chunk length trades off emotion-state granularity and stability. The user study reported in the paper shows that increasing chunk size improves emotion prediction accuracy because more temporal context is available, but it reduces perceived synchronicity. The authors therefore choose $1.5$ seconds as the default chunk size.

10.4 Sampling steps

The authors also vary the number of ODE sampling steps for the flow-matching solver. The relationship between step count and lip synchronization is reported as non-monotonic. At low step counts, discretization error dominates; at higher step counts, accumulated prediction errors can introduce jitter and degrade SyncNet scores. The best trade-off is obtained at 5 steps, which is the setting adopted in the final system.

Impact of sampling step on synthesis quality and inference latency

11. Reported Limitations and Future Work

The main stated limitation is modality coverage. MindFlow currently relies only on audio to infer emotion and semantics. The authors explicitly note that natural face-to-face conversation also depends on visual cues such as eye contact, facial expression, and body language. As a result, audio-only inference may miss silent but meaningful non-verbal signals.

Their stated future direction is to extend the framework to multimodal sensory inputs, especially by integrating visual cues alongside raw audio so that the model can obtain a more holistic contextual understanding and generate more empathetic and accurate reactions.

12. Conclusion

MindFlow’s core contribution is a streaming, dual-pathway facial animation framework that separates semantic state tracking from reflexive motion generation. The Ventral module performs chunk-level emotional reasoning with a Chain-of-State memory, while the Dorsal module uses a Selective Acoustic Injector plus flow matching to produce high-fidelity motion in real time. The reported results support the paper’s claims that this design improves semantic appropriateness, motion naturalness, and temporal continuity over prior dyadic conversation baselines.