Akapulu Labs logo Akapulu Labs Research

M2S-AVSR

M2S-AVSR: Modality-aware Multi-view Self-supervised Representation for Robust Audio-Visual Speech Recognition

M2S-AVSR — method overview

M2S-AVSR improves audio-visual speech recognition by learning view-invariant visual features and adaptively gating visual input based on quality and timing. It boosts robustness against viewpoint changes, occlusion, and asynchrony, and introduces a real-world multi-view AV dataset for challenging environments.

  • multimodal
  • asr
  • talking-head

Authors: Fei Su, Cancan Li, Ming Li, Juan Liu

Categories: eess.AS, cs.SD

Comment: submitted to IEEE Transactions on Audio, Speech, and Language Processing

Published 2026-06-04 · Updated 2026-06-05

Abstract

Audio-Visual Speech Recognition (AVSR) enhances speech recognition robustness by leveraging visual cues, while real-world scenarios remain challenging due to viewpoint variation, audio distortion, and visual occlusion, which degrade modality quality and increase audio-visual asynchrony. In this paper, we propose a novel Modality-aware Multi-view Self-supervised representation framework for robust Audio-Visual Speech Recognition (M2S-AVSR). First, we introduce a multi-view representation learning encoder to learn view-invariant visual speech representations. Next, we employ a modality-aware module that explicitly models modality quality and cross-modal synchrony to perform fine-grained modality-aware fusion, enabling fine-grained visual information injection during decoding. In addition, we release AISHELL8-RealScene, a public multi-scenario, multi-view conversational audio-visual dataset recorded in real-world environments, and establish a speech recognition benchmark on it. Experiments on English and Mandarin benchmarks demonstrate the effectiveness of the proposed method under challenging conditions. On LRS3, M2S-AVSR achieves up to 29.4% relative improvement under viewpoint perturbation and visual degradation settings. Our method also achieves new state-of-the-art performance on the MISP2021-AVSR test set. On AISHELL8-RealScene, it achieves the best result in outdoor scenes. The proposed method and dataset provide useful support for future research on robust speech and multimodal tasks under realistic conditions.


1. Problem Setting and High-Level Idea

This paper addresses robust audio-visual speech recognition (AVSR) in realistic environments, where the visual stream is not guaranteed to be stable or aligned. The motivating failure modes are threefold: viewpoint variation (non-frontal or changing camera angle), visual degradation (occlusion, blur, missing lip regions), and audio-visual asynchrony or inconsistency. The authors argue that standard AVSR fusion methods are often too uniform: they combine audio and video without explicitly estimating when the visual stream is trustworthy or synchronized.

The proposed solution, M2S-AVSR (Modality-aware Multi-view Self-supervised representation for robust AVSR), has two main components: (1) a multi-view self-supervised visual encoder that learns view-invariant lip representations from real and synthesized views, and (2) a modality-aware fusion mechanism that gates visual injection based on visual quality and cross-modal synchrony.

The paper also contributes a new dataset, AISHELL8-RealScene, a multi-scenario, multi-view conversational AV corpus recorded in real-world environments, and uses it to establish a benchmark for realistic speech recognition.

Illustration of AVSR performance degradation under challenging acoustic and visual conditions, evaluated on the LRS3 test set using Whisper-Flamingo trained on 433 h of labeled LRS3 data.
Illustration of AVSR performance degradation under challenging acoustic and visual conditions, evaluated on the LRS3 test set using Whisper-Flamingo trained on 433 h of labeled LRS3 data.

2. Main Contributions

  • Multi-view self-supervised visual representation learning. The MVL encoder is trained on real and synthesized multi-view pairs using consistency and domain-alignment objectives, so that the visual representation becomes more robust to camera pose changes.
  • Modality-aware fusion. The decoder does not blindly inject visual features. Instead, it estimates a frame-wise visual quality gate $g_q(t)$ and a synchrony gate $g_s(t)$, then combines them into a modality-aware gate $g_{\mathrm{ma}}(t)$ to regulate visual cross-attention.
  • New real-world benchmark. AISHELL8-RealScene contains 102.19 hours of synchronized audio and video from 171 foreground speakers across five locations and both indoor and outdoor settings.
  • Strong empirical results. On LRS3, the method improves robustness under viewpoint perturbation and masking; on MISP2021-AVSR it reaches new state of the art; and on AISHELL8-RealScene it performs best in outdoor scenes.

3. Overall Architecture

The full system uses a Whisper Large encoder as the audio backbone and an AV-HuBERT-initialized multi-view visual encoder as the visual backbone. The decoder is based on Whisper, but each decoder block is modified by inserting visual cross-attention before self-attention and by controlling the visual contribution with modality-aware gates.

In the architecture figure, the visual encoder is called the MVL encoder. During AVSR training and inference, it consumes only multi-view visual features; the extra audio branch used in multi-view pretraining is disabled. The method is trained in three stages: visual multi-view pretraining, audio-only Whisper adaptation, and AVSR fusion training with frozen encoders.

Overview of the proposed M2S-AVSR framework. Left: the overall architecture of the proposed framework. The audio and video front-ends use a two-layer Conv1D network and ResNet-18, respectively. The audio encoder and MVL encoder are initialized from Whisper and AV-HuBERT, respectively, and are built upon Transformer encoder architectures. Right: the three main components of M2S-AVSR, including the Gated Audio-Visual Fusion Decoder, the MVL Encoder, and the Modality-Aware Module. Dashed lines in the MVL encoder denote components used only during self-supervised pretraining. During AVSR training and inference, the MVL encoder takes only multi-view visual features, while the audio branch is disabled. MVC and RDA denote Multi-View Consistency and Representation Domain Alignment, respectively.
Overview of the proposed M2S-AVSR framework. Left: the overall architecture of the proposed framework. The audio and video front-ends use a two-layer Conv1D network and ResNet-18, respectively. The audio encoder and MVL encoder are initialized from Whisper and AV-HuBERT, respectively, and are built upon Transformer encoder architectures. Right: the three main components of M2S-AVSR, including the Gated Audio-Visual Fusion Decoder, the MVL Encoder, and the Modality-Aware Module. Dashed lines in the MVL encoder denote components used only during self-supervised pretraining. During AVSR training and inference, the MVL encoder takes only multi-view visual features, while the audio branch is disabled. MVC and RDA denote Multi-View Consistency and Representation Domain Alignment, respectively.

4. Multi-View Self-Supervised Visual Representation Learning

The visual side is where the paper makes its most distinct representation-learning contribution. Rather than relying on a standard pretrained AV-HuBERT visual encoder, the authors build a multi-view representation learning (MVL) encoder initialized from AV-HuBERT Large and then further optimized using a mix of real and synthesized views. The purpose is to make the lip encoder robust to viewpoint changes and to align representations across domains.

4.1 Multi-View Consistency Loss

For each utterance, the paper pairs a real view $\mathbf{x}_{r,i}$ with a synthesized view $\mathbf{x}_{s,i}$ and compares their embeddings $\mathbf{X}_{r,i}=f(\mathbf{x}_{r,i})$ and $\mathbf{X}_{s,i}=f(\mathbf{x}_{s,i})$. Two alignment terms are used:

$$L_{\mathrm{mse}} = \frac{1}{N}\sum_{i=1}^{N} \lVert \mathbf{X}_{r,i} - \mathbf{X}_{s,i} \rVert_2^2$$

and a correlation-matching term based on normalized embeddings $\mathbf{Z}_{r,i}$ and $\mathbf{Z}_{s,i}$:

$$L_{\mathrm{corr}} = \frac{1}{N}\sum_{i=1}^{N} \left\| \operatorname{Corr}(\mathbf{Z}_{r,i}) - \operatorname{Corr}(\mathbf{Z}_{s,i}) \right\|_F^2$$

These are combined as

$$L_{\mathrm{MVC}} = \alpha L_{\mathrm{corr}} + (1-\alpha)L_{\mathrm{mse}}$$

so the encoder matches both pointwise features and their internal relational structure. The paper emphasizes that the correlation term is intended to preserve structural dependencies, not just low-level similarity.

4.2 Representation Domain Alignment

To reduce the gap between real and synthesized views, the authors add a contrastive representation domain alignment objective. A real sample and its corresponding synthesized sample from the same utterance form a positive pair, while other utterances in the batch serve as negatives. Using cosine similarity, the loss is

$$L_{\mathrm{RDA}} = -\frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(\operatorname{sim}(\mathbf{X}_{r,i},\mathbf{X}_{s,i})/\tau)}{\sum_{j=1}^{M}\exp(\operatorname{sim}(\mathbf{X}_{r,i},\mathbf{X}_{n,j})/\tau)}$$

with temperature $\tau$. This encourages the encoder to focus on speech-related motion patterns rather than view-specific artifacts introduced by simulation.

4.3 Full MVL Objective

The MVL encoder is trained with the weighted sum

$$L_{\mathrm{MVL}} = \lambda_{\mathrm{MVC}}L_{\mathrm{MVC}} + \lambda_{\mathrm{RDA}}L_{\mathrm{RDA}} + \lambda_{\mathrm{MMP}}L_{\mathrm{MMP}}$$

where $L_{\mathrm{MMP}}$ is the masked multimodal cluster prediction objective inherited from AV-HuBERT pretraining. In other words, the encoder is not trained from scratch: it retains AV-HuBERT’s masked prediction behavior while being explicitly adapted to multi-view robustness.

Detailed structure of the modality-aware module. The upper branch estimates the visual quality score g_q(t). The lower branch estimates the synchrony score g_s(t) through contrastive synchrony modeling and local-window l2 distance. The two scores are then fused to produce the final modality-aware gating value g_ma(t).
Detailed structure of the modality-aware module. The upper branch estimates the visual quality score $g_q(t)$. The lower branch estimates the synchrony score $g_s(t)$ through contrastive synchrony modeling and local-window $\ell_2$ distance. The two scores are then fused to produce the final modality-aware gating value $g_{\mathrm{ma}}(t)$.

5. Modality-Aware Fusion

After the MVL encoder produces view-robust visual features $\hat{\mathbf{X}}_v$, the paper tackles the more difficult question of when and how to inject visual information into decoding. This is important because visual input may be unreliable due to occlusion, blur, or changes in viewpoint, and because even high-quality lip features can be misaligned with audio.

5.1 Quality-Aware Gate

The first branch estimates frame-level visual reliability. A temporal convolution is applied to the visual features, followed by a lightweight MLP and sigmoid output to produce a gate $g_q(t) \in [0,1]$. Intuitively, this gate is larger when the visual stream is cleaner and more useful for decoding, and smaller when the lips are occluded or otherwise degraded.

5.2 Synchrony-Aware Gate

The second branch measures cross-modal consistency. The audio and visual streams are projected into a shared embedding space using modality-specific projectors, and their local-window distance is computed as

$$D_s(t) = \frac{1}{2T_w+1}\sum_{k=t-T_w}^{t+T_w} \lVert \mathbf{E}_a(k) - \mathbf{E}_v(k) \rVert_2$$

The corresponding synchrony gate is

$$g_s(t) = \frac{\gamma}{\gamma + D_s(t)}$$

so that higher synchrony yields a larger gate. The local temporal window $T_w$ gives some tolerance to slight misalignment, which is important in practice.

5.3 Fused Modality-Aware Gate

The two gates are combined through logit-space fusion:

$$g_{\mathrm{ma}}(t) = \tanh\Big(w_q\,\operatorname{logit}(g_q(t)) + w_s\,\operatorname{logit}(g_s(t))\Big)$$

This final scalar modulates visual cross-attention in the decoder. The paper’s design choice is to keep the audio backbone as the stable foundation and use the visual stream as an adaptively injected refinement rather than a uniformly fused signal.

5.4 Gated Audio-Visual Fusion Decoder

In each decoder block, the hidden state attends to the visual features via cross-attention, and the result is scaled by $g_{\mathrm{ma}}(t)$ before being added back to the residual stream. The block also uses a learnable feed-forward gate $g_{\mathrm{ff}}$ bounded by $\tanh(\cdot)$, and all gating parameters are initialized near zero to prevent destabilizing the pretrained Whisper decoder early in training.

6. Training Strategy and Objectives

The system is trained in three stages:

  1. MVL pretraining. The visual encoder is initialized from AV-HuBERT Large and trained on multi-view data with the MVL loss.
  2. Whisper adaptation. The Whisper encoder is fine-tuned on audio-only data for domain adaptation, following the paper’s audio-pretraining recipe.
  3. AVSR fusion training. Both encoders are frozen, and only the modality-aware module plus the visual cross-attention layers in the decoder are trained.

During the final fusion stage, the paper adds a contrastive synchrony regularizer for temporally aligned audio-visual segments versus shifted mismatched segments:

$$L_{\mathrm{sync}} = \frac{1}{T_s}\sum_{t=1}^{T_s} \left(y_t D_s^2 + (1-y_t)[\max(m-D_s,0)]^2\right)$$

The AVSR objective is then

$$L_{\mathrm{avsr}} = L_{\mathrm{att}} + \lambda_{\mathrm{sync}}L_{\mathrm{sync}}$$

where $L_{\mathrm{att}}$ is the autoregressive cross-entropy on target tokens. The authors set $\lambda_{\mathrm{sync}} = 0.1$.

Key implementation choices reported in the paper include: initialization from the 5th-iteration AV-HuBERT Large model, pretraining on $\text{ms433h}$ and $\text{ms1759h}$, fine-tuning on $\text{ms30h}$ or $\text{ms433h}$, a two-stage loss schedule with $\tau = 0.07$ and $\alpha = 0.6$, decay of $\lambda_{\mathrm{RDA}}$ from $0.3$ to $0.1$ in the last 70% of training, Adam optimization with weight decay $0.01$, learning rate $0.002$ for MVL pretraining, and a beam size of 25 at inference.

7. AISHELL8-RealScene Dataset

AISHELL8-RealScene is presented as a public, multi-scenario, multi-view conversational AV corpus recorded in real-world environments. It contains 102.19 hours of synchronized audio and video from 171 foreground speakers. The collection includes both indoor and outdoor scenes and is explicitly designed to capture realistic background activity and interference.

Overview of the AISHELL8-RealScene recording scenes. Schematic illustration of the recording scenes, where m foreground speakers (F-speakers, m=1-5) converse in Mandarin, while n background speakers (B-speakers, n=1-5) stand behind them to create natural interfering activities. An example of an actual recording scene (hotel). All recording scenes in AISHELL8-RealScene: outside the office building, residential hall, hotel, park, and street.
Overview of the AISHELL8-RealScene recording scenes. (a) Schematic illustration of the recording scenes, where $m$ foreground speakers (F-speakers, $m=1\text{--}5$) converse in Mandarin, while $n$ background speakers (B-speakers, $n=1\text{--}5$) stand behind them to create natural interfering activities. (b) An example of an actual recording scene (hotel). (c) All recording scenes in AISHELL8-RealScene: outside the office building, residential hall, hotel, park, and street.
Distribution analysis of AISHELL8-RealScene dataset.
Distribution analysis of AISHELL8-RealScene dataset.

7.1 Recording Setup

Audio is captured with a circular microphone array of 16 microphones at 16 kHz / 16-bit, and the release includes 8-channel far-field audio chosen by uniform angular subsampling. Foreground speakers also wear near-field close-talking microphones for transcription. Video is captured by three HD cameras at 25 fps from roughly $30^\circ$ separated viewpoints; face detection, face recognition, and face re-identification are used to extract foreground-speaker face crops at $256\times256$ resolution.

The corpus covers five locations: building, hall, hotel, park, and street. The paper defines five permutation patterns (P1--P5) that vary the number of foreground and background speakers to create different conversational and interference settings. The resulting split is balanced across locations, with no speaker overlap across train/dev/eval subsets.

7.2 Dataset Splits

Split Duration (h) Indoor (h) Outdoor (h) Sessions Groups F-speakers Gender ratio (M/F)
Train79.8436.7443.10133561331:1.18
Dev10.705.035.67187181:0.29
Eval11.655.286.37207201:0.82
Total102.1947.0555.14171701711:0.99

The paper notes that the overlapping-speech ratio is about 25%, making the benchmark suitable for realistic conversational modeling rather than isolated speech only.

8. Experimental Setup

The paper evaluates on LRS3, VoxCeleb2, MISP2021-AVSR, OuluVS2, and the new AISHELL8-RealScene benchmark. For English AVSR, LRS3 is split into 433 h train / 1 h validation / 1 h test, and the 433 h LRS3 training set is combined with 1,326 h of VoxCeleb2 for large-scale pretraining. For LRS3 robustness tests, babble noise at $0$ dB SNR is applied, and synthetic test conditions include view offsets of $5^\circ$ and $15^\circ$ and visual masking ratios of $0.1$ and $0.3$.

The paper also builds multi-view training variants by mixing synthesized and real data. In the reported setup, $\text{ms30h}$ and $\text{ms433h}$ contain 30% and 40% synthesized data, respectively, while $\text{ms1759h}$ is the larger pretraining set. For Mandarin evaluation, the paper reports character error rate (CER), and for LRS3 it reports word error rate (WER).

9. Main Results on LRS3

The LRS3 table shows that M2S-AVSR is particularly strong under adverse conditions. Compared with AV-HuBERT and CMA, it lowers noisy WER from $5.80\%$ and $4.40\%$ to $3.00\%$ in the 433 h setting, and further to $2.12\%$ in the 1,759 h setting. Under viewpoint perturbation and masking, it consistently beats prior AVSR and LLM-based baselines.

The paper also reports that, when multi-view data are used, M2S-AVSR improves further and achieves up to 29.4% relative improvement under challenging visual conditions. In the table below, values are shown exactly as reported.

Model Modality Total params Training data Clean WER (%) Noisy WER (%) View $5^\circ$ View $15^\circ$ Mask 0.1 Mask 0.3
AV-HuBERTA+V477M433/1,759 h1.405.809.6014.087.2911.14
Whisper-FlamingoA+V2.5B433/1,759 h1.50/2.005.60/5.606.86/6.377.39/7.266.72/6.178.52/8.22
Llama-AVSRA+V8.7B1,759/1,759 h0.774.006.009.484.736.75
MMS-LlamaA+V3.2B433/1,759 h0.90/0.722.4/1.94.90/3.956.93/6.194.79/4.536.17/5.90
M2S-AVSRA+V2.6B433/1,759 h0.82/0.683.00/2.125.58/4.006.78/5.955.38/4.647.23/5.84
M2S-AVSR with multi-view dataA+V2.6Bms433/ms1,759 h0.82/0.652.84/2.024.83/3.865.78/4.055.30/4.606.01/5.77

The key takeaway is that the proposed fusion scheme adds robustness without relying on very large language-model backbones. The authors explicitly note that noisy-condition comparisons are not perfectly apples-to-apples across methods because some baselines use different babble-noise generation protocols.

10. Visual Robustness Analysis and Multi-View Learning Results

To understand what the MVL encoder learns, the paper includes both qualitative and quantitative analyses. The occlusion-based sensitivity plot shows that the proposed encoder keeps its attention concentrated on lip regions even when the view changes, whereas AV-HuBERT becomes less stable under simulated viewpoint shifts. The authors interpret this as evidence that MVL learns more view-consistent lip representations.

Occlusion-based lip sensitivity visualization for self-supervised multi-view representation learning on an LRS3 test utterance at SNR=0 dB. Original view and simulated view with yaw +20 degrees. Heatmap encodes occlusion-induced loss change delta L.
Occlusion-based lip sensitivity visualization for self-supervised multi-view representation learning on an LRS3 test utterance at $\mathrm{SNR}=0$ dB. (a) Original view. (b) Simulated view (yaw $=+20^\circ$), with identical audio input. Each row shows 8 uniformly sampled frames from a 60-frame clip. The heatmap color encodes the occlusion-induced loss change $\Delta\mathcal{L}$, where blue indicates lower values and red indicates higher values. GT and Pred denote the ground-truth and predicted transcripts, respectively. Red tokens in Pred indicate mismatches with GT, and [miss] denotes deletion.
Evaluation on multi-view data showing angular offset, random proportions, LRS3 yaw subsets, and OuluVS2 viewpoints.
Evaluation on multi-view data (VSR WER (%)). “30h” and “433h” denote AV-HuBERT Large pretrained on 1759h and fine-tuned on 30h and 433h, respectively. “mv30h” and “mv433h” denote AV-HuBERT with the MVL encoder, pretrained on ms1759h and fine-tuned on ms30h and ms433h, respectively. (a) Angular Offset: gradually introducing different horizontal viewing angles into the test set. (b) Random Proportions: evaluating varying proportions of simulated multi-view data. (c) LRS3: evaluation on three subsets of the LRS3 test set grouped by face yaw angles ($<10^\circ$, $10^\circ$--$30^\circ$, and $>30^\circ$). (d) OuluVS2: evaluation on five viewpoints ($0^\circ$, $30^\circ$, $45^\circ$, $60^\circ$, and $90^\circ$) and the overall set (All), following the protocol of the cited reference.

In supplementary experiments, the MVL encoder consistently outperforms the original AV-HuBERT under synthetic and real multi-view settings. Reported examples include a WER of $41.29\%$ versus $44.09\%$ at $+20^\circ$ on the synthesized-view test, and $3.65\%$ versus $4.54\%$ on the full OuluVS2 set. The paper also reports relative reductions of up to $20.8\%$ on yaw-grouped LRS3 subsets.

11. Modality-Aware Fusion Results on MISP2021-AVSR

The MISP2021-AVSR benchmark is used to test the modality-aware fusion module under realistic indoor conversational conditions. The paper reports a CER of 21.95% for the full M2S-AVSR model, which is better than all listed prior systems, including ModalBiasAVSR at 22.13% and Whisper-Flamingo at 26.08% in the compared setup. With ROVER post-processing, CER drops further to 18.82%, which the paper describes as a 12.6% relative reduction over the previous best ROVER result.

Visualization of modality-aware fusion on the MISP2021 dataset. Test sample with gating value over time, facial snapshots, spectrogram and recognition outputs.
Visualization of modality-aware fusion on the MISP2021 dataset. Test sample: R16_S242243244245_C02_I1_Far, Abs time: 1093--1099 s. Top: Video-aligned modality-aware gating value $g_{\mathrm{ma}}$ over relative time (s). Middle: Facial snapshots illustrating gradual visual occlusion–deocclusion transitions. Bottom: Spectrogram and recognition outputs, including ground truth (GT), Whisper-Flamingo (WF), and the proposed M2S-AVSR (M2S). Red tokens mark transcript mismatches. Shaded regions indicate occlusion–deocclusion transitions (orange) and time spans aligned with recognition errors (blue).
System Year Audio data (h) Visual data Backbone CER (%)
SJTU2022300LRW-1000Conformer34.02
NIO20223300LRW-1000Transformer25.07
USTC2023500w/o extraConformer24.58
ModalBiasAVSR20241000w/o extraConformer22.13
Whisper-Flamingo2025600LRS3+Vox2(En)Transformer26.08
M2S-A2026600--Transformer25.08
M2S-V2026--LRS3+Vox2(En)Transformer76.35
M2S-AV2026600LRS3+Vox2(En)Transformer21.95
M2S-AV with ROVER2026600LRS3+Vox2(En)Transformer18.82

12. AISHELL8-RealScene Benchmark Results

The AISHELL8-RealScene benchmark is split into indoor and outdoor subsets, and the paper reports both overall and camera-view-specific performance. The main conclusion is that the proposed model helps most in the harder outdoor scenes, where both background noise and visual interference are stronger. The authors note that audio-only systems remain competitive indoors, but visual information becomes more helpful in outdoor conditions.

Model Indoor CER (%) Outdoor CER (%) Overall CER (%)
Whisper-finetuned29.4942.0135.75
Fun-ASR-Nano34.6349.9442.74
FireRed-ASR23.6638.6931.19
Qwen3-ASR23.9838.7631.39
Whisper-Flamingo28.3041.6434.97
LLaMA-AVSR27.4340.8234.13
MMS-LLaMA25.5239.1632.34
M2S-AVSR (audio-only)26.7040.5633.64
M2S-AVSR25.3537.4731.41

On the outdoor subset, M2S-AVSR achieves the best CER, and the paper reports that this corresponds to a 7.3% relative reduction compared with the mean CER of Whisper-Flamingo and MMS-LLaMA. The result supports the paper’s thesis that the visual stream is especially useful when the acoustic scene is difficult.

A supplementary view-specific table further shows that training on all three camera views makes M2S-AVSR nearly view-invariant on AISHELL8-RealScene, with identical reported CER across D0/D1/D2 in the final row, while Whisper-Flamingo remains less robust across views.

Model Training views D0 D1 D2 Avg
Whisper-FlamingoSingle, Front D135.0534.4834.7534.76
Whisper-FlamingoAll35.1234.7635.0334.97
M2S-AVSRSingle, Front D132.0431.7331.9931.92
M2S-AVSRAll31.4131.4131.4131.41

13. Ablation Study

The ablation study isolates the effects of the MVL encoder, the consistency/domain-alignment losses, and the modality-aware gates. The baseline without MVL or modality-aware fusion reaches 5.55% noisy WER. Adding only the MVL encoder gives modest gains, but combining both $L_{\mathrm{MVC}}$ and $L_{\mathrm{RDA}}$ is better than using either alone. The biggest improvements come from the modality-aware fusion block, especially the quality-aware gate and synchrony-aware gate together.

System Multi-view data $L_{\mathrm{MVC}}$ $L_{\mathrm{RDA}}$ $g_q$ $g_s$ Clean WER Noisy WER View $15^\circ$ Mask
w/o MVL + MAF----------1.325.557.168.50
w/ MVC------1.325.477.048.41
w/ RDA------1.325.606.948.45
w/ MVC + RDA----1.325.436.058.42
w/ QualityGate--1.102.936.146.37
w/ SynchronyGate--0.882.886.106.51
w/ Modality-Aware--0.822.906.787.23
w/ Modality-Aware0.822.845.786.01

The ablation supports three conclusions. First, multi-view training alone is not sufficient; the MVL objectives are needed to translate synthetic diversity into useful view invariance. Second, the quality and synchrony gates both matter, and each improves robustness in slightly different ways. Third, the combination of multi-view representation learning and modality-aware fusion gives the best overall result, reducing noisy WER by 48.8% relative to the system without these components.

14. Practical Takeaways

  • The paper’s core design principle is to separate representation robustness from fusion reliability: first learn view-invariant lip features, then decide how much to trust them at decode time.
  • The modality-aware module is explicitly designed for realistic failure modes, not just synthetic noise. It uses both visual quality and audio-visual synchrony rather than a single confidence signal.
  • AISHELL8-RealScene fills an important gap by providing a real-world multi-view conversational benchmark with indoor/outdoor settings and interfering background speakers.
  • The results suggest that the method is especially valuable when the visual stream is unstable or the acoustic scene is corrupted; on clean data, gains are smaller, but robustness improves substantially.

15. Limitations and Scope

The paper does not present a dedicated failure-analysis section, so the limitations must be read from the experimental scope. The reported improvements are concentrated in challenging conditions: viewpoint perturbation, visual masking, noisy audio, and real-world scene variation. The method is still built on large pretrained backbones and a multi-stage training pipeline, so its computational footprint is non-trivial. In addition, the strongest gains are demonstrated on the paper’s chosen benchmarks and evaluation protocols; the paper does not claim broader generalization beyond the tested AVSR datasets and conditions.

16. Bottom Line

M2S-AVSR is a robust AVSR system that combines multi-view self-supervised visual representation learning with a modality-aware gated fusion decoder. The method is carefully engineered around the two main sources of real-world AVSR failure: viewpoint-dependent visual variation and unreliable audio-visual coupling. Across LRS3, MISP2021-AVSR, and AISHELL8-RealScene, the paper shows that this design improves robustness, especially in noisy, occluded, and multi-view conditions, and that the newly released AISHELL8-RealScene benchmark is a meaningful testbed for future work in realistic multimodal speech recognition.