Akapulu Labs logo Akapulu Labs Research

Audio-Interaction

Audio Interaction Model

Audio-Interaction — method overview

Audio-Interaction is a unified streaming audio-language model that listens continuously and decides when to respond in real time. It combines ASR, dialogue, translation, and proactive help, enabling interactive, timely multi-task audio understanding and response beyond offline models.

  • llm
  • audio-driven
  • dialogue
  • realtime
  • streaming
  • agent
  • asr

Demos

These demos show the Audio Interaction Model’s ability to listen, decide, and respond to live audio streams in real-time. It unifies tasks like speech recognition, translation, and proactive chatting in one system. Watch for its timely, context-aware responses and strong proactive interventions that outperform other models while keeping offline task quality.

Authors: Zhifei Xie, Zihang Liu, Ze An, Xiaobin Hu, Yue Liao, Ziyang Ma, Dongchao Yang, Mingbao Lin, Deheng Ye, Shuicheng Yan, Chunyan Miao

Categories: cs.SD, cs.AI, cs.CL, cs.MM, eess.AS

Comment: Next generation of LALMs, work in progress

Published 2026-06-03 · Updated 2026-06-03

Abstract

Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and streaming audio models each handle only a single task such as streaming ASR or voice chatting. It is time to unify them into one online LALM: a model that, through an always-on perceive-decide-respond loop, listens to sound, environment, and instructions in real time and reacts on the fly. We formalize this regime as the Audio Interaction Model, and realize it with Audio-Interaction, a unified streaming model that retains offline task execution while adding online general audio instruction following, from dialogue to full voice chatting, deciding when to respond from the semantics of the stream. To enable this, we propose SoundFlow, a framework that instantiates the perceive-decide-respond loop end to end, from data to training to deployment, through streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference for stable real-time interaction. We further construct StreamAudio-2M, a 2.6M-item streaming corpus spanning 7 fundamental abilities and 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Across 8 benchmarks, Audio-Interaction preserves competitive performance on mainstream audio tasks while unlocking capabilities inaccessible to offline LALMs, including real-time ASR, streaming audio instruction following, and proactive help.


1. Problem Setting and Main Thesis

The paper argues that audio should not be treated as an offline input that is fully observed before any response is produced. Instead, it frames audio as an inherently interactive and real-time modality that calls for an always-on model that can continuously listen, decide whether to remain silent, and respond only when the semantics of the stream justify intervention. The authors name this new regime the Audio Interaction Model and implement it with Audio-Interaction, a unified streaming audio-language model that aims to preserve standard large audio language model capabilities while adding online behaviors such as streaming ASR, simultaneous interpretation, audio instruction following, voice chatting, and proactive assistance.

The central shift is from the offline formulation $y = f(x, \mathcal{A})$ to a streaming control loop that acts on a chunked audio stream. At each step, the model outputs a control decision and, if appropriate, a response. This is presented as a perceive--decide--respond loop, where the key novelty is not just generating text from audio, but deciding when to speak based on the unfolding acoustic context.

listens to a continuous audio stream and decides at each moment whether to stay silent or speak, unifying conventional capabilities (e.g., dialogue, ASR) and streaming-native (e.g., simultaneous translation, proactive help) capabilitie within a single model.
listens to a continuous audio stream and decides at each moment whether to stay silent or speak, unifying conventional capabilities (e.g., dialogue, ASR) and streaming-native (e.g., simultaneous translation, proactive help) capabilitie within a single model.

The paper positions this work as moving beyond two unsatisfactory extremes: offline LALMs that wait for a complete clip and then answer once, and task-specific streaming models that handle only one capability at a time. The intended destination is a single streaming audio-language system that can support both conventional audio understanding and genuinely online interaction.

Human listening is a continuous activity. We take in sound moment by moment and judge for ourselves when a reaction is called for. Current audio models work the opposite way: they wait for a finished recording, answer once, and handle only one kind of task per system. closes this gap by processing sound as it arrives and judging, step by step, when to speak and when to hold back—letting one model cover what previously took many specialized ones.
Human listening is a continuous activity. We take in sound moment by moment and judge for ourselves when a reaction is called for. Current audio models work the opposite way: they wait for a finished recording, answer once, and handle only one kind of task per system. closes this gap by processing sound as it arrives and judging, step by step, when to speak and when to hold back—letting one model cover what previously took many specialized ones.

2. Problem Formulation and Streaming Interface

The paper formalizes a streaming interaction regime in which the model consumes audio in fixed-length chunks and maintains a running context over prior chunks and prior decisions. Let $a_t$ denote the current audio chunk, $d_t$ the streaming intervention decision, and $r_t$ the generated response. The model is written as a sequential decision process:

$$ (d_t, r_t) = f\!\left(a_{\leq t}, d_{<t}, r_{<t}\right), $$

with the practical implementation reducing this to a control-token prediction over each chunk:

$$ d_t, r_t = f_{\mathrm{det}}(a_t, C_t), \qquad r_t = \begin{cases} \varnothing, & d_t = \langle \texttt{silent} \rangle, \\ f_{\mathrm{resp}}(a_t, C_t), & d_t = \langle \texttt{response} \rangle, \end{cases} $$

where $C_t$ is the streaming context accumulated up to step $t$. The decision token is a single special control symbol, and the training objective explicitly supervises both the language model output and the streaming control output.

3. SoundFlow: End-to-End Framework for Streaming Audio Interaction

The implementation framework is called SoundFlow. The paper decomposes it into three pieces: (1) streaming-native data construction, (2) comprehension-aware training, and (3) asynchronous low-latency inference. The point is to solve the entire pipeline end to end rather than bolting a streaming wrapper onto an offline audio LLM.

The training framework of . Audio signals, intermediate representations, and supervision signals are organized into a unified temporal sequence, and a streaming training strategy jointly optimizes language modeling and response triggering, enabling to decide when to respond or remain silent across diverse real-time tasks.
The training framework of . Audio signals, intermediate representations, and supervision signals are organized into a unified temporal sequence, and a streaming training strategy jointly optimizes language modeling and response triggering, enabling to decide when to respond or remain silent across diverse real-time tasks.

3.1 Streaming-native data construction

The data pipeline is built to synthesize long, coherent streams from shorter clips while preserving acoustic plausibility and context-dependent intervention cues. It has two main pieces: a time-frequency joint preprocessing module and a hierarchical event curation pipeline.

The time-frequency preprocessing module, called TFJP in the paper, regularizes clip boundaries and reduces artifacts so clips can be stitched into longer streams. In the appendix, TFJP is described as a sequence of operations over a shared STFT representation: silence trimming, stationary-noise estimation from low-energy regions, spectral subtraction, core-span localization, boundary snapping to a half-chunk grid, and short Hann-window smoothing at the edges. The core timing constants are a chunk size of $400$ ms, a half-chunk alignment of $200$ ms, a smoothing window of $20$ ms, and a silence limit of $300$ ms.

The hierarchical event curation pipeline avoids naive random concatenation, which would create contradictory or implausible scene compositions. Instead, it uses an LLM to plan a coherent scenario, then refines each sub-event into a retrieval query and fallback caption, and finally grounds each clip by verifying whether it matches identity, cleanliness, duration fit, and continuity. Retrieved clips are checked against top-3 candidates, and if retrieval fails the pipeline falls back to generated audio from AudioX or ElevenLabs before verification.

3.2 Comprehension-aware training

The paper identifies two training-time failure modes: insufficient long-range context retention and false triggering on irrelevant sounds. To address them, it adds two forms of auxiliary supervision. First, history review inserts questions about content at least three turns earlier in the stream, forcing the model to retrieve older context. Second, silent-audio verification provides many non-speech examples that should not elicit a response, reducing over-triggering. This is intended to teach the model not just what to say, but whether it should say anything at all.

The training loss is a dual objective:

$$ \mathcal{L} = \frac{1}{N} \sum_{j=1}^{N} \left( -\log p_\theta(t_j \mid \mathcal{H}_j) + \lambda\,[-\log p_\theta(s_j \mid \mathcal{H}_j)] \right), $$

where $t_j$ is the target text token, $s_j$ is the target streaming control token, $\mathcal{H}_j$ is the decoding context, and $\lambda$ controls the balance between language modeling and streaming control. The selected value is $\lambda = 1.0$.

3.3 Asynchronous inference with FIFO scheduling

At inference time, the paper separates encoding and decoding using an asynchronous FIFO queue. The encoder continuously appends chunk embeddings to a queue, and the decoder only consumes them when the last emitted token indicates silence or end-of-response. This avoids the encoder/decoder waiting conflicts that arise in complex interaction patterns and is reported to reduce first-frame latency after a response by $4.5\times$.

Paper figure 'figure4-inference-newname'
Paper figure 'figure4-inference-newname'

In the ablation table, the FIFO design reduces average first-chunk latency from $831$ ms to $392$ ms and eliminates stalls entirely, whereas removing FIFO introduces a $5.2\%$ stall rate.

4. StreamAudio-2M: The Streaming Training Corpus

The paper’s major data contribution is StreamAudio-2M, a streaming-native corpus designed for the perceive--decide--respond regime rather than the short clip triplets common in conventional audio datasets. The main text describes it as a $2.6$M-item corpus totaling about $302$k hours, spanning $7$ major capability groups and $28$ sub-tasks. The dataset is organized around long, multi-turn streams with sparse response cues, so the model learns both when to intervene and what to generate.

is a dataset built for streaming audio interaction, pairing long-form, real-world-simulated audio with token-level annotations. It jointly trains the model to interact in real time grounded in context while covering 7 foundational capabilities across 28 sub-tasks.
is a dataset built for streaming audio interaction, pairing long-form, real-world-simulated audio with token-level annotations. It jointly trains the model to interact in real time grounded in context while covering 7 foundational capabilities across 28 sub-tasks.

4.1 Capability taxonomy and composition

The seven reported capability families and their item counts in the main figure are:

Capability Items Share
Voice chatting539k23.1%
Streaming instruction following487k20.8%
Streaming audio understanding382k16.4%
Streaming translation357k15.3%
Real-time ASR270k11.6%
Proactive response171k7.3%
Environment audio agent130k5.5%

The paper also reports a source-data breakdown of 2.34M items, 7.49M rounds, and 66.7K source hours in the figure that summarizes the construction pipeline. The main point is that the corpus is assembled from many shorter resources and then converted into longer streaming interactions with token-level labels.

4.2 Source corpora and how they are used

The corpus draws from dialogue sources such as MOSS and GammaCorpus-Fact-QA; speech corpora such as CommonVoice, GigaSpeech, LibriSpeech, and VoxPopuli; speech translation corpora such as CoVoST2 and AISHELL; audio understanding sources such as AudioSet and FMA; event sources such as AudioSet events, AudioX, and ElevenLabs; and noise corpora such as MUSAN, WHAM!, and DNS-Challenge. Text sources are converted into speech with CosyVoice, then checked by ASR and rewriting verification to keep the round trip within a WER threshold of $0.10$.

Importantly, the dataset is not just a mixture of tasks. It is composed into long streams with interleaved foreground, background, and ambient events, role-dependent gains, and mixed noise tracks so that the model sees realistic acoustic continuity and has to learn selective silence.

4.3 ProactiveSound-Bench

The second dataset contribution is ProactiveSound-Bench, a benchmark for proactive audio intervention. It contains $644$ human-designed events, arranged into $6$ top-level categories and $17$ sub-categories, and uses two evaluation tiers: Single and Multiple. The model must decide whether to respond at all, and if it does, generate a useful natural-language intervention rather than a generic acknowledgment.

The benchmark targets high-stakes triggers such as acute human distress, severe weather, equipment damage, fire indicators, and other safety-critical signals. It is designed specifically to test the selectivity of response, not just recognition. The paper emphasizes that the benchmark is different from sound event detection or captioning because it jointly evaluates intervention and response quality.

5. Training Recipe and Optimization Details

Audio-Interaction is initialized from Qwen2.5-Omni-3B, chosen as a compact but capable base model. Training is organized into four stages, with progressively more streaming-specific supervision. The paper gives the following recipe:

Stage Trainable modules Main purpose
Stage 1: format trainingLM head + new embeddingTeach sequence formatting and the control token on offline data
Stage 2: adapter trainingAdapter onlyMap chunk-wise acoustic representations into LM space
Stage 3: large-scale streaming SFTAdapter + LMJointly train core capabilities: audio understanding, ASR, dialogue, and translation
Stage 4: instruction-following fine-tuningAdapter + LMLearn complex streaming behavior, proactive response, and interleaved multi-turn interaction

The implementation uses $400$ ms chunks, a $24$ s maximum stream length ($60$ chunks), bf16 mixed precision, gradient checkpointing, and DeepSpeed ZeRO-2 on $32$ NVIDIA H100 $80$ GB GPUs. The reported total wall-clock training time is about $10$ days. Learning rates range from $10^{-4}$ in the early stages to $10^{-5}$ in the final instruction-following stage, and the optimizer is AdamW with cosine decay and linear warmup.

6. Experimental Setup

The evaluation suite spans $8$ benchmarks covering the core large audio language model space: MMAU for general audio understanding, four spoken-dialogue benchmarks under the VoiceBench setting, LibriSpeech for ASR, CoVoST2 for speech-to-text translation, and ProactiveSound-Bench for proactive intervention. The paper compares against three broad baseline families: specialized task models, audio LLMs, and omni multimodal models, plus streaming spoken-dialogue systems such as Moshi and Freeze-Omni.

7. Main Results

7.1 General audio understanding on MMAU

Model Stream. Multi-turn Text instr. avg. Audio instr. avg.
Audio Flamingo 2NoNo62.401.16
Qwen2-AudioNoYes49.2019.41
Voxtral-MiniNoYes50.6037.24
Audio-ReasonerNoNo61.7120.57
Qwen2.5-Omni 3BNoYes57.8142.51
Qwen2.5-Omni 7BNoYes65.6049.58
Phi-4-multimodalNoYes55.5631.75
Baichuan-Omni-1.5NoYes59.9040.40
Audio-InteractionYesYes55.6858.15

The main claim here is not that the streaming model dominates every offline setting. Rather, it preserves strong general audio understanding while becoming dramatically better under spoken instructions: the audio-instruction average of $58.15$ is the headline result, and it exceeds the base Qwen2.5-Omni-3B audio-instruction score of $42.51$ by a large margin. The paper also notes that the model remains competitive on text-instruction MMAU and is smaller than several 7B baselines.

7.2 Spoken dialogue

Model LLaMA Q. Web Q. AlpacaEval SD-QA
Moshi62.2026.302.0115.01
Freeze-Omni72.0044.734.1450.16
Baichuan-Omni-1.578.5059.104.5043.40
Qwen2-Audio69.6745.203.7435.71
Qwen2.5-Omni 3B66.0027.954.3249.37
Qwen2.5-Omni 7B75.3362.804.4955.71
Phi-4-multimodal60.2026.603.8139.78
Audio-Interaction67.3154.344.2852.14

On spoken dialogue, Audio-Interaction is a strong 3B-scale system, but not a universal winner. It is competitive with the 3B and mid-scale baselines and notably strong on SD-QA, while some larger offline models still score higher on certain dialogue subsets. The key point is that the streaming model does not collapse on these tasks despite being trained for online interaction.

7.3 ASR and speech translation

Model LibriSpeech clean WER LibriSpeech other WER CoVoST2 en$\to$zh BLEU CoVoST2 zh$\to$en BLEU
Canary1.482.93----
Canary-Qwen1.493.10----
Qwen2-Audio1.603.6045.2024.40
Qwen2.5-Omni 3B2.875.9039.5018.17
Qwen2.5-Omni 7B1.803.4041.4029.40
Phi-4-multimodal1.693.8246.3022.39
Baichuan-Omni-1.55.7110.09----
Audio-Interaction3.176.0455.2235.21

This is one of the clearest wins in the paper. Audio-Interaction does not beat dedicated ASR systems on WER, which is expected given that it is a general streaming model, but it achieves the strongest reported translation BLEU on both directions and remains broadly competitive on recognition. The authors explicitly characterize the ASR WER regression relative to the offline base model as a modest cost of chunk-wise streaming decoding.

7.4 Proactive intervention

Model Single avg. Multiple avg. Key pattern
Qwen2.5-Omni 3B41.029.3Falls with longer streams
Qwen2.5-Omni 7B58.232.1Strong single-shot, collapses in long streams
Kimi-Audio-Instruct39.928.4Moderate, still unstable
MiniCPM-o-4.558.958.9Very high over-triggering; poor selectivity
Step-Audio 28.93.0Very weak on this benchmark
Gemini-3-Flash37.050.8Improves in multi-round setting
Audio-Interaction61.262.8Best overall; stable over long streams

The proactive benchmark is where the streaming paradigm is most clearly differentiated. Audio-Interaction achieves the best average score on both Single and Multiple tiers and is the only model that the paper claims combines balanced category coverage with stability as stream length grows. The paper also reports that its trigger accuracy reaches $96.77\%$ in the ablation setting.

8. Ablations and Internal Analyses

The paper includes several ablations to isolate which pieces of the pipeline matter. The main findings are: FIFO scheduling is necessary for stable low-latency inference; streaming training and the synthetic-stream data pipeline contribute cumulatively; chunk size trades off accuracy against latency; and the dual-loss coefficient balances response triggering and understanding.

Capability stability of as the stream extends from 1 to 5 concatenated segments. We report MMAU average accuracy, dialogue accuracy, and end-to-end latency.
Capability stability of as the stream extends from 1 to 5 concatenated segments. We report MMAU average accuracy, dialogue accuracy, and end-to-end latency.

8.1 Core ablation findings

Variant MMAU AlpacaEval Trigger accuracy Latency / stall note
Baseline57.814.32----
+ Streaming SFT58.564.1792.42%Improves control
V2 without TFJP57.744.1985.35%Boundary smoothing matters
V2 without event selection55.114.2588.51%Semantic coherence matters
Audio-Interaction58.154.2896.77%Best overall

The paper also reports that removing FIFO raises average first-chunk latency from $392$ ms to $831$ ms and increases stall rate from $0.0\%$ to $5.2\%$. For chunk size, $0.2$ s is too small and harms accuracy, while $0.6$ s and $0.8$ s improve accuracy but increase latency to $674$ ms and $786$ ms respectively. The chosen $0.4$ s setting is the best overall trade-off, reaching $58.15$ MMAU and $392$ ms latency.

The dual-loss coefficient ablation shows that increasing $\lambda$ improves trigger accuracy but can slightly harm comprehension at high values. The selected $\lambda=1.0$ is the paper’s balance point: it keeps MMAU near its peak while achieving very high trigger accuracy.

8.2 Continuity reconstruction and control-head localization

Beyond benchmark scores, the authors probe where the offline-to-streaming transition happens inside the model. They compute a continuity ratio defined as the cosine similarity of boundary pairs relative to intra-chunk pairs. The audio encoder starts low, around $0.25$, the projector changes this very little, and the earliest GPT block reconstructs continuity sharply, lifting the ratio to about $0.80$. This suggests that streaming continuity is not imposed by the frontend but reconstructed in the decoder through cross-chunk key-value cache access.

For the silent-vs-response control decision, the paper performs single-head ablation over $576$ attention heads and finds that one head, identified as L35H14, dominates across tasks. Removing it alone reduces the S2TT token-match score by $0.88$, and the task-specific rankings are highly correlated. The conclusion is that streaming control is concentrated in a narrow, shared pathway rather than split into separate circuits for each task.

Results of cross-chunk continuity ratio across the audio encoder, audio projector, and GPT blocks on four tasks.
Results of cross-chunk continuity ratio across the audio encoder, audio projector, and GPT blocks on four tasks.
Results of cross-chunk continuity ratio across the audio encoder, audio projector, and GPT blocks on four tasks.
Results of cross-chunk continuity ratio across the audio encoder, audio projector, and GPT blocks on four tasks.

9. Qualitative Case Study and Real-World Validation

The appendix extends the synthetic-stream evaluation to roughly $2$ hours of naturally recorded audio from four deployment scenarios: travel, work, home, and commute. These recordings were captured on consumer devices and were not processed by TFJP, so they test the model in more realistic conditions. The model retains most of its synthetic-stream behavior: average trigger accuracy is $58.9\%$ versus $62.0\%$ on a matched synthetic split; first-chunk latency stays within about $\pm 25$ ms of the synthetic measurement; per-chunk silence rates correlate at $0.91$ with the synthetic split; and the dominant control head and continuity metrics remain similar. Performance degrades most in travel and commute because crowd noise and non-stationary audio make both recognition and trigger selectivity harder.

Case studies show s gains over SOTA streaming models. In the second, other models detect the cat mostly through the transcribed words "meow", while handles the audio cue directly via native streaming training.
Case studies show s gains over SOTA streaming models. In the second, other models detect the cat mostly through the transcribed words "meow", while handles the audio cue directly via native streaming training.

The qualitative case study underscores a recurring theme: some streaming baselines rely too heavily on transcript-like cues, while Audio-Interaction is intended to react to the audio event itself. This matches the broader motivation of the paper, which is to move from spoken-text surrogates to genuinely audio-native interaction.

10. Error Analysis and Practical Weaknesses Reported by the Paper

The paper does not present a formal limitations section, but it does report several failure modes in the appendix that are useful for understanding where the model is still weak.

  • LibriSpeech ASR: the largest error class is local token deviation, followed by rare-word and long-utterance degradation, then function-word bias and decoding loops.
  • CoVoST2 translation: low-BLEU outputs are dominated by semantic hallucinations, with some incomplete or mixed-language translations.
  • MMAU: about $20\%$ of errors are generation collapse, and the rest are genuine recognition or reasoning mistakes.
  • SpokenQA: factual hallucination is the dominant issue, followed by irrelevant or generalized responses and numerical/temporal mistakes.
  • VoiceBench: low-score outputs often reflect hallucination or inappropriate refusal, while SD-QA errors include hallucination, miscomprehension, and over-refusal.
  • ProactiveSound-Bench: false positives dominate the errors, meaning the model sometimes overreacts to benign sounds; false negatives still occur for safety-critical cues.

These error patterns matter because they align with the paper’s central objective: the hard part of streaming audio interaction is not just recognition, but calibrated intervention. The reported weaknesses show that the model is still vulnerable to hallucination, over-triggering, and long-tail recognition errors, especially in noisy or ambiguous contexts.

11. What the Paper Concludes

The paper’s conclusion is that moving from offline LALMs to an always-on audio interaction model is both feasible and useful. With SoundFlow, StreamAudio-2M, and ProactiveSound-Bench, the authors show that a single model can retain competitive performance on mainstream audio tasks while unlocking streaming-native capabilities that offline models cannot express. The strongest claims supported by the reported experiments are: (1) streaming training can preserve much of the general audio understanding of the base model; (2) audio instruction following and proactive response are materially improved by native streaming supervision; and (3) the streaming control policy can be learned as a compact, internalized decision mechanism rather than an external scheduler.