Can Audio–Language Models Reason Over Time?

Large language models have become remarkably capable with text, and recent large audio language models (LALMs) aim to extend these abilities to sound. But understanding audio is not only about recognising what sound is present. A truly capable audio AI system should also reason about when events occur, how long they last, and how many sound sources are present.

To study this capability, researchers at the LEAP Lab, Indian Institute of Science developed TREA — Temporal Reasoning Evaluation of Audio, a benchmark designed to test whether audio–language models can perform fine-grained temporal reasoning.

What problem does TREA address?

Most audio benchmarks focus on tasks such as classification, captioning, speech recognition, or general audio question answering. However, temporal reasoning requires a deeper understanding of sound events.

For example, an audio model may need to answer:

  • Order: Which sound occurred after the dog bark?
  • Duration: Which event lasted the longest?
  • Count: How many unique sound sources are present?

These questions require the model to go beyond identifying isolated sounds and instead understand the structure of events over time.

At a glance

Dataset: TREA — Temporal Reasoning Evaluation of Audio
Task format: Multiple-choice audio question answering
Total samples: 600
Subtasks: Ordering, duration, and counting
Models evaluated: Qwen2-Audio, SALMONN, WavLLM
Human baseline: Evaluated on a subset of TREA questions
Additional focus: Confidence and uncertainty estimation for audio–language models

Key findings

  • Current open-source LALMs remain far behind humans on temporal audio reasoning.
  • Ordering is relatively easy, while duration and counting remain especially challenging.
  • Accuracy alone is insufficient: models that perform better are not always better calibrated or more reliable.

Paper details

Title: Benchmarking and Confidence Evaluation of LALMs For Temporal Reasoning
Authors: Debarpan Bhattacharya, Apoorva Kulkarni, and Sriram Ganapathy
Accepted at: INTERSPEECH 2025, Rotterdam, The Netherlands

Paper link: https://arxiv.org/pdf/2505.13115